Machine learning approaches are increasingly suggested as tools to improve prediction of clinical outcomes. We aimed to identify when machine learning methods perform better than a classical... Show moreMachine learning approaches are increasingly suggested as tools to improve prediction of clinical outcomes. We aimed to identify when machine learning methods perform better than a classical learning method. We hereto examined the impact of the data-generating process on the relative predictive accuracy of six machine and statistical learning methods: bagged classification trees, stochastic gradient boosting machines using trees as the base learners, random forests, the lasso, ridge regression, and unpenalized logistic regression. We performed simulations in two large cardiovascular datasets which each comprised an independent derivation and validation sample collected from temporally distinct periods: patients hospitalized with acute myocardial infarction (AMI, n = 9484 vs. n = 7000) and patients hospitalized with congestive heart failure (CHF, n = 8240 vs. n = 7608). We used six data-generating processes based on each of the six learning methods to simulate outcomes in the derivation and validation samples based on 33 and 28 predictors in the AMI and CHF data sets, respectively. We applied six prediction methods in each of the simulated derivation samples and evaluated performance in the simulated validation samples according to c-statistic, generalized R-2, Brier score, and calibration. While no method had uniformly superior performance across all six data-generating process and eight performance metrics, (un)penalized logistic regression and boosted trees tended to have superior performance to the other methods across a range of data-generating processes and performance metrics. This study confirms that classical statistical learning methods perform well in low-dimensional settings with large data sets. Show less
The four possible segments A, T, C and G that link together to form DNA molecules, and with their ordering encode genetic information, are not only different in name, but also in their physical and... Show moreThe four possible segments A, T, C and G that link together to form DNA molecules, and with their ordering encode genetic information, are not only different in name, but also in their physical and chemical properties. The result is that DNA molecules with different sequences have different physical behavior. For instance, one sequence may lead to a very flexible DNA molecule, another to a very stiff one. A DNA molecule with a given sequence may be straight, or intrinsically curved. This leads to an interplay between the information stored in a DNA molecule on one hand, and the physical properties of that molecule on the other. This is of great importance in our cells, where lengths of DNA far longer than the size of the cells that contain them need to be significantly folded up. The research presented in this thesis looks at how we can model this interplay, what its effects can be, and whether nature has made use of it to encode mechanical signals into real genomes. Show less
Recent studies have provided experimental evidence for the existence of an encounter complex, a transient intermediate in the formation of protein complexes. We have used paramagnetic relaxation... Show moreRecent studies have provided experimental evidence for the existence of an encounter complex, a transient intermediate in the formation of protein complexes. We have used paramagnetic relaxation enhancement NMR spectroscopy in combination with Monte Carlo simulations to characterize and visualize the ensemble of encounter orientations in the short-lived electron transfer complex of yeast Cc and CcP. The complete conformational space sampled by the protein molecules during the dynamic part of the interaction was mapped experimentally. Our results demonstrate that the encounter complex is populated for 30% of the time, where Cc samples only about 15% of the surface area of CcP. We have also shown that the occupancy of the encounter complex can be modulated across a broad range by single point mutations of interfacial residues. Thus, by adjusting the amount of the encounter complex through a judicious choice of point mutations, we can remodel the energy landscape of a protein complex and tune its binding specificity. It has not been well established whether binding hot spots, which are frequently found in strong static complexes, also govern transient protein interactions. To address this issue, we have investigated an electron transfer complex of physiological partners from yeast: yeast Cc and yeast CcP. Using NMR spectroscopy and double mutant cycle, we show that Cc R13 is a hot-spot residue, as R13A mutation has a strong destabilizing effect on binding. Based on our analysis, we propose that binding energy hot spots, which are prevalent in static protein complexes, could also govern transient protein interactions. We have also investigated the effect of interface mutations on the structure and dynamics of the horse Cc __ yeast CcP complex using NMR spectroscopy and X-ray crystallography. The horse Cc forms a more dynamic complex with yeast CcP as compared to the native yeast Cc-CcP complex, and the two Cc molecules acquire different orientations in complex with CcP. Interestingly, a single interface mutation makes the complex more specific, with the horse Cc in an orientation resembling that of the native yeast Cc. Show less