Modern day technology allows us to collect and store more data than ever before. Although the abundance of data offers great opportunities with respect to formulating and testing all kinds of... Show moreModern day technology allows us to collect and store more data than ever before. Although the abundance of data offers great opportunities with respect to formulating and testing all kinds of hypotheses and with respect to discovering and modeling patterns that can be used for future prediction, these opportunities do not come without risks. If the data is explored in a too naive manner, there is a very large probability that, at least part of, the actual findings will not be reproducible in future research. When the number of hypotheses or the number of parameters within a model increases, the possibility of actually ending up in such a situation increases as well. Profiting from all available data, while preventing chance findings from occurring, will call for advanced statistical techniques. The development of such techniques is the subject of this thesis. We focus both on multiple hypothesis testing and on the construction of prediction models. Throughout the thesis, emphasis is placed on the development of methods that are efficient, both from a power perspective and from the perspective of computational feasibility. Show less
In model building and model evaluation, cross-validation is a frequently used resampling method. Unfortunately, this method can be quite time consuming. In this article, we discuss an approximation... Show moreIn model building and model evaluation, cross-validation is a frequently used resampling method. Unfortunately, this method can be quite time consuming. In this article, we discuss an approximation method that is much faster and can be used in generalized linear models and Cox' proportional hazards model with a ridge penalty term. Our approximation method is based on a Taylor expansion around the estimate of the full model. In this way, all cross-validated estimates are approximated without refitting the model. The tuning parameter can now be chosen based on these approximations and can be optimized in less time. The method is most accurate when approximating leave-one-out cross-validation results for large data sets which is originally the most computationally demanding situation. In order to demonstrate the method's performance, it will be applied to several microarray data sets. An R package penalized, which implements the method, is available on CRAN. Show less