Background Missing or incomplete phenotypic information can severely deteriorate the statistical power in epidemiological studies. High-throughput quantification of small-molecules in bio-samples,... Show moreBackground Missing or incomplete phenotypic information can severely deteriorate the statistical power in epidemiological studies. High-throughput quantification of small-molecules in bio-samples, i.e. `metabolomics', is steadily gaining popularity, as it is highly informative for various phenotypical characteristics. Here we aim to leverage metabolomics to impute missing data in clinical variables routinely assessed in large epidemiological and clinical studies.Methods To this end, we have employed similar to 26,000 H-1-NMR metabolomics samples from 28 Dutch cohorts collected within the BBMRI-NL consortium, to create 19 metabolomics-based predictors for clinical variables, including diabetes status (AUC(5-Fold CV) = 0.94) and lipid medication usage (AUC(5-Fold CV) = 0.90).Findings Subsequent application in independent cohorts confirmed that our metabolomics-based predictors can indeed be used to impute a wide array of missing clinical variables from a single metabolomics data resource. In addition, application highlighted the potential use of our predictors to explore the effects of totally unobserved confounders in omics association studies. Finally, we show that our predictors can be used to explore risk factor profiles contributing to mortality in older participants.Interpretation To conclude, we provide H-1-NMR metabolomics-based models to impute clinical variables routinely assessed in epidemiological studies and illustrate their merit in scenarios when phenotypic variables are partially incomplete or totally unobserved. Copyright (C) 2021 The Author(s). Published by Elsevier B.V. Show less
Longitudinal categorical data are often collected using an experimental design where the interest is in the differential development of the treatment group compared to the control group. Such... Show moreLongitudinal categorical data are often collected using an experimental design where the interest is in the differential development of the treatment group compared to the control group. Such differential development is often assessed based on average growth curves but can also be based on transitions. For longitudinal multinomial data we describe a transitional methodology for the statistical analysis based on a distance model. Such a distance approach has two advantages compared to a multinomial regression model: (1) sparse data can be handled more efficiently; (2) a graphical representation of the model can be made to enhance interpretation. Within this approach it is possible to jointly model the observations and missing values by adding a new category to the response variable representing the missingness condition. This approach is investigated in a Monte Carlo simulation study. The results show this is a promising way to deal with missing data, although the mechanism is not yet completely understood in all cases. Finally, an empirical example is presented where the advantages of the modeling procedure are highlighted. Show less