Background Clinical prediction models should be validated before implementation in clinical practice. But is favorable performance at internal validation or one external validation sufficient to... Show moreBackground Clinical prediction models should be validated before implementation in clinical practice. But is favorable performance at internal validation or one external validation sufficient to claim that a prediction model works well in the intended clinical context? Main body We argue to the contrary because (1) patient populations vary, (2) measurement procedures vary, and (3) populations and measurements change over time. Hence, we have to expect heterogeneity in model performance between locations and settings, and across time. It follows that prediction models are never truly validated. This does not imply that validation is not important. Rather, the current focus on developing new models should shift to a focus on more extensive, well-conducted, and well-reported validation studies of promising models. Conclusion Principled validation strategies are needed to understand and quantify heterogeneity, monitor performance over time, and update prediction models when appropriate. Such strategies will help to ensure that prediction models stay up-to-date and safe to support clinical decision-making. Show less
Background: Clinical prediction models are often not evaluated properly in specific settings or updated, for instance, with information from new markers. These key steps are needed such that models... Show moreBackground: Clinical prediction models are often not evaluated properly in specific settings or updated, for instance, with information from new markers. These key steps are needed such that models are fit for purpose and remain relevant in the long-term. We aimed to present an overview of methodological guidance for the evaluation (i.e., validation and impact assessment) and updating of clinical prediction models. Methods: We systematically searched nine databases from January 2000 to January 2022 for articles in English with methodological recommendations for the post-derivation stages of interest. Qualitative analysis was used to summarize the 70 selected guidance papers. Results: Key aspects for validation are the assessment of statistical performance using measures for discrimination (e.g., C-statistic) and calibration (e.g., calibration-in-the-large and calibration slope). For assessing impact or usefulness in clinical decision-making, recent papers advise using decision-analytic measures (e.g., the Net Benefit) over simplistic classification measures that ignore clinical consequences (e.g., accuracy, overall Net Reclassification Index). Commonly recommended methods for model updating are recalibration (i.e., adjustment of intercept or baseline hazard and/or slope), revision (i.e., re-estimation of individual predictor effects), and extension (i.e., addition of new markers). Additional methodological guidance is needed for newer types of updating (e.g., meta-model and dynamic updating) and machine learning-based models. Conclusion: Substantial guidance was found for model evaluation and more conventional updating of regression-based models. An important development in model evaluation is the introduction of a decision-analytic framework for assessing clinical usefulness. Consensus is emerging on methods for model updating. Show less
Endt, V.H.W. van der; Milders, J.; Vries, B.B.L.P. de; Trines, S.A.; Groenwold, R.H.H.; Dekkers, O.M.; ... ; Jong, Y. de 2022
Aims Multiple risk scores to predict ischaemic stroke (IS) in patients with atrial fibrillation (AF) have been developed. This study aims to systematically review these scores, their validations... Show moreAims Multiple risk scores to predict ischaemic stroke (IS) in patients with atrial fibrillation (AF) have been developed. This study aims to systematically review these scores, their validations and updates, assess their methodological quality, and calculate pooled estimates of the predictive performance.Methods and results We searched PubMed and Web of Science for studies developing, validating, or updating risk scores for IS in AF patients. Methodological quality was assessed using the Prediction model Risk Of Bias ASsessment Tool (PROBAST). To assess discrimination, pooled c-statistics were calculated using random-effects meta-analysis. We identified 19 scores, which were validated and updated once or more in 70 and 40 studies, respectively, including 329 validations and 76 updates-nearly all on the CHA(2)DS(2)-VASc and CHADS(2). Pooled c-statistics were calculated among 6 267 728 patients and 359 373 events of IS. For the CHA(2)DS(2)-VASc and CHADS(2), pooled c-statistics were 0.644 [95% confidence interval (CI) 0.635-0.653] and 0.658 (0.644-0.672), respectively. Better discriminatory abilities were found in the newer risk scores, with the modified-CHADS(2) demonstrating the best discrimination [c-statistic 0.715 (0.674-0.754)]. Updates were found for the CHA(2)DS(2)-VASc and CHADS(2) only, showing improved discrimination. Calibration was reasonable but available for only 17 studies. The PROBAST indicated a risk of methodological bias in all studies.Conclusion Nineteen risk scores and 76 updates are available to predict IS in patients with AF. The guideline-endorsed CHA(2)DS(2)-VASc shows inferior discriminative abilities compared with newer scores. Additional external validations and data on calibration are required before considering the newer scores in clinical practice. Show less
Background: Most risk assessment models for type 2 diabetes (T2DM) have been developed in Caucasians and Asians; little is known about their performance in other ethnic groups.Objective(s): We... Show moreBackground: Most risk assessment models for type 2 diabetes (T2DM) have been developed in Caucasians and Asians; little is known about their performance in other ethnic groups.Objective(s): We aimed to identify existing models for the risk of prevalent or undiagnosed T2DM and externally validate them in a multi-ethnic population currently living in the Netherlands.Methods: A literature search to identify risk assessment models for prevalent or undiagnosed T2DM was performed in PubMed until December 2017. We validated these models in 4,547 Dutch, 3,035 South Asian Surinamese, 4,119 African Surinamese, 2,326 Ghanaian, 3,598 Turkish, and 3,894 Moroccan origin participants from the HELIUS (Healthy Life in an Urban Setting) cohort study performed in Amsterdam. Model performance was assessed in terms of discrimination (C-statistic) and calibration (Hosmer-Lemeshow test). We identified 25 studies containing 29 models for prevalent or undiagnosed T2DM. C-statistics varied between 0.77-0.92 in Dutch, 0.66-0.83 in South Asian Surinamese, 0.70-0.82 in African Surinamese, 0.61-0.81 in Ghanaian, 0.69-0.86 in Turkish, and 0.69-0.87 in the Moroccan populations. The C-statistics were generally lower among the South Asian Surinamese, African Surinamese, and Ghanaian populations and highest among the Dutch. Calibration was poor (Hosmer-Lemeshow p < 0.05) for all models except one.Conclusions: Generally, risk models for prevalent or undiagnosed T2DM show moderate to good discriminatory ability in different ethnic populations living in the Netherlands, but poor calibration. Therefore, these models should be recalibrated before use in clinical practice and should be adapted to the situation of the population they are intended to be used in. Show less
Over the past decades, the genome and proteome have been widely explored for biomarker discovery and personalized medicine. However, there is still a large need for improved diagnostics and... Show moreOver the past decades, the genome and proteome have been widely explored for biomarker discovery and personalized medicine. However, there is still a large need for improved diagnostics and stratification strategies for a wide range of diseases. Post-translational modification of proteins by glycosylation affects protein structure and function, and glycosylation has been implicated in many prevalent human diseases. Numerous proteins for which the plasma levels are nowadays evaluated in clinical practice are glycoproteins. While the glycosylation of these proteins often changes with disease, their glycosylation status is largely ignored in the clinical setting. Hence, the implementation of glycomic markers in the clinic is still in its infancy. This is for a large part caused by the high complexity of protein glycosylation itself and of the analytical techniques required for their robust quantification. Mass spectrometry-based workflows are particularly suitable for the quantification of glycans and glycoproteins, but still require advances for their transformation from a biomedical research setting to a clinical laboratory. In this review, we describe why and how glycomics is expected to find its role in clinical tests and the status of current mass spectrometry-based methods for clinical glycomics. (C) 2020 The Authors. Published by Elsevier B.V. on behalf of The Association for Mass Spectrometry: Applications to the Clinical Lab (MSACL). Show less
Background The T-1 Mapping and Extracellular volume (ECV) Standardization (T1MES) program explored T-1 mapping quality assurance using a purpose-developed phantom with Food and Drug Administration ... Show moreBackground The T-1 Mapping and Extracellular volume (ECV) Standardization (T1MES) program explored T-1 mapping quality assurance using a purpose-developed phantom with Food and Drug Administration (FDA) and Conformite Europeenne (CE) regulatory clearance. We report T-1 measurement repeatability across centers describing sequence, magnet, and vendor performance. Methods Phantoms batch-manufactured in August 2015 underwent 2 years of structural imaging, B-0 and B-1, and "reference" slow T-1 testing. Temperature dependency was evaluated by the United States National Institute of Standards and Technology and by the German Physikalisch-Technische Bundesanstalt. Center-specific T-1 mapping repeatability (maximum one scan per week to minimum one per quarter year) was assessed over mean 358 (maximum 1161) days on 34 1.5 T and 22 3 T magnets using multiple T-1 mapping sequences. Image and temperature data were analyzed semi-automatically. Repeatability of serial T-1 was evaluated in terms of coefficient of variation (CoV), and linear mixed models were constructed to study the interplay of some of the known sources of T-1 variation. Results Over 2 years, phantom gel integrity remained intact (no rips/tears), B-0 and B-1 homogenous, and "reference" T-1 stable compared to baseline (% change at 1.5 T, 1.95 +/- 1.39%; 3 T, 2.22 +/- 1.44%). Per degrees Celsius, 1.5 T, T-1 (MOLLI 5s(3s)3s) increased by 11.4 ms in long native blood tubes and decreased by 1.2 ms in short post-contrast myocardium tubes. Agreement of estimated T-1 times with "reference" T-1 was similar across Siemens and Philips CMR systems at both field strengths (adjusted R-2 ranges for both field strengths, 0.99-1.00). Over 1 year, many 1.5 T and 3 T sequences/magnets were repeatable with mean CoVs < 1 and 2% respectively. Repeatability was narrower for 1.5 T over 3 T. Within T1MES repeatability for native T-1 was narrow for several sequences, for example, at 1.5 T, Siemens MOLLI 5s(3s)3s prototype number 448B (mean CoV = 0.27%) and Philips modified Look-Locker inversion recovery (MOLLI) 3s(3s)5s (CoV 0.54%), and at 3 T, Philips MOLLI 3b(3s)5b (CoV 0.33%) and Siemens shortened MOLLI (ShMOLLI) prototype 780C (CoV 0.69%). After adjusting for temperature and field strength, it was found that the T-1 mapping sequence and scanner software version (both P < 0.001 at 1.5 T and 3 T), and to a lesser extent the scanner model (P = 0.011, 1.5 T only), had the greatest influence on T-1 across multiple centers. Conclusion The T1MES CE/FDA approved phantom is a robust quality assurance device. In a multi-center setting, T-1 mapping had performance differences between field strengths, sequences, scanner software versions, and manufacturers. However, several specific combinations of field strength, sequence, and scanner are highly repeatable, and thus, have potential to provide standardized assessment of T-1 times for clinical use, although temperature correction is required for native T-1 tubes at least. Show less
Luijken, K.; Wynants, L.; Smeden, M. van; Calster, B. van; Steyerberg, E.W.; Groenwold, R.H.H. 2020
Objectives: The aim of this study was to quantify the impact of predictor measurement heterogeneity on prediction model performance. Predictor measurement heterogeneity refers to variation in the... Show moreObjectives: The aim of this study was to quantify the impact of predictor measurement heterogeneity on prediction model performance. Predictor measurement heterogeneity refers to variation in the measurement of predictor(s) between the derivation of a prediction model and its validation or application. It arises, for instance, when predictors are measured using different measurement instruments or protocols.Study Design and Setting: We examined the effects of various scenarios of predictor measurement heterogeneity in real-world clinical examples using previously developed prediction models for diagnosis of ovarian cancer, mutation carriers for Lynch syndrome, and intrauterine pregnancy.Results: Changing the measurement procedure of a predictor influenced the performance at validation of the prediction models in nine clinical examples. Notably, it induced model miscalibration. The calibration intercept at validation ranged from -0.70 to 1.43 (0 for good calibration), whereas the calibration slope ranged from 0.50 to 1.67 (1 for good calibration). The difference in C-statistic and scaled Brier score between derivation and validation ranged from -0.08 to +0.08 and from -0.40 to +0.16, respectively.Conclusion: This study illustrates that predictor measurement heterogeneity can influence the performance of a prediction model substantially, underlining that predictor measurements used in research settings should resemble clinical practice. Specification of measurement heterogeneity can help researchers explaining discrepancies in predictive performance between derivation and validation setting. (C) 2019 The Authors. Published by Elsevier Inc. Show less
Calster, B. van; McLernon, D.J.; Smeden, M. van; Wynants, L.; Steyerberg, E.W.; STRATOS Initiative 2019
Background: The assessment of calibration performance of risk prediction models based on regression or more flexible machine learning algorithms receives little attention.Main text: Herein, we... Show moreBackground: The assessment of calibration performance of risk prediction models based on regression or more flexible machine learning algorithms receives little attention.Main text: Herein, we argue that this needs to change immediately because poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making. We summarize how to avoid poor calibration at algorithm development and how to assess calibration at algorithm validation, emphasizing balance between model complexity and the available sample size. At external validation, calibration curves require sufficiently large samples. Algorithm updating should be considered for appropriate support of clinical practice.Conclusion: Efforts are required to avoid poor calibration when developing prediction models, to evaluate calibration when validating models, and to update models when indicated. The ultimate aim is to optimize the utility of predictive analytics for shared decision-making and patient counseling. Show less
Christodoulou, E.; Ma, J.; Collins, G.S.; Steyerberg, E.W.; Verbakel, J.Y.; Calster, B. van 2019
Aims There is debate about the optimum algorithm for cardiovascular disease (CVD) risk estimation. We conducted head-to-head comparisons of four algorithms recommended by primary prevention... Show moreAims There is debate about the optimum algorithm for cardiovascular disease (CVD) risk estimation. We conducted head-to-head comparisons of four algorithms recommended by primary prevention guidelines, before and after 'recalibration', a method that adapts risk algorithms to take account of differences in the risk characteristics of the populations being studied.Methods and results Using individual-participant data on 360 737 participants without CVD at baseline in 86 prospective studies from 22 countries, we compared the Framingham risk score (FRS), Systematic COronary Risk Evaluation (SCORE), pooled cohort equations (PCE), and Reynolds risk score (RRS). We calculated measures of risk discrimination and calibration, and modelled clinical implications of initiating statin therapy in people judged to be at 'high' 10 year CVD risk. Original risk algorithms were recalibrated using the risk factor profile and CVD incidence of target populations. The four algorithms had similar risk discrimination. Before recalibration, FRS, SCORE, and PCE over predicted CVD risk on average by 10%, 52%, and 41%, respectively, whereas RRS under-predicted by 10%. Original versions of algorithms classified 29 39% of individuals aged >= 40 years as high risk. By contrast, recalibration reduced this proportion to 22-24% for every algorithm. We estimated that to prevent one CVD event, it would be necessary to initiate statin therapy in 44 51 such individuals using original algorithms, in contrast to 37-39 individuals with recalibrated algorithms.Conclusion Before recalibration, the clinical performance of four widely used CVD risk algorithms varied substantially. By contrast, simple recalibration nearly equalized their performance and improved modelled targeting of preventive action to clinical need. Show less
Besselaar, A.M.H.P. van den; Witteveen, E.; Meer, F.J.M. van der 2013
Fluorescence bias in in signals from individual SNP arrays can be calibrated using linear models. Given the data, the system of equations is very large, so a specialized symbolic algorithm was... Show moreFluorescence bias in in signals from individual SNP arrays can be calibrated using linear models. Given the data, the system of equations is very large, so a specialized symbolic algorithm was developed. These models are also used to illustrate that genomic waves do not exist, but are merely an artifact of commonly used methods. Furthermore, a new semi-parametric, single array, approach to SNP genotyping is introduced and shown to be both effective and efficient. A refined algorithm for copy number estimation, using a zero-exponent norm is proposed, which performs well, as is illustrated by thorough comparisons with other methods. Indications that the signal calibration can improve (genotyping) results from lower quality samples are also discussed. A software suite that implements the above is described and illustrated. Show less
BackgroundNon-equivalence in serum creatinine (SCr) measurements across Dutch laboratories and the consequences hereof on chronic kidney disease (CKD) staging were examined. MethodsNational data... Show moreBackgroundNon-equivalence in serum creatinine (SCr) measurements across Dutch laboratories and the consequences hereof on chronic kidney disease (CKD) staging were examined. MethodsNational data from the Dutch annual external quality organization of 2009 were used. 144 participating laboratories examined 11 pairs of commutable, value-assigned SCr specimens in the range 52–262 μmol/L, using Jaffe or enzymatic techniques. Regression equations were created for each participating laboratory (by regressing values as measured by participating laboratories on the target values of the samples sent by the external quality organization); area under the curves were examined and used to rank laboratories. The 10th and 90th percentile regression equation were selected for each technique separately. To evaluate the impact of the variability in SCr measurements and its eventual clinical consequences in a real patient population, we used a cohort of 82424 patients aged 19–106 years. The SCr measurements of these 82424 patients were introduced in the 10th and 90th percentile regression equations. The newly calculated SCr values were used to calculate an estimated glomerular filtration rate (eGFR) using the 4-variable Isotope Dilution Mass Spectrometry traceable Modification of Diet in Renal Disease formula. Differences in CKD staging were examined, comparing the stratification outcomes for Jaffe and enzymatic SCr techniques. ResultsJaffe techniques overestimated SCr: 21%, 12%, 10% for SCr target values 52, 73 and 94 μmol/L, respectively. For enzymatic assay these values were 0%, -1%, -2%, respectively. eGFR using the MDRD formula and SCr measured by Jaffe techniques, staged patients in a lower CKD category. Downgrading to a lower CKD stage occurred in 1-42%, 2-37% and 12–78.9% of patients for the 10th and 90th percentile laboratories respectively in CKD categories 45–60, 60–90 and >90 ml/min/1.73 m2. Using enzymatic techniques, downgrading occurred only in 2-4% of patients. ConclusionsEnzymatic techniques lead to less variability in SCr measurements than Jaffe techniques, and therefore result in more accurate staging of CKD. Therefore the specific enzymatic techniques are preferably used in clinical practice in order to generate more reliable GFR estimates. Show less
Besselaar, A.M.H.P. van den; Zanten, A.P. van; Brantjes, H.M.; Elisen, M.G.L.M.; Meer, F.J.M. van der; Poland, D.C.W.; ... ; Castel, A. 2012
In the last decade, the deployment of simulation systems in the automotive industry on the basis of the finite element method (FEM) became a standard for the evaluation of sheet metal forming... Show moreIn the last decade, the deployment of simulation systems in the automotive industry on the basis of the finite element method (FEM) became a standard for the evaluation of sheet metal forming processes. The technical and economic benefits of the FEM based simulation strongly depends on the accuracy of the computed prediction. The predictive capability of FEM based simulations is mainly determined by the chosen physical theory and its numerical solution. This thesis focuses on the application of optimization algorithms for the identification and validation of material models, which belong to the group of the constitutive laws. The objective of this thesis is threefold: Firstly, the identification of potentials regarding the material models in order to maximize the benefit of the FEM forming simulation and, secondly, the development of an identification and validation procedure for material models. Finally, the effect of the deviations between the measured data and the true values of the calibration experiments on the predictive capability of material models is investigated Show less
BackgroundNon-equivalence in serum creatinine (SCr) measurements across Dutch laboratories and the consequences hereof on chronic kidney disease (CKD) staging were examined. MethodsNational data... Show moreBackgroundNon-equivalence in serum creatinine (SCr) measurements across Dutch laboratories and the consequences hereof on chronic kidney disease (CKD) staging were examined. MethodsNational data from the Dutch annual external quality organization of 2009 were used. 144 participating laboratories examined 11 pairs of commutable, value-assigned SCr specimens in the range 52–262 μmol/L, using Jaffe or enzymatic techniques. Regression equations were created for each participating laboratory (by regressing values as measured by participating laboratories on the target values of the samples sent by the external quality organization); area under the curves were examined and used to rank laboratories. The 10th and 90th percentile regression equation were selected for each technique separately. To evaluate the impact of the variability in SCr measurements and its eventual clinical consequences in a real patient population, we used a cohort of 82424 patients aged 19–106 years. The SCr measurements of these 82424 patients were introduced in the 10th and 90th percentile regression equations. The newly calculated SCr values were used to calculate an estimated glomerular filtration rate (eGFR) using the 4-variable Isotope Dilution Mass Spectrometry traceable Modification of Diet in Renal Disease formula. Differences in CKD staging were examined, comparing the stratification outcomes for Jaffe and enzymatic SCr techniques. ResultsJaffe techniques overestimated SCr: 21%, 12%, 10% for SCr target values 52, 73 and 94 μmol/L, respectively. For enzymatic assay these values were 0%, -1%, -2%, respectively. eGFR using the MDRD formula and SCr measured by Jaffe techniques, staged patients in a lower CKD category. Downgrading to a lower CKD stage occurred in 1-42%, 2-37% and 12–78.9% of patients for the 10th and 90th percentile laboratories respectively in CKD categories 45–60, 60–90 and >90 ml/min/1.73 m2. Using enzymatic techniques, downgrading occurred only in 2-4% of patients. ConclusionsEnzymatic techniques lead to less variability in SCr measurements than Jaffe techniques, and therefore result in more accurate staging of CKD. Therefore the specific enzymatic techniques are preferably used in clinical practice in order to generate more reliable GFR estimates. Show less
The study of cosmic large-scale structure formation benefits from radio observations, because it provides an unbiased view on the early Universe. Distant radio galaxies and diffuse cluster sources... Show moreThe study of cosmic large-scale structure formation benefits from radio observations, because it provides an unbiased view on the early Universe. Distant radio galaxies and diffuse cluster sources generally have a steep spectrum, which implies an increased brightness towards lower frequencies (below 300 MHz). The quality of low-frequency radio observations is compromised by the propagation effects on cosmic radio waves passing through the ionosphere. In this thesis, we present a calibration method for low-frequency radio interferometric observations. This method significantly improves the quality of radio maps from archival observations as compared to other existing calibration methods. The method was used to produce one of the deepest high-resolution surveys at 153 MHz to date, including the detection of 16 candidate distant radio galaxies. Furthermore, the method was used in a study of the diffuse radio sources in the merging galaxy cluster Abell 2256. These observations support the theory of revival of old radio sources through cluster merger shock compression. Finally, we present a study of the cosmic large-scale structure near a radio galaxy in the early universe by using an optical selection technique for galaxies. The projected galaxy distribution appears to trace the cosmic structure during the assembly of galaxy clusters. Show less