The generalizability of predictive algorithms is of key relevance to application in clinical practice. We provide an overview of three types of generalizability, based on existing literature:... Show moreThe generalizability of predictive algorithms is of key relevance to application in clinical practice. We provide an overview of three types of generalizability, based on existing literature: temporal, geographical, and domain generalizability. These generalizability types are linked to their associated goals, methodology, and stakeholders. Show less
Calster, B. van; Steyerberg, E.W.; Wynants, L.; Smeden, M. van 2023
Background Clinical prediction models should be validated before implementation in clinical practice. But is favorable performance at internal validation or one external validation sufficient to... Show moreBackground Clinical prediction models should be validated before implementation in clinical practice. But is favorable performance at internal validation or one external validation sufficient to claim that a prediction model works well in the intended clinical context? Main body We argue to the contrary because (1) patient populations vary, (2) measurement procedures vary, and (3) populations and measurements change over time. Hence, we have to expect heterogeneity in model performance between locations and settings, and across time. It follows that prediction models are never truly validated. This does not imply that validation is not important. Rather, the current focus on developing new models should shift to a focus on more extensive, well-conducted, and well-reported validation studies of promising models. Conclusion Principled validation strategies are needed to understand and quantify heterogeneity, monitor performance over time, and update prediction models when appropriate. Such strategies will help to ensure that prediction models stay up-to-date and safe to support clinical decision-making. Show less
McLernon, D.J.; Giardiello, D.; Calster, B. van; Wynants, L.; Geloven, N. van; Smeden, M. van; ... ; STRATOS Initiative 2022
Risk prediction models need thorough validation to assess their performance. Validation of models for survival outcomes poses challenges due to the censoring of observations and the varying time... Show moreRisk prediction models need thorough validation to assess their performance. Validation of models for survival outcomes poses challenges due to the censoring of observations and the varying time horizon at which predictions can be made. This article describes measures to evaluate predictions and the potential improvement in decision making from survival models based on Cox proportional hazards regression.As a motivating case study, the authors consider the prediction of the composite outcome of recurrence or death (the "event ") in patients with breast cancer after surgery. They developed a simple Cox regression model with 3 predictors, as in the Nottingham Prognostic Index, in 2982 women (1275 events over 5 years of follow-up) and externally validated this model in 686 women (285 events over 5 years). Improvement in performance was assessed after the addition of progesterone receptor as a prognostic biomarker.The model predictions can be evaluated across the full range of observed follow-up times or for the event occurring by the end of a fixed time horizon of interest. The authors first discuss recommended statistical measures that evaluate model performance in terms of discrimination, calibration, or overall performance. Further, they evaluate the potential clinical utility of the model to support clinical decision making according to a net benefit measure. They provide SAS and R code to illustrate internal and external validation.The authors recommend the proposed set of performance measures for transparent reporting of the validity of predictions from survival models. Show less
Geloven, N. van; Giardiello, D.; Bonneville, E.F.; Teece, L.; Ramspek, C.L.; Smeden, M. van; ... ; STRATOS Initiative 2022
Background: While clinical prediction models (CPMs) are used increasingly commonly to guide patient care, the performance and clinical utility of these CPMs in new patient cohorts is poorly... Show moreBackground: While clinical prediction models (CPMs) are used increasingly commonly to guide patient care, the performance and clinical utility of these CPMs in new patient cohorts is poorly understood. Methods: We performed 158 external validations of 104 unique CPMs across 3 domains of cardiovascular disease (primary prevention, acute coronary syndrome, and heart failure). Validations were performed in publicly available clinical trial cohorts and model performance was assessed using measures of discrimination, calibration, and net benefit. To explore potential reasons for poor model performance, CPM-clinical trial cohort pairs were stratified based on relatedness, a domain-specific set of characteristics to qualitatively grade the similarity of derivation and validation patient populations. We also examined the model-based C-statistic to assess whether changes in discrimination were because of differences in case-mix between the derivation and validation samples. The impact of model updating on model performance was also assessed. Results: Discrimination decreased significantly between model derivation (0.76 [interquartile range 0.73-0.78]) and validation (0.64 [interquartile range 0.60-0.67], P<0.001), but approximately half of this decrease was because of narrower case-mix in the validation samples. CPMs had better discrimination when tested in related compared with distantly related trial cohorts. Calibration slope was also significantly higher in related trial cohorts (0.77 [interquartile range, 0.59-0.90]) than distantly related cohorts (0.59 [interquartile range 0.43-0.73], P=0.001). When considering the full range of possible decision thresholds between half and twice the outcome incidence, 91% of models had a risk of harm (net benefit below default strategy) at some threshold; this risk could be reduced substantially via updating model intercept, calibration slope, or complete re-estimation. Conclusions: There are significant decreases in model performance when applying cardiovascular disease CPMs to new patient populations, resulting in substantial risk of harm. Model updating can mitigate these risks. Care should be taken when using CPMs to guide clinical decision-making. Show less
Heinze, G.; Smeden, M. van; Wynants, L.; Steyerberg, E.; Calster, B. van 2022
Thorough validation is pivotal for any prediction model before it can be advocated for use in medical practice. For time-to-event outcomes such as breast cancer recurrence, death from other causes... Show moreThorough validation is pivotal for any prediction model before it can be advocated for use in medical practice. For time-to-event outcomes such as breast cancer recurrence, death from other causes is a competing risk. Model performance measures must account for such competing events. In this article, we present a comprehensive yet accessible overview of performance measures for this competing eventsetting, including the calculation and interpretation of statistical measures for calibration, discrimination, overall prediction error, and clinical usefulness by decision curve analysis. All methods are illustrated for patients with breast cancer, with publicly available data and R code. Show less
Background: There are many clinical prediction models (CPMs) available to inform treatment decisions for patients with cardiovascular disease. However, the extent to which they have been externally... Show moreBackground: There are many clinical prediction models (CPMs) available to inform treatment decisions for patients with cardiovascular disease. However, the extent to which they have been externally tested, and how well they generally perform has not been broadly evaluated. Methods: A SCOPUS citation search was run on March 22, 2017 to identify external validations of cardiovascular CPMs in the Tufts Predictive Analytics and Comparative Effectiveness CPM Registry. We assessed the extent of external validation, performance heterogeneity across databases, and explored factors associated with model performance, including a global assessment of the clinical relatedness between the derivation and validation data. Results: We identified 2030 external validations of 1382 CPMs. Eight hundred seven (58%) of the CPMs in the Registry have never been externally validated. On average, there were 1.5 validations per CPM (range, 0-94). The median external validation area under the receiver operating characteristic curve was 0.73 (25th-75th percentile [interquartile range (IQR)], 0.66-0.79), representing a median percent decrease in discrimination of -11.1% (IQR, -32.4% to +2.7%) compared with performance on derivation data. 81% (n=1333) of validations reporting area under the receiver operating characteristic curve showed discrimination below that reported in the derivation dataset. 53% (n=983) of the validations report some measure of CPM calibration. For CPMs evaluated more than once, there was typically a large range of performance. Of 1702 validations classified by relatedness, the percent change in discrimination was -3.7% (IQR, -13.2 to 3.1) for closely related validations (n=123), -9.0 (IQR, -27.6 to 3.9) for related validations (n=862), and -17.2% (IQR, -42.3 to 0) for distantly related validations (n=717; P<0.001). Conclusions: Many published cardiovascular CPMs have never been externally validated, and for those that have, apparent performance during development is often overly optimistic. A single external validation appears insufficient to broadly understand the performance heterogeneity across different settings. Show less
Verbakel, J.Y.; Steyerberg, E.W.; Uno, H.; Cock, B. de; Wynants, L.; Collins, G.S.; Calster, B. van 2020
Objectives: Receiver operating characteristic (ROC) curves show how well a risk prediction model discriminates between patients with and without a condition. We aim to investigate how ROC curves... Show moreObjectives: Receiver operating characteristic (ROC) curves show how well a risk prediction model discriminates between patients with and without a condition. We aim to investigate how ROC curves are presented in the literature and discuss and illustrate their potential limitations.Study Design and Setting: We conducted a pragmatic literature review of contemporary publications that externally validated clinical prediction models. We illustrated limitations of ROC curves using a testicular cancer case study and simulated data.Results: Of 86 identified prediction modeling studies, 52 (60%) presented ROC curves without thresholds and one (1%) presented an ROC curve with only a few thresholds. We illustrate that ROC curves in their standard form withhold threshold information have an unstable shape even for the same area under the curve (AUC) and are problematic for comparing model performance conditional on threshold. We compare ROC curves with classification plots, which show sensitivity and specificity conditional on risk thresholds.Conclusion: ROC curves do not offer more information than the AUC to indicate discriminative ability. To assess the model's performance for decision-making, results should be provided conditional on risk thresholds. Therefore, if discriminatory ability must be visualized, classification plots are attractive. (C) 2020 Published by Elsevier Inc. Show less
Objective: We aimed to explore the added value of common machine learning (ML) algorithms for prediction of outcome for moderate and severe traumatic brain injury.Study Design and Setting: We... Show moreObjective: We aimed to explore the added value of common machine learning (ML) algorithms for prediction of outcome for moderate and severe traumatic brain injury.Study Design and Setting: We performed logistic regression (LR), lasso regression, and ridge regression with key baseline predictors in the IMPACT-II database (15 studies, n = 11,022). ML algorithms included support vector machines, random forests, gradient boosting machines, and artificial neural networks and were trained using the same predictors. To assess generalizability of predictions, we performed internal, internal-external, and external validation on the recent CENTER-TBI study (patients with Glasgow Coma Scale <13, n = 1,554). Both calibration (calibration slope/intercept) and discrimination (area under the curve) was quantified.Results: In the IMPACT-II database, 3,332/11,022 (30%) died and 5,233(48%) had unfavorable outcome (Glasgow Outcome Scale less than 4). In the CENTER-TBI study, 348/1,554(29%) died and 651(54%) had unfavorable outcome. Discrimination and calibration varied widely between the studies and less so between the studied algorithms. The mean area under the curve was 0.82 for mortality and 0.77 for unfavorable outcomes in the CENTER-TBI study.Conclusion: ML algorithms may not outperform traditional regression approaches in a low-dimensional setting for outcome prediction after moderate or severe traumatic brain injury. Similar to regression-based prediction models, ML algorithms should be rigorously validated to ensure applicability to new populations. (C) 2020 The Authors. Published by Elsevier Inc. Show less
Calster, B. van; Smeden, M. van; Cock, B. de; Steyerberg, E.W. 2020
When developing risk prediction models on datasets with limited sample size, shrinkage methods are recommended. Earlier studies showed that shrinkage results in better predictive performance on... Show moreWhen developing risk prediction models on datasets with limited sample size, shrinkage methods are recommended. Earlier studies showed that shrinkage results in better predictive performance on average. This simulation study aimed to investigate the variability of regression shrinkage on predictive performance for a binary outcome. We compared standard maximum likelihood with the following shrinkage methods: uniform shrinkage (likelihood-based and bootstrap-based), penalized maximum likelihood (ridge) methods, LASSO logistic regression, adaptive LASSO, and Firth's correction. In the simulation study, we varied the number of predictors and their strength, the correlation between predictors, the event rate of the outcome, and the events per variable. In terms of results, we focused on the calibration slope. The slope indicates whether risk predictions are too extreme (slope < 1) or not extreme enough (slope > 1). The results can be summarized into three main findings. First, shrinkage improved calibration slopes on average. Second, the between-sample variability of calibration slopes was often increased relative to maximum likelihood. In contrast to other shrinkage approaches, Firth's correction had a small shrinkage effect but showed low variability. Third, the correlation between the estimated shrinkage and the optimal shrinkage to remove overfitting was typically negative, with Firth's correction as the exception. We conclude that, despite improved performance on average, shrinkage often worked poorly in individual datasets, in particular when it was most needed. The results imply that shrinkage methods do not solve problems associated with small sample size or low number of events per variable. Show less
The role of P-values for null hypothesis testing is under debate. We aim to explore the impact of the significance threshold on estimates for the strengths of associations ("effects") and the... Show moreThe role of P-values for null hypothesis testing is under debate. We aim to explore the impact of the significance threshold on estimates for the strengths of associations ("effects") and the implications for different types of epidemiological research. We consider situations with normal distribution of a true effect, while varying the effect size. We confirm the occurrence of "testimation bias": estimating effect size only if the test was statistically significant leads to exaggerated results. The absolute bias is largest for true effects around 0.7 times the size of the standard error: +220% bias if effects are selected after testing with P < .05, and +335% if tested with P P < .20 (+130%) and larger true effect sizes. We conclude that a lower P-value threshold for declaring statistical significance implies more exaggeration in an estimated effect. This implies that if a low threshold is used, effect size estimation should not be attempted, for example in the context of selecting promising discoveries that need further validation. Confirmatory studies, such as randomized controlled trials, might stick to the 0.05 threshold if adequately powered, while prediction modelling studies should use an even higher threshold, such as 0.2, to avoid strongly biased effect estimates. Show less
Wynants, L.; Calster, B. van; Bonten, M.M.J.; Collins, G.S.; Debray, T.P.A.; Vos, M. de; ... ; Smeden, M. van 2020
OBJECTIVETo review and critically appraise published and preprint reports of prediction models for diagnosing coronavirus disease 2019 (covid-19) in patients with suspected infection, for prognosis... Show moreOBJECTIVETo review and critically appraise published and preprint reports of prediction models for diagnosing coronavirus disease 2019 (covid-19) in patients with suspected infection, for prognosis of patients with covid-19, and for detecting people in the general population at risk of being admitted to hospital for covid-19 pneumonia.DESIGNRapid systematic review and critical appraisal.DATA SOURCESPubMed and Embase through Ovid, Arxiv, medRxiv, and bioRxiv up to 24 March 2020.STUDY SELECTIONStudies that developed or validated a multivariable covid-19 related prediction model.DATA EXTRACTIONAt least two authors independently extracted data using the CHARMS (critical appraisal and data extraction for systematic reviews of prediction modelling studies) checklist; risk of bias was assessed using PROBAST (prediction model risk of bias assessment tool).RESULTS2696 titles were screened, and 27 studies describing 31 prediction models were included. Three models were identified for predicting hospital admission from pneumonia and other events (as proxy outcomes for covid-19 pneumonia) in the general population; 18 diagnostic models for detecting covid-19 infection (13 were machine learning based on computed tomography scans); and 10 prognostic models for predicting mortality risk, progression to severe disease, or length of hospital stay. Only one study used patient data from outside of China. The most reported predictors of presence of covid-19 in patients with suspected disease included age, body temperature, and signs and symptoms. The most reported predictors of severe prognosis in patients with covid-19 included age, sex, features derived from computed tomography scans, C reactive protein, lactic dehydrogenase, and lymphocyte count. C index estimates ranged from 0.73 to 0.81 in prediction models for the general population (reported for all three models), from 0.81 to more than 0.99 in diagnostic models (reported for 13 of the 18 models), and from 0.85 to 0.98 in prognostic models (reported for six of the 10 models). All studies were rated at high risk of bias, mostly because of non-representative selection of control patients, exclusion of patients who had not experienced the event of interest by the end of the study, and high risk of model overfitting. Reporting quality varied substantially between studies. Most reports did not include a description of the study population or intended use of the models, and calibration of predictions was rarely assessed.CONCLUSIONPrediction models for covid-19 are quickly entering the academic literature to support medical decision making at a time when they are urgently needed. This review indicates that proposed models are poorly reported, at high risk of bias, and their reported performance is probably optimistic. Immediate sharing of well documented individual participant data from covid-19 studies is needed for collaborative efforts to develop more rigorous prediction models and validate existing ones. The predictors identified in included studies could be considered as candidate predictors for new models. Methodological guidance should be followed because unreliable predictions could cause more harm than benefit in guiding clinical decisions. Finally, studies should adhere to the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) reporting guideline. Show less
Luijken, K.; Wynants, L.; Smeden, M. van; Calster, B. van; Steyerberg, E.W.; Groenwold, R.H.H. 2020
Objectives: The aim of this study was to quantify the impact of predictor measurement heterogeneity on prediction model performance. Predictor measurement heterogeneity refers to variation in the... Show moreObjectives: The aim of this study was to quantify the impact of predictor measurement heterogeneity on prediction model performance. Predictor measurement heterogeneity refers to variation in the measurement of predictor(s) between the derivation of a prediction model and its validation or application. It arises, for instance, when predictors are measured using different measurement instruments or protocols.Study Design and Setting: We examined the effects of various scenarios of predictor measurement heterogeneity in real-world clinical examples using previously developed prediction models for diagnosis of ovarian cancer, mutation carriers for Lynch syndrome, and intrauterine pregnancy.Results: Changing the measurement procedure of a predictor influenced the performance at validation of the prediction models in nine clinical examples. Notably, it induced model miscalibration. The calibration intercept at validation ranged from -0.70 to 1.43 (0 for good calibration), whereas the calibration slope ranged from 0.50 to 1.67 (1 for good calibration). The difference in C-statistic and scaled Brier score between derivation and validation ranged from -0.08 to +0.08 and from -0.40 to +0.16, respectively.Conclusion: This study illustrates that predictor measurement heterogeneity can influence the performance of a prediction model substantially, underlining that predictor measurements used in research settings should resemble clinical practice. Specification of measurement heterogeneity can help researchers explaining discrepancies in predictive performance between derivation and validation setting. (C) 2019 The Authors. Published by Elsevier Inc. Show less
Background: The assessment of calibration performance of risk prediction models based on regression or more flexible machine learning algorithms receives little attention.Main text: Herein, we... Show moreBackground: The assessment of calibration performance of risk prediction models based on regression or more flexible machine learning algorithms receives little attention.Main text: Herein, we argue that this needs to change immediately because poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making. We summarize how to avoid poor calibration at algorithm development and how to assess calibration at algorithm validation, emphasizing balance between model complexity and the available sample size. At external validation, calibration curves require sufficiently large samples. Algorithm updating should be considered for appropriate support of clinical practice.Conclusion: Efforts are required to avoid poor calibration when developing prediction models, to evaluate calibration when validating models, and to update models when indicated. The ultimate aim is to optimize the utility of predictive analytics for shared decision-making and patient counseling. Show less
Wynants, L.; Smeden, M. van; McLernon, D.J.; Timmerman, D.; Steyerberg, E.W.; Calster, B. van; Topic Grp Evaluating Diagnosti 2019