Video Abstract
Prediction models can be a valuable tool in performing risk assessment of mortality in preterm infants.
Summarizing prognostic models for predicting mortality in very preterm infants and assessing their quality.
Medline was searched for all articles (up to June 2020).
All developed or externally validated prognostic models for mortality prediction in liveborn infants born <32 weeks’ gestation and/or <1500 g birth weight were included.
Data were extracted by 2 independent authors. Risk of bias (ROB) and applicability assessment was performed by 2 independent authors using Prediction model Risk of Bias Assessment Tool.
One hundred forty-four models from 36 studies reporting on model development and 118 models from 34 studies reporting on external validation were included. ROB assessment revealed high ROB in the majority of the models, most often because of inadequate (reporting of) analysis. Internal and external validation was lacking in 42% and 94% of these models. Meta-analyses revealed an average C-statistic of 0.88 (95% confidence interval [CI]: 0.83–0.91) for the Clinical Risk Index for Babies score, 0.87 (95% CI: 0.81–0.92) for the Clinical Risk Index for Babies II score, 0.86 (95% CI: 0.78–0.92) for the Score for Neonatal Acute Physiology Perinatal Extension II score and 0.71 (95% CI 0.61–0.79) for the NICHD model.
Occasionally, an external validation study was included, but not the development study, because studies developed in the presurfactant era or general NICU population were excluded.
Instead of developing additional mortality prediction models for preterm infants, the emphasis should be shifted toward external validation and consecutive adaption of the existing prediction models.
Very preterm birth (<32 completed weeks’ gestation) and very low birth weight (<1500 g) infants are associated with increased mortality and neonatal morbidity and as such are considered a major challenge in perinatal health care.1 Very preterm birth occurs in 1.3% of all live births in developed regions.2 Despite this low prevalence, complications associated with preterm birth are responsible for 35% of the world’s annual neonatal deaths.3 Accurate risk assessment of postnatal death in very preterm infants can help caregivers and parents to decide whether and when to intervene in a pregnancy or to adjust postnatal care.4 Prediction models can be a helpful tool in performing such risk assessment.5,6
In a 2011 systematic review, Medlock et al7 reported on the availability of >40 prediction models to assess the risk of neonatal mortality in infants born very preterm. When the review was conducted, no standardized tool for risk of bias (ROB) assessment of prediction models was yet available. The recently published Prediction study Risk of Bias Assessment Tool (PROBAST)8,9 provides the opportunity to formally assess the ROB of newly published models as well as of the models that were identified by Medlock et al.7 Second, since its publication in 2011, the review has not been updated, whereas many development and validation studies of prognostic models for the prediction of mortality in preterm infants have been published. Third, Medlock et al7 excluded external validation studies, thereby limiting the information on external validity of identified models and the possibility to perform any quantitative analyses.
Therefore, the aim with this study was to update the systematic review of Medlock et al7 on prognostic models for predicting postnatal mortality in liveborn very preterm infants (PICOTS [population, index model, comparator model, outcome, timing, setting] framework presented in Supplemental Table 5) and to extend it by the addition of a ROB assessment using PROBAST, inclusion of studies externally validating existing models, and meta-analysis of model performance measures of the models most often validated.
Methods
Search Strategy
To obtain an update of the systematic review by Medlock et al7 from 2011, the same search strategy was used. Medline was searched for all articles from May 2010 (last search date of Medlock et al7 ) up to June 2020 by using a search that followed the general form: “prediction model AND preterm AND infant AND mortality.” The detailed search strategy is shown in Supplemental Table 6. The current review is reported according to the Preferred Reporting Items for Systematic Reviews and Meta-analyses statement.10
Eligibility Criteria
Inclusion and exclusion criteria were similar to the criteria used in Medlock et al.7 The criteria following the critical appraisal and data extraction for systematic reviews of prediction modelling studies (CHARMS) checklist11 are given in Table 1. In brief, all prognostic models that aim to predict mortality at any time point in infants born at <32 weeks and/or <1500 g birth weight were included. Studies were classified as pre- or postsurfactant on the basis of the authors’ report of surfactant use, whereafter presurfactant studies were excluded. In studies in which surfactant use was not reported, surfactant was assumed to be in routine use after 1990.
Criteria to Guide the Literature Search Following the CHARMS Checklist
Item . | Criteria . |
---|---|
Prognostic versus diagnostic prediction model | Prognostic prediction models |
Intended scope of the review | Purpose of the included models is to predict the probability of mortality, rather than to investigate a single specific risk factor. |
Type of prediction modeling studies | Prediction model development studies with or without external validation and external model validation studies, in which researchers report at least 1 measure of model performance on preterm infants |
Target population to whom the prediction model applies | Population of liveborn infants or admitted infants born at <32 wk gestational age and/or <1500 g birth wt; inclusion of prognostic models for gestational age– or birth wt–specific population; inclusion of studies in which researchers used a slightly broader definition of VLGA/BW or ELGA/BW; exclusion of models derived for a subpopulation with a specific disease or condition (eg, NEC); exclusion of models for general NICU population, unless separately reported performance for VLGA/BW infants; exclusion of studies from the presurfactant era |
Outcome to be predicted | Outcome that the model predicts is mortality or survival. Studies in which authors report on models developed for combined outcome measures (eg, morbidity and mortality) are excluded. |
Time span of prediction | Mortality at any time point |
Intended moment of using the model | Models using liveborn infants as well as admitted infants will be included, so intended moment of using the model depends on this population. |
Item . | Criteria . |
---|---|
Prognostic versus diagnostic prediction model | Prognostic prediction models |
Intended scope of the review | Purpose of the included models is to predict the probability of mortality, rather than to investigate a single specific risk factor. |
Type of prediction modeling studies | Prediction model development studies with or without external validation and external model validation studies, in which researchers report at least 1 measure of model performance on preterm infants |
Target population to whom the prediction model applies | Population of liveborn infants or admitted infants born at <32 wk gestational age and/or <1500 g birth wt; inclusion of prognostic models for gestational age– or birth wt–specific population; inclusion of studies in which researchers used a slightly broader definition of VLGA/BW or ELGA/BW; exclusion of models derived for a subpopulation with a specific disease or condition (eg, NEC); exclusion of models for general NICU population, unless separately reported performance for VLGA/BW infants; exclusion of studies from the presurfactant era |
Outcome to be predicted | Outcome that the model predicts is mortality or survival. Studies in which authors report on models developed for combined outcome measures (eg, morbidity and mortality) are excluded. |
Time span of prediction | Mortality at any time point |
Intended moment of using the model | Models using liveborn infants as well as admitted infants will be included, so intended moment of using the model depends on this population. |
NEC, necrotizing enterocolitis.
Titles and abstracts were independently screened by 2 authors each (P.E.v.B., P.A., W.O., and E.S.) and included if considered relevant. Before full text screening, all studies included in Medlock et al7 were added. In addition, Medlock was contacted for external validation studies that were excluded from their 2011 review, and these were added as well. Subsequently, full texts of the selected articles were screened in duplicate for final inclusion by 2 authors (P.E.v.B., P.A., W.O., and E.S.). Likewise, data extraction and ROB assessment were conducted in duplicate (P.E.v.B., P.A., W.O., and E.S.). In case of discrepancies, a third reviewer was involved to establish consensus.
Data Extraction and Critical Appraisal
Eligible articles were categorized into 2 groups: development studies and external validation studies, with separate data extraction forms for each group. Relevant items were extracted from each selected article by using the domains described in the CHARMS checklist, which included information on population, candidate predictors (only for development studies), outcome to be predicted, model development (only for development studies), and model performance.11 If an article described the development or external validation of multiple (existing) models, separate data extraction for each model was conducted for each model. Additionally, a ROB and applicability assessment were performed by using PROBAST.8,9 PROBAST is organized into 4 domains (participants, predictors, outcome, and analysis) and contains a total of 20 signaling questions to facilitate structured judgment of ROB. Signaling questions are answered as “yes,” “probably yes,” “no,” “probably no,” or “no information.” A domain in which all signaling questions are answered as “yes” or “probably yes” should be judged as low ROB, whereas a “no” or “probably no” on 1 or more questions in a domain flags the potential for bias. Insufficient information on 1 or more questions might result in unclear ROB as well as in low or high ROB, depending on judgment of the reviewers. Applicability of the study to the review question is assessed for the 3 domains participants, predictors, and outcomes and is rated as low, high, or unclear, with low concern regarding applicability if the review question and the study are a good match. To achieve consistent data extraction and ROB assessment, the standardized data extraction forms were piloted, modified, and finalized after discussion with all authors. The full list of final extracted items is available on request.
Differences Between the Protocol and Review
Details of the protocol for this systematic review were registered on PROSPERO (IDCRD42019141434).12 In the protocol, prediction of mortality 1 year after birth was registered as the maximum time span of prediction, but in the final review, no such maximum was used, meaning all articles on mortality were included, regardless of the time point at which mortality was predicted, giving a comprehensive overview of all available models. Consequently, the current review provides a comprehensive list of prediction models for mortality. Furthermore, in the protocol, it was stated that the aim of the article was to give a narrative overview, but in the final review, a quantitative analysis was also added. During the selection process, it became apparent that certain models had been externally validated frequently, thereby allowing quantitative (meta-)analysis of model performance.
Statistical Analysis
Results of development and external validation studies were summarized by using descriptive statistics. Prognostic models that were externally validated in at least 5 studies were analyzed quantitatively by using random effects meta-analyses. If researchers of a study performed multiple external validations of 1 model, the validation with characteristics most similar to the development study was used for meta-analysis. Furthermore, meta-analysis was performed in each subgroup separately, on the basis of whether the study population was extremely low gestational age or birth weight (ELGA/BW) (defined as a gestational age <28 weeks or birth weight <1000 g) or very low gestational age or birth weight (VLGA/BW) (applicable to all infants that were not ELGA/BW). Subgroup analysis was performed if at least 5 studies were included in a subgroup. If no C-statistic was reported despite presentation of the receiver operating characteristic curve, WebPlotDigitizer was used to reconstruct the curve and to calculate the area under the curve (ie, the C-statistic). Logit transformation for the C-statistics was used during meta-analyses to overcome the poor statistical properties of the normal distribution when the C-statistic was close to 0 or 1 or when sample sizes were relatively small.13 Between-study heterogeneity was quantified by using the I2 statistic.14,15 A rough guide to interpretation of the I2 statistic is as follows: 0% to 40% might not be important; 30% to 60% may represent moderate heterogeneity; 50% to 90% may represent substantial heterogeneity; 75% to 100% is considerable heterogeneity.16 Furthermore, 95% confidence intervals (CIs) were calculated to indicate the precision of the summary performance estimate, and 95% prediction intervals (PIs) were calculated to provide boundaries on the likely performance in future model validation studies that are comparable to the studies included in the meta-analysis and thus can be seen as an indication of model generalizability.17 In addition, we calculated the probability that the C-statistic of the validated models will be larger than 0.70 and 0.80 in future validation studies. All analyses were performed in R version 3.5.2.
Results
Results of the Search
The initial search yielded 2159 unique articles, as shown in the flowchart in Fig 1. After title and abstract screening, 62 articles were provisionally selected for full text screening. All 41 articles identified by Medlock et al plus an additional 18 articles reporting on the external validation of existing models were added for full text screening. Out of those 121 articles, 60 articles, including 29 articles from Medlock et al, met the inclusion criteria and were selected for data extraction. The 30 articles from Medlock were excluded because the article did not concern an individual prediction model (n = 7), the study was performed in the presurfactant era (n = 9), the population was not applicable to our research question (n = 8), the outcome was not applicable to our research question (n = 2), no full text was available (n = 3), or the article was written in a foreign language (n = 1). From the 36 studies reporting on model development, 144 unique models were identified (Table 2). In the 34 studies reporting on external validation, 118 models were validated (Table 3). Of these 34 studies reporting on external validation, 23 studies were used for meta-analysis of the Clinical Risk Index for Babies (CRIB) (n = 15), CRIB-II (n = 12), Score for Neonatal Acute Physiology Perinatal Extension (SNAPPE) II (n = 6) and National Institute of Child Health and Human Development (NICHD) calculator (N = 5) scores.
Studies reporting on model development.
Article . | No. models . | Differences between models caused by differences in . | Inclusion criterion . | Timing of death . | % mortality . |
---|---|---|---|---|---|
Pishevar, 202050 | 1 | NA | GA <27 | Unclear | 17% |
Rysavy, 202024 | 2 | Random intercept hospital variation | GA <26/ BW <1000 | Discharge home | 37% |
Podda, 201851 | 6 | Modelling method | GA <30 & BW <1500 | Discharge home | 12% |
Oltman, 201852 | 1 | NA | GA <26 | 7 days | 20% |
Beltempo, 201853 | 6 | Timing of death; predictors | GA <29 | 7 days (3)/discharge home (3) | 6%/14% |
Cnattingius, 201754 | 7 | Predictors | GA <31 | 28 days | 7% |
Koller-Smith, 201755 | 2 | Inclusion criterion | GA <32/BW <1500 | Discharge home | 9%/7% |
Steurer, 201756 | 6 | Age at inclusion | GA <28 | 1 year | NR |
Sullivan, 201657 | 4 | Predictors | BW <1500 | Discharge home | 9% |
Jeschke, 201658 | 1 | NA | BW <1500 | 180 days | 11% |
Rüdiger, 201559 | 6 | Timing of death; predictors | GA <32 | 28 days (3)/discharge home (3) | 11% |
Vincer, 201460 | 1 | NA | GA <30 | 28 days | 12% |
Ravelli, 201461 | 1 | NA | GA <32 | 28 days | 9% |
Wu, 201462 | 3 | Predictors | BW <1500 | 7 days | 10% |
Manktelow, 201348 | 1 | NA | GA <32 | Discharge home | 8% |
Dong, 201263 | 1 | NA | BW <1500 | Discharge home | 29% |
Ambalavanan, 201264 | 4 | Predictors | BW <1000 | Discharge home | 6%-34% |
Lee, 201265 | 8 | Timing of death; predictors | GA <32 | 7 days (4)/discharge home (4) | NR |
Phillips, 201166 | 1 | NA | BW <1500 | Discharge home | 12% |
Schenone, 201067 | 1 | NA | GA <26 & BW<1397 | Discharge home | 35% |
Cole, 201068 | 6 | Predictors | GA <31 | Term age | 16%-17% |
Gargus, 200969 | 1 | NA | BW <1000 | 18-22 months | 34% |
Forsblad, 200870 | 2 | Inclusion criterion | GA =23/GA =24 | 180 days | 22% |
Zupancic, 200771 | 2 | Predictors | BW <1500 | Discharge home | 19%/14% |
Forsblad, 200772 | 1 | NA | GA <25 | 180 days | 22% |
Evans, 200673 | 2 | Age at inclusion | GA <32 & BW <1500 | Discharge home | 7% |
Marshall, 200574 | 1 | NA | BW <1500 | Discharge home | 27% |
Locatelli, 200575 | 1 | NA | BW <750 | 120 days | 49% |
Ambalavanan, 200576 | 10 | Age at inclusion; predictors | BW <1000 | Unclear | NR |
Parry, 200319 | 1 | NA | GA <32 | Discharge home | NR |
Janota, 200177 | 4 | Inclusion crit.; timing of death | GA <31 & BW <1500 | 28 days (2)/discharge home (2) | 11%/17% |
Ambalavanan, 200178 | 20 | Predictors; modelling method | BW <1000 | Discharge home | 34% |
Doyle, 200179 | 1 | NA | GA <27 | 5 years | 33% |
Pollack, 200080 | 10 | Predictors | BW <1500 | Discharge home | 14% |
Draper, 199981 | 2 | Inclusion criterion | GA <32 | Discharge home | 20%/9% |
Zernikow, 199882 | 17 | Predictors; modelling method | GA <32 & BW <1500 | 28 days | 9% |
Article . | No. models . | Differences between models caused by differences in . | Inclusion criterion . | Timing of death . | % mortality . |
---|---|---|---|---|---|
Pishevar, 202050 | 1 | NA | GA <27 | Unclear | 17% |
Rysavy, 202024 | 2 | Random intercept hospital variation | GA <26/ BW <1000 | Discharge home | 37% |
Podda, 201851 | 6 | Modelling method | GA <30 & BW <1500 | Discharge home | 12% |
Oltman, 201852 | 1 | NA | GA <26 | 7 days | 20% |
Beltempo, 201853 | 6 | Timing of death; predictors | GA <29 | 7 days (3)/discharge home (3) | 6%/14% |
Cnattingius, 201754 | 7 | Predictors | GA <31 | 28 days | 7% |
Koller-Smith, 201755 | 2 | Inclusion criterion | GA <32/BW <1500 | Discharge home | 9%/7% |
Steurer, 201756 | 6 | Age at inclusion | GA <28 | 1 year | NR |
Sullivan, 201657 | 4 | Predictors | BW <1500 | Discharge home | 9% |
Jeschke, 201658 | 1 | NA | BW <1500 | 180 days | 11% |
Rüdiger, 201559 | 6 | Timing of death; predictors | GA <32 | 28 days (3)/discharge home (3) | 11% |
Vincer, 201460 | 1 | NA | GA <30 | 28 days | 12% |
Ravelli, 201461 | 1 | NA | GA <32 | 28 days | 9% |
Wu, 201462 | 3 | Predictors | BW <1500 | 7 days | 10% |
Manktelow, 201348 | 1 | NA | GA <32 | Discharge home | 8% |
Dong, 201263 | 1 | NA | BW <1500 | Discharge home | 29% |
Ambalavanan, 201264 | 4 | Predictors | BW <1000 | Discharge home | 6%-34% |
Lee, 201265 | 8 | Timing of death; predictors | GA <32 | 7 days (4)/discharge home (4) | NR |
Phillips, 201166 | 1 | NA | BW <1500 | Discharge home | 12% |
Schenone, 201067 | 1 | NA | GA <26 & BW<1397 | Discharge home | 35% |
Cole, 201068 | 6 | Predictors | GA <31 | Term age | 16%-17% |
Gargus, 200969 | 1 | NA | BW <1000 | 18-22 months | 34% |
Forsblad, 200870 | 2 | Inclusion criterion | GA =23/GA =24 | 180 days | 22% |
Zupancic, 200771 | 2 | Predictors | BW <1500 | Discharge home | 19%/14% |
Forsblad, 200772 | 1 | NA | GA <25 | 180 days | 22% |
Evans, 200673 | 2 | Age at inclusion | GA <32 & BW <1500 | Discharge home | 7% |
Marshall, 200574 | 1 | NA | BW <1500 | Discharge home | 27% |
Locatelli, 200575 | 1 | NA | BW <750 | 120 days | 49% |
Ambalavanan, 200576 | 10 | Age at inclusion; predictors | BW <1000 | Unclear | NR |
Parry, 200319 | 1 | NA | GA <32 | Discharge home | NR |
Janota, 200177 | 4 | Inclusion crit.; timing of death | GA <31 & BW <1500 | 28 days (2)/discharge home (2) | 11%/17% |
Ambalavanan, 200178 | 20 | Predictors; modelling method | BW <1000 | Discharge home | 34% |
Doyle, 200179 | 1 | NA | GA <27 | 5 years | 33% |
Pollack, 200080 | 10 | Predictors | BW <1500 | Discharge home | 14% |
Draper, 199981 | 2 | Inclusion criterion | GA <32 | Discharge home | 20%/9% |
Zernikow, 199882 | 17 | Predictors; modelling method | GA <32 & BW <1500 | 28 days | 9% |
NA = not applicable; GA = gestational age; BW = birth weight; NR = not reported. The number between brackets (N) in the column “Timing of death” represents the number of models with this timing of death.
Studies reporting on external validation.
Name model . | Article . | No. studies . | No. models . | C-statistic original article . | C-statistics external validations (range) . |
---|---|---|---|---|---|
CRIB | Intern. Network 199318 | 1519,23,66,80,83–93 | 16 | 0.90 | Presented in meta-analysis (fig. 5A) |
CRIB-II | Parry 200319 | 1223,62,66,90–92,94–99 | 18 | 0.92 | Presented in meta-analysis (fig. 5B) |
SNAPPE-II | Richardson 200120 | 671,89,90,93,96,99 | 7 | 0.85 | Presented in meta-analysis (fig. 5C) |
NICHD | Tyson 200821 | 524,51,100–102 | 13 | 0.75 | Presented in meta-analysis (fig. 5D) |
Apgar | Apgar 195322 | 554,59,98,103,104 | 21 | NA | Presented in figure 5E |
SNAP-II | Richardson 200120 | 353,71,99 | 7 | NA | 0.68–0.82 |
SNAPPE | Richardson 1993105 | 380,83,89 | 4 | 0.92 | 0.79–0.93 |
SNAP | Richardson 1993106 | 183 | 1 | NA | 0.82 |
Other models | Podda 201851 | 151 | 8 | 0.91 | 0.77–0.91 |
BW+GA | 151 | 4 | NA | 0.72–0.89 | |
Manktelow 201348 | 151 | 4 | 0.86 | 0.69–0.86 | |
Zupancic 200771 | 151 | 4 | 0.85 | 0.76–0.90 | |
Gray 1992107 | 162 | 3 | NA | 0.91–0.96 | |
Draper 199981 | 14 | 2 | NA | 0.82–0.92 | |
Maier 1997108 | 187 | 1 | 0.86 | 0.82 | |
Horbar 199380 | 180 | 1 | 0.82 | 0.87 | |
Rysavy 2020-124 | 124 | 2 | NA | 0.73 | |
Rysavy 2020-224 | 124 | 2 | NA | 0.74 |
Name model . | Article . | No. studies . | No. models . | C-statistic original article . | C-statistics external validations (range) . |
---|---|---|---|---|---|
CRIB | Intern. Network 199318 | 1519,23,66,80,83–93 | 16 | 0.90 | Presented in meta-analysis (fig. 5A) |
CRIB-II | Parry 200319 | 1223,62,66,90–92,94–99 | 18 | 0.92 | Presented in meta-analysis (fig. 5B) |
SNAPPE-II | Richardson 200120 | 671,89,90,93,96,99 | 7 | 0.85 | Presented in meta-analysis (fig. 5C) |
NICHD | Tyson 200821 | 524,51,100–102 | 13 | 0.75 | Presented in meta-analysis (fig. 5D) |
Apgar | Apgar 195322 | 554,59,98,103,104 | 21 | NA | Presented in figure 5E |
SNAP-II | Richardson 200120 | 353,71,99 | 7 | NA | 0.68–0.82 |
SNAPPE | Richardson 1993105 | 380,83,89 | 4 | 0.92 | 0.79–0.93 |
SNAP | Richardson 1993106 | 183 | 1 | NA | 0.82 |
Other models | Podda 201851 | 151 | 8 | 0.91 | 0.77–0.91 |
BW+GA | 151 | 4 | NA | 0.72–0.89 | |
Manktelow 201348 | 151 | 4 | 0.86 | 0.69–0.86 | |
Zupancic 200771 | 151 | 4 | 0.85 | 0.76–0.90 | |
Gray 1992107 | 162 | 3 | NA | 0.91–0.96 | |
Draper 199981 | 14 | 2 | NA | 0.82–0.92 | |
Maier 1997108 | 187 | 1 | 0.86 | 0.82 | |
Horbar 199380 | 180 | 1 | 0.82 | 0.87 | |
Rysavy 2020-124 | 124 | 2 | NA | 0.73 | |
Rysavy 2020-224 | 124 | 2 | NA | 0.74 |
In total, within 34 studies, 118 external validations were performed. The column named “article” reflects the original paper in which the model was published. The number of studies reflect the number of papers that were published presenting external validation of the model. A study might perform external validation of more than one model, therefore the column total of number of studies exceeds 34. The number of models reflect the number of external validations of the model, which might exceed the number of studies due to multiple external validations of a model in one study when the model was applied in for example different populations or with different time spans of the outcome. NA = not available.
Characteristics of the Included Model Development Studies
Table 4 shows key characteristics of the study design, sample size, predictors, outcome, modeling method, and predictive performance of the included model development studies. The majority of the included studies originated from registry or retrospective cohorts (n = 32, 89%). Of all 144 models, 60 (42%) models used birth weight as their inclusion criterion, 52 (36%) models used gestational age as their inclusion criterion, and 32 (22%) models used both birth weight and gestational age as inclusion criterion. The number of participants used for developing the models varied from 57 to 29 180 (median 828), and the number of events ranged between 16 and 4448 (median 171). The median mortality rate was 13%, with an interquartile range of 9% to 28%. The number of events per variable (EPV) could be calculated for 120 (83%) models, ranged from 0 to 426 (median 10), and was <10 for 51% of the models. Although the majority of prediction models were focused on mortality during hospital admission (n = 72, 50%) and within 28 days after birth (n = 31, 22%), 7 other outcome measures were identified, including mortality before term age, or within 7, 120, 180 days, 1 year, 18 to 22 months, and 5 years. The C-statistic varied from 0.70 to 0.95, with a similar range in subgroups for VLGA/BW and ELGA/BW. For 36 (25%) models, both discrimination and calibration were reported, with 10 (25%) models presenting calibration by using a calibration plot and the majority presenting the resulting P value of a Hosmer–Lemeshow test (n = 35, 88%). In total, 84 of the 144 models (58%) were internally validated, most often by using a random split of the data into development and validation data sets (n = 42, 50%) or cross-validation (n = 18, 21%). For 64 (44%) models, insufficient information was presented to allow calculation of individual risks.
Characteristics of the included model development studies and external validation studies.
Item . | Categories . | Development studies . | External validation studies . |
---|---|---|---|
Per study, total n = 44 | N=36 | N=34 | |
Study design and study population | |||
Years of publication (min-max) | 1998-2020 | 1994-2020 | |
Number of models per study | 2 (1–6) | 2 (1–4) | |
Data source | Registry | 20 (56) | 13 (38) |
Retrospective cohort | 12 (33) | 10 (29) | |
Prospective cohort | 2 (5.6) | 9 (27) | |
Other | 1 (2.8) | 1 (2.9) | |
Unclear | 1 (2.8) | 1 (2.9) | |
Country* | Europe | 16 (44) | 14 (41) |
North America | 17 (47) | 7 (21) | |
Oceania | 3 (8.3) | 4 (12) | |
Asia | 3 (8.3) | 8 (24) | |
South America | 1 (2.8) | 2 (5.9) | |
Africa | 0 (0.0) | 1 (2.9) | |
Per model, total n = 154 | N=144 | N=118 | |
Inclusion criteria | |||
Birth weight only | 60 (42) | 40 (34) | |
≤1000g | 36 (60) | 6 (15) | |
≤1500g | 24 (40) | 34 (85) | |
Gestational age only | 52 (36) | 51 (43) | |
≤28 weeks | 12 (23) | 12 (24) | |
≤32 weeks | 40 (77) | 39 (76) | |
Birth weight and gestational age | 32 (22) | 27 (23) | |
≤28 weeks/≤1000g | 1 (9.4) | 13 (48) | |
≤32 weeks/≤1500g | 29 (91) | 14 (52) | |
Sample size | |||
Number of participants | 828 (476–5,745) | 842 (267–3,378) | |
Number of events | 171 (53-411) | 81 (43-197) | |
Not reported | 18 (13) | 27 (23) | |
EPV | 10 (4-68) | NA | |
EPV <10 | 61 (51) | ||
EPV 10-20 | 6 (4.2) | ||
EPV >20 | 53 (44) | ||
Not possible to calculate | 24 (17) | ||
Predictors | NA | ||
No. candidate predictors | 12 (6-22) | ||
Not reported | 6 (4.2) | ||
No. predictors in final model | 7 (4-12) | ||
Outcome | |||
Mortality rate | 13% (9%–28%) | 12% (10%–23%) | |
Not reported | 26 (18) | 27 (23) | |
Time span of outcome | Discharge home | 72 (50) | 86 (73) |
28 postnatal days | 31 (22) | 14 (12) | |
7 postnatal days | 11 (7.6) | 9 (7.6) | |
Term age | 6 (4.2) | 0 (0.0) | |
1 year of age | 6 (4.1) | 0 (0.0) | |
180 postnatal days | 4 (2.8) | 0 (0.0) | |
18-22 postnatal months | 1 (0.7) | 0 (0.0) | |
120 postnatal days | 1 (0.7) | 0 (0.0) | |
5 years of age | 1 (0.7) | 0 (0.0) | |
2 years of age | 0 (0.0) | 1 (0.8) | |
2-3 years’ corrected age | 0 (0.0) | 5 (4.2) | |
Unclear | 11 (7.6) | 3 (2.5) | |
Modelling method and model presentation | NA | ||
Modelling method | Logistic regression | 102 (71) | |
Neural networks | 32 (22) | ||
Other | 4 (2.8) | ||
Unclear | 6 (4.2) | ||
Model presentation | Final model presented, including intercept | 45 (31) | |
Final model presented without intercept | 19 (13) | ||
Alternative presentation | 16 (11) | ||
Insufficient information to allow individual risk calculation | 64 (44) | ||
Predictive performance | |||
Discrimination | c-statistic range | 0.70 – 0.95 | 0.56–0.97 |
≤28 weeks/≤1000g | 0.71 – 0.89 | 0.56–0.95 | |
≤32 weeks/≤1500g | 0.70 – 0.95 | 0.67–0.97 | |
Not reported | 11 (7.6) | 9 (7.6) | |
Calibration* | Reported | 40 (28) | 31 (26) |
Hosmer-Lemeshow | 35 (88) | 20 (65) | |
Calibration plot | 10 (25) | 14 (45) | |
Observed-expected ratio | 2 (5.0) | 0 (0.0) | |
Both discrimination and calibration reported | 36 (25) | 26 (22) | |
Internal validation | NA | ||
Internally validated models | 84 (58) | ||
Method of validation* | Random split of data | 42 (50) | |
Cross-validation | 18 (21) | ||
Non-random split of data | 20 (24) | ||
Resampling | 3 (3.6) | ||
Other | 2 (2.4) |
Item . | Categories . | Development studies . | External validation studies . |
---|---|---|---|
Per study, total n = 44 | N=36 | N=34 | |
Study design and study population | |||
Years of publication (min-max) | 1998-2020 | 1994-2020 | |
Number of models per study | 2 (1–6) | 2 (1–4) | |
Data source | Registry | 20 (56) | 13 (38) |
Retrospective cohort | 12 (33) | 10 (29) | |
Prospective cohort | 2 (5.6) | 9 (27) | |
Other | 1 (2.8) | 1 (2.9) | |
Unclear | 1 (2.8) | 1 (2.9) | |
Country* | Europe | 16 (44) | 14 (41) |
North America | 17 (47) | 7 (21) | |
Oceania | 3 (8.3) | 4 (12) | |
Asia | 3 (8.3) | 8 (24) | |
South America | 1 (2.8) | 2 (5.9) | |
Africa | 0 (0.0) | 1 (2.9) | |
Per model, total n = 154 | N=144 | N=118 | |
Inclusion criteria | |||
Birth weight only | 60 (42) | 40 (34) | |
≤1000g | 36 (60) | 6 (15) | |
≤1500g | 24 (40) | 34 (85) | |
Gestational age only | 52 (36) | 51 (43) | |
≤28 weeks | 12 (23) | 12 (24) | |
≤32 weeks | 40 (77) | 39 (76) | |
Birth weight and gestational age | 32 (22) | 27 (23) | |
≤28 weeks/≤1000g | 1 (9.4) | 13 (48) | |
≤32 weeks/≤1500g | 29 (91) | 14 (52) | |
Sample size | |||
Number of participants | 828 (476–5,745) | 842 (267–3,378) | |
Number of events | 171 (53-411) | 81 (43-197) | |
Not reported | 18 (13) | 27 (23) | |
EPV | 10 (4-68) | NA | |
EPV <10 | 61 (51) | ||
EPV 10-20 | 6 (4.2) | ||
EPV >20 | 53 (44) | ||
Not possible to calculate | 24 (17) | ||
Predictors | NA | ||
No. candidate predictors | 12 (6-22) | ||
Not reported | 6 (4.2) | ||
No. predictors in final model | 7 (4-12) | ||
Outcome | |||
Mortality rate | 13% (9%–28%) | 12% (10%–23%) | |
Not reported | 26 (18) | 27 (23) | |
Time span of outcome | Discharge home | 72 (50) | 86 (73) |
28 postnatal days | 31 (22) | 14 (12) | |
7 postnatal days | 11 (7.6) | 9 (7.6) | |
Term age | 6 (4.2) | 0 (0.0) | |
1 year of age | 6 (4.1) | 0 (0.0) | |
180 postnatal days | 4 (2.8) | 0 (0.0) | |
18-22 postnatal months | 1 (0.7) | 0 (0.0) | |
120 postnatal days | 1 (0.7) | 0 (0.0) | |
5 years of age | 1 (0.7) | 0 (0.0) | |
2 years of age | 0 (0.0) | 1 (0.8) | |
2-3 years’ corrected age | 0 (0.0) | 5 (4.2) | |
Unclear | 11 (7.6) | 3 (2.5) | |
Modelling method and model presentation | NA | ||
Modelling method | Logistic regression | 102 (71) | |
Neural networks | 32 (22) | ||
Other | 4 (2.8) | ||
Unclear | 6 (4.2) | ||
Model presentation | Final model presented, including intercept | 45 (31) | |
Final model presented without intercept | 19 (13) | ||
Alternative presentation | 16 (11) | ||
Insufficient information to allow individual risk calculation | 64 (44) | ||
Predictive performance | |||
Discrimination | c-statistic range | 0.70 – 0.95 | 0.56–0.97 |
≤28 weeks/≤1000g | 0.71 – 0.89 | 0.56–0.95 | |
≤32 weeks/≤1500g | 0.70 – 0.95 | 0.67–0.97 | |
Not reported | 11 (7.6) | 9 (7.6) | |
Calibration* | Reported | 40 (28) | 31 (26) |
Hosmer-Lemeshow | 35 (88) | 20 (65) | |
Calibration plot | 10 (25) | 14 (45) | |
Observed-expected ratio | 2 (5.0) | 0 (0.0) | |
Both discrimination and calibration reported | 36 (25) | 26 (22) | |
Internal validation | NA | ||
Internally validated models | 84 (58) | ||
Method of validation* | Random split of data | 42 (50) | |
Cross-validation | 18 (21) | ||
Non-random split of data | 20 (24) | ||
Resampling | 3 (3.6) | ||
Other | 2 (2.4) |
Numbers are presented as N (%) or median (Q1-Q3), unless stated otherwise. If missing/unclear not reported, it means characteristic was available for all studies/models. If percentage calculated relative to specific characteristic/category instead of per study/model, numbers are indented. * percentages do not add up to 100%, because studies/models might belong to more than one category. EPV = events per variable. NA = not applicable.
Figure 2 summarizes all predictors included in the final models. Variables concerning size and maturity of the infant and variables concerning birth and delivery were most often included (in 77% and 64% of the final models, respectively).
Predictors included in the final development models. The bars reflect the percentage of the 153 models including this predictor; the number at the end of each bar reflects the absolute number of models including this predictor. The upper bar of each category shows the total number and percentage of models including a predictor in this category; subsequently, the categories are subdivided into the lighter-color bars showing the specific predictors in a certain category. Models might have included >1 predictor of a category. BPD, bronchopulmonary disease; CPAP, continuous positive airway pressure; Fio2, fraction of inspired oxygen; GA, gestational age; IVH, intraventricular hemorrhage; NEC, necrotizing enterocolitis; NICHD, Eunice Kennedy Shriver National Institute of Child Health and Human Development; PPHN, persistent pulmonary hypertension of the newborn; PPROM, preterm prelabor rupture of membranes; SNAP, Score for Neonatal Acute Physiology; TRIPS-II, Transport Risk Index of Physiologic Stability, version II.
Predictors included in the final development models. The bars reflect the percentage of the 153 models including this predictor; the number at the end of each bar reflects the absolute number of models including this predictor. The upper bar of each category shows the total number and percentage of models including a predictor in this category; subsequently, the categories are subdivided into the lighter-color bars showing the specific predictors in a certain category. Models might have included >1 predictor of a category. BPD, bronchopulmonary disease; CPAP, continuous positive airway pressure; Fio2, fraction of inspired oxygen; GA, gestational age; IVH, intraventricular hemorrhage; NEC, necrotizing enterocolitis; NICHD, Eunice Kennedy Shriver National Institute of Child Health and Human Development; PPHN, persistent pulmonary hypertension of the newborn; PPROM, preterm prelabor rupture of membranes; SNAP, Score for Neonatal Acute Physiology; TRIPS-II, Transport Risk Index of Physiologic Stability, version II.
ROB and Applicability Assessment of the Included Model Development Studies
Figure 3 shows a summary of ROB and applicability for all models. Across nearly all models, ROB related to outcome and predictors was considered low. ROB related to the participants’ domain was high in 14% of the models because of inappropriate inclusion and exclusion criteria of participants; for example, including <50% of the eligible infants or exclusion of infants that died later than the prediction horizon. By contrast, ROB related to the statistical analysis was high in every single model, mostly because of inappropriate handling of missing data (100%), not presenting all relevant performance measures (96%), a low number of participants with the outcome in relation to the number of candidate predictors (51%), and no correction for overfitting when indicated (94%). In summary, the overall ROB was high across all models.
ROB and applicability assessment of developed models by using PROBAST.
The concern of the model not being applicable to our research question was high in 29% of the models, mainly because of inclusion of participants different from those in our research question (eg, studies excluding outborn infants).
External Validation Studies
Table 4, shows key characteristics of the study design, sample size, outcome, and predictive performance of the included external validation studies. Although in 34 articles, 118 external validations were performed, the majority of the 144 developed models (n = 135, 94%) had not been externally validated. For some models, an external validation study was included in this review, but not the original development study, because the model was developed in the presurfactant era or in a population not applicable to this review but was externally validated in a time period or population that was applicable. In total, 18 different models were externally validated (Table 3). Median mortality rate was 12% (interquartile range: 10%–23%), which was comparable to the mortality rate in the development studies. The C-statistic was reported for 109 (92%) models, with a range of 0.56 to 0.97. For 26 (22%) models, both discrimination and calibration were reported, with 14 (45%) models presenting calibration using a calibration plot and the majority presenting the resulting P value of a Hosmer–Lemeshow test (n = 20, 65%). Figure 4 shows a summary of ROB and applicability by domain. Across almost all models, ROB related to outcome, predictors, and participants was low. By contrast, ROB related to the analysis was high in almost all models, mostly because of inappropriate handling of missing data (95%) and not presenting a calibration plot (92%). This resulted in an overall high ROB for the validation of 114 (97%) models.
ROB and applicability assessment of externally validated models by using PROBAST.
ROB and applicability assessment of externally validated models by using PROBAST.
Meta-analysis
The CRIB18 was validated most (n = 15), followed by the CRIB-II19 (n = 12), the SNAPPE-II20 (n = 6), the NICHD model21 (n = 5) and the Apgar score22 (n = 5) (Table 3). However, the Apgar score was unsuitable for meta-analysis because of substantial heterogeneity across the external validations, caused by differences in moment of prediction, prediction horizon, and type of Apgar score (conventional, specified or expanded score). Results from the studies on external validation of the Apgar score are presented without meta-analysis in Fig 5E.
A, Meta-analysis for the CRIB score, showing a forest plot with study specific C-statistics, the average C-statistic (Summary Estimate) and the prediction interval. The first row of the table shows characteristics of the original development paper on CRIB score18. The CRIB score includes six parameters: birth weight, gestation, congenital malformations, maximum base excess, minimum appropriate fraction of inspired oxygen and maximum appropriate fraction of inspired oxygen in first 12h. Although 16 studies externally validated the CRIB score, 1 study could not be used in the meta-analysis because the c-statistic was not presented and 1 study could not be used in the meta-analyses because the 95% confidence interval could not be calculated due to missing information on the number of outcomes, resulting in 14 studies used for the meta-analysis. Subgroup analyses was not applicable, as all studies were performed in a VLBW/GA population. B, Meta-analysis for the CRIB-II score, showing a forest plot with study specific C-statistics, the average C-statistic (Summary Estimate) and the prediction interval. The first row of the table shows characteristics of the original development paper on CRIB-II score.19 The CRIB-II score includes five parameters: sex, birth weight, gestation, temperature at admission and base excess. of subgroup analyses in VLBW/GA infants is shown in Figure 6a – online only. C, Meta-analysis for the SNAPPE-II score, showing a forest plot with study specific C-statistics, the average C-statistic (Summary Estimate) and the prediction interval. The first row of the table shows characteristics of the original development paper on SNAPPE-II score.20 The SNAPPE -II score includes nine parameters: mean blood pressure, lowest temperature, PO2FiO2 ratio, lowest serum pH, multiple seizures, urine output, birth weight, small for gestational age and Apgar score at 5 minutes. D, Meta-analysis for the NICHD score, showing a forest plot with study specific C-statistics, the average C-statistic (Summary Estimate) and the prediction interval. The first row of the table shows characteristics of the original development paper on NICHD score.21 The NICHD score includes five parameters: gestational age, birth weight, infant sex, singleton birth and antenatal steroids. E, Results from all external validations of the Apgar score, without meta-analysis. Conventional = original scoring system as introduced by Virginia Apgar in 195322, including five items: heart rate, respiratory effort, reflex irritability, muscle tone and color.; Specified = scoring the items of the conventional Apgar independent of the requirements need to achieve condition; Expanded = scoring the interventions that are required to achieve a condition; Combined = scoring both the Specified and Expanded Apgar scores.
A, Meta-analysis for the CRIB score, showing a forest plot with study specific C-statistics, the average C-statistic (Summary Estimate) and the prediction interval. The first row of the table shows characteristics of the original development paper on CRIB score18. The CRIB score includes six parameters: birth weight, gestation, congenital malformations, maximum base excess, minimum appropriate fraction of inspired oxygen and maximum appropriate fraction of inspired oxygen in first 12h. Although 16 studies externally validated the CRIB score, 1 study could not be used in the meta-analysis because the c-statistic was not presented and 1 study could not be used in the meta-analyses because the 95% confidence interval could not be calculated due to missing information on the number of outcomes, resulting in 14 studies used for the meta-analysis. Subgroup analyses was not applicable, as all studies were performed in a VLBW/GA population. B, Meta-analysis for the CRIB-II score, showing a forest plot with study specific C-statistics, the average C-statistic (Summary Estimate) and the prediction interval. The first row of the table shows characteristics of the original development paper on CRIB-II score.19 The CRIB-II score includes five parameters: sex, birth weight, gestation, temperature at admission and base excess. of subgroup analyses in VLBW/GA infants is shown in Figure 6a – online only. C, Meta-analysis for the SNAPPE-II score, showing a forest plot with study specific C-statistics, the average C-statistic (Summary Estimate) and the prediction interval. The first row of the table shows characteristics of the original development paper on SNAPPE-II score.20 The SNAPPE -II score includes nine parameters: mean blood pressure, lowest temperature, PO2FiO2 ratio, lowest serum pH, multiple seizures, urine output, birth weight, small for gestational age and Apgar score at 5 minutes. D, Meta-analysis for the NICHD score, showing a forest plot with study specific C-statistics, the average C-statistic (Summary Estimate) and the prediction interval. The first row of the table shows characteristics of the original development paper on NICHD score.21 The NICHD score includes five parameters: gestational age, birth weight, infant sex, singleton birth and antenatal steroids. E, Results from all external validations of the Apgar score, without meta-analysis. Conventional = original scoring system as introduced by Virginia Apgar in 195322, including five items: heart rate, respiratory effort, reflex irritability, muscle tone and color.; Specified = scoring the items of the conventional Apgar independent of the requirements need to achieve condition; Expanded = scoring the interventions that are required to achieve a condition; Combined = scoring both the Specified and Expanded Apgar scores.
At meta-analyses, estimated approximate average C-statistics across the included studies were 0.88 (95% CI: 0.83–0.91, I2 = 91%) for the CRIB score (Fig 5A), 0.87 (95% CI: 0.81–0.92, I2 = 94%) for the CRIB-II score (Fig 5B), 0.86 (95% CI: 0.78–0.92, I2 = 90%) for the SNAPPE-II score (Fig 5C) and 0.71 (95%CI 0.61–0.79, i2 = 84%) for the NICHD model (Fig 5D). The 95% PIs were 0.63–0.97, 0.59–0.97, and 0.60–0.96 for CRIB, CRIB-II, SNAPPE-II and NICHD scores, respectively. Based on the forest plot in Fig 5A, the study of Asker et al was an outlier in comparison with other studies and as such may be great source of heterogeneity. Exclusion of this study lowered the I2 to 69% and improved the 95% PI to 0.80–0.94. The probabilities that the scores would achieve a discrimination >0.7 and >0.8 in future validation studies were 93% and 78%, respectively, for the CRIB score, 92% and 78%, respectively, for the CRIB-II score, 93% and 78% for the SNAPPE-II score and 57% and 10% respectively for the NICHD score. A calibration plot was presented for 2 external validation studies of the CRIB score for 1 external validation study of the CRIB-II score and for 1 external validation study of the NICHD score, showing poor and good calibration for the CRIB score19,23 good calibration for the CRIB-II score23 and moderate calibration for the NICHD score.24 Subgroup analyses in a VLGA/BW population for the CRIB-II and SNAPPE-II scores showed similar results (Supplemental Fig 6).
Discussion
In this systematic review, we summarized all available prognostic models for mortality prediction in liveborn very preterm infants. In total, 144 models from 36 studies on model development and 118 models from 34 studies on external validation were identified, revealing that there is an abundance of mortality risk prediction models for very preterm infants. ROB assessment showed high ROB in the majority of the models, most often because of inadequate (reporting of the) analysis. Furthermore, internal and external validation of these models is often lacking.
Four main identified methodologic flaws within the analysis domain need addressing. First, at development, 61 (51%) models had a number of participants with the outcome in relation to the number of candidate predictors (EPV) <10, resulting in high ROB according to PROBAST because of the risk of overfitting. With such a small EPV, it is recommended to account for overfitting and optimism to decrease the ROB,9 but this was scarcely done in the included models. Historically, and as such in PROBAST, sample size consideration have been based on the EPV; however, it has been recently suggested to also include the total number of participants, the outcome incidence in the study population, and the expected predictive performance.25
Second, none of the included studies handled participants with missing data correctly according to PROBAST. Use of missing data as an exclusion criterion or excluding enrolled participants with any missing data from the analysis leads to biased associations and model performance.26–35 Therefore, multiple imputation is recommended to handle missing data because it leads to the least biased results with correct SEs and P values.26–31,33–35
Third, information on both calibration and discrimination was presented for only 25% of the models. Calibration was most often assessed by using a Hosmer–Lemeshow test, whereas this statistical test indicates neither the presence nor the magnitude of any miscalibration and is known to be dependent on the sample size.9 Therefore, it is recommended to present a calibration plot instead, which unfortunately was hardly ever reported in the included articles.
Fourth, 84 of the 144 models (58%) were internally validated, most often by using a random split of the data into development and validation data sets (n = 42, 50%). However, this has been shown to be an insufficient way of data use and as an inadequate way to measure optimism.36,37 Instead, bootstrapping or cross-validation is recommended to quantify overfitting of the developed model and optimism in its predictive performance.38 Furthermore, the majority of the studies performing internal validation seemingly failed to replicate the exact model development procedure and thus may still underestimate the actual optimism and thus overestimate the actual performance of their model.39,40
Methodologic flaws identified within the ROB assessment of the participants domain included using a nested case control design without correction for baseline risk, inclusion of <50% of the eligible infants, and exclusion of all infants who died after 7 days. Within the applicability assessment, issues raised included exclusion of outborn infants, a study conducted in a high altitude NICU, and exclusion of all infants who died within 72 hours.
This review reveals that development of new prediction models for mortality in preterm infants is an ongoing practice. However, many models are of unknown value for daily practice because of lack of validation. Therefore, future emphasis should be shifted toward external validation and adaption of existing prediction models, which applies to a broader field of prediction modeling and has been stated before.41,42 Ideally, these validation studies are performed by using prospectively collected data because validation studies have higher potential for ROB when participant data are from existing sources with data collected for a purpose other than validation or updating of prediction models. Consecutively, impact studies are warranted to quantify the effect of a prognostic model on physician’s behavior and patient outcome.5
In the majority of development studies, participants, predictors, and outcome were described sufficiently clear and did not introduce bias. Contrastingly, high ROB occurred in the analysis section of practically all studies because of inappropriate analysis methods or omission of important statistical considerations. Moreover, for almost 40% of the models, information to allow others to correctly apply the models in new individuals (ie, information on predictors and coefficients of the final developed model including intercept) was insufficient. Improvements in studies on mortality risk prediction in very preterm infants are needed and can be achieved from better (reporting of) analyses. A first step in that direction would be better adherence to the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) statement43 and consideration of PROBAST.8,9
This review showed that variables concerning size and maturity of the infant (in 77% of the models), variables concerning birth and delivery (64%), and maternal variables (41%) were most often included. Specifically, gestational age, Apgar score, birth weight, sex, multiplicity, antenatal corticosteroids, and ethnicity were used as predictors in >40 models. This reveals the importance of these variables in mortality risk prediction in preterm infants. Nevertheless, because the vast majority of these models were considered of low quality and calibration of these models was not reported, their actual in value in mortality risk prediction remains unclear.
At meta-analysis of the C-statistic, the CRIB, CRIB-II, and SNAPPE-II, all revealed excellent performance (C-statistic >0.85), comparable to a recently published meta-analysis.44 However, considerable heterogeneity across the included studies was found (I2 ≥ 90% for all models), which can originate from differences between study populations and study designs.45,46 Important characteristics of the included studies, including inclusion criteria, moment of prediction, and time span of the outcome, are shown in Fig 5 A–C and indicate substantial differences in study population. However, it is difficult to draw conclusions on the defining sources of heterogeneity, meaning further research will be necessary. Although the 3 models revealed great discriminative performance, information on calibration is largely lacking. To provide a complete and accurate judgment of the performance of these models, information on calibration, ideally by providing a calibration plot, will be needed. Compared to these three models, the NICHD model lagged in its discriminative performance (average c-statistics 0.71), which may be related to the study population of extremely preterm infants.
Health care decisions for individual patients should be informed by using the best available evidence. Systematic reviews summarizing large amounts of information are very powerful tools to facilitate clinical decision-making but also to identify gaps in our knowledge or room for improvement. In our article, we clearly show a lack of evidence regarding the external validity of the majority of models, poor (reporting of) analyses, and absence of calibration plots in the majority of the models. The abundant availability of insufficiently validated models is not useful for clinical practice.47 In our systematic review, the extensive ROB assessment revealed that the model published by Manktelow et al48 had the highest quality among all 144 developed models. Furthermore, the external validity of the CRIB, CRIB-II SNAPPE-II and NICHD models has been assessed often and show good discriminative performance.18–21 Unfortunately, information on their calibration is still lacking. Based on the currently available evidence, we consider these 5 prediction models to have the highest potential for use in clinical practice. A first step would be to (again) externally validate these models, but now also focus on calibration. Presenting discrimination will be sufficient when the aim is to distinguish high and low risk populations, but for individual prediction information on calibration is essential. During such external validation, the original model may require an update, thereby addressing the potential issue of miscalibration associated with differences in mortality rate between the development and validation population. Ideally, such external validations are followed by impact studies to quantify the effect of a prognostic model on physician’s behavior and patient outcome.
Since Medlock et al7 published their systematic review of models for the prediction of mortality in very premature infants in 2011, only 1 systematic review in Spanish has been published.49 Large improvements of our review compared with both existing reviews are (1) the use of a standard tool for ROB assessment, which is an essential step in any systematic review8,9 ; (2) the inclusion of articles externally validating models and meta-analysis of the models most often validated, giving additional insight in their quality and value for clinical practice; and (3) the vast amount of newly published prediction models since 2011, showing the need for an update to provide a comprehensive overview of prediction models for mortality in very preterm infants.
However, this study has several limitations, too. First, for some models, an external validation study was included, but not the development study, because studies developed in the presurfactant era or in the general NICU population that included very preterm born infants but also infants born >32 weeks’ gestational age were excluded. Second, PROBAST is a recently developed tool using contemporary expertise and knowledge, which was applied to models of which some were developed and published decades ago. Information currently necessary for assessment of bias (eg, calibration) was often not reported, leading to high ROB in the analysis domain across all models. Third, the majority of the included studies originated from developed countries, making this review less applicable to developing countries. In future research, validating prediction models in developing countries might require more attention because there is much to be gained with respect to postnatal mortality in preterm infants.
Conclusions
There is an abundance of mortality risk prediction models for very preterm infants. Improvement in studies on mortality risk prediction in very preterm infants can be achieved from improved (reporting of) analyses. Many of the models are of unknown value for daily practice because of lack of external validation. Meta-analysis on the widely used CRIB, CRIB-II, SNAPPE-II and NICHD scores revealed good discriminative performance of these scores, but calibration is currently unknown. Instead of developing additional mortality prediction models for preterm infants, the emphasis should be shifted toward external validation and consecutive adaption of the existing prediction models for mortality in preterm infants.
Dr van Beek designed the study, performed the literature search, conducted the study selection process, data extraction, and critical appraisal, analyzed the data, and wrote the first draft of the manuscript; Drs Andriessen and Onland conducted the study selection process, data extraction, and critical appraisal, provided critical feedback, and helped shape the research, analysis, and manuscript; Dr Schuit designed the study, conducted the study selection process, data extraction, and critical appraisal, provided critical feedback, helped shape the research, analysis, and manuscript, and supervised the project; and all authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.
FINANCIAL DISCLOSURE: The authors have indicated they have no financial relationships relevant to this article to disclose.
- CHARMS
critical appraisal and data extraction for systematic reviews of prediction modelling studies
- CI
confidence interval
- CRIB
clinical risk index for babies
- ELGA/BW
extremely low gestational age or birth weight
- EPV
events per variable
- PI
prediction interval
- PROBAST
prediction model risk of bias assessment tool
- ROB
risk of bias
- SNAPPE
score for neonatal acute physiology perinatal extension
- VLGA/BW
very low gestational age or birth weight
References
Competing Interests
POTENTIAL CONFLICT OF INTEREST: The authors have indicated they have no potential conflicts of interest to disclose.
Comments