BACKGROUND AND OBJECTIVES

Outcome prediction of preterm birth is important for neonatal care, yet prediction performance using conventional statistical models remains insufficient. Machine learning has a high potential for complex outcome prediction. In this scoping review, we provide an overview of the current applications of machine learning models in the prediction of neurodevelopmental outcomes in preterm infants, assess the quality of the developed models, and provide guidance for future application of machine learning models to predict neurodevelopmental outcomes of preterm infants.

METHODS

A systematic search was performed using PubMed. Studies were included if they reported on neurodevelopmental outcome prediction in preterm infants using predictors from the neonatal period and applying machine learning techniques. Data extraction and quality assessment were independently performed by 2 reviewers.

RESULTS

Fourteen studies were included, focusing mainly on very or extreme preterm infants, predicting neurodevelopmental outcome before age 3 years, and mostly assessing outcomes using the Bayley Scales of Infant Development. Predictors were most often based on MRI. The most prevalent machine learning techniques included linear regression and neural networks. None of the studies met all newly developed quality assessment criteria. Studies least prone to inflated performance showed promising results, with areas under the curve up to 0.86 for classification and R2 values up to 91% in continuous prediction. A limitation was that only 1 data source was used for the literature search.

CONCLUSIONS

Studies least prone to inflated prediction results are the most promising. The provided evaluation framework may contribute to improved quality of future machine learning models.

Preterm birth is common, with a prevalence of 11% in live-born neonates.1  Short-term survival has increased considerably during the past decades.2  However, surviving infants in all degrees of prematurity may experience adverse neurodevelopmental outcomes. Many preterm infants face neonatal illnesses and may need intensified support in different organ systems that might affect brain development and, in turn, lead to neurodevelopmental impairment.37 

The current main known risk factors for adverse neurodevelopmental outcomes in preterm infants are, apart from socioeconomic status, those related to altered brain maturation (eg, lower gestational age [GA]), oxygenation problems (eg, bronchopulmonary dysplasia), or serious infections.6,8,9  Previous attempts to develop prediction models using these known risk factors through conventional statistical methods have thus far not been fruitful and provide unstable ground for clinical decision-making.911  Accurate prediction models might aid clinical care in several respects: Such models (1) may aid clinical decision-making in the critical phase; (2) reveal novel insights into the mechanisms underlying poor neurodevelopmental outcome, exposing potential targets for further improvement of neonatal care; and (3) may aid in selecting children at high risk of poor outcomes for targeted follow-up, allowing early intervention or even prevention of developmental challenges.12,13 

The considerable body of literature on prematurity has shown that neurodevelopmental outcome of preterm infants is affected by a wide range of factors, which include genetic, antenatal, perinatal and neonatal characteristics; environmental influences; and treatment protocols.6,14,15  Due to the large amount of relevant factors in play, presumably with complex interacting effects, it is unlikely that the complexity of neurodevelopmental outcome prediction can be captured in linear models produced by conventional statistical procedures. Recent advances in artificial intelligence have shown that machine learning has the potential for complex outcome prediction.16  The field of machine learning includes statistical techniques for modeling of complex, nonlinear relations and might, therefore, be particularly suitable to developing models to predict neurodevelopmental outcomes of preterm born infants.17,18  Machine learning has already proven beneficial in several fields of medicine at the level of, or even outperforming, clinicians in the interpretation of radiographic images, diagnosis of retinopathy of prematurity, and prediction of in-hospital cardiac arrest.1921  This triggers the question: to what extent can machine learning contribute to the prediction of neurodevelopmental outcomes in preterm infants?

Conventional statistics and machine learning are not easy to discriminate by definition and are sometimes considered part of a continuum. However, one difference could be the hypothesis-driven approach of conventional statistics, versus the data-driven approach of machine learning, bypassing (potentially biased) human influence.22  The application of machine learning in medicine not only holds great promise but also carries the risk of improper use. The most prominent threat is related to the inherent inexhaustible search for possible relationships, which may lead to nongeneralizable findings (due to overfitting of the model).18  Although some guidelines exist on the use of machine learning in medical research, their use is not yet common practice.17,18,23  Thus, critical appraisal of the quality of evidence on machine learning to predict neurodevelopmental outcomes after preterm birth is important.

In this scoping review, we provide an overview of the current use of machine learning techniques and their performance in the prediction of neurodevelopmental outcomes of preterm-born children. We provide a critical appraisal of the quality of the available studies, make recommendations on future neonatal prediction models, and provide guidance on the correct use of machine learning in the development of prediction models. We also present an evaluation framework for quality assessment of machine learning models suitable in a broader context of prediction in medicine.

This scoping review was conducted according to the Joanna Briggs Institute formal guidance for scoping reviews.24  No protocol was submitted before the review was conducted. A systematic search was performed in PubMed to identify studies that used machine learning in neurodevelopmental outcome prediction in preterm infants. Inclusion criteria were that (1) the study reported on prediction models for neurodevelopmental outcomes in preterm infants; (2) the mean GA in the study sample was <37 weeks; (3) predictors were assessed before or during the neonatal period (up to 44 weeks postmenstrual age); (4) neurodevelopmental outcomes included motor, neurocognitive, language, behavioral, or academic outcomes, assessed using standardized and validated tests; (5) prediction models were built using machine learning; and (6) the study was published in an English-language, peer-reviewed journal. In an attempt to distinguish machine learning models from models based on conventional statistics, we defined machine learning as the use of a data-driven approach, that is, without manual feature selection and/or inclusion of more complex models (eg, neural networks). Studies published as part of conference proceedings were excluded. The search was conducted using combinations of simple and hierarchical terms for machine learning and preterm infants (last search March 23, 2021, full search available in the Supplemental Information). One author performed the study selection. In case of any uncertainty, a second author was consulted. Snowballing through the reference list of included articles was performed to reveal any articles that were not captured by the initial search.

Two authors independently performed data extraction. The following information was extracted from each study (if available): study design, selection procedure, sample size, sample characteristics, predictors, outcome measurements, age at assessment, and information on type and construction of machine learning models used, including validation procedures and performance indicators. Then, for every article, the performance of the best-performing model per outcome domain was reported on the basis of the receiver operating characteristic area under curve (AUC) or accuracy for classification and R2 for regression models. If possible, any missing performance outcomes were calculated using the presented data.

The flexibility of machine learning models not only allows for the identification of complex patterns and relations but also increases the risk of overfitting (ie, modeling noise).17  Overfitted models typically reveal inflated prediction performance on data that were used for model development (training data) while having limited generalizability of the model to unseen data (test data).18  Consequently, overfitting severely threatens the clinical applicability of machine learning models. Although methods are available to prevent overfitting (eg, validation, cross-validation, penalizing predictors), there are also common pitfalls in the application of such methods. One common and serious pitfall is the leakage of test data into training data by tweaking model performance based on test data (eg, through exhaustive predictor selection or hyperparameter tuning).17,18  Other common pitfalls include using sample sizes that are too small, inappropriate validation methods, or inappropriate performance metrics in case of skewed classes.18  In contrast, large sample sizes in combination with too few parameters is more likely to lead to underfitting of the model when targeting complex predictions.17,18,23  Furthermore, attempts should be made to make a model interpretable, to enable better neurobiological insight in the model, and to simplify quality assessment and updates or improvement of the model.25,26  Thereby, it is also important to encourage open science, allowing for external validation or modification of the model.

To assess the quality of the application of machine learning in the included studies, quality recommendations were gathered from different literature sources and were reshaped into quality assessment criteria. First, quality recommendations were categorized and summarized (Table 1). From these summarized recommendations, manageable criteria were formed that were suitable for quality assessment and subsequently reshaped into an evaluation framework (Table 2).

TABLE 1

Quality Features for Machine Learning Models

TopicExplanationFeatures to Extract
Sample size The sample size should be sufficiently large for the amount of data to prevent overfitting of the model (ie, sample should contain preferably at least several hundreds of observations and should match the complexity of the model).18 
In case of limited sample sizes, transfer learning is a method to increase the performance of a model. Transfer learning refers to the technique that allows a machine learning model to be informed with a priori knowledge extracted from another independent, often larger, data set representing a slightly or completely different population. Through this method, the model parameters are pretrained in the larger data set, and subsequently fine-tuned in the target population, thereby potentially reaching higher prediction efficiency.27  
Sample size (n)
Sample size of transfer cohort (n
Participants High or unequal sample attrition might influence the performance of a model in a new sample because of low representativeness of the final sample in the model. Therefore, sample attrition, as well as the inclusion and exclusion criteria, should be clearly described, and representativeness should be investigated in the final sample.28  Clear inclusion and exclusion criteria (Y/N)
Clear description of possible sample attrition (Y/N)
Representative sample (Y/N) 
Data leakage It should be clear how many predictors were investigated and how many were included in the final model. If predictor selection or hyperparameter tuning was applied on the basis of the testing data, we speak of data leakage from testing to training data, resulting in inflated performance due to overfitting. To prevent this issue, external validation or nested cross-validation can be used.17,18,29  Number of predictors at start (n)
Number of predictors in final (best-predicting) model (n)
Predictor selection independent from validation performance (Y/N)
Number of combinations of hyperparameters tested (n, both within and outside the validation or cross-validation)
Data leakage (Y/N, through either predictor selection or hyperparameter tuning) 
Validation The predictive performance should be based on a sample independent from the training data to ensure generalizability of the model. This validation might be done in a completely independent sample (external validation) or with the use of internal (nested) cross-validation or splitting the data into dedicated training and testing parts. In (nested) cross-validation, the sample is split into a number of samples (k-folds), of which by turn, 1-fold is held out as a test set, whereas the others function as the training data. The k-fold cross-validation is preferred over LOOCV, in which the testing set is not representative of all the data.18,30  Internal validation (k-fold, LOOCV, nested)
External validation (Y/N) 
Performance metrics In case of skewed classes in classification problems, a model might be overestimated using the wrong performance metrics. For example, in case of 10% abnormal outcomes, an accuracy of 90% might be achieved by predicting a normal outcome for all subjects. In that case, also base rate–sensitive metrics should be reported (ie, positive and negative predictive values or F-score).31 
For regression problems, R2 should be calculated directly from predicted and observed values, not through squaring the correlation coefficient (Pearson r). If a correlation coefficient is presented, it should be accompanied by measures directly comparing predicted and observed values, such as prediction R2 of mean absolute error.17,32  
Outcome type (classification, regression)
Base rate (% abnormal)
Type of performance metrics reported (eg, AUC, accuracy, sensitivity) 
Interpretability Some machine learning models are seen as black boxes because in contrast with simple linear regression, it is very hard to understand the influences of the individual predictors and their relationships. However, there are multiple ways to give some insight into this black box.33 
To interpret the added value of the new model, a comparison should be made with previously developed and published prediction models or clinical practice.17  
Interpretability of the model (Y/L/N)
Comparison with previous models (Y/L/N) 
Open science Sharing of code, models, and data is important for the purpose of external validation and open science. Furthermore, some type of decision support model should be made available to facilitate the implementation of a model into clinical practice.17,34  Sharing code, model, or data (C/M/D/N)
Decision support model (Y/N) 
TopicExplanationFeatures to Extract
Sample size The sample size should be sufficiently large for the amount of data to prevent overfitting of the model (ie, sample should contain preferably at least several hundreds of observations and should match the complexity of the model).18 
In case of limited sample sizes, transfer learning is a method to increase the performance of a model. Transfer learning refers to the technique that allows a machine learning model to be informed with a priori knowledge extracted from another independent, often larger, data set representing a slightly or completely different population. Through this method, the model parameters are pretrained in the larger data set, and subsequently fine-tuned in the target population, thereby potentially reaching higher prediction efficiency.27  
Sample size (n)
Sample size of transfer cohort (n
Participants High or unequal sample attrition might influence the performance of a model in a new sample because of low representativeness of the final sample in the model. Therefore, sample attrition, as well as the inclusion and exclusion criteria, should be clearly described, and representativeness should be investigated in the final sample.28  Clear inclusion and exclusion criteria (Y/N)
Clear description of possible sample attrition (Y/N)
Representative sample (Y/N) 
Data leakage It should be clear how many predictors were investigated and how many were included in the final model. If predictor selection or hyperparameter tuning was applied on the basis of the testing data, we speak of data leakage from testing to training data, resulting in inflated performance due to overfitting. To prevent this issue, external validation or nested cross-validation can be used.17,18,29  Number of predictors at start (n)
Number of predictors in final (best-predicting) model (n)
Predictor selection independent from validation performance (Y/N)
Number of combinations of hyperparameters tested (n, both within and outside the validation or cross-validation)
Data leakage (Y/N, through either predictor selection or hyperparameter tuning) 
Validation The predictive performance should be based on a sample independent from the training data to ensure generalizability of the model. This validation might be done in a completely independent sample (external validation) or with the use of internal (nested) cross-validation or splitting the data into dedicated training and testing parts. In (nested) cross-validation, the sample is split into a number of samples (k-folds), of which by turn, 1-fold is held out as a test set, whereas the others function as the training data. The k-fold cross-validation is preferred over LOOCV, in which the testing set is not representative of all the data.18,30  Internal validation (k-fold, LOOCV, nested)
External validation (Y/N) 
Performance metrics In case of skewed classes in classification problems, a model might be overestimated using the wrong performance metrics. For example, in case of 10% abnormal outcomes, an accuracy of 90% might be achieved by predicting a normal outcome for all subjects. In that case, also base rate–sensitive metrics should be reported (ie, positive and negative predictive values or F-score).31 
For regression problems, R2 should be calculated directly from predicted and observed values, not through squaring the correlation coefficient (Pearson r). If a correlation coefficient is presented, it should be accompanied by measures directly comparing predicted and observed values, such as prediction R2 of mean absolute error.17,32  
Outcome type (classification, regression)
Base rate (% abnormal)
Type of performance metrics reported (eg, AUC, accuracy, sensitivity) 
Interpretability Some machine learning models are seen as black boxes because in contrast with simple linear regression, it is very hard to understand the influences of the individual predictors and their relationships. However, there are multiple ways to give some insight into this black box.33 
To interpret the added value of the new model, a comparison should be made with previously developed and published prediction models or clinical practice.17  
Interpretability of the model (Y/L/N)
Comparison with previous models (Y/L/N) 
Open science Sharing of code, models, and data is important for the purpose of external validation and open science. Furthermore, some type of decision support model should be made available to facilitate the implementation of a model into clinical practice.17,34  Sharing code, model, or data (C/M/D/N)
Decision support model (Y/N) 

C, code; D, data; L, limited; M, model; N, no; Y, yes.

TABLE 2

Evaluation Framework for Quality Assessment

TopicAppropriateMinor DeviationMajor Deviation
Participants Clear inclusion and exclusion criteria and clear description of sample attrition forming a representative sample Clear inclusion and exclusion criteria and/or clear sample attrition and/or representative sample All unclear 
Data leakage No data leakage: predictor selection and model selection completely independent from validated or cross-validated results Limited data leakage Large data leakage 
Validation procedure External validation, cross-validation, or splitting up data into dedicated training and testing sets LOOCV No independent validation 
Performance metrics Classification: AUC or accuracy combined with sensitivity and specificity
In case of skewed classes: combined with PPV/NPV or F-score or correction for skewed classes
Regression: prediction R2 or Pearson r combined with metrics of absolute measurements 
Classification: accuracy without sensitivity and specificity or no class-sensitive metrics (PPV/NPV) in case of skewed classes
Regression: only Pearson r without metrics of absolute measurements 
Classification: no AUC or accuracy provided and not able to be calculated from given data
Regression: no R2 or Pearson r given 
Interpretability Provides insight into the relations between predictors and outcomes and compares performance of the model with previously developed external models Provides partial insight into the relation between predictors and outcomes and/or provides limited model comparison with previous (conventional) statistical models or clinical practice Provides no insight into the relations between predictors and outcomes and/or no comparison with previous (conventional) statistical models or clinical practice 
Open science Free online availability of code, model, and data and/or provision of a decision support tool Limited sharing of code, model, or data (or only available upon request) No sharing of code, model, or data and no decision support tool 
TopicAppropriateMinor DeviationMajor Deviation
Participants Clear inclusion and exclusion criteria and clear description of sample attrition forming a representative sample Clear inclusion and exclusion criteria and/or clear sample attrition and/or representative sample All unclear 
Data leakage No data leakage: predictor selection and model selection completely independent from validated or cross-validated results Limited data leakage Large data leakage 
Validation procedure External validation, cross-validation, or splitting up data into dedicated training and testing sets LOOCV No independent validation 
Performance metrics Classification: AUC or accuracy combined with sensitivity and specificity
In case of skewed classes: combined with PPV/NPV or F-score or correction for skewed classes
Regression: prediction R2 or Pearson r combined with metrics of absolute measurements 
Classification: accuracy without sensitivity and specificity or no class-sensitive metrics (PPV/NPV) in case of skewed classes
Regression: only Pearson r without metrics of absolute measurements 
Classification: no AUC or accuracy provided and not able to be calculated from given data
Regression: no R2 or Pearson r given 
Interpretability Provides insight into the relations between predictors and outcomes and compares performance of the model with previously developed external models Provides partial insight into the relation between predictors and outcomes and/or provides limited model comparison with previous (conventional) statistical models or clinical practice Provides no insight into the relations between predictors and outcomes and/or no comparison with previous (conventional) statistical models or clinical practice 
Open science Free online availability of code, model, and data and/or provision of a decision support tool Limited sharing of code, model, or data (or only available upon request) No sharing of code, model, or data and no decision support tool 

NPV, negative predictive value; PPV, positive predictive value.

Despite that larger samples will lead to better generalizability, in the final quality assessment, the models were not judged on their sample size because models cannot be judged on their sample size alone without taking into account the model type, number of predictors, and extent of hyperparameter tuning. Furthermore, appropriate sample size is still a subject of debate among data scientists.35  Consequently, the focus of selected criteria is on the quality of the machine learning application and requirements for appropriate implementation in clinical practice.

The criteria were formed and selected through discussion within the author group of this review, representing a board of experienced data scientists, neuroscientists, and neonatologists. All included studies were then independently assessed for quality by 2 authors. Any disagreements between the authors were discussed, and if no consensus could be reached, a third author was consulted.

The systematic search resulted in 2448 unique records. After initial screening, 13 articles were found eligible for inclusion. Snowballing through reference lists of included articles resulted in the inclusion of 1 additional article that met the inclusion criteria. The complete flow diagram is presented in Fig 1.

FIGURE 1

Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram.

FIGURE 1

Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram.

Close modal

An overview of the characteristics of the 14 included studies is presented in Table 3.

TABLE 3

Characteristics of Included Studies

SourceSample of InfantsPredictorAge at PredictionaOutcome AssessmentClassification Threshold for Abnormal OutcomeAge at Outcome Assessment, Corrected Mo, Mean (SD)Type of Machine Learning ModelsOutcomes of Best-Predictive Model
MotorNeurocognitiveOther
Ambalavanan et al36  <1000 g Clinical variables Near term BSID-1 (motor and neurocognitive) <68 13.3 (0.2) NN, LR LR: AUC, 0.69; sensitivity, 0.7; specificity, 0.69; PPV, 0.26; NPV 0.86 NN: AUC, 0.75; sensitivity, 0.7; specificity, 0.62; PPV, 0.23; NPV, 0.93 — 
Ambalavanan et al37  <1000 g Clinical variables 0, 3, and 8 d after birth NDI or death BSID score <70 or presence of cerebral palsy, vision <20/200, hearing loss, or death 18b CT — — NDI/death: Accuracy, 0.62; sensitivity, 0.53; specificity, 0.69; PPV, 0.58; NPV, 0.65 
Moeskops et al38  <28 wk GA MRI, clinical variables 30 and/or 40 BSID-3 (motor and neurocognitive) <85 29c SVM AUC, 0.85 AUC, 0.81 — 
Kawahara et al39  <32 wk GA DTI-MRI, clinical variables Shortly after birth and/or 40 BSID-3 (motor and neurocognitive) NA 18 CNN, NN, LR CNN: R2 = 0.096; MAE, 10.734; SDAE, 7.734 CNN: R2 = 0.035; MAE, 11.077; SDAE, 8.574 — 
He et al40  24–32 wk GA fcMRI, clinical variables 39.4 (1.3) BSID-3 (neurocognitive) <85 24 SVM — Accuracy, 0.706; AUC, 0.76; sensitivity, 0.701; specificity, 0.712 — 
Schadl et al41  ≤32 wk GA and ≤1500 g DTI MRI, clinical variables 36.6 (1.8) BSID-3 (motor and neurocognitive) <85 18–22b LR Accuracy, 0.867; AUC, 0.912; sensitivity, 0.90; specificity, 0.86; PPV, 0.56; NPV, 0.98; R2 = 0.317 Accuracy, 1.00; AUC, 1.00; sensitivity, 1.00; specificity, 1.00; PPV, 1.00; NPV, 1.00; R2 = 0.296 — 
Brown et al42  24–32 wk GA DTI-MRI 35.56 BSID-3 (motor and neurocognitive) ≤85 18b LR Accuracy, 0.725; R2 = 0.195; AOC, 14.01 Accuracy, 0.590; R2 = 0.196; AOC, 15.37 — 
Cahill-Rowley et al43  ≤32 wk GA and ≤1500 g DTI-MRI
(automatically/manually analyzed) 
36.5 (1.2) TDI (motor) < −1 SD 20.2 (1.0) LR Accuracy, 0.788; AUC 0.83; sensitivity, 0.85; specificity, 0.75; R2 = 0.16 — — 
Girault et al44  <37 wk GA DWI-MRI 40.4 (1.6) MSEL (neurocognitive) <110 (median for full-term cohort) 25.6 (0.8)c CNN-LR (combined) — Accuracy, 0.838; sensitivity, 0.86; specificity, 0.80; PPV, 0.86; NPV, 0.80; R2 = 0.914; MAE, 4.47 — 
Saha et al45  <31 wk GA DWI-MRI, PMA at scan 32 (median, range 23.1-3..8) NSMDA (motor) ≥1 median, 24; range, 2.9–26.7) CNN, LR, SVM, DNN CNN: accuracy, 0.76; AUC, 0.74; sensitivity, 0.67; specificity, 0.82; F-score, 0.69 — — 
Vassar et al46  ≤32 wk GA and ≤1500 g DTI-MRI, MRI near term BSID-3 (language) <85 18–22b LR — — Language: AUC, 0.916; sensitivity, 0.89; specificity, 0.86 
Janjic et al47  <32 wk GA H-MRS, DTI-MRI 40.4 (1.5) BSID-3 (motor and neurocognitive) <85 12.0 (0.8) NN Accuracy, 0.961; sensitivity, 0.769; specificity, 0.989; PPV, 0.909; NPV, 0.967 Accuracy, 0.991; sensitivity, 0.857; specificity, 1.00; PPV, 1.00; NPV, 0.991 — 
He et al48  ≤32 wk GA MRI, clinical variables 40.3 (median, range 39.3-41.4) BSID-3 (motor, neurocognitive, and language) ≤85 24b NN Accuracy, 0.739; AUC, 0.84; sensitivity, 0.760; specificity, 0.717 Accuracy, 0.851; AUC, 0.86; sensitivity, 0.740; specificity, 0.889 Language: accuracy, 0.689; AUC, 0.66; sensitivity, 0.600; specificity, 0.778 
Chen et al49  <32 wk GA DTI-MRI 40.4 (0.6) BSID-3 (neurocognitive) <90 24b CNN, LR, SVM, DNN — CNN: accuracy, 0.745; AUC, 0.75; sensitivity, 0.702; specificity, 0.787; R2 = 0.221; MAE, 16.2; SDAE, 9.5 — 
SourceSample of InfantsPredictorAge at PredictionaOutcome AssessmentClassification Threshold for Abnormal OutcomeAge at Outcome Assessment, Corrected Mo, Mean (SD)Type of Machine Learning ModelsOutcomes of Best-Predictive Model
MotorNeurocognitiveOther
Ambalavanan et al36  <1000 g Clinical variables Near term BSID-1 (motor and neurocognitive) <68 13.3 (0.2) NN, LR LR: AUC, 0.69; sensitivity, 0.7; specificity, 0.69; PPV, 0.26; NPV 0.86 NN: AUC, 0.75; sensitivity, 0.7; specificity, 0.62; PPV, 0.23; NPV, 0.93 — 
Ambalavanan et al37  <1000 g Clinical variables 0, 3, and 8 d after birth NDI or death BSID score <70 or presence of cerebral palsy, vision <20/200, hearing loss, or death 18b CT — — NDI/death: Accuracy, 0.62; sensitivity, 0.53; specificity, 0.69; PPV, 0.58; NPV, 0.65 
Moeskops et al38  <28 wk GA MRI, clinical variables 30 and/or 40 BSID-3 (motor and neurocognitive) <85 29c SVM AUC, 0.85 AUC, 0.81 — 
Kawahara et al39  <32 wk GA DTI-MRI, clinical variables Shortly after birth and/or 40 BSID-3 (motor and neurocognitive) NA 18 CNN, NN, LR CNN: R2 = 0.096; MAE, 10.734; SDAE, 7.734 CNN: R2 = 0.035; MAE, 11.077; SDAE, 8.574 — 
He et al40  24–32 wk GA fcMRI, clinical variables 39.4 (1.3) BSID-3 (neurocognitive) <85 24 SVM — Accuracy, 0.706; AUC, 0.76; sensitivity, 0.701; specificity, 0.712 — 
Schadl et al41  ≤32 wk GA and ≤1500 g DTI MRI, clinical variables 36.6 (1.8) BSID-3 (motor and neurocognitive) <85 18–22b LR Accuracy, 0.867; AUC, 0.912; sensitivity, 0.90; specificity, 0.86; PPV, 0.56; NPV, 0.98; R2 = 0.317 Accuracy, 1.00; AUC, 1.00; sensitivity, 1.00; specificity, 1.00; PPV, 1.00; NPV, 1.00; R2 = 0.296 — 
Brown et al42  24–32 wk GA DTI-MRI 35.56 BSID-3 (motor and neurocognitive) ≤85 18b LR Accuracy, 0.725; R2 = 0.195; AOC, 14.01 Accuracy, 0.590; R2 = 0.196; AOC, 15.37 — 
Cahill-Rowley et al43  ≤32 wk GA and ≤1500 g DTI-MRI
(automatically/manually analyzed) 
36.5 (1.2) TDI (motor) < −1 SD 20.2 (1.0) LR Accuracy, 0.788; AUC 0.83; sensitivity, 0.85; specificity, 0.75; R2 = 0.16 — — 
Girault et al44  <37 wk GA DWI-MRI 40.4 (1.6) MSEL (neurocognitive) <110 (median for full-term cohort) 25.6 (0.8)c CNN-LR (combined) — Accuracy, 0.838; sensitivity, 0.86; specificity, 0.80; PPV, 0.86; NPV, 0.80; R2 = 0.914; MAE, 4.47 — 
Saha et al45  <31 wk GA DWI-MRI, PMA at scan 32 (median, range 23.1-3..8) NSMDA (motor) ≥1 median, 24; range, 2.9–26.7) CNN, LR, SVM, DNN CNN: accuracy, 0.76; AUC, 0.74; sensitivity, 0.67; specificity, 0.82; F-score, 0.69 — — 
Vassar et al46  ≤32 wk GA and ≤1500 g DTI-MRI, MRI near term BSID-3 (language) <85 18–22b LR — — Language: AUC, 0.916; sensitivity, 0.89; specificity, 0.86 
Janjic et al47  <32 wk GA H-MRS, DTI-MRI 40.4 (1.5) BSID-3 (motor and neurocognitive) <85 12.0 (0.8) NN Accuracy, 0.961; sensitivity, 0.769; specificity, 0.989; PPV, 0.909; NPV, 0.967 Accuracy, 0.991; sensitivity, 0.857; specificity, 1.00; PPV, 1.00; NPV, 0.991 — 
He et al48  ≤32 wk GA MRI, clinical variables 40.3 (median, range 39.3-41.4) BSID-3 (motor, neurocognitive, and language) ≤85 24b NN Accuracy, 0.739; AUC, 0.84; sensitivity, 0.760; specificity, 0.717 Accuracy, 0.851; AUC, 0.86; sensitivity, 0.740; specificity, 0.889 Language: accuracy, 0.689; AUC, 0.66; sensitivity, 0.600; specificity, 0.778 
Chen et al49  <32 wk GA DTI-MRI 40.4 (0.6) BSID-3 (neurocognitive) <90 24b CNN, LR, SVM, DNN — CNN: accuracy, 0.745; AUC, 0.75; sensitivity, 0.702; specificity, 0.787; R2 = 0.221; MAE, 16.2; SDAE, 9.5 — 

—, not applicable; AOC, area over the regression error characteristic curve; CNN, convolutional neural network; CT, classification tree; DNN, deep neural network; DTI, diffusion tenor imaging; DWI, diffusion-weighted imaging; fcMRI, functional connectivity MRI; H-MRS, proton magnetic resonance spectroscopy; LR, logistic regression; MAE, mean absolute error; MSEL, Mullen Scales of Early Learning; NA, not available; NDI, neurodevelopmental impairment; NN, neural network; NPV, negative predictive value; NSMDA, Neuro-Sensory Motor Developmental Assessment; PMA, postmenstrual age; PPV, positive predictive value; SDAE, SD of absolute error; SVM, support vector machine; TDI, Toddler Development Index.

a

Data are mean (SD) weeks PMA unless otherwise indicated.

b

No mean/median available, only the intended age at outcome measurement.

c

Uncorrected age.

Study Characteristics

Thirteen studies (93%) included exclusively very preterm infants (GA <32 weeks) in their model. Models on neurocognitive outcome were investigated in 10 studies (71%), motor outcome in 9 (64%), and language outcome in 2 (14%). The aim of 1 study was to predict neurodevelopmental impairment, defined as adverse outcomes in ≥1 developmental domains (ie, motor, neurocognitive, hearing loss, blindness).37  None of the studies investigated behavioral or academic outcomes. Regarding the type of predictors, 12 studies (86%) investigated the predictive power of MRI-derived parameters, of which 6 (43%) also were investigations of a combination of MRI and clinical variables. Two studies (14%) investigated clinical variables only. The age at prediction varied from the day of birth to near-term age, with 3 studies (21%) investigating models at multiple prediction moments. Regarding neurodevelopmental outcome measures, the most commonly used outcome measurement was the Bayley Scales of Infant Development (BSID), versions 1 to 3 (11 studies, 79%). The age at predicted outcome varied between 12 and 29 months.

Model Types

In 13 studies (93%), classification models were developed in which a dichotomous outcome was predicted (normal versus abnormal outcome). The most common threshold for abnormal outcome using the BSID was a score of 85, which equals −1 SD (7 studies, 70%). In 5 of the studies with a classification model, additional regression models were developed aimed at the prediction of a continuous outcome score. Only 1 study (7%) developed a regression model. The machine learning models most often used were logistic regression with exhaustive feature selection (8 studies, 57%), support vector machines (4 studies, 29%), and neural networks (7 studies, 50%). Application of exhaustive feature selection involved data-driven selection of predictors out of a large number of variables based on the most optimal model performance. Five studies (36%) reported on comparisons of different types of machine learning models. A brief explanation of the models used in the included studies is given in Table 4.

TABLE 4

Description of Machine Learning Models

Machine Learning ModelDescription
Logistic regression Logistic regression analysis using a logistic function to predict a binary outcome. Often seen as a conventional statistical method because of its use in combination with hypothesis-driven selection of predictors. However, in this review, only studies were included that used a data-driven approach, eg, through exhaustive feature selection or using regularization to limit model complexity. 
Decision tree Flow chart–like structured model in which each node represents a decision based on a number of categories of values of a predictor (eg, predictor value yes or no, predictor value ≤5 or >5). The tree is built by starting with the most informative node. Branches with nodes are created until a stop criterion is met. When reaching the bottom of the tree, the outcome is a predicted category (classification tree), a value (regression tree), or a regression model (model tree). Decision trees are appreciated for their simplicity and interpretability. 
Support vector machine Initially designed for binary classification, the model aims to find the best boundary between the 2 classes through finding a linear separator that maximizes the distance between data points of one class and the other class (the so-called support vectors). These models have been extended with kernel function to allow for more complex separations between classes. An extension to work for regression problems (support vector regression) is also available. 
Neural network Model structure is inspired by neural networks in the brain, consisting of multiple neurons (nodes) connected to each other. The nodes are typically arranged into layers in which the first layer consists of input nodes (variables), ≥1 inner layers that represent specific states determined by a combination of predictors, and an output layer (outcome). The nodes between layers are fully connected in the case of so-called feed-forward neural networks, making it possible to capture extensive variable interactions in the model. Connections between nodes are associated with a connection strength (weight) that is learned from the available data. 
Deep neural network A deep neural network is an extensive form of a neural network. Deep neural networks are characterized by multiple inner layers, allowing for capture of more complex interactions between predictors while keeping the number of additional weights reasonably limited. There is no clear distinction made between neural networks and deep neural networks. 
Convolutional neural network Subtype of a deep neural network in which the model takes advantage of the hierarchical pattern in the data by applying so-called convolutional filters on the data that extract the most useful parts of the data by using a more limited number of weights compared with fully connected layers. This process leads to the reduction of model overfitting, increasing generalizability of the prediction to unseen data. Convolutional neural networks are often used in image analysis. 
Machine Learning ModelDescription
Logistic regression Logistic regression analysis using a logistic function to predict a binary outcome. Often seen as a conventional statistical method because of its use in combination with hypothesis-driven selection of predictors. However, in this review, only studies were included that used a data-driven approach, eg, through exhaustive feature selection or using regularization to limit model complexity. 
Decision tree Flow chart–like structured model in which each node represents a decision based on a number of categories of values of a predictor (eg, predictor value yes or no, predictor value ≤5 or >5). The tree is built by starting with the most informative node. Branches with nodes are created until a stop criterion is met. When reaching the bottom of the tree, the outcome is a predicted category (classification tree), a value (regression tree), or a regression model (model tree). Decision trees are appreciated for their simplicity and interpretability. 
Support vector machine Initially designed for binary classification, the model aims to find the best boundary between the 2 classes through finding a linear separator that maximizes the distance between data points of one class and the other class (the so-called support vectors). These models have been extended with kernel function to allow for more complex separations between classes. An extension to work for regression problems (support vector regression) is also available. 
Neural network Model structure is inspired by neural networks in the brain, consisting of multiple neurons (nodes) connected to each other. The nodes are typically arranged into layers in which the first layer consists of input nodes (variables), ≥1 inner layers that represent specific states determined by a combination of predictors, and an output layer (outcome). The nodes between layers are fully connected in the case of so-called feed-forward neural networks, making it possible to capture extensive variable interactions in the model. Connections between nodes are associated with a connection strength (weight) that is learned from the available data. 
Deep neural network A deep neural network is an extensive form of a neural network. Deep neural networks are characterized by multiple inner layers, allowing for capture of more complex interactions between predictors while keeping the number of additional weights reasonably limited. There is no clear distinction made between neural networks and deep neural networks. 
Convolutional neural network Subtype of a deep neural network in which the model takes advantage of the hierarchical pattern in the data by applying so-called convolutional filters on the data that extract the most useful parts of the data by using a more limited number of weights compared with fully connected layers. This process leads to the reduction of model overfitting, increasing generalizability of the prediction to unseen data. Convolutional neural networks are often used in image analysis. 

Model Performance

Performance of the best-performing models of each study, based on validated or cross-validated results, are presented in the last column of Table 1. For classification problems, AUC values on the test data ranged from 0.66 to 1.00 (mean, 0.81; SD, 0.09), and accuracy values ranged from 0.59 to 1.00 (mean, 0.80; SD, 0.12). For regression problems, R2 ranged from 4% to 91% (mean, 28%; SD, 27%). The performances of the models without data leakage (see quality assessment) ranged from 0.69 to 0.86, 0.62 to 0.85, and 22% to 91% for AUC, accuracy, and R2, respectively.

The extracted data used for the quality assessment of the models are displayed in Table 5.

TABLE 5

Features Extracted for Quality Assessment

Reference No.
3637383940414243444546474849
Sample size, n 218 1046 173 115 28 60 115 52 37 77 59 103 (m)
115 (c) 
33 80 
Size of transfer cohort, n — — — — 884 — — — 75 — — — 1425 1.2 million 
Clear inclusion/exclusion criteria, Y/N 
Representative sample, Y/N 
Clear sample attrition, Y/N 
Predictors at start, s at start, n 21 65a 14 4005 4005 396 4005 396 3003 Not givenb 396 42 4089 8100 
Predictors at final model, n 21 10 (m)
12 (c) 
4005 4005 4005 3003 Not givenb 4089 8100 
Predictor selection independent from test set, Y/N — — — — — — — — 
Hyperparameters tuned independently from test set, n Not given Not given 21 2016 557 >1000a >35a 
Hyperparameters tuned based on test set, n 10 30 1000 
Data leakage, Y/N 
Internal validation 2:1 3:1 with 10-fold CV within the training set Leave 5% out in 500 rounds CV 3-fold CV 10-fold CV LOOCV Leave 2 out in 1000 rounds CV LOOCV 10-fold CV 9:1 with 10-fold CV within the 90% training set LOOCV 4-fold CV 5-fold CV n 50 iterations 5-fold CV and 7:3 training/validation within training set 
External validation, Y/N 
Outcome type, classification/regression Classification Classification Classification Regression Classification Both Both Both Both Classification Classification Classification Classification Both 
Base rate, % abnormal 22 (m)
14 (c) 
41.6 25 (m)
24 (c) 
— 50 17 (m)
14 (c) 
18 (m)
10 (c) 
38 59 38c 33.7 11.8 (m)
5.6 (c) 
15 (m)
15 (c)
21 (l) 
 
Type of performance metrics reported AUC, sens, spec, PPV, NPV Accuracy, sens, spec, PPV, NPV AUC Pearson r, MAE, SDAE Accuracy, AUC, sens, spec AUC, sens, spec, R2d Accuracy, Pearson r, AOC AUC, sens, spec, R2 Accuracy, Pearson r, MAE, SDAEd Accuracy, AUC, sens, spec, F1-score AUC, sens, spec Sens, spec, PPV, NPVd Accuracy, AUC, sens, spec, LR, FPR Accuracy, AUC, sens, spec, Pearson r, MAE, SDAE 
Interpretability of the model, Y/L/N 
Comparison with literature, Y/L/N 
Sharing code, model, or data, C/M/D/N 
Decision support model, Y/N 
Reference No.
3637383940414243444546474849
Sample size, n 218 1046 173 115 28 60 115 52 37 77 59 103 (m)
115 (c) 
33 80 
Size of transfer cohort, n — — — — 884 — — — 75 — — — 1425 1.2 million 
Clear inclusion/exclusion criteria, Y/N 
Representative sample, Y/N 
Clear sample attrition, Y/N 
Predictors at start, s at start, n 21 65a 14 4005 4005 396 4005 396 3003 Not givenb 396 42 4089 8100 
Predictors at final model, n 21 10 (m)
12 (c) 
4005 4005 4005 3003 Not givenb 4089 8100 
Predictor selection independent from test set, Y/N — — — — — — — — 
Hyperparameters tuned independently from test set, n Not given Not given 21 2016 557 >1000a >35a 
Hyperparameters tuned based on test set, n 10 30 1000 
Data leakage, Y/N 
Internal validation 2:1 3:1 with 10-fold CV within the training set Leave 5% out in 500 rounds CV 3-fold CV 10-fold CV LOOCV Leave 2 out in 1000 rounds CV LOOCV 10-fold CV 9:1 with 10-fold CV within the 90% training set LOOCV 4-fold CV 5-fold CV n 50 iterations 5-fold CV and 7:3 training/validation within training set 
External validation, Y/N 
Outcome type, classification/regression Classification Classification Classification Regression Classification Both Both Both Both Classification Classification Classification Classification Both 
Base rate, % abnormal 22 (m)
14 (c) 
41.6 25 (m)
24 (c) 
— 50 17 (m)
14 (c) 
18 (m)
10 (c) 
38 59 38c 33.7 11.8 (m)
5.6 (c) 
15 (m)
15 (c)
21 (l) 
 
Type of performance metrics reported AUC, sens, spec, PPV, NPV Accuracy, sens, spec, PPV, NPV AUC Pearson r, MAE, SDAE Accuracy, AUC, sens, spec AUC, sens, spec, R2d Accuracy, Pearson r, AOC AUC, sens, spec, R2 Accuracy, Pearson r, MAE, SDAEd Accuracy, AUC, sens, spec, F1-score AUC, sens, spec Sens, spec, PPV, NPVd Accuracy, AUC, sens, spec, LR, FPR Accuracy, AUC, sens, spec, Pearson r, MAE, SDAE 
Interpretability of the model, Y/L/N 
Comparison with literature, Y/L/N 
Sharing code, model, or data, C/M/D/N 
Decision support model, Y/N 

—, not applicable; AOC, area over the regression error characteristic curve; C, code; (c), cognitive outcome; CV, cross-validation; D, data; FPR, false positive rate; (l), language outcome; LR, likelihood ratio; M, model; (m), motor outcome; MAE, mean absolute error; N, no; NPV, negative predictive value; PPV, positive predictive value; sens, sensitivity; spec, specificity; SDAE, SD of absolute error; Y, yes.

a

At least, probably more, but not able to retrieve from the article.

b

Scans were split into patches of size 26 × 26 × 26 voxels, with 50% overlap. Average number of patches per scan was 189. Number of patches/voxels at start was equal to number in final model.

c

Data set was balanced using data augmentation techniques.

d

Additional performance metrics could be calculated from data presented in the article.

Sample Size

Sample size ranged from 33 to 1046 infants (mean, 158; SD, 261). Only 1 study (7%) had a sample size above several hundreds of participants. Eight studies (57%) had a sample size <100 infants, of which 4 applied the concept of transfer learning.27 

Participants

Twelve studies (86%) reported clear inclusion and exclusion criteria. Three studies (21%) were considered to have a sample that was not representative for the population because the selection of study population was not in line with the aim of prediction or because of the use of unclear inclusion and exclusion criteria. Authors of 4 articles (29%) did not report on sample attrition.

Data Leakage

Six studies (43%) applied (exhaustive) predictor selection; however, in only 1 of these studies, the predictor selection was completely independent from the performance results through the application of nested validation.37  In 9 studies (64%), models were optimized through hyperparameter tuning. In 3 of them,39,40,42  this optimalization was based on final validated or cross-validated results, thereby causing data leakage (ie, training the model on test results). In total, 8 studies (57%) suffered from data leakage, 5 of which through predictor selection and 3 through hyperparameter tuning. The 8 studies with data leakage were ranked on the severity of data leakage, on the basis of the number of combinations tested, and then evenly divided into limited or strong data leakage.

Validation

All studies used some form of validation. Nine studies (64%) applied some form of k-fold cross-validation, of which 3 (33%) applied leave-one-out cross-validation (LOOCV) in which k = n. The holdout method was used in 1 study (7%), forming 2 exclusive subsets (training versus testing). Four studies (29%) applied nested validation through the use of k-fold cross-validation combined with the holdout method (3 studies) or external validation (1 study), thereby ensuring prevention of any form of data leakage during hyperparameter tuning.

Performance Metrics

In 6 (43%) of the 13 studies wherein classification models were developed (normal versus abnormal outcome), outcomes were clearly skewed, defined as <25% or >75% abnormal outcome. Three of these studies did not provide any base-rate–sensitive metrics or applied data set–balancing techniques (50%).38,41,48 

In the six studies developing continuous regression models, only 2 provided the explained variance (R2) calculated directly from the observed and predicted values.41,43  The other studies reported Pearson’s r, the correlation between the observed and predicted value, which itself is insensitive to the absolute difference between observed and predicted values. However, these 4 studies did provide other metrics that directly compared the observed and predicted values (ie, mean absolute error, area over the regression error curve).

Interpretability

Studies were judged on the interpretability of their constructed models, as well as on the interpretation of the model in relation to previously published models. Eleven studies (79%) made sufficient attempts to provide insight in the model (ie, the relation between individual predictors and outcome), most often done through visualization (in MRI studies) using Circos plots. Only 1 study mentioned a specific method analyzing the influence of specific predictors in the model (gradient-weighted class activation mapping).49 

Comparison with previous (conventional or machine learning) statistical models was absent or limited in most studies (71%). In only 4 studies (29%), extensive comparisons were made, making it possible to judge model performance in the context of previous literature.

Open Science

Six studies (43%) provided sufficient information on the model properties to allow reproduction of the model. Nevertheless, no study researchers made attempts to share code or data, except for 1 study where the researchers stated that the data were available upon request. Likewise, none of the articles reported attempts to provide a tool enabling implementation of the developed models into clinical practice.

Overall Quality Assessment

The overall quality assessment is displayed in Table 6. None of the included studies met all criteria; however, some came close. Five studies (36%) scored as appropriate on 4 of 6 criteria. Specifically, the categories on data leakage, interpretability, and open science were found deserving of improvement in the majority of studies, scoring as appropriate in only 43%, 29%, and 0% of the studies, respectively.

TABLE 6

Quality Assessment

SourceParticipantsData LeakageValidation ProcedurePerformance MetricsInterpretabilityOpen Science
Ambalavanan et al36  XXX XX 
Ambalavanan et al37  XX XX 
Moeskops et al38  XX XXX XX XX XXX 
Kawahara et al39  XX XX XX XX 
He et al40  XX XX XXX 
Schadl et al41  XX XXX XX XX XXX 
Brown et al42  XXX XX XX XXX 
Cahill-Rowley et al43  XXX XX XX XX 
Girault et al44  XX XX XX XX 
Saha et al45  XX XXX 
Vassar et al46  XXX XX XX XXX 
Janjic et al47  XX XX XXX XX 
He et al48  XX XX 
Chen et al49  XX XXX 
SourceParticipantsData LeakageValidation ProcedurePerformance MetricsInterpretabilityOpen Science
Ambalavanan et al36  XXX XX 
Ambalavanan et al37  XX XX 
Moeskops et al38  XX XXX XX XX XXX 
Kawahara et al39  XX XX XX XX 
He et al40  XX XX XXX 
Schadl et al41  XX XXX XX XX XXX 
Brown et al42  XXX XX XX XXX 
Cahill-Rowley et al43  XXX XX XX XX 
Girault et al44  XX XX XX XX 
Saha et al45  XX XXX 
Vassar et al46  XXX XX XX XXX 
Janjic et al47  XX XX XXX XX 
He et al48  XX XX 
Chen et al49  XX XXX 

X, appropriate; XX, minor deviation, XXX, major deviation.

In this scoping review, we aimed to provide an overview of the current applications of machine learning models in the prediction of neurodevelopmental outcomes in preterm infants, to assess the quality of the developed models, and to provide guidance for future application of machine learning models to predict neurodevelopmental outcomes of preterm infants. Most of the 14 included studies aimed to predict outcomes at the age of 1 to 2 years, mainly assessed through the BSID as predicting either normal or abnormal development or an actual individual test score. MRI features were often used as predictors, sometimes combined with clinical data. The most often used machine learning models were linear regression and neural networks. None of the studies completely met our developed quality recommendations; however, those reporting a low risk of inflated results had promising results, with AUCs up to 0.86, accuracy up to 0.85, and R2 up to 91%.

Most of the models were directed at very or extremely preterm infants and were based on predictors gathered in the late neonatal period, mostly around term date. This characteristic limits the use of the models to prediction after the critically ill period and, thereby, their contribution to clinical decision-making in the neonatal phase. However, the developed models are still helpful to provide earlier insight into the expected outcome and might thereby provide future opportunities for early intervention programs after discharge from the hospital. Regrettably, half of the included studies used only MRI-based predictors, thereby leaving aside the predictive potential of many other factors known to influence neurodevelopmental outcome. The relevance of models that included MRI predictors is also bound to the routine clinical use of MRI in preterm infants. Currently, MRI is not part of relevant national and international guidelines. The use of MRI features is much more common in the included studies than in previous conventional prediction models.9,11  This finding might be explained by the suitability of machine learning algorithms for the analysis of comprehensive MRI data without the need of human interpretation at risk for bias and information loss. This suitability not only points out one of the strengths of machine learning as a predictive tool, providing the ability to analyze complex and large data sets such as MRI, but also has potential for longitudinal data, such as time series of vital parameter recordings or laboratory assessments. In terms of outcome measures, the common use of BSID in the studies limits the predictive power of the models for long-term outcomes because the associations between BSID and later neurodevelopmental outcomes are limited, with the percentage of explained variance up to only 37%.50 

Our developed evaluation framework was used to judge the included studies. None of the studies completely met the quality recommendations. Most of the studies seemed predominantly invested in the statistical part of the model rather than the methodology of the development of the prediction model as a whole (eg, selection of participants). However, to ensure generalizability of prediction models to clinical practice, close attention should be paid to sample formation as well to ensure that the developed model is applicable to the foreseen population. Data leakage was present in more than half of the studies. This problem could be overcome by applying some form of nested validation, selecting features or hyperparameters on data completely independent from the test data. Three models applied LOOCV instead of the preferred k-fold cross-validation. Although using LOOCV creates a larger amount of training data, its use might lead to negative R2 values due to anticorrelation of the training data with the test data caused by the fact that the test data are not representative of the whole sample.18  Three studies that developed classification models did not act on their skewed classes. The predictive performance of these models is hard to interpret because no information was given about the predictive value of an individual test result.18  In almost all studies, improvements could be made in both the interpretation of the model and the sharing of the code and data used, facilitating external validation and quality assessment of the model. Taken together, more than half of the included studies were likely to report inflated performances due to methodologic limitations that are most often related to data leakage. It is likely that these findings also hold for machine learning models in general (pediatric) medicine. Therefore, readers should always critically assess the quality of the study before assuming the results to be generalizable.

Although we did not include sample size in the quality assessment as we explained earlier, most of the studies had relatively small samples, with 8 having a sample size of <100 infants. The combination of small samples, large numbers of predictors, extensive hyperparameter tuning, and nonlinear models is prone to overfitting and can lead to inflation of the predictive performance in case of data leakage or to disappointing predictive performances if data leakage is absent because of nongeneralizability of the model to the test data. The small samples in the included studies are rather disappointing, considering the high prevalence of preterm birth.1  This prevalence, in fact, makes the preterm infant population well-suited for machine learning approaches, and prospective standardized data collection would lead to valuable data that offer promising opportunities for improved prediction models. Although some initiatives for gathering big data on large patient groups in neonatal care already are made, these have not yet been fruitful for long-term outcome prediction.51,52  The above also applies to general (pediatric) care, in which advanced standardized data collection deserves more attention as well.

Although the quality assessment revealed a risk of inflated results in most of the studies, those with a low risk (ie, without signs of data leakage) reported promising model performances. These reported performances compare favorably with those of prediction models developed with conventional statistics, such as the Nursery Neurobiologic Risk Score, the Clinical Risk Index for Babies, and the Score for Neonatal Acute Physiology, or manually extracted MRI features. These conventional statistical models reported AUCs from 0.59 to 0.82 and an R2 of ∼20%, whereas predictive performance of manually extracted MRI features only reach sensitivity and specificity scores of 72% and 62% or lower, respectively, not even taking into account that these performances were not externally validated or cross-validated and thereby likely to be overfitted.9,11,5356  Because the included machine learning models were trained on relatively small data sets, it is likely that larger data sets will even further increase their predictive performances. Thus, the results of the studies reviewed here are promising for additional machine learning model development when taking into account the rules on nonbiased patient groups, correct use of machine learning, larger sample sizes, and expansion of predictors to clinical and environmental features, including high-density data such as vital parameter time series, thereby providing outcome predictions earlier in the neonatal period that can aid clinical decision-making in the critical phase.

This scoping review has some limitations. We only investigated 1 literature source (PubMed). However, PubMed is a comprehensive search engine covering many medical journals, and snowballing was applied to identify any missing studies. Furthermore, the evaluation framework leaves room for some degree of subjectivity. Besides, we were not able to include sample size in the evaluation framework as we explained before. However, the presented evaluation framework is the highest quality guideline currently available on this topic. Strengths of this study include the careful assessment of the available machine learning models for the prediction of neurodevelopmental outcome, thereby enabling accurate judgment of their performance. In this study, we provide readers with a framework for quality assessment of machine learning models and present recommendations for future neonatal prediction models on neurodevelopmental outcome.

Machine learning is a relatively new field in clinical science, with considerable differences from conventional statistical approaches. Therefore, existing quality assessment tools are insufficient for proper evaluation of machine learning models. Although valuable attempts have been made in reporting guidelines on proper machine learning use or extensions of currently available risk-of-bias tools, currently, there are no guidelines for the judgment of the technical aspects of machine learning application. Thereby, the risk of model overfitting or underfitting is currently overlooked in existing quality guidelines.23,57  In this study, we have made the first attempt to provide guidelines to assist in the (technical) quality assessment for the application of machine learning. In future studies using machine learning, more attention should be paid on correct application of the technique. Furthermore, studies may contribute to the prediction of neurodevelopmental outcomes of preterm infants by using machine learning applied on a wide variety of data available in the antenatal and neonatal period, including continuous vital parameters, environmental factors, and imaging data. Besides, models should focus on predicting earlier in the neonatal period (using predictors available in the acute phase), as well as on predicting neurodevelopmental outcomes assessed beyond age 2 years.

This scoping review not only reveals promising results in the prediction of neurodevelopmental outcomes in preterm infants but also highlights common pitfalls in the application of machine learning techniques. The provided evaluation framework may contribute to improving the quality of future machine learning models in general.

Drs van Boven conceptualized and designed the study, reviewed and selected the literature, performed the data collection, developed the quality guidelines, appraised the quality of the studies, drafted the initial manuscript, and reviewed and revised the manuscript; Drs Henke reviewed the quality guidelines, appraised the quality of the studies, and reviewed and revised the manuscript; Dr Leemhuis and Prof van Kaam conceptualized and designed the study, were involved in the interpretation and analysis of data, reviewed the quality guidelines, and reviewed and revised the manuscript; Prof Hoogendoorn conceptualized and designed the study, was involved in the interpretation and analysis of data, developed the quality guidelines, and reviewed and revised the manuscript; Dr Königs and Prof Oosterlaan conceptualized and designed the study, supervised the literature selection, data collection, and quality appraisal of the study, developed the quality guidelines, and reviewed and revised the manuscript; and all authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.

FUNDING: No external funding.

CONFLICT OF INTEREST DISCLOSURES: The authors have indicated they have no potential conflicts of interest to disclose.

AUC

area under the receiver-operating curve

BSID

Bayley Scales of Infant Development

GA

gestational age

LOOCV

leave-one-out cross-validation

1
Blencowe
H
,
Cousens
S
,
Oestergaard
MZ
, et al
.
National, regional, and worldwide estimates of preterm birth rates in the year 2010 with time trends since 1990 for selected countries: a systematic analysis and implications
.
Lancet
.
2012
;
379
(
9832
):
2162
2172
2
Field
DJ
,
Dorling
JS
,
Manktelow
BN
,
Draper
ES
.
Survival of extremely premature babies in a geographically defined population: prospective cohort study of 1994-9 compared with 2000-5
.
BMJ
.
2008
;
336
(
7655
):
1221
1223
3
Stoll
BJ
,
Hansen
NI
,
Bell
EF
, et al;
Eunice Kennedy Shriver National Institute of Child Health and Human Development Neonatal Research Network
.
Trends in care practices, morbidity, and mortality of extremely preterm neonates, 1993-2012
.
JAMA
.
2015
;
314
(
10
):
1039
1051
4
Allotey
J
,
Zamora
J
,
Cheong-See
F
, et al
.
Cognitive, motor, behavioural and academic performances of children born preterm: a meta-analysis and systematic review involving 64 061 children
.
BJOG
.
2018
;
125
(
1
):
16
25
5
Oskoui
M
,
Coutinho
F
,
Dykeman
J
,
Jetté
N
,
Pringsheim
T
.
An update on the prevalence of cerebral palsy: a systematic review and meta-analysis
.
Dev Med Child Neurol
.
2013
;
55
(
6
):
509
519
6
Twilhaar
ES
,
Wade
RM
,
de Kieviet
JF
,
van Goudoever
JB
,
van Elburg
RM
,
Oosterlaan
J
.
Cognitive outcomes of children born extremely or very preterm since the 1990s and associated risk factors: a meta-analysis and meta-regression
.
JAMA Pediatr
.
2018
;
172
(
4
):
361
367
7
van Noort-van der Spek
IL
,
Franken
MC
,
Weisglas-Kuperus
N
.
Language functions in preterm-born children: a systematic review and meta-analysis
.
Pediatrics
.
2012
;
129
(
4
):
745
754
8
Salas
AA
,
Carlo
WA
,
Ambalavanan
N
, et al;
Eunice Kennedy Shriver National Institute of Child Health and Human Development Neonatal Research Network
.
Gestational age and birthweight for risk assessment of neurodevelopmental impairment or death in extremely preterm infants
.
Arch Dis Child Fetal Neonatal Ed
.
2016
;
101
(
6
):
F494
F501
9
Van’t Hooft
J
,
van der Lee
JH
,
Opmeer
BC
, et al
.
Predicting developmental outcomes in premature infants by term equivalent MRI: systematic review and meta-analysis
.
Syst Rev
.
2015
;
4
:
71
10
Latal
B
.
Prediction of neurodevelopmental outcome after preterm birth
.
Pediatr Neurol
.
2009
;
40
(
6
):
413
419
11
Crilly
CJ
,
Haneuse
S
,
Litt
JS
.
Predicting the outcomes of preterm neonates beyond the neonatal intensive care unit: what are we missing?
Pediatr Res
.
2021
;
89
(
3
):
426
445
12
Jeukens-Visser
M
,
Koldewijn
K
,
van Wassenaer-Leemhuis
AG
,
Flierman
M
,
Nollet
F
,
Wolf
M-J
.
Development and nationwide implementation of a postdischarge responsive parenting intervention program for very preterm born children: the TOP program
.
Infant Ment Health J
.
2021
;
42
(
3
):
423
437
13
Hadders-Algra
M
.
Early diagnosis and early intervention in cerebral palsy
.
Front Neurol
.
2014
;
5
:
185
14
Roberts
D
,
Brown
J
,
Medley
N
,
Dalziel
SR
.
Antenatal corticosteroids for accelerating fetal lung maturation for women at risk of preterm birth
.
Cochrane Database Syst Rev
.
2017
;(
3
):
CD004454
15
Benavente-Fernández
I
,
Synnes
A
,
Grunau
RE
, et al
.
Association of socioeconomic status and brain injury with neurodevelopmental outcomes of very preterm children
.
JAMA Netw Open
.
2019
;
2
(
5
):
e192914
16
Lonsdale
H
,
Jalali
A
,
Ahumada
L
,
Matava
C
.
Machine learning and artificial intelligence in pediatric research: current state, future prospects, and examples in perioperative and critical care
.
J Pediatr
.
2020
;
221
(
suppl
):
S3
S10
17
Scheinost
D
,
Noble
S
,
Horien
C
, et al
.
Ten simple rules for predictive modeling of individual differences in neuroimaging
.
Neuroimage
.
2019
;
193
:
35
45
18
Poldrack
RA
,
Huckins
G
,
Varoquaux
G
.
Establishment of best practices for evidence for prediction: a review
.
JAMA Psychiatry
.
2020
;
77
(
5
):
534
540
19
Reid
JE
,
Eaton
E
.
Artificial intelligence for pediatric ophthalmology
.
Curr Opin Ophthalmol
.
2019
;
30
(
5
):
337
346
20
Choy
G
,
Khalilzadeh
O
,
Michalski
M
, et al
.
Current applications and future impact of machine learning in radiology
.
Radiology
.
2018
;
288
(
2
):
318
328
21
Kwon
JM
,
Lee
Y
,
Lee
Y
,
Lee
S
,
Park
J
.
An algorithm based on deep learning for predicting in-hospital cardiac arrest
.
J Am Heart Assoc
.
2018
;
7
(
13
):
e008678
22
Beam
AL
,
Kohane
IS
.
big data and machine learning in health care
.
JAMA
.
2018
;
319
(
13
):
1317
1318
23
Luo
W
,
Phung
D
,
Tran
T
, et al
.
Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view
.
J Med Internet Res
.
2016
;
18
(
12
):
e323
24
Peters
MDJ
,
Godfrey
CM
,
Khalil
H
,
McInerney
P
,
Parker
D
,
Soares
CB
.
Guidance for conducting systematic scoping reviews
.
Int J Evid Based Healthc
.
2015
;
13
(
3
):
141
146
25
Shen
X
,
Finn
ES
,
Scheinost
D
, et al
.
Using connectome-based predictive modeling to predict individual behavior from brain connectivity
.
Nat Protoc
.
2017
;
12
(
3
):
506
518
26
Yang
G
,
Ye
Q
,
Xia
J
.
Unbox the black-box for the medical explainable AI via multi-modal and multi-centre data fusion: a mini-review, two showcases and beyond
.
Inf Fusion
.
2022
;
77
:
29
52
27
Romero
M
,
Interian
Y
,
Solberg
T
,
Valdes
G
.
Targeted transfer learning to improve performance in small medical physics datasets
.
Med Phys
.
2020
;
47
(
12
):
6246
6256
28
Wolff
RF
,
Moons
KGM
,
Riley
RD
, et al;
PROBAST Group
.
PROBAST: a tool to assess the risk of bias and applicability of prediction model studies
.
Ann Intern Med
.
2019
;
170
(
1
):
51
58
29
Hastie
T
,
Tibshirani
R
,
Friedman
J
.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
, 2nd ed.
New York, NY
,
Springer
;
2009
30
Varoquaux
G
,
Raamana
PR
,
Engemann
DA
,
Hoyos-Idrobo
A
,
Schwartz
Y
,
Thirion
B
.
Assessing and tuning brain decoders: cross-validation, caveats, and guidelines
.
Neuroimage
.
2017
;
145
(
pt B
):
166
179
31
Thai-Nghe
N
,
Gantner
Z
,
Schmidt-Thieme
L
.
Cost-sensitive learning methods for imbalanced data
. In:
Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN)
;
July 18–23, 2010
;
Barcelona, Spain
.
1
8
32
Alexander
DLJ
,
Tropsha
A
,
Winkler
DA
.
Beware of R(2): simple, unambiguous assessment of the prediction accuracy of QSAR and QSPR models
.
J Chem Inf Model
.
2015
;
55
(
7
):
1316
1322
33
Ribeirom
MT
,
Singh
S
,
Guestrin
C
.
“Why should I trust you?”: explaining the predictions of any classifier
. In:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
;
August 13–17, 2016
;
San Francisco, CA
.
1135
1144
34
Nichols
TE
,
Das
S
,
Eickhoff
SB
, et al
.
Best practices in data analysis and sharing in neuroimaging using MRI
.
Nat Neurosci
.
2017
;
20
(
3
):
299
303
35
Balki
I
,
Amirabadi
A
,
Levman
J
, et al
.
Sample-size determination methodologies for machine learning in medical imaging research: a systematic review
.
Can Assoc Radiol J
.
2019
;
70
(
4
):
344
353
36
Ambalavanan
N
,
Nelson
KG
,
Alexander
G
,
Johnson
SE
,
Biasini
F
,
Carlo
WA
.
Prediction of neurologic morbidity in extremely low birth weight infants
.
J Perinatol
.
2000
;
20
(
8 pt 1
):
496
503
37
Ambalavanan
N
,
Baibergenova
A
,
Carlo
WA
,
Saigal
S
,
Schmidt
B
,
Thorpe
KE
;
Trial of Indomethacin Prophylaxis in Preterms (TIPP) Investigators
.
Early prediction of poor outcome in extremely low birth weight infants by classification tree analysis
.
J Pediatr
.
2006
;
148
(
4
):
438
444
38
Moeskops
P
,
Išgum
I
,
Keunen
K
, et al
.
Prediction of cognitive and motor outcome of preterm infants based on automatic quantitative descriptors from neonatal MR brain images
.
Sci Rep
.
2017
;
7
(
1
):
2163
39
Kawahara
J
,
Brown
CJ
,
Miller
SP
, et al
.
BrainNetCNN: convolutional neural networks for brain networks; towards predicting neurodevelopment
.
Neuroimage
.
2017
;
146
:
1038
1049
40
He
L
,
Li
H
,
Holland
SK
,
Yuan
W
,
Altaye
M
,
Parikh
NA
.
Early prediction of cognitive deficits in very preterm infants using functional connectome data in an artificial neural network framework
.
Neuroimage Clin
.
2018
;
18
:
290
297
41
Schadl
K
,
Vassar
R
,
Cahill-Rowley
K
,
Yeom
KW
,
Stevenson
DK
,
Rose
J
.
Prediction of cognitive and motor development in preterm children using exhaustive feature selection and cross-validation of near-term white matter microstructure
.
Neuroimage Clin
.
2017
;
17
:
667
679
42
Brown
CJ
,
Miller
SP
,
Booth
BG
, et al
.
Predictive connectome subnetwork extraction with anatomical and connectivity priors
.
Comput Med Imaging Graph
.
2019
;
71
:
67
78
43
Cahill-Rowley
K
,
Schadl
K
,
Vassar
R
,
Yeom
KW
,
Stevenson
DK
,
Rose
J
.
Prediction of gait impairment in toddlers born preterm from near-term brain microstructure assessed with DTI, using exhaustive feature selection and cross-validation
.
Front Hum Neurosci
.
2019
;
13
:
305
44
Girault
JB
,
Munsell
BC
,
Puechmaille
D
, et al
.
White matter connectomes at birth accurately predict cognitive abilities at age 2
.
Neuroimage
.
2019
;
192
:
145
155
45
Saha
S
,
Pagnozzi
A
,
Bourgeat
P
, et al
.
Predicting motor outcome in preterm infants from very early brain diffusion MRI using a deep learning convolutional neural network (CNN) model
.
Neuroimage
.
2020
;
215
:
116807
46
Vassar
R
,
Schadl
K
,
Cahill-Rowley
K
,
Yeom
K
,
Stevenson
D
,
Rose
J
.
Neonatal brain microstructure and machine-learning-based prediction of early language development in children born very preterm
.
Pediatr Neurol
.
2020
;
108
:
86
92
47
Janjic
T
,
Pereverzyev
S
Jr
,
Hammerl
M
, et al
.
Feed-forward neural networks using cerebral MR spectroscopy and DTI might predict neurodevelopmental outcome in preterm neonates
.
Eur Radiol
.
2020
;
30
(
12
):
6441
6451
48
He
L
,
Li
H
,
Wang
J
, et al
.
A multi-task, multi-stage deep transfer learning model for early prediction of neurodevelopment in very preterm infants
.
Sci Rep
.
2020
;
10
(
1
):
15072
49
Chen
M
,
Li
H
,
Wang
J
, et al
.
Early prediction of cognitive deficit in very preterm infants using brain structural connectome with transfer learning enhanced deep convolutional neural networks
.
Front Neurosci
.
2020
;
14
:
858
50
Luttikhuizen dos Santos
ES
,
de Kieviet
JF
,
Königs
M
,
van Elburg
RM
,
Oosterlaan
J
.
Predictive value of the Bayley scales of infant development on development of very preterm/very low birth weight children: a meta-analysis
.
Early Hum Dev
.
2013
;
89
(
7
):
487
496
51
Shirwaikar
RD
,
Acharya
U D
,
Makkithaya
K
,
Mallayaswamy
S
,
Lewis
LES
.
Design framework for a data mart in the neonatal intensive care unit
.
Crit Rev Biomed Eng
.
2018
;
46
(
3
):
221
243
52
Spitzer
AR
,
Ellsbury
D
,
Clark
RH
.
The Pediatrix BabySteps® Data Warehouse--a unique national resource for improving outcomes for neonates
.
Indian J Pediatr
.
2015
;
82
(
1
):
71
79
53
Fowlie
PW
,
Gould
CR
,
Tarnow-Mordi
WO
,
Strang
D
.
Measurement properties of the Clinical Risk Index for Babies--reliability, validity beyond the first 12 hours, and responsiveness over 7 days
.
Crit Care Med
.
1998
;
26
(
1
):
163
168
54
Fowlie
PW
,
Tarnow-Mordi
WO
,
Gould
CR
,
Strang
D
.
Predicting outcome in very low birthweight infants using an objective measure of illness severity and cranial ultrasound scanning
.
Arch Dis Child Fetal Neonatal Ed
.
1998
;
78
(
3
):
F175
F178
55
Lefebvre
F
,
Grégoire
MC
,
Dubois
J
,
Glorieux
J
.
Nursery Neurobiologic Risk Score and outcome at 18 months
.
Acta Paediatr
.
1998
;
87
(
7
):
751
757
56
Eriksson
M
,
Bodin
L
,
Finnström
O
,
Schollin
J
.
Can severity-of-illness indices for neonatal intensive care predict outcome at 4 years of age?
Acta Paediatr
.
2002
;
91
(
10
):
1093
1100
57
Langerhuizen
DWG
,
Janssen
SJ
,
Mallee
WH
, et al
.
What are the applications and limitations of artificial intelligence for fracture detection and classification in orthopaedic trauma imaging? A systematic review
.
Clin Orthop Relat Res
.
2019
;
477
(
11
):
2482
2491

Supplementary data