Outcome prediction of preterm birth is important for neonatal care, yet prediction performance using conventional statistical models remains insufficient. Machine learning has a high potential for complex outcome prediction. In this scoping review, we provide an overview of the current applications of machine learning models in the prediction of neurodevelopmental outcomes in preterm infants, assess the quality of the developed models, and provide guidance for future application of machine learning models to predict neurodevelopmental outcomes of preterm infants.
A systematic search was performed using PubMed. Studies were included if they reported on neurodevelopmental outcome prediction in preterm infants using predictors from the neonatal period and applying machine learning techniques. Data extraction and quality assessment were independently performed by 2 reviewers.
Fourteen studies were included, focusing mainly on very or extreme preterm infants, predicting neurodevelopmental outcome before age 3 years, and mostly assessing outcomes using the Bayley Scales of Infant Development. Predictors were most often based on MRI. The most prevalent machine learning techniques included linear regression and neural networks. None of the studies met all newly developed quality assessment criteria. Studies least prone to inflated performance showed promising results, with areas under the curve up to 0.86 for classification and R2 values up to 91% in continuous prediction. A limitation was that only 1 data source was used for the literature search.
Studies least prone to inflated prediction results are the most promising. The provided evaluation framework may contribute to improved quality of future machine learning models.
Preterm birth is common, with a prevalence of 11% in live-born neonates.1 Short-term survival has increased considerably during the past decades.2 However, surviving infants in all degrees of prematurity may experience adverse neurodevelopmental outcomes. Many preterm infants face neonatal illnesses and may need intensified support in different organ systems that might affect brain development and, in turn, lead to neurodevelopmental impairment.3–7
The current main known risk factors for adverse neurodevelopmental outcomes in preterm infants are, apart from socioeconomic status, those related to altered brain maturation (eg, lower gestational age [GA]), oxygenation problems (eg, bronchopulmonary dysplasia), or serious infections.6,8,9 Previous attempts to develop prediction models using these known risk factors through conventional statistical methods have thus far not been fruitful and provide unstable ground for clinical decision-making.9–11 Accurate prediction models might aid clinical care in several respects: Such models (1) may aid clinical decision-making in the critical phase; (2) reveal novel insights into the mechanisms underlying poor neurodevelopmental outcome, exposing potential targets for further improvement of neonatal care; and (3) may aid in selecting children at high risk of poor outcomes for targeted follow-up, allowing early intervention or even prevention of developmental challenges.12,13
The considerable body of literature on prematurity has shown that neurodevelopmental outcome of preterm infants is affected by a wide range of factors, which include genetic, antenatal, perinatal and neonatal characteristics; environmental influences; and treatment protocols.6,14,15 Due to the large amount of relevant factors in play, presumably with complex interacting effects, it is unlikely that the complexity of neurodevelopmental outcome prediction can be captured in linear models produced by conventional statistical procedures. Recent advances in artificial intelligence have shown that machine learning has the potential for complex outcome prediction.16 The field of machine learning includes statistical techniques for modeling of complex, nonlinear relations and might, therefore, be particularly suitable to developing models to predict neurodevelopmental outcomes of preterm born infants.17,18 Machine learning has already proven beneficial in several fields of medicine at the level of, or even outperforming, clinicians in the interpretation of radiographic images, diagnosis of retinopathy of prematurity, and prediction of in-hospital cardiac arrest.19–21 This triggers the question: to what extent can machine learning contribute to the prediction of neurodevelopmental outcomes in preterm infants?
Conventional statistics and machine learning are not easy to discriminate by definition and are sometimes considered part of a continuum. However, one difference could be the hypothesis-driven approach of conventional statistics, versus the data-driven approach of machine learning, bypassing (potentially biased) human influence.22 The application of machine learning in medicine not only holds great promise but also carries the risk of improper use. The most prominent threat is related to the inherent inexhaustible search for possible relationships, which may lead to nongeneralizable findings (due to overfitting of the model).18 Although some guidelines exist on the use of machine learning in medical research, their use is not yet common practice.17,18,23 Thus, critical appraisal of the quality of evidence on machine learning to predict neurodevelopmental outcomes after preterm birth is important.
In this scoping review, we provide an overview of the current use of machine learning techniques and their performance in the prediction of neurodevelopmental outcomes of preterm-born children. We provide a critical appraisal of the quality of the available studies, make recommendations on future neonatal prediction models, and provide guidance on the correct use of machine learning in the development of prediction models. We also present an evaluation framework for quality assessment of machine learning models suitable in a broader context of prediction in medicine.
Methods
Search Strategy and Selection Criteria
This scoping review was conducted according to the Joanna Briggs Institute formal guidance for scoping reviews.24 No protocol was submitted before the review was conducted. A systematic search was performed in PubMed to identify studies that used machine learning in neurodevelopmental outcome prediction in preterm infants. Inclusion criteria were that (1) the study reported on prediction models for neurodevelopmental outcomes in preterm infants; (2) the mean GA in the study sample was <37 weeks; (3) predictors were assessed before or during the neonatal period (up to 44 weeks postmenstrual age); (4) neurodevelopmental outcomes included motor, neurocognitive, language, behavioral, or academic outcomes, assessed using standardized and validated tests; (5) prediction models were built using machine learning; and (6) the study was published in an English-language, peer-reviewed journal. In an attempt to distinguish machine learning models from models based on conventional statistics, we defined machine learning as the use of a data-driven approach, that is, without manual feature selection and/or inclusion of more complex models (eg, neural networks). Studies published as part of conference proceedings were excluded. The search was conducted using combinations of simple and hierarchical terms for machine learning and preterm infants (last search March 23, 2021, full search available in the Supplemental Information). One author performed the study selection. In case of any uncertainty, a second author was consulted. Snowballing through the reference list of included articles was performed to reveal any articles that were not captured by the initial search.
Data Extraction
Two authors independently performed data extraction. The following information was extracted from each study (if available): study design, selection procedure, sample size, sample characteristics, predictors, outcome measurements, age at assessment, and information on type and construction of machine learning models used, including validation procedures and performance indicators. Then, for every article, the performance of the best-performing model per outcome domain was reported on the basis of the receiver operating characteristic area under curve (AUC) or accuracy for classification and R2 for regression models. If possible, any missing performance outcomes were calculated using the presented data.
Quality Assessment
The flexibility of machine learning models not only allows for the identification of complex patterns and relations but also increases the risk of overfitting (ie, modeling noise).17 Overfitted models typically reveal inflated prediction performance on data that were used for model development (training data) while having limited generalizability of the model to unseen data (test data).18 Consequently, overfitting severely threatens the clinical applicability of machine learning models. Although methods are available to prevent overfitting (eg, validation, cross-validation, penalizing predictors), there are also common pitfalls in the application of such methods. One common and serious pitfall is the leakage of test data into training data by tweaking model performance based on test data (eg, through exhaustive predictor selection or hyperparameter tuning).17,18 Other common pitfalls include using sample sizes that are too small, inappropriate validation methods, or inappropriate performance metrics in case of skewed classes.18 In contrast, large sample sizes in combination with too few parameters is more likely to lead to underfitting of the model when targeting complex predictions.17,18,23 Furthermore, attempts should be made to make a model interpretable, to enable better neurobiological insight in the model, and to simplify quality assessment and updates or improvement of the model.25,26 Thereby, it is also important to encourage open science, allowing for external validation or modification of the model.
To assess the quality of the application of machine learning in the included studies, quality recommendations were gathered from different literature sources and were reshaped into quality assessment criteria. First, quality recommendations were categorized and summarized (Table 1). From these summarized recommendations, manageable criteria were formed that were suitable for quality assessment and subsequently reshaped into an evaluation framework (Table 2).
Quality Features for Machine Learning Models
Topic . | Explanation . | Features to Extract . |
---|---|---|
Sample size | The sample size should be sufficiently large for the amount of data to prevent overfitting of the model (ie, sample should contain preferably at least several hundreds of observations and should match the complexity of the model).18 In case of limited sample sizes, transfer learning is a method to increase the performance of a model. Transfer learning refers to the technique that allows a machine learning model to be informed with a priori knowledge extracted from another independent, often larger, data set representing a slightly or completely different population. Through this method, the model parameters are pretrained in the larger data set, and subsequently fine-tuned in the target population, thereby potentially reaching higher prediction efficiency.27 | Sample size (n) Sample size of transfer cohort (n) |
Participants | High or unequal sample attrition might influence the performance of a model in a new sample because of low representativeness of the final sample in the model. Therefore, sample attrition, as well as the inclusion and exclusion criteria, should be clearly described, and representativeness should be investigated in the final sample.28 | Clear inclusion and exclusion criteria (Y/N) Clear description of possible sample attrition (Y/N) Representative sample (Y/N) |
Data leakage | It should be clear how many predictors were investigated and how many were included in the final model. If predictor selection or hyperparameter tuning was applied on the basis of the testing data, we speak of data leakage from testing to training data, resulting in inflated performance due to overfitting. To prevent this issue, external validation or nested cross-validation can be used.17,18,29 | Number of predictors at start (n) Number of predictors in final (best-predicting) model (n) Predictor selection independent from validation performance (Y/N) Number of combinations of hyperparameters tested (n, both within and outside the validation or cross-validation) Data leakage (Y/N, through either predictor selection or hyperparameter tuning) |
Validation | The predictive performance should be based on a sample independent from the training data to ensure generalizability of the model. This validation might be done in a completely independent sample (external validation) or with the use of internal (nested) cross-validation or splitting the data into dedicated training and testing parts. In (nested) cross-validation, the sample is split into a number of samples (k-folds), of which by turn, 1-fold is held out as a test set, whereas the others function as the training data. The k-fold cross-validation is preferred over LOOCV, in which the testing set is not representative of all the data.18,30 | Internal validation (k-fold, LOOCV, nested) External validation (Y/N) |
Performance metrics | In case of skewed classes in classification problems, a model might be overestimated using the wrong performance metrics. For example, in case of 10% abnormal outcomes, an accuracy of 90% might be achieved by predicting a normal outcome for all subjects. In that case, also base rate–sensitive metrics should be reported (ie, positive and negative predictive values or F-score).31 For regression problems, R2 should be calculated directly from predicted and observed values, not through squaring the correlation coefficient (Pearson r). If a correlation coefficient is presented, it should be accompanied by measures directly comparing predicted and observed values, such as prediction R2 of mean absolute error.17,32 | Outcome type (classification, regression) Base rate (% abnormal) Type of performance metrics reported (eg, AUC, accuracy, sensitivity) |
Interpretability | Some machine learning models are seen as black boxes because in contrast with simple linear regression, it is very hard to understand the influences of the individual predictors and their relationships. However, there are multiple ways to give some insight into this black box.33 To interpret the added value of the new model, a comparison should be made with previously developed and published prediction models or clinical practice.17 | Interpretability of the model (Y/L/N) Comparison with previous models (Y/L/N) |
Open science | Sharing of code, models, and data is important for the purpose of external validation and open science. Furthermore, some type of decision support model should be made available to facilitate the implementation of a model into clinical practice.17,34 | Sharing code, model, or data (C/M/D/N) Decision support model (Y/N) |
Topic . | Explanation . | Features to Extract . |
---|---|---|
Sample size | The sample size should be sufficiently large for the amount of data to prevent overfitting of the model (ie, sample should contain preferably at least several hundreds of observations and should match the complexity of the model).18 In case of limited sample sizes, transfer learning is a method to increase the performance of a model. Transfer learning refers to the technique that allows a machine learning model to be informed with a priori knowledge extracted from another independent, often larger, data set representing a slightly or completely different population. Through this method, the model parameters are pretrained in the larger data set, and subsequently fine-tuned in the target population, thereby potentially reaching higher prediction efficiency.27 | Sample size (n) Sample size of transfer cohort (n) |
Participants | High or unequal sample attrition might influence the performance of a model in a new sample because of low representativeness of the final sample in the model. Therefore, sample attrition, as well as the inclusion and exclusion criteria, should be clearly described, and representativeness should be investigated in the final sample.28 | Clear inclusion and exclusion criteria (Y/N) Clear description of possible sample attrition (Y/N) Representative sample (Y/N) |
Data leakage | It should be clear how many predictors were investigated and how many were included in the final model. If predictor selection or hyperparameter tuning was applied on the basis of the testing data, we speak of data leakage from testing to training data, resulting in inflated performance due to overfitting. To prevent this issue, external validation or nested cross-validation can be used.17,18,29 | Number of predictors at start (n) Number of predictors in final (best-predicting) model (n) Predictor selection independent from validation performance (Y/N) Number of combinations of hyperparameters tested (n, both within and outside the validation or cross-validation) Data leakage (Y/N, through either predictor selection or hyperparameter tuning) |
Validation | The predictive performance should be based on a sample independent from the training data to ensure generalizability of the model. This validation might be done in a completely independent sample (external validation) or with the use of internal (nested) cross-validation or splitting the data into dedicated training and testing parts. In (nested) cross-validation, the sample is split into a number of samples (k-folds), of which by turn, 1-fold is held out as a test set, whereas the others function as the training data. The k-fold cross-validation is preferred over LOOCV, in which the testing set is not representative of all the data.18,30 | Internal validation (k-fold, LOOCV, nested) External validation (Y/N) |
Performance metrics | In case of skewed classes in classification problems, a model might be overestimated using the wrong performance metrics. For example, in case of 10% abnormal outcomes, an accuracy of 90% might be achieved by predicting a normal outcome for all subjects. In that case, also base rate–sensitive metrics should be reported (ie, positive and negative predictive values or F-score).31 For regression problems, R2 should be calculated directly from predicted and observed values, not through squaring the correlation coefficient (Pearson r). If a correlation coefficient is presented, it should be accompanied by measures directly comparing predicted and observed values, such as prediction R2 of mean absolute error.17,32 | Outcome type (classification, regression) Base rate (% abnormal) Type of performance metrics reported (eg, AUC, accuracy, sensitivity) |
Interpretability | Some machine learning models are seen as black boxes because in contrast with simple linear regression, it is very hard to understand the influences of the individual predictors and their relationships. However, there are multiple ways to give some insight into this black box.33 To interpret the added value of the new model, a comparison should be made with previously developed and published prediction models or clinical practice.17 | Interpretability of the model (Y/L/N) Comparison with previous models (Y/L/N) |
Open science | Sharing of code, models, and data is important for the purpose of external validation and open science. Furthermore, some type of decision support model should be made available to facilitate the implementation of a model into clinical practice.17,34 | Sharing code, model, or data (C/M/D/N) Decision support model (Y/N) |
C, code; D, data; L, limited; M, model; N, no; Y, yes.
Evaluation Framework for Quality Assessment
Topic . | Appropriate . | Minor Deviation . | Major Deviation . |
---|---|---|---|
Participants | Clear inclusion and exclusion criteria and clear description of sample attrition forming a representative sample | Clear inclusion and exclusion criteria and/or clear sample attrition and/or representative sample | All unclear |
Data leakage | No data leakage: predictor selection and model selection completely independent from validated or cross-validated results | Limited data leakage | Large data leakage |
Validation procedure | External validation, cross-validation, or splitting up data into dedicated training and testing sets | LOOCV | No independent validation |
Performance metrics | Classification: AUC or accuracy combined with sensitivity and specificity In case of skewed classes: combined with PPV/NPV or F-score or correction for skewed classes Regression: prediction R2 or Pearson r combined with metrics of absolute measurements | Classification: accuracy without sensitivity and specificity or no class-sensitive metrics (PPV/NPV) in case of skewed classes Regression: only Pearson r without metrics of absolute measurements | Classification: no AUC or accuracy provided and not able to be calculated from given data Regression: no R2 or Pearson r given |
Interpretability | Provides insight into the relations between predictors and outcomes and compares performance of the model with previously developed external models | Provides partial insight into the relation between predictors and outcomes and/or provides limited model comparison with previous (conventional) statistical models or clinical practice | Provides no insight into the relations between predictors and outcomes and/or no comparison with previous (conventional) statistical models or clinical practice |
Open science | Free online availability of code, model, and data and/or provision of a decision support tool | Limited sharing of code, model, or data (or only available upon request) | No sharing of code, model, or data and no decision support tool |
Topic . | Appropriate . | Minor Deviation . | Major Deviation . |
---|---|---|---|
Participants | Clear inclusion and exclusion criteria and clear description of sample attrition forming a representative sample | Clear inclusion and exclusion criteria and/or clear sample attrition and/or representative sample | All unclear |
Data leakage | No data leakage: predictor selection and model selection completely independent from validated or cross-validated results | Limited data leakage | Large data leakage |
Validation procedure | External validation, cross-validation, or splitting up data into dedicated training and testing sets | LOOCV | No independent validation |
Performance metrics | Classification: AUC or accuracy combined with sensitivity and specificity In case of skewed classes: combined with PPV/NPV or F-score or correction for skewed classes Regression: prediction R2 or Pearson r combined with metrics of absolute measurements | Classification: accuracy without sensitivity and specificity or no class-sensitive metrics (PPV/NPV) in case of skewed classes Regression: only Pearson r without metrics of absolute measurements | Classification: no AUC or accuracy provided and not able to be calculated from given data Regression: no R2 or Pearson r given |
Interpretability | Provides insight into the relations between predictors and outcomes and compares performance of the model with previously developed external models | Provides partial insight into the relation between predictors and outcomes and/or provides limited model comparison with previous (conventional) statistical models or clinical practice | Provides no insight into the relations between predictors and outcomes and/or no comparison with previous (conventional) statistical models or clinical practice |
Open science | Free online availability of code, model, and data and/or provision of a decision support tool | Limited sharing of code, model, or data (or only available upon request) | No sharing of code, model, or data and no decision support tool |
NPV, negative predictive value; PPV, positive predictive value.
Despite that larger samples will lead to better generalizability, in the final quality assessment, the models were not judged on their sample size because models cannot be judged on their sample size alone without taking into account the model type, number of predictors, and extent of hyperparameter tuning. Furthermore, appropriate sample size is still a subject of debate among data scientists.35 Consequently, the focus of selected criteria is on the quality of the machine learning application and requirements for appropriate implementation in clinical practice.
The criteria were formed and selected through discussion within the author group of this review, representing a board of experienced data scientists, neuroscientists, and neonatologists. All included studies were then independently assessed for quality by 2 authors. Any disagreements between the authors were discussed, and if no consensus could be reached, a third author was consulted.
Results
Study Selection
The systematic search resulted in 2448 unique records. After initial screening, 13 articles were found eligible for inclusion. Snowballing through reference lists of included articles resulted in the inclusion of 1 additional article that met the inclusion criteria. The complete flow diagram is presented in Fig 1.
Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram.
Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram.
Summary of Study Findings
An overview of the characteristics of the 14 included studies is presented in Table 3.
Characteristics of Included Studies
Source . | Sample of Infants . | Predictor . | Age at Predictiona . | Outcome Assessment . | Classification Threshold for Abnormal Outcome . | Age at Outcome Assessment, Corrected Mo, Mean (SD) . | Type of Machine Learning Models . | Outcomes of Best-Predictive Model . | ||
---|---|---|---|---|---|---|---|---|---|---|
Motor . | Neurocognitive . | Other . | ||||||||
Ambalavanan et al36 | <1000 g | Clinical variables | Near term | BSID-1 (motor and neurocognitive) | <68 | 13.3 (0.2) | NN, LR | LR: AUC, 0.69; sensitivity, 0.7; specificity, 0.69; PPV, 0.26; NPV 0.86 | NN: AUC, 0.75; sensitivity, 0.7; specificity, 0.62; PPV, 0.23; NPV, 0.93 | — |
Ambalavanan et al37 | <1000 g | Clinical variables | 0, 3, and 8 d after birth | NDI or death | BSID score <70 or presence of cerebral palsy, vision <20/200, hearing loss, or death | 18b | CT | — | — | NDI/death: Accuracy, 0.62; sensitivity, 0.53; specificity, 0.69; PPV, 0.58; NPV, 0.65 |
Moeskops et al38 | <28 wk GA | MRI, clinical variables | 30 and/or 40 | BSID-3 (motor and neurocognitive) | <85 | 29c | SVM | AUC, 0.85 | AUC, 0.81 | — |
Kawahara et al39 | <32 wk GA | DTI-MRI, clinical variables | Shortly after birth and/or 40 | BSID-3 (motor and neurocognitive) | NA | 18 | CNN, NN, LR | CNN: R2 = 0.096; MAE, 10.734; SDAE, 7.734 | CNN: R2 = 0.035; MAE, 11.077; SDAE, 8.574 | — |
He et al40 | 24–32 wk GA | fcMRI, clinical variables | 39.4 (1.3) | BSID-3 (neurocognitive) | <85 | 24 | SVM | — | Accuracy, 0.706; AUC, 0.76; sensitivity, 0.701; specificity, 0.712 | — |
Schadl et al41 | ≤32 wk GA and ≤1500 g | DTI MRI, clinical variables | 36.6 (1.8) | BSID-3 (motor and neurocognitive) | <85 | 18–22b | LR | Accuracy, 0.867; AUC, 0.912; sensitivity, 0.90; specificity, 0.86; PPV, 0.56; NPV, 0.98; R2 = 0.317 | Accuracy, 1.00; AUC, 1.00; sensitivity, 1.00; specificity, 1.00; PPV, 1.00; NPV, 1.00; R2 = 0.296 | — |
Brown et al42 | 24–32 wk GA | DTI-MRI | 35.56 | BSID-3 (motor and neurocognitive) | ≤85 | 18b | LR | Accuracy, 0.725; R2 = 0.195; AOC, 14.01 | Accuracy, 0.590; R2 = 0.196; AOC, 15.37 | — |
Cahill-Rowley et al43 | ≤32 wk GA and ≤1500 g | DTI-MRI (automatically/manually analyzed) | 36.5 (1.2) | TDI (motor) | < −1 SD | 20.2 (1.0) | LR | Accuracy, 0.788; AUC 0.83; sensitivity, 0.85; specificity, 0.75; R2 = 0.16 | — | — |
Girault et al44 | <37 wk GA | DWI-MRI | 40.4 (1.6) | MSEL (neurocognitive) | <110 (median for full-term cohort) | 25.6 (0.8)c | CNN-LR (combined) | — | Accuracy, 0.838; sensitivity, 0.86; specificity, 0.80; PPV, 0.86; NPV, 0.80; R2 = 0.914; MAE, 4.47 | — |
Saha et al45 | <31 wk GA | DWI-MRI, PMA at scan | 32 (median, range 23.1-3..8) | NSMDA (motor) | ≥1 | median, 24; range, 2.9–26.7) | CNN, LR, SVM, DNN | CNN: accuracy, 0.76; AUC, 0.74; sensitivity, 0.67; specificity, 0.82; F-score, 0.69 | — | — |
Vassar et al46 | ≤32 wk GA and ≤1500 g | DTI-MRI, MRI | near term | BSID-3 (language) | <85 | 18–22b | LR | — | — | Language: AUC, 0.916; sensitivity, 0.89; specificity, 0.86 |
Janjic et al47 | <32 wk GA | H-MRS, DTI-MRI | 40.4 (1.5) | BSID-3 (motor and neurocognitive) | <85 | 12.0 (0.8) | NN | Accuracy, 0.961; sensitivity, 0.769; specificity, 0.989; PPV, 0.909; NPV, 0.967 | Accuracy, 0.991; sensitivity, 0.857; specificity, 1.00; PPV, 1.00; NPV, 0.991 | — |
He et al48 | ≤32 wk GA | MRI, clinical variables | 40.3 (median, range 39.3-41.4) | BSID-3 (motor, neurocognitive, and language) | ≤85 | 24b | NN | Accuracy, 0.739; AUC, 0.84; sensitivity, 0.760; specificity, 0.717 | Accuracy, 0.851; AUC, 0.86; sensitivity, 0.740; specificity, 0.889 | Language: accuracy, 0.689; AUC, 0.66; sensitivity, 0.600; specificity, 0.778 |
Chen et al49 | <32 wk GA | DTI-MRI | 40.4 (0.6) | BSID-3 (neurocognitive) | <90 | 24b | CNN, LR, SVM, DNN | — | CNN: accuracy, 0.745; AUC, 0.75; sensitivity, 0.702; specificity, 0.787; R2 = 0.221; MAE, 16.2; SDAE, 9.5 | — |
Source . | Sample of Infants . | Predictor . | Age at Predictiona . | Outcome Assessment . | Classification Threshold for Abnormal Outcome . | Age at Outcome Assessment, Corrected Mo, Mean (SD) . | Type of Machine Learning Models . | Outcomes of Best-Predictive Model . | ||
---|---|---|---|---|---|---|---|---|---|---|
Motor . | Neurocognitive . | Other . | ||||||||
Ambalavanan et al36 | <1000 g | Clinical variables | Near term | BSID-1 (motor and neurocognitive) | <68 | 13.3 (0.2) | NN, LR | LR: AUC, 0.69; sensitivity, 0.7; specificity, 0.69; PPV, 0.26; NPV 0.86 | NN: AUC, 0.75; sensitivity, 0.7; specificity, 0.62; PPV, 0.23; NPV, 0.93 | — |
Ambalavanan et al37 | <1000 g | Clinical variables | 0, 3, and 8 d after birth | NDI or death | BSID score <70 or presence of cerebral palsy, vision <20/200, hearing loss, or death | 18b | CT | — | — | NDI/death: Accuracy, 0.62; sensitivity, 0.53; specificity, 0.69; PPV, 0.58; NPV, 0.65 |
Moeskops et al38 | <28 wk GA | MRI, clinical variables | 30 and/or 40 | BSID-3 (motor and neurocognitive) | <85 | 29c | SVM | AUC, 0.85 | AUC, 0.81 | — |
Kawahara et al39 | <32 wk GA | DTI-MRI, clinical variables | Shortly after birth and/or 40 | BSID-3 (motor and neurocognitive) | NA | 18 | CNN, NN, LR | CNN: R2 = 0.096; MAE, 10.734; SDAE, 7.734 | CNN: R2 = 0.035; MAE, 11.077; SDAE, 8.574 | — |
He et al40 | 24–32 wk GA | fcMRI, clinical variables | 39.4 (1.3) | BSID-3 (neurocognitive) | <85 | 24 | SVM | — | Accuracy, 0.706; AUC, 0.76; sensitivity, 0.701; specificity, 0.712 | — |
Schadl et al41 | ≤32 wk GA and ≤1500 g | DTI MRI, clinical variables | 36.6 (1.8) | BSID-3 (motor and neurocognitive) | <85 | 18–22b | LR | Accuracy, 0.867; AUC, 0.912; sensitivity, 0.90; specificity, 0.86; PPV, 0.56; NPV, 0.98; R2 = 0.317 | Accuracy, 1.00; AUC, 1.00; sensitivity, 1.00; specificity, 1.00; PPV, 1.00; NPV, 1.00; R2 = 0.296 | — |
Brown et al42 | 24–32 wk GA | DTI-MRI | 35.56 | BSID-3 (motor and neurocognitive) | ≤85 | 18b | LR | Accuracy, 0.725; R2 = 0.195; AOC, 14.01 | Accuracy, 0.590; R2 = 0.196; AOC, 15.37 | — |
Cahill-Rowley et al43 | ≤32 wk GA and ≤1500 g | DTI-MRI (automatically/manually analyzed) | 36.5 (1.2) | TDI (motor) | < −1 SD | 20.2 (1.0) | LR | Accuracy, 0.788; AUC 0.83; sensitivity, 0.85; specificity, 0.75; R2 = 0.16 | — | — |
Girault et al44 | <37 wk GA | DWI-MRI | 40.4 (1.6) | MSEL (neurocognitive) | <110 (median for full-term cohort) | 25.6 (0.8)c | CNN-LR (combined) | — | Accuracy, 0.838; sensitivity, 0.86; specificity, 0.80; PPV, 0.86; NPV, 0.80; R2 = 0.914; MAE, 4.47 | — |
Saha et al45 | <31 wk GA | DWI-MRI, PMA at scan | 32 (median, range 23.1-3..8) | NSMDA (motor) | ≥1 | median, 24; range, 2.9–26.7) | CNN, LR, SVM, DNN | CNN: accuracy, 0.76; AUC, 0.74; sensitivity, 0.67; specificity, 0.82; F-score, 0.69 | — | — |
Vassar et al46 | ≤32 wk GA and ≤1500 g | DTI-MRI, MRI | near term | BSID-3 (language) | <85 | 18–22b | LR | — | — | Language: AUC, 0.916; sensitivity, 0.89; specificity, 0.86 |
Janjic et al47 | <32 wk GA | H-MRS, DTI-MRI | 40.4 (1.5) | BSID-3 (motor and neurocognitive) | <85 | 12.0 (0.8) | NN | Accuracy, 0.961; sensitivity, 0.769; specificity, 0.989; PPV, 0.909; NPV, 0.967 | Accuracy, 0.991; sensitivity, 0.857; specificity, 1.00; PPV, 1.00; NPV, 0.991 | — |
He et al48 | ≤32 wk GA | MRI, clinical variables | 40.3 (median, range 39.3-41.4) | BSID-3 (motor, neurocognitive, and language) | ≤85 | 24b | NN | Accuracy, 0.739; AUC, 0.84; sensitivity, 0.760; specificity, 0.717 | Accuracy, 0.851; AUC, 0.86; sensitivity, 0.740; specificity, 0.889 | Language: accuracy, 0.689; AUC, 0.66; sensitivity, 0.600; specificity, 0.778 |
Chen et al49 | <32 wk GA | DTI-MRI | 40.4 (0.6) | BSID-3 (neurocognitive) | <90 | 24b | CNN, LR, SVM, DNN | — | CNN: accuracy, 0.745; AUC, 0.75; sensitivity, 0.702; specificity, 0.787; R2 = 0.221; MAE, 16.2; SDAE, 9.5 | — |
—, not applicable; AOC, area over the regression error characteristic curve; CNN, convolutional neural network; CT, classification tree; DNN, deep neural network; DTI, diffusion tenor imaging; DWI, diffusion-weighted imaging; fcMRI, functional connectivity MRI; H-MRS, proton magnetic resonance spectroscopy; LR, logistic regression; MAE, mean absolute error; MSEL, Mullen Scales of Early Learning; NA, not available; NDI, neurodevelopmental impairment; NN, neural network; NPV, negative predictive value; NSMDA, Neuro-Sensory Motor Developmental Assessment; PMA, postmenstrual age; PPV, positive predictive value; SDAE, SD of absolute error; SVM, support vector machine; TDI, Toddler Development Index.
Data are mean (SD) weeks PMA unless otherwise indicated.
No mean/median available, only the intended age at outcome measurement.
Uncorrected age.
Study Characteristics
Thirteen studies (93%) included exclusively very preterm infants (GA <32 weeks) in their model. Models on neurocognitive outcome were investigated in 10 studies (71%), motor outcome in 9 (64%), and language outcome in 2 (14%). The aim of 1 study was to predict neurodevelopmental impairment, defined as adverse outcomes in ≥1 developmental domains (ie, motor, neurocognitive, hearing loss, blindness).37 None of the studies investigated behavioral or academic outcomes. Regarding the type of predictors, 12 studies (86%) investigated the predictive power of MRI-derived parameters, of which 6 (43%) also were investigations of a combination of MRI and clinical variables. Two studies (14%) investigated clinical variables only. The age at prediction varied from the day of birth to near-term age, with 3 studies (21%) investigating models at multiple prediction moments. Regarding neurodevelopmental outcome measures, the most commonly used outcome measurement was the Bayley Scales of Infant Development (BSID), versions 1 to 3 (11 studies, 79%). The age at predicted outcome varied between 12 and 29 months.
Model Types
In 13 studies (93%), classification models were developed in which a dichotomous outcome was predicted (normal versus abnormal outcome). The most common threshold for abnormal outcome using the BSID was a score of 85, which equals −1 SD (7 studies, 70%). In 5 of the studies with a classification model, additional regression models were developed aimed at the prediction of a continuous outcome score. Only 1 study (7%) developed a regression model. The machine learning models most often used were logistic regression with exhaustive feature selection (8 studies, 57%), support vector machines (4 studies, 29%), and neural networks (7 studies, 50%). Application of exhaustive feature selection involved data-driven selection of predictors out of a large number of variables based on the most optimal model performance. Five studies (36%) reported on comparisons of different types of machine learning models. A brief explanation of the models used in the included studies is given in Table 4.
Description of Machine Learning Models
Machine Learning Model . | Description . |
---|---|
Logistic regression | Logistic regression analysis using a logistic function to predict a binary outcome. Often seen as a conventional statistical method because of its use in combination with hypothesis-driven selection of predictors. However, in this review, only studies were included that used a data-driven approach, eg, through exhaustive feature selection or using regularization to limit model complexity. |
Decision tree | Flow chart–like structured model in which each node represents a decision based on a number of categories of values of a predictor (eg, predictor value yes or no, predictor value ≤5 or >5). The tree is built by starting with the most informative node. Branches with nodes are created until a stop criterion is met. When reaching the bottom of the tree, the outcome is a predicted category (classification tree), a value (regression tree), or a regression model (model tree). Decision trees are appreciated for their simplicity and interpretability. |
Support vector machine | Initially designed for binary classification, the model aims to find the best boundary between the 2 classes through finding a linear separator that maximizes the distance between data points of one class and the other class (the so-called support vectors). These models have been extended with kernel function to allow for more complex separations between classes. An extension to work for regression problems (support vector regression) is also available. |
Neural network | Model structure is inspired by neural networks in the brain, consisting of multiple neurons (nodes) connected to each other. The nodes are typically arranged into layers in which the first layer consists of input nodes (variables), ≥1 inner layers that represent specific states determined by a combination of predictors, and an output layer (outcome). The nodes between layers are fully connected in the case of so-called feed-forward neural networks, making it possible to capture extensive variable interactions in the model. Connections between nodes are associated with a connection strength (weight) that is learned from the available data. |
Deep neural network | A deep neural network is an extensive form of a neural network. Deep neural networks are characterized by multiple inner layers, allowing for capture of more complex interactions between predictors while keeping the number of additional weights reasonably limited. There is no clear distinction made between neural networks and deep neural networks. |
Convolutional neural network | Subtype of a deep neural network in which the model takes advantage of the hierarchical pattern in the data by applying so-called convolutional filters on the data that extract the most useful parts of the data by using a more limited number of weights compared with fully connected layers. This process leads to the reduction of model overfitting, increasing generalizability of the prediction to unseen data. Convolutional neural networks are often used in image analysis. |
Machine Learning Model . | Description . |
---|---|
Logistic regression | Logistic regression analysis using a logistic function to predict a binary outcome. Often seen as a conventional statistical method because of its use in combination with hypothesis-driven selection of predictors. However, in this review, only studies were included that used a data-driven approach, eg, through exhaustive feature selection or using regularization to limit model complexity. |
Decision tree | Flow chart–like structured model in which each node represents a decision based on a number of categories of values of a predictor (eg, predictor value yes or no, predictor value ≤5 or >5). The tree is built by starting with the most informative node. Branches with nodes are created until a stop criterion is met. When reaching the bottom of the tree, the outcome is a predicted category (classification tree), a value (regression tree), or a regression model (model tree). Decision trees are appreciated for their simplicity and interpretability. |
Support vector machine | Initially designed for binary classification, the model aims to find the best boundary between the 2 classes through finding a linear separator that maximizes the distance between data points of one class and the other class (the so-called support vectors). These models have been extended with kernel function to allow for more complex separations between classes. An extension to work for regression problems (support vector regression) is also available. |
Neural network | Model structure is inspired by neural networks in the brain, consisting of multiple neurons (nodes) connected to each other. The nodes are typically arranged into layers in which the first layer consists of input nodes (variables), ≥1 inner layers that represent specific states determined by a combination of predictors, and an output layer (outcome). The nodes between layers are fully connected in the case of so-called feed-forward neural networks, making it possible to capture extensive variable interactions in the model. Connections between nodes are associated with a connection strength (weight) that is learned from the available data. |
Deep neural network | A deep neural network is an extensive form of a neural network. Deep neural networks are characterized by multiple inner layers, allowing for capture of more complex interactions between predictors while keeping the number of additional weights reasonably limited. There is no clear distinction made between neural networks and deep neural networks. |
Convolutional neural network | Subtype of a deep neural network in which the model takes advantage of the hierarchical pattern in the data by applying so-called convolutional filters on the data that extract the most useful parts of the data by using a more limited number of weights compared with fully connected layers. This process leads to the reduction of model overfitting, increasing generalizability of the prediction to unseen data. Convolutional neural networks are often used in image analysis. |
Model Performance
Performance of the best-performing models of each study, based on validated or cross-validated results, are presented in the last column of Table 1. For classification problems, AUC values on the test data ranged from 0.66 to 1.00 (mean, 0.81; SD, 0.09), and accuracy values ranged from 0.59 to 1.00 (mean, 0.80; SD, 0.12). For regression problems, R2 ranged from 4% to 91% (mean, 28%; SD, 27%). The performances of the models without data leakage (see quality assessment) ranged from 0.69 to 0.86, 0.62 to 0.85, and 22% to 91% for AUC, accuracy, and R2, respectively.
Quality Assessment
The extracted data used for the quality assessment of the models are displayed in Table 5.
Features Extracted for Quality Assessment
. | . | Reference No. . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | 36 . | 37 . | 38 . | 39 . | 40 . | 41 . | 42 . | 43 . | 44 . | 45 . | 46 . | 47 . | 48 . | 49 . |
Sample size, n | 218 | 1046 | 173 | 115 | 28 | 60 | 115 | 52 | 37 | 77 | 59 | 103 (m) 115 (c) | 33 | 80 |
Size of transfer cohort, n | — | — | — | — | 884 | — | — | — | 75 | — | — | — | 1425 | 1.2 million |
Clear inclusion/exclusion criteria, Y/N | Y | Y | Y | N | Y | Y | N | Y | Y | Y | Y | Y | Y | Y |
Representative sample, Y/N | Y | Y | N | Y | Y | N | N | Y | Y | Y | Y | Y | Y | Y |
Clear sample attrition, Y/N | Y | Y | Y | N | Y | Y | N | Y | N | N | Y | Y | Y | Y |
Predictors at start, s at start, n | 21 | 65a | 14 | 4005 | 4005 | 396 | 4005 | 396 | 3003 | Not givenb | 396 | 42 | 4089 | 8100 |
Predictors at final model, n | 21 | 3 | 10 (m) 12 (c) | 4005 | 4005 | 3 | 4005 | 3 | 3003 | Not givenb | 3 | 7 | 4089 | 8100 |
Predictor selection independent from test set, Y/N | — | Y | N | — | — | N | — | N | — | — | N | N | — | — |
Hyperparameters tuned independently from test set, n | Not given | Not given | 0 | 0 | 21 | 0 | 0 | 0 | 2016 | 557 | 0 | 0 | >1000a | >35a |
Hyperparameters tuned based on test set, n | 0 | 0 | 0 | 10 | 30 | 0 | 1000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Data leakage, Y/N | N | N | Y | Y | Y | Y | Y | Y | N | N | Y | Y | N | N |
Internal validation | 2:1 | 3:1 with 10-fold CV within the training set | Leave 5% out in 500 rounds CV | 3-fold CV | 10-fold CV | LOOCV | Leave 2 out in 1000 rounds CV | LOOCV | 10-fold CV | 9:1 with 10-fold CV within the 90% training set | LOOCV | 4-fold CV | 5-fold CV n 50 iterations | 5-fold CV and 7:3 training/validation within training set |
External validation, Y/N | N | N | N | N | N | N | N | N | Y | N | N | N | N | N |
Outcome type, classification/regression | Classification | Classification | Classification | Regression | Classification | Both | Both | Both | Both | Classification | Classification | Classification | Classification | Both |
Base rate, % abnormal | 22 (m) 14 (c) | 41.6 | 25 (m) 24 (c) | — | 50 | 17 (m) 14 (c) | 18 (m) 10 (c) | 38 | 59 | 38c | 33.7 | 11.8 (m) 5.6 (c) | 15 (m) 15 (c) 21 (l) | |
Type of performance metrics reported | AUC, sens, spec, PPV, NPV | Accuracy, sens, spec, PPV, NPV | AUC | Pearson r, MAE, SDAE | Accuracy, AUC, sens, spec | AUC, sens, spec, R2d | Accuracy, Pearson r, AOC | AUC, sens, spec, R2 | Accuracy, Pearson r, MAE, SDAEd | Accuracy, AUC, sens, spec, F1-score | AUC, sens, spec | Sens, spec, PPV, NPVd | Accuracy, AUC, sens, spec, LR, FPR | Accuracy, AUC, sens, spec, Pearson r, MAE, SDAE |
Interpretability of the model, Y/L/N | N | Y | Y | Y | Y | Y | Y | L | Y | Y | Y | N | Y | Y |
Comparison with literature, Y/L/N | N | N | N | L | N | Y | Y | N | L | Y | N | N | Y | N |
Sharing code, model, or data, C/M/D/N | M | M | N | M | N | N | N | D | M | N | N | M | M | N |
Decision support model, Y/N | N | N | N | N | N | N | N | N | N | N | N | N | N | N |
. | . | Reference No. . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | 36 . | 37 . | 38 . | 39 . | 40 . | 41 . | 42 . | 43 . | 44 . | 45 . | 46 . | 47 . | 48 . | 49 . |
Sample size, n | 218 | 1046 | 173 | 115 | 28 | 60 | 115 | 52 | 37 | 77 | 59 | 103 (m) 115 (c) | 33 | 80 |
Size of transfer cohort, n | — | — | — | — | 884 | — | — | — | 75 | — | — | — | 1425 | 1.2 million |
Clear inclusion/exclusion criteria, Y/N | Y | Y | Y | N | Y | Y | N | Y | Y | Y | Y | Y | Y | Y |
Representative sample, Y/N | Y | Y | N | Y | Y | N | N | Y | Y | Y | Y | Y | Y | Y |
Clear sample attrition, Y/N | Y | Y | Y | N | Y | Y | N | Y | N | N | Y | Y | Y | Y |
Predictors at start, s at start, n | 21 | 65a | 14 | 4005 | 4005 | 396 | 4005 | 396 | 3003 | Not givenb | 396 | 42 | 4089 | 8100 |
Predictors at final model, n | 21 | 3 | 10 (m) 12 (c) | 4005 | 4005 | 3 | 4005 | 3 | 3003 | Not givenb | 3 | 7 | 4089 | 8100 |
Predictor selection independent from test set, Y/N | — | Y | N | — | — | N | — | N | — | — | N | N | — | — |
Hyperparameters tuned independently from test set, n | Not given | Not given | 0 | 0 | 21 | 0 | 0 | 0 | 2016 | 557 | 0 | 0 | >1000a | >35a |
Hyperparameters tuned based on test set, n | 0 | 0 | 0 | 10 | 30 | 0 | 1000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Data leakage, Y/N | N | N | Y | Y | Y | Y | Y | Y | N | N | Y | Y | N | N |
Internal validation | 2:1 | 3:1 with 10-fold CV within the training set | Leave 5% out in 500 rounds CV | 3-fold CV | 10-fold CV | LOOCV | Leave 2 out in 1000 rounds CV | LOOCV | 10-fold CV | 9:1 with 10-fold CV within the 90% training set | LOOCV | 4-fold CV | 5-fold CV n 50 iterations | 5-fold CV and 7:3 training/validation within training set |
External validation, Y/N | N | N | N | N | N | N | N | N | Y | N | N | N | N | N |
Outcome type, classification/regression | Classification | Classification | Classification | Regression | Classification | Both | Both | Both | Both | Classification | Classification | Classification | Classification | Both |
Base rate, % abnormal | 22 (m) 14 (c) | 41.6 | 25 (m) 24 (c) | — | 50 | 17 (m) 14 (c) | 18 (m) 10 (c) | 38 | 59 | 38c | 33.7 | 11.8 (m) 5.6 (c) | 15 (m) 15 (c) 21 (l) | |
Type of performance metrics reported | AUC, sens, spec, PPV, NPV | Accuracy, sens, spec, PPV, NPV | AUC | Pearson r, MAE, SDAE | Accuracy, AUC, sens, spec | AUC, sens, spec, R2d | Accuracy, Pearson r, AOC | AUC, sens, spec, R2 | Accuracy, Pearson r, MAE, SDAEd | Accuracy, AUC, sens, spec, F1-score | AUC, sens, spec | Sens, spec, PPV, NPVd | Accuracy, AUC, sens, spec, LR, FPR | Accuracy, AUC, sens, spec, Pearson r, MAE, SDAE |
Interpretability of the model, Y/L/N | N | Y | Y | Y | Y | Y | Y | L | Y | Y | Y | N | Y | Y |
Comparison with literature, Y/L/N | N | N | N | L | N | Y | Y | N | L | Y | N | N | Y | N |
Sharing code, model, or data, C/M/D/N | M | M | N | M | N | N | N | D | M | N | N | M | M | N |
Decision support model, Y/N | N | N | N | N | N | N | N | N | N | N | N | N | N | N |
—, not applicable; AOC, area over the regression error characteristic curve; C, code; (c), cognitive outcome; CV, cross-validation; D, data; FPR, false positive rate; (l), language outcome; LR, likelihood ratio; M, model; (m), motor outcome; MAE, mean absolute error; N, no; NPV, negative predictive value; PPV, positive predictive value; sens, sensitivity; spec, specificity; SDAE, SD of absolute error; Y, yes.
At least, probably more, but not able to retrieve from the article.
Scans were split into patches of size 26 × 26 × 26 voxels, with 50% overlap. Average number of patches per scan was 189. Number of patches/voxels at start was equal to number in final model.
Data set was balanced using data augmentation techniques.
Additional performance metrics could be calculated from data presented in the article.
Sample Size
Sample size ranged from 33 to 1046 infants (mean, 158; SD, 261). Only 1 study (7%) had a sample size above several hundreds of participants. Eight studies (57%) had a sample size <100 infants, of which 4 applied the concept of transfer learning.27
Participants
Twelve studies (86%) reported clear inclusion and exclusion criteria. Three studies (21%) were considered to have a sample that was not representative for the population because the selection of study population was not in line with the aim of prediction or because of the use of unclear inclusion and exclusion criteria. Authors of 4 articles (29%) did not report on sample attrition.
Data Leakage
Six studies (43%) applied (exhaustive) predictor selection; however, in only 1 of these studies, the predictor selection was completely independent from the performance results through the application of nested validation.37 In 9 studies (64%), models were optimized through hyperparameter tuning. In 3 of them,39,40,42 this optimalization was based on final validated or cross-validated results, thereby causing data leakage (ie, training the model on test results). In total, 8 studies (57%) suffered from data leakage, 5 of which through predictor selection and 3 through hyperparameter tuning. The 8 studies with data leakage were ranked on the severity of data leakage, on the basis of the number of combinations tested, and then evenly divided into limited or strong data leakage.
Validation
All studies used some form of validation. Nine studies (64%) applied some form of k-fold cross-validation, of which 3 (33%) applied leave-one-out cross-validation (LOOCV) in which k = n. The holdout method was used in 1 study (7%), forming 2 exclusive subsets (training versus testing). Four studies (29%) applied nested validation through the use of k-fold cross-validation combined with the holdout method (3 studies) or external validation (1 study), thereby ensuring prevention of any form of data leakage during hyperparameter tuning.
Performance Metrics
In 6 (43%) of the 13 studies wherein classification models were developed (normal versus abnormal outcome), outcomes were clearly skewed, defined as <25% or >75% abnormal outcome. Three of these studies did not provide any base-rate–sensitive metrics or applied data set–balancing techniques (50%).38,41,48
In the six studies developing continuous regression models, only 2 provided the explained variance (R2) calculated directly from the observed and predicted values.41,43 The other studies reported Pearson’s r, the correlation between the observed and predicted value, which itself is insensitive to the absolute difference between observed and predicted values. However, these 4 studies did provide other metrics that directly compared the observed and predicted values (ie, mean absolute error, area over the regression error curve).
Interpretability
Studies were judged on the interpretability of their constructed models, as well as on the interpretation of the model in relation to previously published models. Eleven studies (79%) made sufficient attempts to provide insight in the model (ie, the relation between individual predictors and outcome), most often done through visualization (in MRI studies) using Circos plots. Only 1 study mentioned a specific method analyzing the influence of specific predictors in the model (gradient-weighted class activation mapping).49
Comparison with previous (conventional or machine learning) statistical models was absent or limited in most studies (71%). In only 4 studies (29%), extensive comparisons were made, making it possible to judge model performance in the context of previous literature.
Open Science
Six studies (43%) provided sufficient information on the model properties to allow reproduction of the model. Nevertheless, no study researchers made attempts to share code or data, except for 1 study where the researchers stated that the data were available upon request. Likewise, none of the articles reported attempts to provide a tool enabling implementation of the developed models into clinical practice.
Overall Quality Assessment
The overall quality assessment is displayed in Table 6. None of the included studies met all criteria; however, some came close. Five studies (36%) scored as appropriate on 4 of 6 criteria. Specifically, the categories on data leakage, interpretability, and open science were found deserving of improvement in the majority of studies, scoring as appropriate in only 43%, 29%, and 0% of the studies, respectively.
Quality Assessment
Source . | Participants . | Data Leakage . | Validation Procedure . | Performance Metrics . | Interpretability . | Open Science . |
---|---|---|---|---|---|---|
Ambalavanan et al36 | X | X | X | X | XXX | XX |
Ambalavanan et al37 | X | X | X | X | XX | XX |
Moeskops et al38 | XX | XXX | X | XX | XX | XXX |
Kawahara et al39 | XX | XX | X | X | XX | XX |
He et al40 | X | XX | X | X | XX | XXX |
Schadl et al41 | XX | XXX | XX | XX | X | XXX |
Brown et al42 | XXX | XX | X | XX | X | XXX |
Cahill-Rowley et al43 | X | XXX | XX | X | XX | XX |
Girault et al44 | XX | X | X | XX | XX | XX |
Saha et al45 | XX | X | X | X | X | XXX |
Vassar et al46 | X | XXX | XX | X | XX | XXX |
Janjic et al47 | X | XX | X | XX | XXX | XX |
He et al48 | X | X | X | XX | X | XX |
Chen et al49 | X | X | X | X | XX | XXX |
Source . | Participants . | Data Leakage . | Validation Procedure . | Performance Metrics . | Interpretability . | Open Science . |
---|---|---|---|---|---|---|
Ambalavanan et al36 | X | X | X | X | XXX | XX |
Ambalavanan et al37 | X | X | X | X | XX | XX |
Moeskops et al38 | XX | XXX | X | XX | XX | XXX |
Kawahara et al39 | XX | XX | X | X | XX | XX |
He et al40 | X | XX | X | X | XX | XXX |
Schadl et al41 | XX | XXX | XX | XX | X | XXX |
Brown et al42 | XXX | XX | X | XX | X | XXX |
Cahill-Rowley et al43 | X | XXX | XX | X | XX | XX |
Girault et al44 | XX | X | X | XX | XX | XX |
Saha et al45 | XX | X | X | X | X | XXX |
Vassar et al46 | X | XXX | XX | X | XX | XXX |
Janjic et al47 | X | XX | X | XX | XXX | XX |
He et al48 | X | X | X | XX | X | XX |
Chen et al49 | X | X | X | X | XX | XXX |
X, appropriate; XX, minor deviation, XXX, major deviation.
Discussion
In this scoping review, we aimed to provide an overview of the current applications of machine learning models in the prediction of neurodevelopmental outcomes in preterm infants, to assess the quality of the developed models, and to provide guidance for future application of machine learning models to predict neurodevelopmental outcomes of preterm infants. Most of the 14 included studies aimed to predict outcomes at the age of 1 to 2 years, mainly assessed through the BSID as predicting either normal or abnormal development or an actual individual test score. MRI features were often used as predictors, sometimes combined with clinical data. The most often used machine learning models were linear regression and neural networks. None of the studies completely met our developed quality recommendations; however, those reporting a low risk of inflated results had promising results, with AUCs up to 0.86, accuracy up to 0.85, and R2 up to 91%.
Most of the models were directed at very or extremely preterm infants and were based on predictors gathered in the late neonatal period, mostly around term date. This characteristic limits the use of the models to prediction after the critically ill period and, thereby, their contribution to clinical decision-making in the neonatal phase. However, the developed models are still helpful to provide earlier insight into the expected outcome and might thereby provide future opportunities for early intervention programs after discharge from the hospital. Regrettably, half of the included studies used only MRI-based predictors, thereby leaving aside the predictive potential of many other factors known to influence neurodevelopmental outcome. The relevance of models that included MRI predictors is also bound to the routine clinical use of MRI in preterm infants. Currently, MRI is not part of relevant national and international guidelines. The use of MRI features is much more common in the included studies than in previous conventional prediction models.9,11 This finding might be explained by the suitability of machine learning algorithms for the analysis of comprehensive MRI data without the need of human interpretation at risk for bias and information loss. This suitability not only points out one of the strengths of machine learning as a predictive tool, providing the ability to analyze complex and large data sets such as MRI, but also has potential for longitudinal data, such as time series of vital parameter recordings or laboratory assessments. In terms of outcome measures, the common use of BSID in the studies limits the predictive power of the models for long-term outcomes because the associations between BSID and later neurodevelopmental outcomes are limited, with the percentage of explained variance up to only 37%.50
Our developed evaluation framework was used to judge the included studies. None of the studies completely met the quality recommendations. Most of the studies seemed predominantly invested in the statistical part of the model rather than the methodology of the development of the prediction model as a whole (eg, selection of participants). However, to ensure generalizability of prediction models to clinical practice, close attention should be paid to sample formation as well to ensure that the developed model is applicable to the foreseen population. Data leakage was present in more than half of the studies. This problem could be overcome by applying some form of nested validation, selecting features or hyperparameters on data completely independent from the test data. Three models applied LOOCV instead of the preferred k-fold cross-validation. Although using LOOCV creates a larger amount of training data, its use might lead to negative R2 values due to anticorrelation of the training data with the test data caused by the fact that the test data are not representative of the whole sample.18 Three studies that developed classification models did not act on their skewed classes. The predictive performance of these models is hard to interpret because no information was given about the predictive value of an individual test result.18 In almost all studies, improvements could be made in both the interpretation of the model and the sharing of the code and data used, facilitating external validation and quality assessment of the model. Taken together, more than half of the included studies were likely to report inflated performances due to methodologic limitations that are most often related to data leakage. It is likely that these findings also hold for machine learning models in general (pediatric) medicine. Therefore, readers should always critically assess the quality of the study before assuming the results to be generalizable.
Although we did not include sample size in the quality assessment as we explained earlier, most of the studies had relatively small samples, with 8 having a sample size of <100 infants. The combination of small samples, large numbers of predictors, extensive hyperparameter tuning, and nonlinear models is prone to overfitting and can lead to inflation of the predictive performance in case of data leakage or to disappointing predictive performances if data leakage is absent because of nongeneralizability of the model to the test data. The small samples in the included studies are rather disappointing, considering the high prevalence of preterm birth.1 This prevalence, in fact, makes the preterm infant population well-suited for machine learning approaches, and prospective standardized data collection would lead to valuable data that offer promising opportunities for improved prediction models. Although some initiatives for gathering big data on large patient groups in neonatal care already are made, these have not yet been fruitful for long-term outcome prediction.51,52 The above also applies to general (pediatric) care, in which advanced standardized data collection deserves more attention as well.
Although the quality assessment revealed a risk of inflated results in most of the studies, those with a low risk (ie, without signs of data leakage) reported promising model performances. These reported performances compare favorably with those of prediction models developed with conventional statistics, such as the Nursery Neurobiologic Risk Score, the Clinical Risk Index for Babies, and the Score for Neonatal Acute Physiology, or manually extracted MRI features. These conventional statistical models reported AUCs from 0.59 to 0.82 and an R2 of ∼20%, whereas predictive performance of manually extracted MRI features only reach sensitivity and specificity scores of 72% and 62% or lower, respectively, not even taking into account that these performances were not externally validated or cross-validated and thereby likely to be overfitted.9,11,53–56 Because the included machine learning models were trained on relatively small data sets, it is likely that larger data sets will even further increase their predictive performances. Thus, the results of the studies reviewed here are promising for additional machine learning model development when taking into account the rules on nonbiased patient groups, correct use of machine learning, larger sample sizes, and expansion of predictors to clinical and environmental features, including high-density data such as vital parameter time series, thereby providing outcome predictions earlier in the neonatal period that can aid clinical decision-making in the critical phase.
This scoping review has some limitations. We only investigated 1 literature source (PubMed). However, PubMed is a comprehensive search engine covering many medical journals, and snowballing was applied to identify any missing studies. Furthermore, the evaluation framework leaves room for some degree of subjectivity. Besides, we were not able to include sample size in the evaluation framework as we explained before. However, the presented evaluation framework is the highest quality guideline currently available on this topic. Strengths of this study include the careful assessment of the available machine learning models for the prediction of neurodevelopmental outcome, thereby enabling accurate judgment of their performance. In this study, we provide readers with a framework for quality assessment of machine learning models and present recommendations for future neonatal prediction models on neurodevelopmental outcome.
Machine learning is a relatively new field in clinical science, with considerable differences from conventional statistical approaches. Therefore, existing quality assessment tools are insufficient for proper evaluation of machine learning models. Although valuable attempts have been made in reporting guidelines on proper machine learning use or extensions of currently available risk-of-bias tools, currently, there are no guidelines for the judgment of the technical aspects of machine learning application. Thereby, the risk of model overfitting or underfitting is currently overlooked in existing quality guidelines.23,57 In this study, we have made the first attempt to provide guidelines to assist in the (technical) quality assessment for the application of machine learning. In future studies using machine learning, more attention should be paid on correct application of the technique. Furthermore, studies may contribute to the prediction of neurodevelopmental outcomes of preterm infants by using machine learning applied on a wide variety of data available in the antenatal and neonatal period, including continuous vital parameters, environmental factors, and imaging data. Besides, models should focus on predicting earlier in the neonatal period (using predictors available in the acute phase), as well as on predicting neurodevelopmental outcomes assessed beyond age 2 years.
Conclusions
This scoping review not only reveals promising results in the prediction of neurodevelopmental outcomes in preterm infants but also highlights common pitfalls in the application of machine learning techniques. The provided evaluation framework may contribute to improving the quality of future machine learning models in general.
Drs van Boven conceptualized and designed the study, reviewed and selected the literature, performed the data collection, developed the quality guidelines, appraised the quality of the studies, drafted the initial manuscript, and reviewed and revised the manuscript; Drs Henke reviewed the quality guidelines, appraised the quality of the studies, and reviewed and revised the manuscript; Dr Leemhuis and Prof van Kaam conceptualized and designed the study, were involved in the interpretation and analysis of data, reviewed the quality guidelines, and reviewed and revised the manuscript; Prof Hoogendoorn conceptualized and designed the study, was involved in the interpretation and analysis of data, developed the quality guidelines, and reviewed and revised the manuscript; Dr Königs and Prof Oosterlaan conceptualized and designed the study, supervised the literature selection, data collection, and quality appraisal of the study, developed the quality guidelines, and reviewed and revised the manuscript; and all authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.
FUNDING: No external funding.
CONFLICT OF INTEREST DISCLOSURES: The authors have indicated they have no potential conflicts of interest to disclose.
Comments