Video Abstract
Retinopathy of prematurity (ROP) is a leading cause of childhood blindness. Screening and treatment reduces this risk, but requires multiple examinations of infants, most of whom will not develop severe disease. Previous work has suggested that artificial intelligence may be able to detect incident severe disease (treatment-requiring retinopathy of prematurity [TR-ROP]) before clinical diagnosis. We aimed to build a risk model that combined artificial intelligence with clinical demographics to reduce the number of examinations without missing cases of TR-ROP.
Infants undergoing routine ROP screening examinations (1579 total eyes, 190 with TR-ROP) were recruited from 8 North American study centers. A vascular severity score (VSS) was derived from retinal fundus images obtained at 32 to 33 weeks’ postmenstrual age. Seven ElasticNet logistic regression models were trained on all combinations of birth weight, gestational age, and VSS. The area under the precision-recall curve was used to identify the highest-performing model.
The gestational age + VSS model had the highest performance (mean ± SD area under the precision-recall curve: 0.35 ± 0.11). On 2 different test data sets (n = 444 and n = 132), sensitivity was 100% (positive predictive value: 28.1% and 22.6%) and specificity was 48.9% and 80.8% (negative predictive value: 100.0%).
Using a single examination, this model identified all infants who developed TR-ROP, on average, >1 month before diagnosis with moderate to high specificity. This approach could lead to earlier identification of incident severe ROP, reducing late diagnosis and treatment while simultaneously reducing the number of ROP examinations and unnecessary physiologic stress for low-risk infants.
Retinopathy of prematurity (ROP) screenings are an essential service in NICUs; however, current risk models subject infants to multiple physiologically stressful examinations. Previous work has revealed that an artificial intelligence–derived vascular severity score may prove useful for identifying severe disease.
We developed an image-based risk model that, using a single retinal photograph, accurately detects severe ROP 1 month before diagnosis. Implementation of this screening approach could result in a paradigm shift toward neonatology-led ROP screenings.
Retinopathy of prematurity (ROP) is a leading cause of childhood blindness, although visual impairment can be prevented with appropriate screening and treatment.1–4 In the context of prematurely born infants, the epidemiology of ROP is directly related to 2 primary factors: neonatal mortality and exposure to supraphysiologic oxygen for resuscitation.1,5 Primary prevention of ROP, through careful oxygen titration, effectively reduces the incidence of treatment-requiring retinopathy of prematurity (TR-ROP); however, there exists a delicate balance: a lower fraction of inspired oxygen reduces the probability of developing ROP but consequently increases the probability of mortality, and vice-versa.5 To err on the side of caution, higher fraction of inspired oxygen is supplied and NICUs are responsible for ensuring that secondary prevention, through timely ROP screenings, occurs for all at-risk neonates.1,4,5 The risk of blindness can be reduced, but not eliminated, with optimal primary and secondary prevention; however, because adverse outcomes are at times preventable, ROP is a leading cause of medicolegal liability in ophthalmology.6,7
ROP screenings help identify eyes progressing to TR-ROP so that timely treatments may be provided. However, screening guidelines must balance the risk of missing cases of TR-ROP with the risks of discomfort and potentially life-threatening events from the screenings themselves.3–5 In the United States, screenings are recommended on the basis of demographic criteria (gestational age [GA] <31 weeks or birth weight [BW] <1501 g).4 Examinations begin at either 4 weeks’ chronological age or 31 weeks’ postmenstrual age (PMA) (whichever is later) and are repeated every 1 to 2 weeks until the retina is fully developed or until ROP requires treatment.2,4 On average, infants who meet screening criteria receive 3 to 8 examinations, yet <10% develop TR-ROP. Thus, current screening guidelines, although highly sensitive, are not specific and subject low-risk infants to examinations that would not be necessary if high-risk infants could be better identified.1–3,8,9 Using numerous risk models, researchers have attempted to add specificity by incorporating comorbidities, but many of them are rare or are confounded by BW and GA.10,11 The best performing models have had promise but, thus far, have not been well generalizable to larger, more diverse populations.10,12,13 Ultimately, these models have not gained traction because they have either failed to ensure 100% sensitivity or have been clinically impractical to implement.10–13
Herein, we explore whether the specificity of risk models can be improved by including biometric information. Deep learning (DL) has had promise for objective diagnosis of ROP and may be useful for screening.14–19 Previous work using the Imaging and Informatics in Retinopathy of Prematurity Deep Learning (i-ROP DL) algorithm has suggested that a DL-derived vascular severity score (VSS) may identify infants progressing to TR-ROP weeks before treatment.16,17 To address this gap in knowledge, we incorporated the output of the i-ROP DL algorithm in a predictive risk model for incident TR-ROP. We hypothesize that adding biometric information relevant to ROP may add specificity to risk models based only on demographic variables without sacrificing TR-ROP detection sensitivity.
Methods
Imaging and Informatics in Retinopathy of Prematurity Study Details
This study was approved by the institutional review boards at the coordinating center (Oregon Health & Science University) and at each of the 7 study centers (Columbia University, University of Illinois Chicago, William Beaumont Hospital, Children’s Hospital Los Angeles, Cedars-Sinai Medical Center, University of Miami, and Weill Cornell Medical Center) and was conducted in accordance with the Declaration of Helsinki. Written informed consent was obtained from parents of all enrolled infants.
As part of the multicenter Imaging and Informatics in Retinopathy of Prematurity (i-ROP) cohort study, 842 unique patients (BW <1501 g or GA <31 weeks) were screened multiple times for ROP between January 2012 and July 2020. During each examination, retinal fundus images were captured via a RetCam (Natus, Pleasanton, CA). Patients were clinically examined at the bedside but also received image-based ROP diagnoses, which were determined by a consensus of 3 ROP experts using the full International Classification of ROP criteria.4 Patients’ retinal images were required to have expert consensus agreement that their quality was acceptable for diagnosis; 33 images did not meet this criterion. Clinical comorbidities and demographics were recorded for all patients’ examinations (Table 1, Supplemental Table 4). Statistical significance, as applicable, was determined by using Welch’s 2-sample t test and was defined at a cutoff of P ≤ .05.
i-ROP and Salem Data Set Demographics and Clinical Outcomes
Study Patient Characteristics . | Not Treated . | Treated . | P . |
---|---|---|---|
i-ROP training data set | |||
BW, g, mean ± SD | 944.5 ± 248.3 | 673.0 ± 206.3 | <.001 |
GA, wk, mean ± SD | 26.7 ± 1.7 | 24.7 ± 1.4 | <.001 |
VSS, mean ± SD | 1.4 ± 0.9 | 2.9 ± 1.9 | <.001 |
Total patients, n (%) | 345 (91.8) | 31 (8.2) | — |
Total eyes, n (%) | 660 (91.9) | 58 (8.1) | — |
i-ROP test data set | |||
BW, g, mean ± SD | 930.6 ± 275.8 | 632.6 ± 136.1 | <.001 |
GA, wk, mean ± SD | 26.9 ± 2.1 | 24.3 ± 1.1 | <.001 |
VSS, mean ± SD | 1.8 ± 1.4 | 3.9 ± 2.6 | <.001 |
Total patients, n (%) | 377 (84.9) | 67 (15.1) | — |
Total eyes, n (%) | 729 (84.7) | 132 (15.3) | — |
Salem data set | |||
BW, g, mean ± SD | 1265.4 ± 281.5 | 823.0 ± 200.9 | .052 |
GA, wk, mean ± SD | 29.2 ± 2.2 | 25.0 ± 0.7 | <.001 |
VSS, mean ± SD | 1.6 ± 0.5 | 2.3 ± 0.9 | .029 |
Total patients, n (%) | 125 (94.7) | 7 (5.3) | — |
Total eyes, n (%) | 248 (94.7) | 14 (5.3) | — |
Study Patient Characteristics . | Not Treated . | Treated . | P . |
---|---|---|---|
i-ROP training data set | |||
BW, g, mean ± SD | 944.5 ± 248.3 | 673.0 ± 206.3 | <.001 |
GA, wk, mean ± SD | 26.7 ± 1.7 | 24.7 ± 1.4 | <.001 |
VSS, mean ± SD | 1.4 ± 0.9 | 2.9 ± 1.9 | <.001 |
Total patients, n (%) | 345 (91.8) | 31 (8.2) | — |
Total eyes, n (%) | 660 (91.9) | 58 (8.1) | — |
i-ROP test data set | |||
BW, g, mean ± SD | 930.6 ± 275.8 | 632.6 ± 136.1 | <.001 |
GA, wk, mean ± SD | 26.9 ± 2.1 | 24.3 ± 1.1 | <.001 |
VSS, mean ± SD | 1.8 ± 1.4 | 3.9 ± 2.6 | <.001 |
Total patients, n (%) | 377 (84.9) | 67 (15.1) | — |
Total eyes, n (%) | 729 (84.7) | 132 (15.3) | — |
Salem data set | |||
BW, g, mean ± SD | 1265.4 ± 281.5 | 823.0 ± 200.9 | .052 |
GA, wk, mean ± SD | 29.2 ± 2.2 | 25.0 ± 0.7 | <.001 |
VSS, mean ± SD | 1.6 ± 0.5 | 2.3 ± 0.9 | .029 |
Total patients, n (%) | 125 (94.7) | 7 (5.3) | — |
Total eyes, n (%) | 248 (94.7) | 14 (5.3) | — |
—, not applicable.
VSS and Data Set Preparation
Each eye examination was represented by a single RetCam image centered on the macula, which is approximately the field of view of zone I. Images were analyzed by i-ROP DL, an algorithm developed to detect plus disease (a manifestation of severe ROP).14 i-ROP DL provided a softmax probability of each image having normal, preplus, or plus disease vasculature (ie, it approximated the probability [P()] of each class, in which values range between 0.0 and 1.0 but must sum to 1.0 across all classes). From these values, a VSS, ranging from 1.0 to 9.0, was developed: VSS = P(normal) + 5 × P(preplus) + 9 × P(plus).
The VSS has been shown to independently correlate with more posterior disease (zone), higher stage, and higher extent of stage 3 ROP, in addition to plus disease (all the components of the International Classification of ROP criteria).15–19 On the basis of previous work, the 32 to 33 weeks’ PMA imaging window was identified as potentially predictive of TR-ROP.16,17 Thus, the first eye examination in this window was used for each patient. Because the goal was to develop a predictive (rather than diagnostic) model, infants who were diagnosed with TR-ROP within this window were excluded from the training data set (specifically, if they developed TR-ROP within 7 days of the first examination to occur within the 32 to 33 weeks’ PMA window). The held-out test data set (a subset of examinations from the i-ROP data set that were only used for model evaluation) contained all infants eligible for ROP screening, regardless of if or when they developed TR-ROP. Patients were mutually exclusive to the training (n = 376 patients) and test (n = 444 patients) data sets. The training data set contained 58 eyes that eventually developed TR-ROP and 660 eyes that did not.
Risk Model Development
BW, GA, and VSS were evaluated via recursive feature elimination by using multiple ElasticNet models trained by using Sci-Kit Learn in Python.20 ElasticNet is a type of logistic regression in which a mixture of L1 and L2 regularization is used.21 L1 and L2 regularization is useful for feature selection and when collinear and codependent features are included in a model, respectively, and help to improve model generalizability. The ElasticNet mixing parameter was tuned via fivefold cross-validation by using 11 evenly distributed operating points from 0.0 to 1.0. Values of 1.0 and 0.0 are equal to L1 and L2 regularization, respectively. Because of the class imbalance (ie, eyes that eventually developed TR-ROP versus those that did not), the area under the precision-recall curve (AUPR) was the primary measure of model performance rather than the area under the receiver operating characteristic curve (AUROC) because the AUROC may be too optimistic, that is, a random classifier theoretically has an AUROC of 0.5 but an AUPR only equal to the proportion of positive cases divided by the total number of cases.
Operating Point Selection
The performance of the model with the highest AUPR was assessed via the Fβ score by using fivefold cross-validation across 101 evenly distributed operating points from 0.00 to 1.00. Whereas the F1 score (β = 1) attempts to balance the proportion of false-negatives to false-positives, increasing β (eg, F2, F3, etc) prioritizes minimizing false-negatives over minimizing false-positives. The F2 score is commonly used to slightly prioritize minimization of false-negatives. To minimize false-negatives, β was set to 4. The mean operating point (minus 1 SD) that maximized the F4 score was selected and used to evaluate both test data sets.
Model Evaluation
This model was then evaluated on the held-out i-ROP test data set and on an independent data set that was collected between September 2015 and June 2018 from 132 unique patients born at a hospital in Salem, Oregon (Table 1). Data collection and exclusion criteria were similar to those for the i-ROP data set. Retrospective evaluation of these data was performed under a waiver of consent from the Oregon Health & Science University Institutional Review Board. Because patients are referred for treatment (not individual eyes), test data set evaluations were conducted at the patient-level (ie, if 1 or both eyes were predicted to develop TR-ROP, the patient was labeled as such). The i-ROP test data set contained 74 patients (132 eyes) who eventually developed TR-ROP and 370 patients (729 eyes) who did not. The Salem data set, contained 7 patients (14 eyes) who developed TR-ROP and 125 patients (248 eyes) who did not. The main outcome measures were sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and their corresponding 95% confidence intervals (CIs), evaluated independently by using the conservative Clopper-Pearson method, as suggested by Ying et al.22
Secondary Analysis of Positive Cases
In a secondary analysis, the maximum VSS between eyes for patients in the i-ROP test data set who screened positive was managed over time. On the basis of previous work, the VSS has potential for use as a monitoring tool to detect disease progression. The change in VSS over time for patients who screened positive and eventually developed TR-ROP was compared with that for those who screened positive but did not develop TR-ROP. Statistical significance was set at a cutoff of P ≤ .05 and was determined by using an analysis of variance and a Welch’s 2-sample t test.
Results
Data Sets
Table 1 displays the relevant demographics and VSSs at 32 to 33 weeks’ PMA and clinical outcomes in the 2 data sets used in this study. In both data sets, eyes that developed TR-ROP tended to have higher VSSs at 32 to 33 weeks, and infants who required treatment in 1 or both eyes tended to have lower BW and GA.
Risk Model Development
ElasticNet was tuned via fivefold cross-validation for all combinations of BW, GA, and VSS. An ElasticNet model with an L1 ratio of 0.4, by using the predictors GA and VSS, had the highest AUPR (0.35 ± 0.11; Table 2, Fig 1). A random classifier would have an AUPR approximately equal to 0.08 (the proportion of TR-ROP cases in the training data set).
AUPR and AUROC for the GA + VSS model. A, Fivefold cross-validation precision-recall (PR) curve. B, Fivefold cross-validation receiver operating characteristic (ROC) curve. The means ± SDs of the AUPR (A) and AUROC (B) were 0.35 ± 0.11 and 0.82 ± 0.07, respectively, on fivefold cross-validation.
AUPR and AUROC for the GA + VSS model. A, Fivefold cross-validation precision-recall (PR) curve. B, Fivefold cross-validation receiver operating characteristic (ROC) curve. The means ± SDs of the AUPR (A) and AUROC (B) were 0.35 ± 0.11 and 0.82 ± 0.07, respectively, on fivefold cross-validation.
Fivefold Cross-Validation Results for Every Combination of BW, GA, and VSS
Variables . | AUPRa . | AUROCa . | L1 Ratio . |
---|---|---|---|
BW | 0.21 ± 0.14 | 0.77 ± 0.12 | 0.0 |
GA | 0.23 ± 0.20 | 0.79 ± 0.09 | 1.0 |
VSS | 0.29 ± 0.05 | 0.76 ± 0.03 | 0.0 |
BW + GA | 0.23 ± 0.20 | 0.78 ± 0.10 | 0.0 |
BW + VSS | 0.32 ± 0.13 | 0.82 ± 0.11 | 0.0 |
GA + VSS | 0.35 ± 0.11 | 0.82 ± 0.07 | 0.4 |
BW + GA + VSS | 0.31 ± 0.11 | 0.81 ± 0.11 | 0.0 |
Variables . | AUPRa . | AUROCa . | L1 Ratio . |
---|---|---|---|
BW | 0.21 ± 0.14 | 0.77 ± 0.12 | 0.0 |
GA | 0.23 ± 0.20 | 0.79 ± 0.09 | 1.0 |
VSS | 0.29 ± 0.05 | 0.76 ± 0.03 | 0.0 |
BW + GA | 0.23 ± 0.20 | 0.78 ± 0.10 | 0.0 |
BW + VSS | 0.32 ± 0.13 | 0.82 ± 0.11 | 0.0 |
GA + VSS | 0.35 ± 0.11 | 0.82 ± 0.07 | 0.4 |
BW + GA + VSS | 0.31 ± 0.11 | 0.81 ± 0.11 | 0.0 |
VSS at 32–33 wk PMA. L1 Ratio, weighting of L1 versus L2 regularization in ElasticNet.
Mean ± SD results from fivefold cross-validation.
The operating point was tuned for increased sensitivity (so that all cases of TR-ROP would be identified) before we evaluated performance on the test data sets. The maximum F4 score ± SD (0.74 ± 0.12) occurred at an operating point of 0.33 ± 0.08. To further increase sensitivity, this operating point was lowered by 1 SD to 0.25.
Model Evaluation
The model was then evaluated on the held-out test data set from the i-ROP database (Table 3). It identified all infants who eventually required treatment (sensitivity: 100.0% [CI 95.1%–100.0%]; PPV: 28.1% [CI 22.8%–34.0%]) while correctly identifying nearly half of the infants who never would (specificity: 48.9% [CI 43.7%–54.1%]; NPV: 100.0% [CI 98.0%–100.0%]). For infants who developed TR-ROP, the average number of weeks ± SD to TR-ROP diagnosis was 3.7 ± 2.7 weeks (range: 0.1–11.0 weeks) after prediction.
Confusion Matrix of the Model Compared With the Ground Truth in 2 Test Data Sets
Model Predictions . | True Label . | |||
---|---|---|---|---|
i-ROP Test Data Set . | Salem Test Data Set . | |||
Not Treated . | Treated . | Not Treated . | Treated . | |
Predicted not treated | 181 (TN) | 0 (FN) | 101 (TN) | 0 (FN) |
Predicted treated | 189 (FP) | 74 (TP) | 24 (FP) | 7 (TP) |
Model Predictions . | True Label . | |||
---|---|---|---|---|
i-ROP Test Data Set . | Salem Test Data Set . | |||
Not Treated . | Treated . | Not Treated . | Treated . | |
Predicted not treated | 181 (TN) | 0 (FN) | 101 (TN) | 0 (FN) |
Predicted treated | 189 (FP) | 74 (TP) | 24 (FP) | 7 (TP) |
FN, false-negative; FP, false-positive; TN, true-negative; TP, true-positive.
The model was also evaluated on an independent test data set collected from a hospital located in Salem, Oregon (Table 3). Again, it correctly identified all infants who eventually required treatment (sensitivity: 100.0% [CI 59.0%–100.0%]; PPV: 22.6% [CI 9.6%–41.1%]), and specificity increased to 80.8% (CI 72.8%–87.3%) (NPV: 100.0% [CI 96.4%–100.0%]). The average time ± SD to TR-ROP diagnosis, after prediction, was 3.4 ± 2.1 weeks (range: 0.1–5.0 weeks).
Secondary Analysis of Positive Cases
Among positive predictions in the i-ROP test data set (Table 3), the average VSS was managed over time. Patients who developed TR-ROP appeared to have a greater change in average VSS compared with those who screened positive but never required treatment (P ≤ .05), suggesting that specificity could be further improved by analyzing change in VSS over time (Fig 2).
Change in maximum intereye VSS over time among patients who screened positive, by treatment group. Among patients who screened positive by the optimal model, patients who developed TR-ROP had higher maximum intereye VSSs at every subsequent follow-up (P ≤ .05). * P ≤ .05.
Change in maximum intereye VSS over time among patients who screened positive, by treatment group. Among patients who screened positive by the optimal model, patients who developed TR-ROP had higher maximum intereye VSSs at every subsequent follow-up (P ≤ .05). * P ≤ .05.
Discussion
We tested whether incorporation of an artificial intelligence–based assessment of vascular severity could improve the performance of ROP risk prediction models. We found that using just GA and VSS (obtained during a single eye examination at 32 to 33 weeks’ PMA) can identify all infants who are at risk for developing TR-ROP nearly 1 month before diagnosis and simultaneously rule out more than half of the low-risk population. With further validation, implementation of this model could reduce the number of ROP examinations and associated physiologic stress for low-risk infants. Finally, quantitative monitoring of vascular severity may lead to earlier and more consistent diagnosis of TR-ROP in infants who are at the highest risk, thus minimizing the overall risk of adverse outcomes.
This hypothesis was based on previous work that revealed that a DL-derived VSS may identify high-risk eyes as early as 1 month before TR-ROP diagnosis.16,17 This proved to be accurate because the AUPR of the VSS at predicting TR-ROP was 0.07 points higher than the BW or GA univariate models, or the combination thereof (Table 1). This suggested that diagnostic prediction might be higher if a combination of the VSS and GA and/or BW were to be used in a risk model. After optimizing the operating point of the highest-performing algorithm (GA + VSS) for increased sensitivity (to avoid missing cases of TR-ROP), the model correctly identified 100% of infants who developed TR-ROP in 2 separate populations.
The intended use population and the potential impact of the PPV and NPV in each target population must also be considered. In the i-ROP data set, consistent with a population of infants from academic medical centers (who may be higher risk than those in the average NICU), the specificity of the model was 48.9%, compared to 80.8% in the Salem, Oregon, hospital, where the incidence of TR-ROP was lower. Even in the higher-risk population (i-ROP), these results suggest that, by 32 to 33 weeks’ PMA, half the population could be accurately identified as low risk and no longer require frequent examinations. The Salem, Oregon, population suggests that this proportion may be substantially higher in community ROP screening programs.
We also found that using the VSS to monitor disease progression may further enhance early detection of incident TR-ROP in infants who screen positive (Fig 2). This is consistent with previous work revealing that quantitative monitoring of vascular severity may be useful not only for screening but also for quantitative diagnosis and determining if the disease is stable, progressing, or regressing.14–19 This could lay the framework for a new model of ROP screening in which low-risk infants receive less examinations and high-risk infants receive earlier and more precise diagnoses. To this point, it may be worth investigating the roles of oxygen exposure, intraventricular hemorrhages, sepsis, necrotizing enterocolitis, thrombocytopenia, and other previously associated risk factors to further increase specificity, although they may complicate this model and/or introduce confounding effects.
This model may also be easier to implement than previous ROP risk models. The performance of the GA + VSS model is comparable to the initial performance measurements of the Children’s Hospital of Philadelphia ROP model, which used a combination of BW + GA + weight gain to predict future occurrences of type II ROP and TR-ROP.12 Both models achieved 100.0% sensitivity in predicting TR-ROP and had similar specificities. However, when the Children’s Hospital of Philadelphia ROP model was applied to an external validation cohort of infants admitted to 30 hospitals across North America, the operating point had to be lowered to achieve 100.0% sensitivity, consequently reducing specificity to just 6.8%, which is too low to have a substantive impact on screening protocols.13 Another advantage of the proposed model is that it only requires data from a single examination. In general, GA is known with high precision, except in low- and middle-income countries (LMICs), where dating pregnancies may be less reliable. In these settings, it may be worth exploring a model that uses BW + VSS instead because Table 1 suggests almost comparable performance. However, a retinal fundus photograph obtained at 32 to 33 weeks’ PMA is also required, and herein lies the main barrier to implementation at this time. Images are not part of the standard of care, and digital fundus cameras can be expensive, so images are not often obtained.2,4 As cameras drop in price and smartphone-based cameras become viable alternatives, it may be that future studies validating this concept reveal that the clinical benefit of earlier detection of high-risk infants, along with the reduced screening burden, outweighs the cost of implementing routine imaging.23–25 Nonetheless, this remains a barrier to implementation and the main disadvantage of this method.
Additionally, this model is not likely to generalize well “out of the box” to populations different from the North American screening population. In many LMICs, the epidemiology and demographic risk factors are different, and the model would need to be retuned on the basis of local disease epidemiology.9,26,27 For example, high-risk infants could be less premature and a time point other than 32 to 33 weeks’ PMA may be more predictive. There is, however, evidence that the i-ROP DL system accurately diagnoses TR-ROP in an Indian ROP telemedicine program, suggesting that the technology is effective in that context and thus may be translatable.19
Regardless, this model has potential to create a paradigm shift, transitioning from ophthalmology-led to neonatology-led ROP screenings, because the only required inputs are GA and a fundus photograph (not a complete ophthalmoscopic examination). Such a paradigm shift could, in addition to reducing the number of examinations needed for low-risk infants, dramatically reduce the number of examinations for which an ophthalmologist is needed. This could lead to better use of scarce resources, especially in rural regions and LMICs, where this is a significant issue.26,27
Conclusions
We have trained and optimized an interpretable, parsimonious model for the prediction of TR-ROP. In 2 separate validation cohorts, we demonstrated that a single examination at 32 to 33 weeks’ PMA detected all infants who eventually developed TR-ROP and more than half of those who did not. Implementation of this model could lead to significantly fewer ROP examinations for low-risk infants, better use of ROP screening resources, and earlier recognition of TR-ROP disease progression. Future work will validate this concept in LMICs, where the potential added value may be even greater given the increasing prevalence of disease and scarcity of resources, with the goal of reducing or eliminating blindness due to ROP.
Drs Coyner, Chiang, Campbell, and Kalpathy-Cramer were involved in all aspects of the study, including conceptualizing and designing the study, analyzing the data, drafting the initial manuscript, and reviewing and revising the manuscript; Mr Chen, Dr Singh, and Ms Anderson assisted with analyzing the data and reviewing and revising the manuscript; Drs Schelonka, Jordan, McEvoy, Sonmez, and Erdogmus were involved in critically revising the manuscript; and all authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.
Deidentified individual participant data will not be made available.
FUNDING: Supported by grants T15 LM007088, R01 EY19474, R01 EY031331, R21 EY031883, and P30 EY10572 from the National Institutes of Health (Bethesda, MD) and by unrestricted departmental funding and a Career Development Award (to Dr Campbell) from Research to Prevent Blindness (New York, NY). Funded by the National Institutes of Health (NIH).
COMPANION PAPER: A companion to this article can be found online at www.pediatrics.org/cgi/doi/10.1542/peds.2021-053255.
A risk model based on GA and an artificial intelligence–based assessment of disease severity predicts TR-ROP 1 month before treatment.
- AUPR
area under the precision-recall curve
- AUROC
area under the receiver operating characteristic curve
- BW
birth weight
- CI
confidence interval
- DL
deep learning
- GA
gestational age
- i-ROP
Imaging and Informatics in Retinopathy of Prematurity
- i-ROP DL
Imaging and Informatics in Retinopathy of Prematurity Deep Learning
- LMICs
low- and middle-income countries
- NPV
negative predictive value
- PMA
postmenstrual age
- PPV
positive predictive value
- ROP
retinopathy of prematurity
- TR-ROP
treatment-requiring retinopathy of prematurity
- VSS
vascular severity score
References
Competing Interests
FINANCIAL DISCLOSURE: The authors have indicated they have no financial relationships relevant to this article to disclose.
POTENTIAL CONFLICT OF INTEREST: Dr Chan is on the Scientific Advisory Board for Phoenix Technology Group (Pleasanton, CA) and is a consultant for Novartis (Basel, Switzerland) and Alcon (Ft Worth, TX). Dr Chiang was previously a consultant for Novartis (Basel, Switzerland) and is an equity owner of Inteleretina (Honolulu, HI). Drs Chiang, Campbell, Chan, and Kalpathy-Cramer receive research support from Genentech. Dr Chan receives research support from Regeneron. The Imaging and Informatics in Retinopathy of Prematurity Deep Learning system has been licensed to Boston Artificial Intelligence laboratories by Oregon Health & Science University, Massachusetts General Hospital, Northeastern University, and the University of Illinois Chicago, which may result in royalties to Drs Chan, Campbell, and Kalpathy-Cramer in the future. Dr Campbell is a consultant to Boston AI labs; the other authors have indicated they have no financial relationships relevant to this article to disclose.
Comments