Video Abstract

Video Abstract

Close modal
BACKGROUND AND OBJECTIVES

Retinopathy of prematurity (ROP) is a leading cause of childhood blindness. Screening and treatment reduces this risk, but requires multiple examinations of infants, most of whom will not develop severe disease. Previous work has suggested that artificial intelligence may be able to detect incident severe disease (treatment-requiring retinopathy of prematurity [TR-ROP]) before clinical diagnosis. We aimed to build a risk model that combined artificial intelligence with clinical demographics to reduce the number of examinations without missing cases of TR-ROP.

METHODS

Infants undergoing routine ROP screening examinations (1579 total eyes, 190 with TR-ROP) were recruited from 8 North American study centers. A vascular severity score (VSS) was derived from retinal fundus images obtained at 32 to 33 weeks’ postmenstrual age. Seven ElasticNet logistic regression models were trained on all combinations of birth weight, gestational age, and VSS. The area under the precision-recall curve was used to identify the highest-performing model.

RESULTS

The gestational age + VSS model had the highest performance (mean ± SD area under the precision-recall curve: 0.35 ± 0.11). On 2 different test data sets (n = 444 and n = 132), sensitivity was 100% (positive predictive value: 28.1% and 22.6%) and specificity was 48.9% and 80.8% (negative predictive value: 100.0%).

CONCLUSIONS

Using a single examination, this model identified all infants who developed TR-ROP, on average, >1 month before diagnosis with moderate to high specificity. This approach could lead to earlier identification of incident severe ROP, reducing late diagnosis and treatment while simultaneously reducing the number of ROP examinations and unnecessary physiologic stress for low-risk infants.

What’s Known on This Subject:

Retinopathy of prematurity (ROP) screenings are an essential service in NICUs; however, current risk models subject infants to multiple physiologically stressful examinations. Previous work has revealed that an artificial intelligence–derived vascular severity score may prove useful for identifying severe disease.

What This Study Adds:

We developed an image-based risk model that, using a single retinal photograph, accurately detects severe ROP 1 month before diagnosis. Implementation of this screening approach could result in a paradigm shift toward neonatology-led ROP screenings.

Retinopathy of prematurity (ROP) is a leading cause of childhood blindness, although visual impairment can be prevented with appropriate screening and treatment.14  In the context of prematurely born infants, the epidemiology of ROP is directly related to 2 primary factors: neonatal mortality and exposure to supraphysiologic oxygen for resuscitation.1,5  Primary prevention of ROP, through careful oxygen titration, effectively reduces the incidence of treatment-requiring retinopathy of prematurity (TR-ROP); however, there exists a delicate balance: a lower fraction of inspired oxygen reduces the probability of developing ROP but consequently increases the probability of mortality, and vice-versa.5  To err on the side of caution, higher fraction of inspired oxygen is supplied and NICUs are responsible for ensuring that secondary prevention, through timely ROP screenings, occurs for all at-risk neonates.1,4,5  The risk of blindness can be reduced, but not eliminated, with optimal primary and secondary prevention; however, because adverse outcomes are at times preventable, ROP is a leading cause of medicolegal liability in ophthalmology.6,7 

ROP screenings help identify eyes progressing to TR-ROP so that timely treatments may be provided. However, screening guidelines must balance the risk of missing cases of TR-ROP with the risks of discomfort and potentially life-threatening events from the screenings themselves.35  In the United States, screenings are recommended on the basis of demographic criteria (gestational age [GA] <31 weeks or birth weight [BW] <1501 g).4  Examinations begin at either 4 weeks’ chronological age or 31 weeks’ postmenstrual age (PMA) (whichever is later) and are repeated every 1 to 2 weeks until the retina is fully developed or until ROP requires treatment.2,4  On average, infants who meet screening criteria receive 3 to 8 examinations, yet <10% develop TR-ROP. Thus, current screening guidelines, although highly sensitive, are not specific and subject low-risk infants to examinations that would not be necessary if high-risk infants could be better identified.13,8,9  Using numerous risk models, researchers have attempted to add specificity by incorporating comorbidities, but many of them are rare or are confounded by BW and GA.10,11  The best performing models have had promise but, thus far, have not been well generalizable to larger, more diverse populations.10,12,13  Ultimately, these models have not gained traction because they have either failed to ensure 100% sensitivity or have been clinically impractical to implement.1013 

Herein, we explore whether the specificity of risk models can be improved by including biometric information. Deep learning (DL) has had promise for objective diagnosis of ROP and may be useful for screening.1419  Previous work using the Imaging and Informatics in Retinopathy of Prematurity Deep Learning (i-ROP DL) algorithm has suggested that a DL-derived vascular severity score (VSS) may identify infants progressing to TR-ROP weeks before treatment.16,17  To address this gap in knowledge, we incorporated the output of the i-ROP DL algorithm in a predictive risk model for incident TR-ROP. We hypothesize that adding biometric information relevant to ROP may add specificity to risk models based only on demographic variables without sacrificing TR-ROP detection sensitivity.

This study was approved by the institutional review boards at the coordinating center (Oregon Health & Science University) and at each of the 7 study centers (Columbia University, University of Illinois Chicago, William Beaumont Hospital, Children’s Hospital Los Angeles, Cedars-Sinai Medical Center, University of Miami, and Weill Cornell Medical Center) and was conducted in accordance with the Declaration of Helsinki. Written informed consent was obtained from parents of all enrolled infants.

As part of the multicenter Imaging and Informatics in Retinopathy of Prematurity (i-ROP) cohort study, 842 unique patients (BW <1501 g or GA <31 weeks) were screened multiple times for ROP between January 2012 and July 2020. During each examination, retinal fundus images were captured via a RetCam (Natus, Pleasanton, CA). Patients were clinically examined at the bedside but also received image-based ROP diagnoses, which were determined by a consensus of 3 ROP experts using the full International Classification of ROP criteria.4  Patients’ retinal images were required to have expert consensus agreement that their quality was acceptable for diagnosis; 33 images did not meet this criterion. Clinical comorbidities and demographics were recorded for all patients’ examinations (Table 1, Supplemental Table 4). Statistical significance, as applicable, was determined by using Welch’s 2-sample t test and was defined at a cutoff of P ≤ .05.

TABLE 1

i-ROP and Salem Data Set Demographics and Clinical Outcomes

Study Patient CharacteristicsNot TreatedTreatedP
i-ROP training data set    
 BW, g, mean ± SD 944.5 ± 248.3 673.0 ± 206.3 <.001 
 GA, wk, mean ± SD 26.7 ± 1.7 24.7 ± 1.4 <.001 
 VSS, mean ± SD 1.4 ± 0.9 2.9 ± 1.9 <.001 
 Total patients, n (%) 345 (91.8) 31 (8.2) — 
 Total eyes, n (%) 660 (91.9) 58 (8.1) — 
i-ROP test data set    
 BW, g, mean ± SD 930.6 ± 275.8 632.6 ± 136.1 <.001 
 GA, wk, mean ± SD 26.9 ± 2.1 24.3 ± 1.1 <.001 
 VSS, mean ± SD 1.8 ± 1.4 3.9 ± 2.6 <.001 
 Total patients, n (%) 377 (84.9) 67 (15.1) — 
 Total eyes, n (%) 729 (84.7) 132 (15.3) — 
Salem data set    
 BW, g, mean ± SD 1265.4 ± 281.5 823.0 ± 200.9 .052 
 GA, wk, mean ± SD 29.2 ± 2.2 25.0 ± 0.7 <.001 
 VSS, mean ± SD 1.6 ± 0.5 2.3 ± 0.9 .029 
 Total patients, n (%) 125 (94.7) 7 (5.3) — 
 Total eyes, n (%) 248 (94.7) 14 (5.3) — 
Study Patient CharacteristicsNot TreatedTreatedP
i-ROP training data set    
 BW, g, mean ± SD 944.5 ± 248.3 673.0 ± 206.3 <.001 
 GA, wk, mean ± SD 26.7 ± 1.7 24.7 ± 1.4 <.001 
 VSS, mean ± SD 1.4 ± 0.9 2.9 ± 1.9 <.001 
 Total patients, n (%) 345 (91.8) 31 (8.2) — 
 Total eyes, n (%) 660 (91.9) 58 (8.1) — 
i-ROP test data set    
 BW, g, mean ± SD 930.6 ± 275.8 632.6 ± 136.1 <.001 
 GA, wk, mean ± SD 26.9 ± 2.1 24.3 ± 1.1 <.001 
 VSS, mean ± SD 1.8 ± 1.4 3.9 ± 2.6 <.001 
 Total patients, n (%) 377 (84.9) 67 (15.1) — 
 Total eyes, n (%) 729 (84.7) 132 (15.3) — 
Salem data set    
 BW, g, mean ± SD 1265.4 ± 281.5 823.0 ± 200.9 .052 
 GA, wk, mean ± SD 29.2 ± 2.2 25.0 ± 0.7 <.001 
 VSS, mean ± SD 1.6 ± 0.5 2.3 ± 0.9 .029 
 Total patients, n (%) 125 (94.7) 7 (5.3) — 
 Total eyes, n (%) 248 (94.7) 14 (5.3) — 

—, not applicable.

Each eye examination was represented by a single RetCam image centered on the macula, which is approximately the field of view of zone I. Images were analyzed by i-ROP DL, an algorithm developed to detect plus disease (a manifestation of severe ROP).14  i-ROP DL provided a softmax probability of each image having normal, preplus, or plus disease vasculature (ie, it approximated the probability [P()] of each class, in which values range between 0.0 and 1.0 but must sum to 1.0 across all classes). From these values, a VSS, ranging from 1.0 to 9.0, was developed: VSS = P(normal) + 5 × P(preplus) + 9 × P(plus).

The VSS has been shown to independently correlate with more posterior disease (zone), higher stage, and higher extent of stage 3 ROP, in addition to plus disease (all the components of the International Classification of ROP criteria).1519  On the basis of previous work, the 32 to 33 weeks’ PMA imaging window was identified as potentially predictive of TR-ROP.16,17  Thus, the first eye examination in this window was used for each patient. Because the goal was to develop a predictive (rather than diagnostic) model, infants who were diagnosed with TR-ROP within this window were excluded from the training data set (specifically, if they developed TR-ROP within 7 days of the first examination to occur within the 32 to 33 weeks’ PMA window). The held-out test data set (a subset of examinations from the i-ROP data set that were only used for model evaluation) contained all infants eligible for ROP screening, regardless of if or when they developed TR-ROP. Patients were mutually exclusive to the training (n = 376 patients) and test (n = 444 patients) data sets. The training data set contained 58 eyes that eventually developed TR-ROP and 660 eyes that did not.

BW, GA, and VSS were evaluated via recursive feature elimination by using multiple ElasticNet models trained by using Sci-Kit Learn in Python.20  ElasticNet is a type of logistic regression in which a mixture of L1 and L2 regularization is used.21  L1 and L2 regularization is useful for feature selection and when collinear and codependent features are included in a model, respectively, and help to improve model generalizability. The ElasticNet mixing parameter was tuned via fivefold cross-validation by using 11 evenly distributed operating points from 0.0 to 1.0. Values of 1.0 and 0.0 are equal to L1 and L2 regularization, respectively. Because of the class imbalance (ie, eyes that eventually developed TR-ROP versus those that did not), the area under the precision-recall curve (AUPR) was the primary measure of model performance rather than the area under the receiver operating characteristic curve (AUROC) because the AUROC may be too optimistic, that is, a random classifier theoretically has an AUROC of 0.5 but an AUPR only equal to the proportion of positive cases divided by the total number of cases.

The performance of the model with the highest AUPR was assessed via the Fβ score by using fivefold cross-validation across 101 evenly distributed operating points from 0.00 to 1.00. Whereas the F1 score (β = 1) attempts to balance the proportion of false-negatives to false-positives, increasing β (eg, F2, F3, etc) prioritizes minimizing false-negatives over minimizing false-positives. The F2 score is commonly used to slightly prioritize minimization of false-negatives. To minimize false-negatives, β was set to 4. The mean operating point (minus 1 SD) that maximized the F4 score was selected and used to evaluate both test data sets.

This model was then evaluated on the held-out i-ROP test data set and on an independent data set that was collected between September 2015 and June 2018 from 132 unique patients born at a hospital in Salem, Oregon (Table 1). Data collection and exclusion criteria were similar to those for the i-ROP data set. Retrospective evaluation of these data was performed under a waiver of consent from the Oregon Health & Science University Institutional Review Board. Because patients are referred for treatment (not individual eyes), test data set evaluations were conducted at the patient-level (ie, if 1 or both eyes were predicted to develop TR-ROP, the patient was labeled as such). The i-ROP test data set contained 74 patients (132 eyes) who eventually developed TR-ROP and 370 patients (729 eyes) who did not. The Salem data set, contained 7 patients (14 eyes) who developed TR-ROP and 125 patients (248 eyes) who did not. The main outcome measures were sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and their corresponding 95% confidence intervals (CIs), evaluated independently by using the conservative Clopper-Pearson method, as suggested by Ying et al.22 

In a secondary analysis, the maximum VSS between eyes for patients in the i-ROP test data set who screened positive was managed over time. On the basis of previous work, the VSS has potential for use as a monitoring tool to detect disease progression. The change in VSS over time for patients who screened positive and eventually developed TR-ROP was compared with that for those who screened positive but did not develop TR-ROP. Statistical significance was set at a cutoff of P ≤ .05 and was determined by using an analysis of variance and a Welch’s 2-sample t test.

Table 1 displays the relevant demographics and VSSs at 32 to 33 weeks’ PMA and clinical outcomes in the 2 data sets used in this study. In both data sets, eyes that developed TR-ROP tended to have higher VSSs at 32 to 33 weeks, and infants who required treatment in 1 or both eyes tended to have lower BW and GA.

ElasticNet was tuned via fivefold cross-validation for all combinations of BW, GA, and VSS. An ElasticNet model with an L1 ratio of 0.4, by using the predictors GA and VSS, had the highest AUPR (0.35 ± 0.11; Table 2, Fig 1). A random classifier would have an AUPR approximately equal to 0.08 (the proportion of TR-ROP cases in the training data set).

FIGURE 1

AUPR and AUROC for the GA + VSS model. A, Fivefold cross-validation precision-recall (PR) curve. B, Fivefold cross-validation receiver operating characteristic (ROC) curve. The means ± SDs of the AUPR (A) and AUROC (B) were 0.35 ± 0.11 and 0.82 ± 0.07, respectively, on fivefold cross-validation.

FIGURE 1

AUPR and AUROC for the GA + VSS model. A, Fivefold cross-validation precision-recall (PR) curve. B, Fivefold cross-validation receiver operating characteristic (ROC) curve. The means ± SDs of the AUPR (A) and AUROC (B) were 0.35 ± 0.11 and 0.82 ± 0.07, respectively, on fivefold cross-validation.

Close modal
TABLE 2

Fivefold Cross-Validation Results for Every Combination of BW, GA, and VSS

VariablesAUPRaAUROCaL1 Ratio
BW 0.21 ± 0.14 0.77 ± 0.12 0.0 
GA 0.23 ± 0.20 0.79 ± 0.09 1.0 
VSS 0.29 ± 0.05 0.76 ± 0.03 0.0 
BW + GA 0.23 ± 0.20 0.78 ± 0.10 0.0 
BW + VSS 0.32 ± 0.13 0.82 ± 0.11 0.0 
GA + VSS 0.35 ± 0.11 0.82 ± 0.07 0.4 
BW + GA + VSS 0.31 ± 0.11 0.81 ± 0.11 0.0 
VariablesAUPRaAUROCaL1 Ratio
BW 0.21 ± 0.14 0.77 ± 0.12 0.0 
GA 0.23 ± 0.20 0.79 ± 0.09 1.0 
VSS 0.29 ± 0.05 0.76 ± 0.03 0.0 
BW + GA 0.23 ± 0.20 0.78 ± 0.10 0.0 
BW + VSS 0.32 ± 0.13 0.82 ± 0.11 0.0 
GA + VSS 0.35 ± 0.11 0.82 ± 0.07 0.4 
BW + GA + VSS 0.31 ± 0.11 0.81 ± 0.11 0.0 

VSS at 32–33 wk PMA. L1 Ratio, weighting of L1 versus L2 regularization in ElasticNet.

a

Mean ± SD results from fivefold cross-validation.

The operating point was tuned for increased sensitivity (so that all cases of TR-ROP would be identified) before we evaluated performance on the test data sets. The maximum F4 score ± SD (0.74 ± 0.12) occurred at an operating point of 0.33 ± 0.08. To further increase sensitivity, this operating point was lowered by 1 SD to 0.25.

The model was then evaluated on the held-out test data set from the i-ROP database (Table 3). It identified all infants who eventually required treatment (sensitivity: 100.0% [CI 95.1%–100.0%]; PPV: 28.1% [CI 22.8%–34.0%]) while correctly identifying nearly half of the infants who never would (specificity: 48.9% [CI 43.7%–54.1%]; NPV: 100.0% [CI 98.0%–100.0%]). For infants who developed TR-ROP, the average number of weeks ± SD to TR-ROP diagnosis was 3.7 ± 2.7 weeks (range: 0.1–11.0 weeks) after prediction.

TABLE 3

Confusion Matrix of the Model Compared With the Ground Truth in 2 Test Data Sets

Model PredictionsTrue Label
i-ROP Test Data SetSalem Test Data Set
Not TreatedTreatedNot TreatedTreated
Predicted not treated 181 (TN) 0 (FN) 101 (TN) 0 (FN) 
Predicted treated 189 (FP) 74 (TP) 24 (FP) 7 (TP) 
Model PredictionsTrue Label
i-ROP Test Data SetSalem Test Data Set
Not TreatedTreatedNot TreatedTreated
Predicted not treated 181 (TN) 0 (FN) 101 (TN) 0 (FN) 
Predicted treated 189 (FP) 74 (TP) 24 (FP) 7 (TP) 

FN, false-negative; FP, false-positive; TN, true-negative; TP, true-positive.

The model was also evaluated on an independent test data set collected from a hospital located in Salem, Oregon (Table 3). Again, it correctly identified all infants who eventually required treatment (sensitivity: 100.0% [CI 59.0%–100.0%]; PPV: 22.6% [CI 9.6%–41.1%]), and specificity increased to 80.8% (CI 72.8%–87.3%) (NPV: 100.0% [CI 96.4%–100.0%]). The average time ± SD to TR-ROP diagnosis, after prediction, was 3.4 ± 2.1 weeks (range: 0.1–5.0 weeks).

Among positive predictions in the i-ROP test data set (Table 3), the average VSS was managed over time. Patients who developed TR-ROP appeared to have a greater change in average VSS compared with those who screened positive but never required treatment (P ≤ .05), suggesting that specificity could be further improved by analyzing change in VSS over time (Fig 2).

FIGURE 2

Change in maximum intereye VSS over time among patients who screened positive, by treatment group. Among patients who screened positive by the optimal model, patients who developed TR-ROP had higher maximum intereye VSSs at every subsequent follow-up (P ≤ .05). * P ≤ .05.

FIGURE 2

Change in maximum intereye VSS over time among patients who screened positive, by treatment group. Among patients who screened positive by the optimal model, patients who developed TR-ROP had higher maximum intereye VSSs at every subsequent follow-up (P ≤ .05). * P ≤ .05.

Close modal

We tested whether incorporation of an artificial intelligence–based assessment of vascular severity could improve the performance of ROP risk prediction models. We found that using just GA and VSS (obtained during a single eye examination at 32 to 33 weeks’ PMA) can identify all infants who are at risk for developing TR-ROP nearly 1 month before diagnosis and simultaneously rule out more than half of the low-risk population. With further validation, implementation of this model could reduce the number of ROP examinations and associated physiologic stress for low-risk infants. Finally, quantitative monitoring of vascular severity may lead to earlier and more consistent diagnosis of TR-ROP in infants who are at the highest risk, thus minimizing the overall risk of adverse outcomes.

This hypothesis was based on previous work that revealed that a DL-derived VSS may identify high-risk eyes as early as 1 month before TR-ROP diagnosis.16,17  This proved to be accurate because the AUPR of the VSS at predicting TR-ROP was 0.07 points higher than the BW or GA univariate models, or the combination thereof (Table 1). This suggested that diagnostic prediction might be higher if a combination of the VSS and GA and/or BW were to be used in a risk model. After optimizing the operating point of the highest-performing algorithm (GA + VSS) for increased sensitivity (to avoid missing cases of TR-ROP), the model correctly identified 100% of infants who developed TR-ROP in 2 separate populations.

The intended use population and the potential impact of the PPV and NPV in each target population must also be considered. In the i-ROP data set, consistent with a population of infants from academic medical centers (who may be higher risk than those in the average NICU), the specificity of the model was 48.9%, compared to 80.8% in the Salem, Oregon, hospital, where the incidence of TR-ROP was lower. Even in the higher-risk population (i-ROP), these results suggest that, by 32 to 33 weeks’ PMA, half the population could be accurately identified as low risk and no longer require frequent examinations. The Salem, Oregon, population suggests that this proportion may be substantially higher in community ROP screening programs.

We also found that using the VSS to monitor disease progression may further enhance early detection of incident TR-ROP in infants who screen positive (Fig 2). This is consistent with previous work revealing that quantitative monitoring of vascular severity may be useful not only for screening but also for quantitative diagnosis and determining if the disease is stable, progressing, or regressing.1419  This could lay the framework for a new model of ROP screening in which low-risk infants receive less examinations and high-risk infants receive earlier and more precise diagnoses. To this point, it may be worth investigating the roles of oxygen exposure, intraventricular hemorrhages, sepsis, necrotizing enterocolitis, thrombocytopenia, and other previously associated risk factors to further increase specificity, although they may complicate this model and/or introduce confounding effects.

This model may also be easier to implement than previous ROP risk models. The performance of the GA + VSS model is comparable to the initial performance measurements of the Children’s Hospital of Philadelphia ROP model, which used a combination of BW + GA + weight gain to predict future occurrences of type II ROP and TR-ROP.12  Both models achieved 100.0% sensitivity in predicting TR-ROP and had similar specificities. However, when the Children’s Hospital of Philadelphia ROP model was applied to an external validation cohort of infants admitted to 30 hospitals across North America, the operating point had to be lowered to achieve 100.0% sensitivity, consequently reducing specificity to just 6.8%, which is too low to have a substantive impact on screening protocols.13  Another advantage of the proposed model is that it only requires data from a single examination. In general, GA is known with high precision, except in low- and middle-income countries (LMICs), where dating pregnancies may be less reliable. In these settings, it may be worth exploring a model that uses BW + VSS instead because Table 1 suggests almost comparable performance. However, a retinal fundus photograph obtained at 32 to 33 weeks’ PMA is also required, and herein lies the main barrier to implementation at this time. Images are not part of the standard of care, and digital fundus cameras can be expensive, so images are not often obtained.2,4  As cameras drop in price and smartphone-based cameras become viable alternatives, it may be that future studies validating this concept reveal that the clinical benefit of earlier detection of high-risk infants, along with the reduced screening burden, outweighs the cost of implementing routine imaging.2325  Nonetheless, this remains a barrier to implementation and the main disadvantage of this method.

Additionally, this model is not likely to generalize well “out of the box” to populations different from the North American screening population. In many LMICs, the epidemiology and demographic risk factors are different, and the model would need to be retuned on the basis of local disease epidemiology.9,26,27  For example, high-risk infants could be less premature and a time point other than 32 to 33 weeks’ PMA may be more predictive. There is, however, evidence that the i-ROP DL system accurately diagnoses TR-ROP in an Indian ROP telemedicine program, suggesting that the technology is effective in that context and thus may be translatable.19 

Regardless, this model has potential to create a paradigm shift, transitioning from ophthalmology-led to neonatology-led ROP screenings, because the only required inputs are GA and a fundus photograph (not a complete ophthalmoscopic examination). Such a paradigm shift could, in addition to reducing the number of examinations needed for low-risk infants, dramatically reduce the number of examinations for which an ophthalmologist is needed. This could lead to better use of scarce resources, especially in rural regions and LMICs, where this is a significant issue.26,27 

We have trained and optimized an interpretable, parsimonious model for the prediction of TR-ROP. In 2 separate validation cohorts, we demonstrated that a single examination at 32 to 33 weeks’ PMA detected all infants who eventually developed TR-ROP and more than half of those who did not. Implementation of this model could lead to significantly fewer ROP examinations for low-risk infants, better use of ROP screening resources, and earlier recognition of TR-ROP disease progression. Future work will validate this concept in LMICs, where the potential added value may be even greater given the increasing prevalence of disease and scarcity of resources, with the goal of reducing or eliminating blindness due to ROP.

Drs Coyner, Chiang, Campbell, and Kalpathy-Cramer were involved in all aspects of the study, including conceptualizing and designing the study, analyzing the data, drafting the initial manuscript, and reviewing and revising the manuscript; Mr Chen, Dr Singh, and Ms Anderson assisted with analyzing the data and reviewing and revising the manuscript; Drs Schelonka, Jordan, McEvoy, Sonmez, and Erdogmus were involved in critically revising the manuscript; and all authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.

Deidentified individual participant data will not be made available.

FUNDING: Supported by grants T15 LM007088, R01 EY19474, R01 EY031331, R21 EY031883, and P30 EY10572 from the National Institutes of Health (Bethesda, MD) and by unrestricted departmental funding and a Career Development Award (to Dr Campbell) from Research to Prevent Blindness (New York, NY). Funded by the National Institutes of Health (NIH).

COMPANION PAPER: A companion to this article can be found online at www.pediatrics.org/cgi/doi/10.1542/peds.2021-053255.

A risk model based on GA and an artificial intelligence–based assessment of disease severity predicts TR-ROP 1 month before treatment.

AUPR

area under the precision-recall curve

AUROC

area under the receiver operating characteristic curve

BW

birth weight

CI

confidence interval

DL

deep learning

GA

gestational age

i-ROP

Imaging and Informatics in Retinopathy of Prematurity

i-ROP DL

Imaging and Informatics in Retinopathy of Prematurity Deep Learning

LMICs

low- and middle-income countries

NPV

negative predictive value

PMA

postmenstrual age

PPV

positive predictive value

ROP

retinopathy of prematurity

TR-ROP

treatment-requiring retinopathy of prematurity

VSS

vascular severity score

1
Good
WV
,
Hardy
RJ
,
Dobson
V
, et al;
Early Treatment for Retinopathy of Prematurity Cooperative Group
.
The incidence and course of retinopathy of prematurity: findings from the early treatment for retinopathy of prematurity study
.
Pediatrics
.
2005
;
116
(
1
):
15
23
2
Fierson
WM
.
American Academy of Pediatrics Section on Ophthalmology
;
American Academy of Ophthalmology
;
American Association for Pediatric Ophthalmology and Strabismus
;
American Association of Certified Orthoptists
.
Screening examination of premature infants for retinopathy of prematurity [published correction appears in Pediatrics. 2019;143(3):e20183810]
.
Pediatrics
.
2018
;
142
(
6
):
e20183061
3
Early Treatment For Retinopathy Of Prematurity Cooperative Group
.
Revised indications for the treatment of retinopathy of prematurity: results of the early treatment for retinopathy of prematurity randomized trial
.
Arch Ophthalmol
.
2003
;
121
(
12
):
1684
1694
4
International Committee for the Classification of Retinopathy of Prematurity
.
The International Classification of Retinopathy of Prematurity revisited
.
Arch Ophthalmol
.
2005
;
123
(
7
):
991
999
5
Lawn
JE
,
Davidge
R
,
Paul
VK
, et al
.
Born too soon: care for the preterm baby
.
Reprod Health
.
2013
;
10
(
suppl 1
):
S5
6
Braverman
RS
,
Enzenauer
RW
.
Socioeconomics of retinopathy of prematurity in-hospital care
.
Arch Ophthalmol
.
2010
;
128
(
8
):
1055
1058
7
Braverman
RS
,
Enzenauer
RW
.
Socioeconomics of retinopathy of prematurity care in the United States
.
Am Orthopt J
.
2013
;
63
(
1
):
92
96
8
Blencowe
H
,
Lawn
JE
,
Vazquez
T
,
Fielder
A
,
Gilbert
C
.
Preterm-associated visual impairment and estimates of retinopathy of prematurity at regional and global levels for 2010
.
Pediatr Res
.
2013
;
74
(
suppl 1
):
35
49
9
Quinn
GE
.
Retinopathy of prematurity blindness worldwide: phenotypes in the third epidemic
.
Eye Brain
.
2016
;
8
:
31
36
10
Kim
SJ
,
Port
AD
,
Swan
R
,
Campbell
JP
,
Chan
RVP
,
Chiang
MF
.
Retinopathy of prematurity: a review of risk factors and their clinical significance
.
Surv Ophthalmol
.
2018
;
63
(
5
):
618
637
11
Hutchinson
AK
,
Melia
M
,
Yang
MB
,
VanderVeen
DK
,
Wilson
LB
,
Lambert
SR
.
Clinical models and algorithms for the prediction of retinopathy of prematurity: a report by the American Academy of Ophthalmology
.
Ophthalmology
.
2016
;
123
(
4
):
804
816
12
Binenbaum
G
,
Ying
GS
,
Quinn
GE
, et al
.
The CHOP postnatal weight gain, birth weight, and gestational age retinopathy of prematurity risk model
.
Arch Ophthalmol
.
2012
;
130
(
12
):
1560
1565
13
Binenbaum
G
,
Ying
G-S
,
Tomlinson
LA
;
Postnatal Growth and Retinopathy of Prematurity (G-ROP) Study Group
.
Validation of the Children’s Hospital of Philadelphia Retinopathy of Prematurity (CHOP ROP) model
.
JAMA Ophthalmol
.
2017
;
135
(
8
):
871
877
14
Brown
JM
,
Campbell
JP
,
Beers
A
, et al;
Imaging and Informatics in Retinopathy of Prematurity (i-ROP) Research Consortium
.
Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks
.
JAMA Ophthalmol
.
2018
;
136
(
7
):
803
810
15
Campbell
JP
,
Kim
SJ
,
Brown
JM
, et al;
of the Imaging and Informatics in Retinopathy of Prematurity Consortium
.
Evaluation of a deep learning-derived quantitative retinopathy of prematurity severity scale
.
Ophthalmology
.
2021
;
128
(
7
):
1070
1076
16
Taylor
S
,
Brown
JM
,
Gupta
K
, et al;
Imaging and Informatics in Retinopathy of Prematurity Consortium
.
Monitoring disease progression with a quantitative severity scale for retinopathy of prematurity using deep learning
.
JAMA Ophthalmol
.
2019
;
137
(
9
):
1022
1028
17
Bellsmith
KN
,
Brown
J
,
Kim
SJ
, et al
.
Aggressive posterior retinopathy of prematurity: clinical and quantitative imaging features in a large North American cohort
.
Ophthalmology
.
2020
;
127
(
8
):
1105
1112
18
Greenwald
MF
,
Danford
ID
,
Shahrawat
M
, et al
.
Evaluation of artificial intelligence-based telemedicine screening for retinopathy of prematurity
.
J AAPOS
.
2020
;
24
(
3
):
160
162
19
Campbell
JP
,
Singh
P
,
Redd
TK
, et al
.
Applications of artificial intelligence for retinopathy of prematurity screening
.
Pediatrics
.
2021
;
147
(
3
):
e2020016618
20
Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
, et al
.
Scikit-learn: machine learning in Python
.
J Mach Learn Res
.
2011
;
12
(
85
):
2825
2830
21
Hastie
T
,
Tibshirani
R
,
Friedman
J
.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
. 2nd ed.
New York, NY
:
Springer
;
2017
22
Ying
GS
,
Maguire
MG
,
Glynn
RJ
,
Rosner
B
.
Calculating sensitivity, specificity, and predictive values for correlated eye data
.
Invest Ophthalmol Vis Sci
.
2020
;
61
(
11
):
29
23
Wang
S
,
Jin
K
,
Lu
H
,
Cheng
C
,
Ye
J
,
Qian
D
.
Human visual system-based fundus image quality assessment of portable fundus camera photographs
.
IEEE Trans Med Imaging
.
2016
;
35
(
4
):
1046
1055
24
Raju
B
,
Raju
NSD
,
Akkara
JD
,
Pathengay
A
.
Do it yourself smartphone fundus camera - DIYretCAM
.
Indian J Ophthalmol
.
2016
;
64
(
9
):
663
667
25
Nazari Khanamiri
H
,
Nakatsuka
A
El-Annan
J
.
Smartphone fundus photography
.
J Vis Exp
.
2017
;(
125
):
55958
26
Gilbert
C
,
Fielder
A
,
Gordillo
L
, et al;
International NO-ROP Group
.
Characteristics of infants with severe retinopathy of prematurity in countries with low, moderate, and high levels of development: implications for screening programs
.
Pediatrics
.
2005
;
115
(
5
).
27
Gilbert
C
.
Retinopathy of prematurity: a global perspective of the epidemics, population of babies at risk and implications for control
.
Early Hum Dev
.
2008
;
84
(
2
):
77
82

Competing Interests

FINANCIAL DISCLOSURE: The authors have indicated they have no financial relationships relevant to this article to disclose.

POTENTIAL CONFLICT OF INTEREST: Dr Chan is on the Scientific Advisory Board for Phoenix Technology Group (Pleasanton, CA) and is a consultant for Novartis (Basel, Switzerland) and Alcon (Ft Worth, TX). Dr Chiang was previously a consultant for Novartis (Basel, Switzerland) and is an equity owner of Inteleretina (Honolulu, HI). Drs Chiang, Campbell, Chan, and Kalpathy-Cramer receive research support from Genentech. Dr Chan receives research support from Regeneron. The Imaging and Informatics in Retinopathy of Prematurity Deep Learning system has been licensed to Boston Artificial Intelligence laboratories by Oregon Health & Science University, Massachusetts General Hospital, Northeastern University, and the University of Illinois Chicago, which may result in royalties to Drs Chan, Campbell, and Kalpathy-Cramer in the future. Dr Campbell is a consultant to Boston AI labs; the other authors have indicated they have no financial relationships relevant to this article to disclose.