OBJECTIVES

Early warning scores detecting clinical deterioration in pediatric inpatients have wide-ranging performance and use a limited number of clinical features. This study developed a machine learning model leveraging multiple static and dynamic clinical features from the electronic health record to predict the composite outcome of unplanned transfer to the ICU within 24 hours and inpatient mortality within 48 hours in hospitalized children.

METHODS

Using a retrospective development cohort of 17 630 encounters across 10 388 patients, 2 machine learning models (light gradient boosting machine [LGBM] and random forest) were trained on 542 features and compared with our institutional Pediatric Early Warning Score (I-PEWS).

RESULTS

The LGBM model significantly outperformed I-PEWS based on receiver operating characteristic curve (AUROC) for the composite outcome of ICU transfer or mortality for both internal validation and temporal validation cohorts (AUROC 0.785 95% confidence interval [0.780–0.791] vs 0.708 [0.701–0.715] for temporal validation) as well as lead-time before deterioration events (median 11 hours vs 3 hours; P = .004). However, LGBM performance as evaluated by precision recall curve was lesser in the temporal validation cohort with associated decreased positive predictive value (6% vs 29%) and increased number needed to evaluate (17 vs 3) compared with I-PEWS.

CONCLUSIONS

Our electronic health record based machine learning model demonstrated improved AUROC and lead-time in predicting clinical deterioration in pediatric inpatients 24 to 48 hours in advance compared with I-PEWS. Further work is needed to optimize model positive predictive value to allow for integration into clinical practice.

Children admitted in inpatient hospital settings are at risk for clinical deterioration, which can lead to unanticipated escalation of care or even mortality. Early escalation of care for patients at risk for clinical deterioration is associated with reduced morbidity and improved survival.13  To help detect hospitalized children at risk for deterioration, several different pediatric early warning scores (PEWS) have been developed that primarily score patients based on their vital signs and physical exam findings.4,5  Scores are frequently calculated with each nursing bedside assessment, and elevated scores lead to increased monitoring or escalation of care.4,5  Although early warning scores have been validated to predict transfer to the ICU, rapid response team activation, and code events, implementation of Bedside PEWS in a large, international, randomized controlled trial did not decrease mortality among hospitalized pediatric patients.69 

Most currently implemented PEWS exclusively use a patient’s dynamic features, such as vital signs and clinical appearance. However, static features, such as medical history, can also be potent predictors of deterioration.10  Combining dynamic and static features may improve score performance but may also make score calculations more time consuming for bedside nurses.11,12  Automated machine-learning based scoring systems provide several advantages over nurse calculated scores. They can leverage the rich data from the electronic health record (EHR) such as medication administrations and laboratory results as well as demographic information and vital signs. Additionally, they can integrate both current values as well as trends into the scoring system, which can improve model performance.13  Further, they relieve nursing staff of the burden imposed by manual scoring.

In recent studies, machine learning models based primarily on vital signs, or the combination of vital signs and laboratory values, have outperformed standard early warning scores at predicting clinical deterioration.1416  These works demonstrate the powerful potential of machine-learning algorithms, but they focused only on evaluation at a single time point or on the first deterioration event for patients. No prior machine learning model has been optimized for repeated predictions of pediatric deterioration, including repeat events throughout an admission, and incorporated medication administration and demographic information. The primary objective of the current study is to develop and validate a machine-learning algorithm to repeatedly predict pediatric inpatient clinical deterioration throughout a child’s hospital course utilizing both static and dynamic patient features. The secondary objective is to compare performance of the newly developed model to our current institutional bedside scoring system (I-PEWS), which incorporates parental and nursing concerns into standard PEWS, and is based off the previously validated Children’s Hospital Early Warning System (C-CHEWS) warning score.17 

This single center retrospective study was performed at a 190-bed children’s hospital-within-a-hospital at a quaternary, academic healthcare system in the southeastern United States. The study was approved by our Institutional Review Board, which waived the requirement for informed consent for the use of identifiable data (Pro00102839). This study followed the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis reporting guideline for prognostic studies.18 

Data from pediatric inpatient encounters were extracted from the EHR. Pediatric encounters were defined as all patients younger than 18 years of age as well as patients younger than 25 years of age who were admitted to an inpatient pediatric hospital service rather than an adult service, which commonly occurs for young adults still cared for primarily by pediatric subspecialists. Patients were excluded if their admission was limited to the labor and delivery or neonatal wards. We did not explicitly exclude patients with Do Not Attempt Resuscitation orders as mortality occurs almost exclusively in the ICU for these patients at our institution. The development cohort included all encounters between October 1, 2014 and August 31, 2018, which was divided into a train-test split with 90% of data used for training and 10% for testing, each containing mutually exclusive patients. Encounters between January 1, 2019 and December 31, 2019 were used for temporal validation to allow for assessment of model performance on a distinct cohort of patients from the train-test cohort.

A deterioration event was defined as an unplanned transfer to the ICU or inpatient mortality. We performed a sensitivity analysis using the outcome of critical deterioration, defined as use of either mechanical ventilation or vasoactive medication or death within 12 hours of ICU transfer.19  Admissions to the ICU from the emergency department or operating room and direct transfers from labor and delivery to the neonatal ICU were not considered deterioration events. All deaths occurring between the time of hospital admission and 72 hours after time of discharge were considered inpatient mortality events. Encounters could experience more than 1 transfer to the ICU. Patients could be involved in multiple inpatient encounters during the study timeframe if they had multiple admissions. No clustering was performed at the patient level, although prior ICU admission was a model input.

A total of 227 elements were selected for inclusion in the model development process based on a combination of expert opinion, literature review, and univariate analysis identifying data elements associated with deterioration events (See Supplemental Information for full details). We did not include race and ethnicity as model inputs, but assessed outcomes and model performance based on race and ethnicity given previous work has demonstrated disparate patient outcomes based on the social construct of race.20  These were defined in the EHR according to each individual’s parent or guardian or self-report.

Data elements used in model development included demographic information, comorbidities, prior inpatient encounters, vital signs, laboratory results, flowsheet measurements, orders, and medication administrations during the encounter. For vital signs and laboratory measurements, 6 features were used: the observed value, a flag indicating a new measurement event, change in value since the last hour’s measurement, and 24-hour rolling mean, minimum, and maximum values. We then trained 2 different machine learning model algorithms on the development cohort using the Python scikit-learn library: light-gradient boosting machine (LGBM) and random forest (RF).21  These models differ slightly in methods of model development and training, but both allow for interpretation of feature importance in the trained model, which provides insight into the patient factors that contribute most to model output. Random forest models build many independent classification trees with the final result representing an aggregation of the individual observations, whereas gradient boosted machine models build classification trees sequentially, with each tree learning from previous iterations. This difference in development can affect the model's ability to fit the initial dataset as well as model performance on classifying new datasets.

Data are summarized with counts and percentages for categorical variables and median and interquartile values for continues variables. Statistical significance was assessed with Wilcoxon rank sum test and χ-squared statistics. We considered P < .05 to be significant. We did not adjust for multiple comparisons. We evaluated the hourly performance of the machine learning models as well as I-PEWS to predict the primary outcome of unplanned ICU transfer within 24 hours or inpatient mortality within 48 hours for both the internal validation and 2019 temporal validation cohorts. We used the area under the receiver operating characteristic curve (AUROC) to evaluate overall ability to differentiate between true positive and true negative events. However, as AUROC is not affected by overall event incidence, we also used the area under the precision recall curve (AUPRC) to evaluate overall precision (positive predictive value) of the model, which balances model sensitivity with its false positive rate, an important consideration given the overall low incidence of patient deterioration. We calculated the 95% confidence interval for AUROC and AUPRC using bootstrapping (10 000 iterations with replacement).22  We further calculated the positive predictive value (PPV), number needed to evaluate (NNE), and lead time before an event at different sensitivity thresholds. Lead-time was determined as the earliest consecutive hour before deterioration that the model would flag positive. We performed a sensitivity analysis of model and PEWS performance on detecting only critical deterioration events.19 

The development cohort included 17 630 encounters across 10 388 patients. The median age at admission was 6.6 years (interquartile values: 1.6–13.6) (Table 1). Fifty-four percent were male. Among this cohort, 864 (8.3%) unique patients experienced 1330 unplanned ICU transfer events over the course of 1049 (5.8%) unique inpatient encounters. A total of 312 events (23%) occurred in patients who had previously deteriorated during the same admission. Deterioration occurred a median of 101.1 hours (22.2–535.6) after admission. One hundred nine (1.0%) patients died while admitted to the hospital, 27 of whom did not experience an unplanned ICU transfer during the encounter. Patients who deteriorated were younger (3.2 years [0.6–11.2] vs 6.8 years [1.6–13.6], P < .001) and experienced a longer hospital stay (14.2 days [5.9–39.1] vs 3.1 days [1.8–5.3], P < .001). They additionally were more likely to have been admitted nonelectively (P < .001) and more likely to be nonwhite (P = .017).

TABLE 1

Development Cohort Demographics and Encounter-Level Features

Encounters Without Deterioration (N = 16 581)Encounters With Deterioration (N = 1049)P
Number of unique patients 10 017 886  
Age at admission (years) 6.8 [1.7–13.7] 3.2 [0.6–11.2] <.001 
Length of stay (days) 3.1 [1.8–5.3] 14.2 [5.9–39.1] <.001 
Sex 
 Male 8875 (53.5) 588 (56.1) .119 
Race 
 Caucasian or white 8012 (48.3) 473 (45.1) .017 
 African American or Black 5242 (31.6) 340 (32.4)  
 Asian or Pacific Islander 439 (2.6) 26 (2.5)  
 American Indian or Alaskan Native 165 (1.0) 16 (1.5)  
 2 or more races 1201 (7.2) 84 (8.0)  
 Other 1083 (6.5) 91 (8.7)  
 Not reported or declined 439 (2.6) 19 (1.8)  
Ethnicity 
 Hispanic or Latino 1761 (10.6) 118 (11.2) .219 
 Not Hispanic or Latino 14 246 (85.9) 905 (86.3)  
 Not reported or declined 569 (3.4) 26 (2.5)  
Admission source   <.001 
 Home 12 069 (72.8) 632 (60.2)  
 Transfer 2971 (17.9) 299 (28.5)  
 Clinic 1126 (6.8) 64 (6.1)  
 Newborn 370 (2.2) 53 (5.1)  
 Other 45 (0.3) 1 (0.1)  
Admission type 
 Emergency admission 7092 (42.8) 493 (47.0) <.001 
 Routine elective admission 5232 (31.6) 172 (16.4)  
 Urgent admission 3906 (23.6) 333 (31.7)  
 Newborn 348 (2.1) 51 (4.9)  
 Other 3 (0.0) 0 (0)  
Number of prior encounters 0.0 [0.0–1.0] 0.0 [0.0–2.0] <.001 
Days since last encounter 57.9 [20.9–169.1] 59.0 [25.6–149.7] .696 
Inpatient mortality — 109 (10.4) — 
All-cause mortality 603 (3.6) 190 (18.1) <.001 
Encounters Without Deterioration (N = 16 581)Encounters With Deterioration (N = 1049)P
Number of unique patients 10 017 886  
Age at admission (years) 6.8 [1.7–13.7] 3.2 [0.6–11.2] <.001 
Length of stay (days) 3.1 [1.8–5.3] 14.2 [5.9–39.1] <.001 
Sex 
 Male 8875 (53.5) 588 (56.1) .119 
Race 
 Caucasian or white 8012 (48.3) 473 (45.1) .017 
 African American or Black 5242 (31.6) 340 (32.4)  
 Asian or Pacific Islander 439 (2.6) 26 (2.5)  
 American Indian or Alaskan Native 165 (1.0) 16 (1.5)  
 2 or more races 1201 (7.2) 84 (8.0)  
 Other 1083 (6.5) 91 (8.7)  
 Not reported or declined 439 (2.6) 19 (1.8)  
Ethnicity 
 Hispanic or Latino 1761 (10.6) 118 (11.2) .219 
 Not Hispanic or Latino 14 246 (85.9) 905 (86.3)  
 Not reported or declined 569 (3.4) 26 (2.5)  
Admission source   <.001 
 Home 12 069 (72.8) 632 (60.2)  
 Transfer 2971 (17.9) 299 (28.5)  
 Clinic 1126 (6.8) 64 (6.1)  
 Newborn 370 (2.2) 53 (5.1)  
 Other 45 (0.3) 1 (0.1)  
Admission type 
 Emergency admission 7092 (42.8) 493 (47.0) <.001 
 Routine elective admission 5232 (31.6) 172 (16.4)  
 Urgent admission 3906 (23.6) 333 (31.7)  
 Newborn 348 (2.1) 51 (4.9)  
 Other 3 (0.0) 0 (0)  
Number of prior encounters 0.0 [0.0–1.0] 0.0 [0.0–2.0] <.001 
Days since last encounter 57.9 [20.9–169.1] 59.0 [25.6–149.7] .696 
Inpatient mortality — 109 (10.4) — 
All-cause mortality 603 (3.6) 190 (18.1) <.001 

Data presented as count (percentage) and median [interquartile values]. This table was prepared using the Python tableone library.36  —, N/A.

A total of 2 370 688 hours were represented in the training dataset, 28 302 (1.2%) of which fell within the prediction window for the combined deterioration outcome. As assessed by AUROC, the LGBM model (AUROC 0.847, 95% confidence interval [0.840–0.854]) outperformed the RF model (0.814 [0.806–0.822]) and I-PEWS (0.690 [0.687–0.693]) in the internal validation cohort (Fig 1). The AUPRC was also higher for the LGBM model (0.082 [0.076–0.090]) compared with the RF model (0.067 [0.061–0.075]) and I-PEWS (0.066 [0.063–0.069]). I-PEWS had high positive predictive value (PPV) at thresholds with low sensitivity and sharply declined in PPV with increasing sensitivity. LGBM maintained higher PPV across sensitivity values.

FIGURE 1

Model receiver operator characteristic (ROC) and precision-recall (PRC) curves from internal validation (test data from the development cohort). The corresponding values of area under the curve (AUC) for each model are shown in the legend.

FIGURE 1

Model receiver operator characteristic (ROC) and precision-recall (PRC) curves from internal validation (test data from the development cohort). The corresponding values of area under the curve (AUC) for each model are shown in the legend.

Close modal

The 25 most important features in the 2 models are compared in Supplemental Fig 4. Vital sign-based features were most commonly represented (48% of combined top features across the 2 models), with features based on comorbidities (24%), laboratory values (16%), and demographics (10%) also common. The value for the 24-hours minimum pulse oximetry saturation was the most important feature for the RF model and second most important for the LGBM model. Hour of encounter was the most important feature for the LGBM model, which corresponded to the peak in both overall deterioration events and rate of deterioration events within the first 12 hours of admission with notable decline after 24 hours (Supplemental Fig 5). Comorbidities of respiratory failure and cardiac dysrhythmias were prominent features in both models, indicating patients with prior history of these were at higher risk for deterioration.

The 2019 temporal validation cohort included 4534 encounters from 3265 unique patients (Table 2). Fifty-seven events (17%) occurred in patients who had previously deteriorated during the same admission. Overall demographics were similar to the training cohort. Patients who deteriorated were similarly younger (3.0 years [0.5–11.1] vs 6.1 years [1.2–13.3], P < .001), and more likely to be nonwhite (P = .002) and admitted nonelectively (P < .001) than those without events. Again, the LGBM model outperformed other methods as measured by AUROC (0.785 [0.780–0.791]) compared with RF (0.743 [0.738–0.749]), and I-PEWS (0.708 [0.701–0.715]) (Fig 2). LGBM model AUROC performance remained high across racial groups (Supplemental Table 4). The AUPRC for I-PEWS was significantly higher than for LGBM (I-PEWS: 0.068 [0.063–0.073]; LGBM: 0.046 [0.043–0.048]) (Fig 2). Although the LGBM model identified patients earlier compared with I-PEWS (median lead-time before deterioration 11 h2,19  vs 3 h1,11 ; P = .004), it conferred lower positive prediction value (6% vs 29%) at a sensitivity corresponding to the I-PEWS threshold for activating a rapid response (9% sensitivity) (Table 3). This lower positive predictive value would correspond to an additional 13 patient evaluations required for each true deterioration event.

TABLE 2

Temporal Validation Cohort Demographics and Encounter-Level Features

Encounters Without Deterioration (N = 4265)Encounters With Deterioration (N = 269)P
Number of unique patients 3,135 244  
Age at admission (years) 6.1 [1.2–13.3] 3.0 [0.5–11.1] <.001 
Length of stay (days) 3.1 [1.9–5.3] 10.2 [5.2–29.9] <.001 
Sex 
 Male 2165 (50.8) 151 (56.1) .1 
Race 
 Caucasian or white 2110 (49.4) 110 (40.1) .002 
 African American or Black 1225 (28.7) 101 (37.5)  
 Asian or Pacific Islander 125 (2.9) 15 (5.6)  
 American Indian or Alaskan Native 30 (0.7) 1 (0.0)  
 2 or more races 381 (8.9) 18 (6.7)  
 Other 236 (5.5) 19 (7.1)  
 Not reported or declined 158 (3.7) 5 (1.9)  
Ethnicity 
 Hispanic or Latino 489 (11.5) 30 (11.1) .8 
 Not Hispanic or Latino 3565 (83.6) 228 (84.8)  
 Not reported or declined 211 (4.9) 11 (4.1)  
Admission source 
 Home 3084 (72.3) 179 (66.5) <.008 
 Transfer 789 (18.5) 69 (25.7)  
 Clinic 258 (6.0) 10 (3.7)  
 Newborn 123 (2.9) 11 (4.1)  
 Other 11 (0.3) 0 (0)  
Admission type 
 Emergency admission 2227 (52.2) 168 (62.4) <.001 
 Routine elective admission 1330 (31.1) 51 (19.0)  
 Urgent admission 585 (13.7) 39 (14.5)  
 Newborn 123 (2.9) 11 (4.1)  
Inpatient mortality — 9 (3.3) — 
All-cause mortality 28 (0.6) 12 (4.5) <.001 
Encounters Without Deterioration (N = 4265)Encounters With Deterioration (N = 269)P
Number of unique patients 3,135 244  
Age at admission (years) 6.1 [1.2–13.3] 3.0 [0.5–11.1] <.001 
Length of stay (days) 3.1 [1.9–5.3] 10.2 [5.2–29.9] <.001 
Sex 
 Male 2165 (50.8) 151 (56.1) .1 
Race 
 Caucasian or white 2110 (49.4) 110 (40.1) .002 
 African American or Black 1225 (28.7) 101 (37.5)  
 Asian or Pacific Islander 125 (2.9) 15 (5.6)  
 American Indian or Alaskan Native 30 (0.7) 1 (0.0)  
 2 or more races 381 (8.9) 18 (6.7)  
 Other 236 (5.5) 19 (7.1)  
 Not reported or declined 158 (3.7) 5 (1.9)  
Ethnicity 
 Hispanic or Latino 489 (11.5) 30 (11.1) .8 
 Not Hispanic or Latino 3565 (83.6) 228 (84.8)  
 Not reported or declined 211 (4.9) 11 (4.1)  
Admission source 
 Home 3084 (72.3) 179 (66.5) <.008 
 Transfer 789 (18.5) 69 (25.7)  
 Clinic 258 (6.0) 10 (3.7)  
 Newborn 123 (2.9) 11 (4.1)  
 Other 11 (0.3) 0 (0)  
Admission type 
 Emergency admission 2227 (52.2) 168 (62.4) <.001 
 Routine elective admission 1330 (31.1) 51 (19.0)  
 Urgent admission 585 (13.7) 39 (14.5)  
 Newborn 123 (2.9) 11 (4.1)  
Inpatient mortality — 9 (3.3) — 
All-cause mortality 28 (0.6) 12 (4.5) <.001 

Data given as count (percentage) and median [interquartile values]. —, N/A.

FIGURE 2

Model receiver operator characteristic (ROC) and precision-recall (PRC) curves from temporal validation. The corresponding values of area under the curve (AUC) for each model are shown in the legend.

FIGURE 2

Model receiver operator characteristic (ROC) and precision-recall (PRC) curves from temporal validation. The corresponding values of area under the curve (AUC) for each model are shown in the legend.

Close modal
TABLE 3

Performance at Different Sensitivity Thresholds of I-Pews and LGBM Model

I-PEWSLGBM
SensitivityPEWS ScoreSpecificityPPVNNELeadtime (h)SpecificityPPVNNELeadtime (h)
0.9 — — — — — 0.43 0.02 50 24 (13–24) 
0.8 — — — — — 0.62 0.02 50 24 (12–24) 
0.7 — — — — — 0.75 0.03 33 24 (10–24) 
0.6 0.76 0.03 33 24 (13–24) 0.82 0.04 25 22 (8.25–24) 
0.33 0.96 0.08 13 19 (4–24) 0.93 0.06 17 18 (7–24) 
0.16 0.99 0.19 8 (2–22) 0.97 0.06 17 14 (4–22) 
0.09 4 1 0.29 3 3 (1–11.25) 0.98 0.06 17 11 (2–19) 
0.06 0.31 2 (1–4.25) 0.99 0.05 20 8.5 (2–16.25) 
0.04 0.32 1 (1–3) 0.99 0.07 14 10 (3–17) 
I-PEWSLGBM
SensitivityPEWS ScoreSpecificityPPVNNELeadtime (h)SpecificityPPVNNELeadtime (h)
0.9 — — — — — 0.43 0.02 50 24 (13–24) 
0.8 — — — — — 0.62 0.02 50 24 (12–24) 
0.7 — — — — — 0.75 0.03 33 24 (10–24) 
0.6 0.76 0.03 33 24 (13–24) 0.82 0.04 25 22 (8.25–24) 
0.33 0.96 0.08 13 19 (4–24) 0.93 0.06 17 18 (7–24) 
0.16 0.99 0.19 8 (2–22) 0.97 0.06 17 14 (4–22) 
0.09 4 1 0.29 3 3 (1–11.25) 0.98 0.06 17 11 (2–19) 
0.06 0.31 2 (1–4.25) 0.99 0.05 20 8.5 (2–16.25) 
0.04 0.32 1 (1–3) 0.99 0.07 14 10 (3–17) 

Leadtime in hours of patient detection before deterioration event presented as median (interquartile values). I-PEWS score of 4 is current institution threshold for initiating a rapid response. —, N/A.

In sensitivity analysis evaluating performance for predicting critical deterioration, the LGBM model outperformed I-PEWS by AUROC for both internal validation (0.808 [0.792–0.823] vs 0.716 [0.710–0.723]) and 2019 temporal validation cohorts (0.827 [0.816–0.838] vs 0.753 [0.739–0.767]) as well as lead time (median 17 h2,11  vs 2 h1,9 ). However, it had lower AUPRC (0.017 [0.013–0.021] vs 0.02 [0.17–0.025]) and positive predictive value (2% vs 6%) in the 2019 temporal validation cohort (Supplemental Figs 6 and 7, Supplemental Table 7).

We have developed a machine learning model that demonstrated increased AUROC and greater lead-time to predict pediatric clinical deterioration compared with our current standard of care, I-PEWS; however, it also has notable limitations with lower positive predictive value and associated higher number needed to evaluate. Our model aimed to generate hourly predictions of clinical deterioration through incorporating numerous patient features from the EHR, including medication administration records as well as trends in vital sign and label value data. Although such a complex model has potential for improving predictions of deterioration, further work will be needed to enhance the model before it can be a clinically effective tool.

Previous machine learning models for pediatric patients have demonstrated the ability to predict ICU transfer with the fixed time of within 24 hours of admission or within a shorter time horizon of 6 to 8 hours before the event.14,2325  More recently, models trained on just vital signs, as well as vital signs combined with laboratory values, have showed improved performance in sensitivity, specificity, and positive predictive value compared with their respective institutional early warning score.15,16  Compared with these more recent models, we explicitly evaluated all deterioration events during an admission, rather than the first such event. Notably, 22% of deterioration events were by patients with prior deterioration, indicating this is a substantial cohort of patients to evaluate. Additionally, our model involved increased complexity. We included medication administration information, demographic, and comorbidity data. We also incorporated 24 hours trends for vital sign and laboratory values rather than single values. For example, the 24-hour peak temperature and respiration rate and 24-hour lowest platelet count were all among the 10 most important features for the LGBM model. Previous studies have shown the benefits of these additional features in model development.10,13,2628  Finally, we trained our model on the timeline of 24 hours before an event. Reflecting this, our model predicted deterioration a median of 11 hours before an event and 17 hours before a critical event, potentially providing a clinically meaningful interval to allow for therapy to improve patient trajectory. However, this increased model sensitivity may come with the tradeoff of decreased positive predictive value.

The broader aim of constructing an hourly model is to integrate the technology in a hospital setting and use real-time data on admitted patients to provide hourly predictions of future clinical deterioration. Accordingly, effective early warning systems must be adequately sensitive to detect health status before deterioration while maintaining an adequate specificity to minimize additional workflow caused by evaluation of false positive patients who are not at risk for deterioration. Prior implementation of an early warning score for adult patients did not improve outcomes, likely because of low positive predictive value and alarm fatigue among frontline staff.29  Although our developed LGBM model outperformed I-PEWS by several metrics, it demonstrated an inferior positive predictive value at the currently used sensitivity levels of I-PEWS. This translated to 13 additional patient evaluations to detect 1 deterioration. The time and cost of these additional patient assessments, as well as potentially associated alarm fatigue, was not evaluated but may be substantial. Before model implementation, we will need to ensure an adequate balance of sensitivity, lead-time before an event, and positive predicative value. Additionally, machine learning models must balance number of features with model performance. Simpler models with fewer variables may be more generalizable and easier to implement across health systems.30  The performance of our model may not yet justify its additional complexity.

The standard for comparison in this study was I-PEWS (Supplemental Fig 3). At our institution, I-PEWS is measured as part of the nursing assessment at least every 4 hours. In the current workflow, if a patient scores a 2 in any category or has a total score of 3, the primary team and charge nurse are notified, and a bedside evaluation is typically requested. If a patient scores a 3 in any category or has a total score of 4 or higher, then a pediatric rapid response team is activated to assess the patient. This is similar to the score and escalation of care algorithm used by the (C-CHEWS), which has previously been validated to demonstrate both higher discrimination and longer early warning time performance over the Brighton PEWS.31  However, neither I-PEWS nor CHEWS were designed to make predictions of clinical deterioration as far in advance as 24 or 48 hours. The median time of an elevated CHEWS score (≥3) before critical deterioration was 11.1 hours, and for a critical CHEWS score (≥5), the time horizon was only 3.8 hours. Similarly, the median time for I-PEWS was 3 hours in our 2019 temporal validation dataset and only 2 hours before a critical deterioration.31  These short warning periods may limit the opportunity for intervention and reflect the need for improved methods of detecting deterioration.

Our study identified that incidence of clinical deterioration differed based on patient race, raising concern that this social construct may affect patient outcomes. This discrepancy may potentially be because of differences in prehospital care, initial patient triage, or care provided in the hospital. Prior studies have identified patient race and ethnicity may impact the prehospital care children receive after out-of-hospital cardiac arrest as well as the length of stay for common pediatric admissions.20,32  Before employing a machine learning based model, it is critical to ensure it improves patient equity rather than preserving disparities.33  Our LGBM model had high AUROC across racial groups in our 2019 temporal validation cohort, but we will need to ensure continued high performance when we employ any model prospectively.

This study has several limitations. The model would not accurately capture scenarios in which clinical concern or elevated I-PEWS score triggers a patient evaluation and leads to a necessary intervention but not ultimately to ICU transfer. These events would be labeled “false positives” based on model interpretation even though the patient may have received appropriate timely care. Conversely, in additional to objective measures, I-PEWS also incorporates nursing and parental concern, which are both subjective measures that could trigger rapid response evaluation and ICU transfer even though the patient may not have experienced a true deterioration event. We attempted to control for potentially unnecessary ICU transfers with the critical deterioration outcome, but real time, prospective evaluation will be needed to fully assess model performance.

Model performance in the 2019 temporal validation cohort was notably decreased compared with the internal validation cohort. This may possibly be because of small changes in patient population or clinical practice over time, and this difference highlights the need for prospective validation on an independent cohort before model implementation. Further, our model was trained and evaluated at a single center and additional optimization and external validation would be required for use at a different institution. Additionally, since these models were trained on a general cohort of pediatric patients, they may not be optimally calibrated for specific patient subgroups. For example, decompensation can be difficult to recognize in children with congenital heart disease because of baseline abnormalities in underlying physiology.17,34,35  Future work could include a similar modeling approach trained in these specific populations to provide patient-specific, real-time predictions of outcomes for these children. Similarly, as our institution’s pediatric rapid response team workflow is targeted for patients on the pediatric intermediate and stepdown units, 4 major patient populations were excluded from the development cohort: healthy neonates, pediatric obstetric patients, and ICU or emergency department patients. The models in this study would not be indicated for use within these populations, each of which would likely require different risk stratification models because of their unique physiology.

In this study, using a wide range of clinical data available from the EHR, we developed a machine learning model that exhibits improved performance by AUROC and lead-time before event in predicting clinical deterioration in pediatric inpatients within 24 to 48 hours compared with our current institutional standard of care, I-PEWS. Although the model had a lower positive predictive value and corresponding higher number needed to evaluate, it demonstrates the powerful potential to improve patient detection and lead-time before a deterioration event, potentially mitigating morbidity and mortality in pediatric patients. In the future, machine learning models should be studied to be efficiently implemented into clinical workflow to improve the outcomes of hospitalized children.

The authors would like to acknowledge Duke Pediatrics Research Scholars.

Dr Foote conceptualized and designed the study, performed data analysis and interpretation, and helped draft the initial manuscript; Dr Shaikh conceptualized and designed the study, developed and trained the machine learning models, performed data analysis and interpretation, and helped draft the initial manuscript; Drs Witt and Sendak, and Mr Shen, Mr Ratliff, Mr Shi, Mr Gao, Mr Nichols, and Mr Balu conceptualized and designed the study, developed and trained the machine learning models, and performed data analysis and interpretation; Ms Osborne conceptualized and designed the study; Drs Kumar, Jackson, McCrary, and Li conceptualized and designed the study and supervised data analysis and interpretation; and all authors reviewed and revised the manuscript and approved the final manuscript as submitted.

FUNDING: This project received funding from the Duke Institute for Health Innovation. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding organization. The funding organization had no role in the design, preparation, review, or approval of this paper.

CONFLICT OF INTEREST DISCLOSURES: The authors have indicated they have no potential conflicts of interest to disclose.

1
Young
MP
,
Gooder
VJ
,
McBride
K
, et al
.
Inpatient transfer to the intensive care unit
.
J Gen Intern Med
.
2003
;
18
(
2
):
77
83
2
Tibballs
J
,
Kinney
S
.
Reduction of hospital mortality and of preventable cardiac arrest and death on introduction of a pediatric medical emergency team
.
Pediatr Crit Care Med
.
2009
;
10
(
3
):
306
312
3
Mehta
SD
,
Muthu
N
,
Yehya
N
, et al
.
Leveraging EHR data to evaluate the association of late recognition of deterioration with outcomes
.
Hosp Pediatr
.
2022
;
12
(
5
):
447
460
4
Parshuram
CS
,
Hutchison
J
,
Middaugh
K
.
Development and initial validation of the Bedside Paediatric Early Warning System score
.
Crit Care
.
2009
;
13
(
4
):
R135
5
Monaghan
A
.
Detecting and managing deterioration in children
.
Paediatr Nurs
.
2005
;
17
(
1
):
32
35
6
Tucker
KM
,
Brewer
TL
,
Baker
RB
,
Demeritt
B
,
Vossmeyer
MT
.
Prospective evaluation of a pediatric inpatient early warning scoring system
.
J Spec Pediatr Nurs
.
2009
;
14
(
2
):
79
85
7
Akre
M
,
Finkelstein
M
,
Erickson
M
,
Liu
M
,
Vanderbilt
L
,
Billman
G
.
Sensitivity of the pediatric early warning score to identify patient deterioration
.
Pediatrics
.
2010
;
125
(
4
):
e763
e769
8
Parshuram
CS
,
Duncan
HP
,
Joffe
AR
, et al
.
Multicentre validation of the bedside paediatric early warning system score: a severity of illness score to detect evolving critical illness in hospitalised children
.
Crit Care
.
2011
;
15
(
4
):
R184
9
Parshuram
CS
,
Dryden-Palmer
K
,
Farrell
C
, et al
.
Canadian Critical Care Trials Group and the EPOCH Investigators
.
Effect of a pediatric early warning system on all-cause mortality in hospitalized pediatric patients: the EPOCH randomized clinical trial
.
JAMA
.
2018
;
319
(
10
):
1002
1012
10
Bonafide
CP
,
Holmes
JH
,
Nadkarni
VM
,
Lin
R
,
Landis
JR
,
Keren
R
.
Development of a score to predict clinical deterioration in hospitalized children
.
J Hosp Med
.
2012
;
7
(
4
):
345
349
11
Duncan
H
,
Hutchison
J
,
Parshuram
CS
.
The Pediatric Early Warning System score: a severity of illness score to predict urgent medical need in hospitalized children
.
J Crit Care
.
2006
;
21
(
3
):
271
278
12
Robson
MA
,
Cooper
CL
,
Medicus
LA
,
Quintero
MJ
,
Zuniga
SA
.
Comparison of three acute care pediatric early warning scoring tools
.
J Pediatr Nurs
.
2013
;
28
(
6
):
e33
e41
13
Churpek
MM
,
Adhikari
R
,
Edelson
DP
.
The value of vital sign trends for detecting clinical deterioration on the wards
.
Resuscitation
.
2016
;
102
:
1
5
14
Zhai
H
,
Brady
P
,
Li
Q
, et al
.
Developing and evaluating a machine learning based algorithm to predict the need of pediatric intensive care unit transfer for newly hospitalized children
.
Resuscitation
.
2014
;
85
(
8
):
1065
1071
15
Mayampurath
A
,
Jani
P
,
Dai
Y
,
Gibbons
R
,
Edelson
D
,
Churpek
MM
.
A vital sign-based model to predict clinical deterioration in hospitalized children
.
Pediatr Crit Care Med
.
2020
;
21
(
9
):
820
826
16
Mayampurath
A
,
Sanchez-Pinto
LN
,
Hegermiller
E
, et al
.
Development and external validation of a machine learning model for prediction of potential transfer to the PICU
.
Pediatr Crit Care Med
.
2022
;
23
(
7
):
514
523
17
McLellan
MC
,
Gauvreau
K
,
Connor
JA
.
Validation of the Cardiac Children’s Hospital Early Warning Score: an early warning scoring tool to prevent cardiopulmonary arrests in children with heart disease
.
Congenit Heart Dis
.
2014
;
9
(
3
):
194
202
18
Moons
KG
,
Altman
DG
,
Reitsma
JB
, et al
.
Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration
.
Ann Intern Med
.
2015
;
162
(
1
):
W1-73
19
Bonafide
CP
,
Localio
AR
,
Roberts
KE
,
Nadkarni
VM
,
Weirich
CM
,
Keren
R
.
Impact of rapid response system implementation on critical deterioration events in children
.
JAMA Pediatr
.
2014
;
168
(
1
):
25
33
20
Harrington
Y
,
Rauch
DA
,
Leary
JC
.
Racial and ethnic disparities in length of stay for common pediatric diagnoses: trends from 2016 to 2019
.
Hosp Pediatr
.
2023
;
13
(
4
):
275
282
21
Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
, et al
.
Scikit-learn: machine learning in Python
.
J Mach Learn Res
.
2011
;
12
(
85
):
2825
2830
22
Carpenter
J
,
Bithell
J
.
Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians
.
Stat Med
.
2000
;
19
(
9
):
1141
1164
23
Shah
S
,
Ledbetter
D
,
Aczon
M
, et al
.
Early prediction of patient deterioration using machine learning techniques with time series data
.
Crit Care Med
.
2016
;
44
(
12
):
87
24
Wellner
B
,
Grand
J
,
Canzone
E
, et al
.
Predicting unplanned transfers to the intensive care unit: a machine learning approach leveraging diverse clinical elements
.
JMIR Med Inform
.
2017
;
5
(
4
):
e45
25
Rubin
J
,
Potes
C
,
Xu-Wilson
M
, et al
.
An ensemble boosting model for predicting transfer to the pediatric intensive care unit
.
Int J Med Inform
.
2018
;
112
:
15
20
26
Huang
EJ
,
Bonafide
CP
,
Keren
R
,
Nadkarni
VM
,
Holmes
JH
.
Medications associated with clinical deterioration in hospitalized children
.
J Hosp Med
.
2013
;
8
(
5
):
254
260
27
Bedoya
AD
,
Futoma
J
,
Clement
ME
, et al
.
Machine learning for early detection of sepsis: an internal and temporal validation study
.
JAMIA Open
.
2020
;
3
(
2
):
252
260
28
Ruiz
VM
,
Goldsmith
MP
,
Shi
L
, et al
.
Early prediction of clinical deterioration using data-driven machine-learning modeling of electronic health records
.
J Thorac Cardiovasc Surg
.
2022
;
164
(
1
):
211
222.e3
29
Bedoya
AD
,
Clement
ME
,
Phelan
M
,
Steorts
RC
,
O’Brien
C
,
Goldstein
BA
.
Minimal impact of implemented early warning score and best practice alert for patient deterioration
.
Crit Care Med
.
2019
;
47
(
1
):
49
55
30
Sanchez-Pinto
LN
,
Bennett
TD
.
Evaluation of machine learning models for clinical prediction problems
.
Pediatr Crit Care Med
.
2022
;
23
(
5
):
405
408
31
McLellan
MC
,
Gauvreau
K
,
Connor
JA
.
Validation of the Children’s Hospital Early Warning System for critical deterioration recognition
.
J Pediatr Nurs
.
2017
;
32
:
52
58
32
Naim
MY
,
Griffis
HM
,
Burke
RV
, et al
.
Race/ethnicity and neighborhood characteristics are associated with bystander cardiopulmonary resuscitation in pediatric out-of-hospital cardiac arrest in the United States: a study from CARES
.
J Am Heart Assoc
.
2019
;
8
(
14
):
e012637
33
Rojas
JC
,
Fahrenbach
J
,
Makhni
S
, et al
.
Framework for integrating equity into machine learning models: a case study
.
Chest
.
2022
;
161
(
6
):
1621
1627
34
Olive
MK
,
Owens
GE
.
Current monitoring and innovative predictive modeling to improve care in the pediatric cardiac intensive care unit
.
Transl Pediatr
.
2018
;
7
(
2
):
120
128
35
Rusin
CG
,
Acosta
SI
,
Shekerdemian
LS
, et al
.
Prediction of imminent, severe deterioration of children with parallel circulations using real-time processing of physiologic data
.
J Thorac Cardiovasc Surg
.
2016
;
152
(
1
):
171
177
36
Pollard
TJ
,
Johnson
AEW
,
Raffa
JD
,
Mark
RG
.
tableone: an open source Python package for producing summary statistics for research papers
.
JAMIA Open
.
2018
;
1
(
1
):
26
31
37
Kansal
A
,
Gao
M
,
Balu
S
, et al
.
Impact of diagnosis code grouping method on clinical prediction model performance: a multi-site retrospective observational study
.
Int J Med Inform
.
2021
;
151
:
104466

Supplementary data