Recent decision rules for the management of febrile infants support the identification of infants at higher risk of serious bacterial infections (SBIs) without the performance of routine lumbar puncture. We derive and validate a model to identify febrile infants ≤60 days of age at low risk for SBIs using supervised machine learning approaches.
We conducted a secondary analysis of a multicenter prospective study performed between December 2008 and May 2013 of febrile infants. Our outcome was SBI, (culture-positive urinary tract infection, bacteremia, and/or bacterial meningitis). We developed and validated 4 supervised learning models: logistic regression, random forest, support vector machine, and a single-hidden layer neural network.
A total of 1470 patients were included (1014 >28 days old). One hundred thirty-eight (9.3%) had SBIs (122 urinary tract infections, 20 bacteremia, and 8 meningitis; 11 with concurrent SBIs). Using 4 features (urinalysis, white blood cell count, absolute neutrophil count, and procalcitonin), we demonstrated with the random forest model the highest specificity (74.9, 95% confidence interval: 71.5%–78.2%) with a sensitivity of 98.6% (95% confidence interval: 92.2%–100.0%) in the validation cohort. One patient with bacteremia was misclassified. Among 1240 patients who received a lumbar puncture, this model could have prevented 849 (68.5%) such procedures.
We derived and internally validated a supervised learning model for the risk-stratification of febrile infants. Although computationally complex, lacking parameter cutoffs, and in need of external validation, this strategy may allow for reductions in unnecessary procedures, hospitalizations, and antibiotics while maintaining excellent sensitivity.
Decision rules using serum procalcitonin for the risk-stratification of well-appearing febrile infants ≤60 days of age for serious bacterial infection and have been described. The diagnostic value of machine learning models for this purpose has not been reported.
We evaluated machine learning algorithms for the identification of serious bacterial infections from a multicenter prospective study. The random forest model demonstrated the greatest accuracy, with a sensitivity of 99% and specificity of 75% in an internally validated cohort.
The emergency department (ED) management of febrile infants ≤60 days of age remains an area of ongoing investigation. Decision rules prioritize high sensitivity in the risk-stratification of infants for serious bacterial infections (SBIs; urinary tract infection [UTI], bacterial meningitis, and/or bacteremia).1–5 Recent rules report sensitivities >90% and specificities between 30% and 60%, without routine use of lumbar puncture.6–8 Some recent strategies use procalcitonin, a biomarker with high diagnostic accuracy for SBI.9,10
Although 10% of febrile infants ≤60 days of age have an SBI,7,11–13 a large proportion without SBI are classified as false-positives by using past decision rules,1–3,6–8 leading to unnecessary lumbar punctures, hospitalizations, antimicrobial use, and parental anxiety.14,15 Historic criteria were developed by expert opinion.1–5,8 Some recent rules were developed by using methods such as logistic regression or recursive partitioning.6,7 The characteristics of machine learning–based models to classify infants with SBI remain unexplored. Supervised learning, in which computer algorithms are used to create models to assign input parameters toward a preassigned outcome, is one commonly performed application of machine learning.16 In pediatrics, such approaches have been used in several arenas, including in the classification of patients with appendicitis17 and in the ED triage of children.18 Machine learning approaches may demonstrate superior classification accuracy compared with conventional statistical techniques, as demonstrated in one investigation evaluating pediatric patients with traumatic head injury.19,20 Applied toward febrile infants, these models may be able to maintain high sensitivity toward the identification of infants with SBI with improved specificity.
We evaluated the test characteristics of machine learning models to predict outcomes of SBI among infants presenting to the ED with fever. We compare these models to previously established clinical prediction rules.6–8
Methods
Study Design and Setting
We performed a secondary analysis of a multicenter prospective study performed between December 2008 and May 2013 by the Pediatric Emergency Care Applied Research Network (PECARN) in 26 EDs. The current study was performed by using a public use data set obtained from the parent study and was designated nonhuman subjects research by our institutional review board.21 Analyses were conducted by using R version 3.6.1 (R Foundation for Statistical Computing, Vienna, Austria).
Selection of Participants
Inclusion criteria of the parent study have been published.22 A convenience sample of patients ≤60 days of age with fever (≥38°C in the preceding 24 hours) evaluated in study site EDs were included. Patients were excluded from the parent study for clinical sepsis, recent antibiotic use, prematurity, preexisting medical conditions, or soft tissue infections. We additionally excluded encounters that were missing any of the following: blood or urine cultures, serum procalcitonin (retaining levels performed on the first or second day of admission), white blood cell count (WBC), absolute neutrophil count (ANC), or urinalysis. We excluded encounters lacking cerebrospinal fluid (CSF) cultures or a phone or medical record review performed 8 to 14 days after the encounter visit to verify that discharged patients without lumbar puncture did not have bacterial meningitis. We developed our model with procalcitonin given the high diagnostic accuracy reported with respect to this predictor for SBI.9,10,23,24 Similar inclusion criteria were used in the PECARN decision rule.7
Outcome and Predictors of Interest
Our outcome was SBI, defined as bacterial meningitis, bacteremia, and/or UTI. Cultures were reviewed by using standardized criteria from the parent study.25 UTI was defined using as ≥1000 colony-forming units (CFUs) per milliliter from urine culture obtained via suprapubic aspiration, ≥50 000 CFU/mL from catheterization, or 10 000 to 49 999 CFU/mL from catherization with a positive urinalysis (leukocyte esterase, nitrite, or >5 WBC per high-power field).22 Candidate features included age, attending or fellow clinical suspicion for risk of SBI (dichotomized as 1%–10% and >10%), sex, fever location (in enrolling ED or elsewhere), WBC, ANC, serum procalcitonin, and urinalysis.
Feature Selection
We performed 2 steps to identify and retain features that were maximally predictive of outcome, while removing features which would limit accuracy. First, we performed univariate logistic regression to test for association of predictors with SBI, retaining variables with P < .10. Second, we performed principal component analysis (PCA), in which orthogonal transformation was performed to linearly uncorrelate variables and identify features that discriminated positive from negative cases of SBI. In PCA, a small angle between vectors implies a positive correlation, a large one suggests negative correlation, and a 90° angle indicates no correlation between characteristics. The farther away these vectors are from a principal component origin, the greater influence these carry in PCA.26
Supervised Learning Models
We randomized patients into derivation and validation cohorts with equal partitions, balanced to ensure equal numbers of SBIs. We developed 4 supervised learning models.27 First, we performed multivariable logistic regression. Logistic regression uses a logistic function to model a binary outcome, resulting in the generation of coefficients that can be used to predict the logarithm of the odds of SBI. This model was optimized to obtain the lowest Akaike Information Criterion using bidirectional stepwise regression. Second, we performed random forest modeling, an ensemble method (in which multiple algorithms are used) whereby numerous decision trees are developed on random subsets of features during model development. These are then run individually and averaged to calculate probabilities of an outcome. The hyperparameter (a parameter used to tune the training process) for the number of variables available for splitting at each tree node was derived from features identified during PCA and verified by using a 10-fold cross validation repeated 3 times. Third, we performed modeling using a support vector machine (SVM) using a radial kernel. SVM is an algorithm in which a boundary is created with the widest margins between each type of predetermined classification. Hyperparameters were identified using 10-fold cross validation. Fourth, we fitted a single-hidden layer neural network as a feed-forward multilayer perceptron. The neural network consists of 3 “layers” in which data are input, analyzed through a serious of transformations, then output into results. We used bootstrapped resampling performed 25 times to identify final values for size and decay functions. Analyses were performed using the MASS (v7.3–51.4), randomForest (version 4.6–14), e1071 (v1.7–2), and nnet (v7.3–12) packages.
Model Validation and Cut-Point Estimation
Models were assessed by using the validation cohort. We estimated the area under the receiver operating curve (AUROC). Predictions from logistic regression were converted to probabilities using the inverse logit formula. Optimal thresholds were estimated for each model by using a misclassification cost algorithm in which the relative cost of false-negatives was 100 times greater than false-positives.7 For each model, we reported (with 95% confidence intervals [CIs]) parameters of diagnostic accuracy, area under the precision-recall curve (AUPRC), and the F1 score. We additionally compared models to 3 recently published models: PECARN, Step-by-Step, and Aronson (Supplemental Table 6).6–8 For the Step-by-Step model, we a priori determined that an infant with a >10% physician suspicion of SBI was criteria for “ill-appearing,” and we classified intermediate-risk patients as high-risk.8 For the Aronson model, we used a score of ≥2 to classify patients into high- and low-risk groups.6 We assessed the lumbar punctures that may have been avoided by low-risk group identification from each model.
Additional Analysis
We performed an exploratory analysis using a modified inclusion criteria. Patients with complete and interpretable culture data were considered for inclusion. Predictors of interest that were missing in >15% of cases were removed, because imputations on data with high proportions of missing data can represent a source of bias.28 For remaining cases, we performed imputation for missing values using predictive mean matching. We then validated the highest performing machine learning model. We performed a post hoc analysis whereby imputation was performed for all missing data from patients having complete and interpretable cultures. Second, we performed an analysis using an outcome limited to bacterial meningitis and bacteremia. Third, we redeveloped the model with highest accuracy using a modified outcome in which patients diagnosed with UTI in the primary analysis by colony counts of 10 000 to 49 999 CFU/hpf with positive urinalysis were considered non-SBIs. Our outcome remained SBI. Fourth, given the limited availability of procalcitonin in North America,29,30 we performed an analysis creating a machine learning model without this test. We additionally evaluated the performance of such a model using only objective measures.
Results
Patient Inclusion and Descriptive Data
Of 7335 encounters, 1470 were included (Fig 1). A total of 1014 patients (69.0%) were >28 days old. A total of 138 of 1470 (9.3%) patients had SBIs, including 122 (8.3%), 20 (1.4%), and 8 (0.5%) with UTI, bacteremia, and bacterial meningitis, respectively. Eleven (0.7%) had concurrent SBIs. A total of 1240 (84.4%) patients had a lumbar puncture performed, and 1050 (71.4%) were admitted.
Patient inclusion and randomization. aFor a sensitivity analysis, 4247 encounters with complete and interpretable cultures were used, and imputation was performed for cases with missing variables. CBC, complete blood count.
Patient inclusion and randomization. aFor a sensitivity analysis, 4247 encounters with complete and interpretable cultures were used, and imputation was performed for cases with missing variables. CBC, complete blood count.
Candidate Variable Selection
Predictors significantly associated with SBI were age, suspicion of SBI, sex, WBC, ANC, positive urinalysis, and procalcitonin (Table 1). In evaluation of non-SBIs cases on the PCA biplot (Supplemental Fig 4), age, urinalysis, and sex strongly influenced the second principal component (PC2). WBC, ANC, and procalcitonin influenced the first principal component (PC1). In particular, ANC and WBC were found to be positively correlated and had a small angle between their vectors in the PC1 dimension. For positive SBI cases, urinalysis, procalcitonin, physician impression of SBI, WBC, and ANC separated PC1. Age, and sex influenced PC2. A total of 4 variables, urinalysis, procalcitonin, ANC, and WBC, based on distances of the vectors from the origin and also how variables correlated with one another on the PCA biplot, were retained for modeling.
Differences Among Patients With and Without SBI
Variable . | Patients Without SBI (n = 1332) . | Patients With SBI (n = 138) . | P . |
---|---|---|---|
Age, d, median (IQR) | 38 (26–48) | 31 (20–45) | .001a |
Male, n (%) | 767 (57.6) | 92 (66.7) | .040a |
Elevated temperature in enrolling ED, n (%) | 808 (60.7) | 89 (64.5) | .380 |
Clinical suspicion for SBI, n (%) | <.001a | ||
1%–10% | 1262 (94.7) | 114 (82.6) | |
>10% | 70 (5.3) | 24 (17.4) | |
WBC, per μL, median (IQR) | 9200 (6900–12 100) | 14 500 (1500–18 000) | <.001a |
ANC, per μL, median (IQR) | 2970 (1910–4710) | 7200 (5020–10 500) | <.001a |
Positive urinalysis, n (%) | 107 (8.0) | 116 (84.1) | <.001a |
Procalcitonin level, ng/mL, median (IQR) | 0.20 (0.15–0.28) | 0.74 (0.31–3.39) | <.001a |
Variable . | Patients Without SBI (n = 1332) . | Patients With SBI (n = 138) . | P . |
---|---|---|---|
Age, d, median (IQR) | 38 (26–48) | 31 (20–45) | .001a |
Male, n (%) | 767 (57.6) | 92 (66.7) | .040a |
Elevated temperature in enrolling ED, n (%) | 808 (60.7) | 89 (64.5) | .380 |
Clinical suspicion for SBI, n (%) | <.001a | ||
1%–10% | 1262 (94.7) | 114 (82.6) | |
>10% | 70 (5.3) | 24 (17.4) | |
WBC, per μL, median (IQR) | 9200 (6900–12 100) | 14 500 (1500–18 000) | <.001a |
ANC, per μL, median (IQR) | 2970 (1910–4710) | 7200 (5020–10 500) | <.001a |
Positive urinalysis, n (%) | 107 (8.0) | 116 (84.1) | <.001a |
Procalcitonin level, ng/mL, median (IQR) | 0.20 (0.15–0.28) | 0.74 (0.31–3.39) | <.001a |
IQR, interquartile range.
P values significant by univariable binary logistic regression.
Derivation and Validation Cohorts
A total of 735 patients were placed into derivation and validation cohorts. Sixty-nine (9.4%) patients in each cohort had SBIs. Groups were similar with respect to predictor and outcome variables with the exception of WBC (Table 2).
Differences Among Patients in Derivation and Validation Cohorts
Variable . | Derivation Cohort (n = 735) . | Validation Cohort (n = 735) . | P . |
---|---|---|---|
Predictor . | . | . | . |
Age, d, median (IQR) | 37 (25–48) | 38 (26–48) | .664 |
Male, n (%) | 423 (57.6) | 436 (59.3) | .491 |
Elevated temperature in enrolling ED, n (%) | 443 (60.3) | 454 (61.8) | .556 |
Unstructured clinical suspicion for SBI, n (%) | 1.000 | ||
1%–10% | 688 (93.6) | 688 (93.6) | |
>10% | 47 (6.4) | 47 (6.4) | |
WBC, per μL, median (IQR) | 9200 (6600–12 300) | 9700 (7300–13 000) | .024a |
ANC, per μL, median (IQR) | 3180 (1940–5090) | 3220 (2040–5180) | .390 |
Positive urinalysis, n (%) | 110 (15.0) | 113 (15.4) | .827 |
Procalcitonin level, ng/mL, median (IQR) | 0.21 (0.16–0.33) | 0.20 (0.15–0.29) | .488 |
Outcomes | |||
Any SBIb | 69 (9.4) | 69 (9.4) | 1.000 |
Positive urine culture | 61 (8.3) | 61 (8.3) | 1.000 |
Positive blood culture | 9 (1.2) | 11 (1.5) | .653 |
Positive CSF culture | 4 (0.5) | 4 (0.5) | 1.000 |
Variable . | Derivation Cohort (n = 735) . | Validation Cohort (n = 735) . | P . |
---|---|---|---|
Predictor . | . | . | . |
Age, d, median (IQR) | 37 (25–48) | 38 (26–48) | .664 |
Male, n (%) | 423 (57.6) | 436 (59.3) | .491 |
Elevated temperature in enrolling ED, n (%) | 443 (60.3) | 454 (61.8) | .556 |
Unstructured clinical suspicion for SBI, n (%) | 1.000 | ||
1%–10% | 688 (93.6) | 688 (93.6) | |
>10% | 47 (6.4) | 47 (6.4) | |
WBC, per μL, median (IQR) | 9200 (6600–12 300) | 9700 (7300–13 000) | .024a |
ANC, per μL, median (IQR) | 3180 (1940–5090) | 3220 (2040–5180) | .390 |
Positive urinalysis, n (%) | 110 (15.0) | 113 (15.4) | .827 |
Procalcitonin level, ng/mL, median (IQR) | 0.21 (0.16–0.33) | 0.20 (0.15–0.29) | .488 |
Outcomes | |||
Any SBIb | 69 (9.4) | 69 (9.4) | 1.000 |
Positive urine culture | 61 (8.3) | 61 (8.3) | 1.000 |
Positive blood culture | 9 (1.2) | 11 (1.5) | .653 |
Positive CSF culture | 4 (0.5) | 4 (0.5) | 1.000 |
IQR, interquartile range.
P values significant by univariable binary logistic regression.
Some patients had concomitant SBIs. In the derivation cohort, there were 2 patients with bacteremia and meningitis and 3 patients with bacteremia and UTI. In the validation cohort, there were 2 patients with concomitant bacteremia and meningitis, 3 patients with UTI and bacteremia, and 1 patient with bacteremia, meningitis, and UTI.
Model Characteristics
The AUROC of models in the validation cohort were high with the highest found in the random forest model (AUROC: 0.96, 95% CI: 0.93–0.98) (Fig 2). This model had the highest specificity on the receiver operator curve (74.9%, 95% CI: 71.5%–78.2%) (Table 3), AUPRC, and F1 statistics (Table 4). The relative importance of features contained within the random forest model is provided in Fig 3. The random forest model demonstrated a sensitivity comparable to the PECARN, Aronson, and Step-by-Step models, but with specificity above the upper limits of the CI for each comparison model. Using the random forest model to classify infants into high and low-risk groups revealed that 849 of 1240 (68.5%) patients who underwent a lumbar puncture could have potentially avoided the procedure. When applying the PECARN, Step-by-Step, and Aronson rules to this study sample, each respectively would have resulted in 673 of 1240 (54.3%), 460 of 1240 (37.1%), and 319 of 1240 (25.7%) fewer lumbar puncture procedures.
Receiver operator curves of all evaluated models on the validation cohort (N = 735). GLM, generalized logistic model.
Receiver operator curves of all evaluated models on the validation cohort (N = 735). GLM, generalized logistic model.
Model Characteristics Among the Derivation and Validation Cohorts
Method . | Derivation Cohort . | Validation Cohort . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AUC . | Sensitivity (%) . | Specificity (%) . | PPV (%) . | NPV (%) . | LR (+) . | LR (−) . | AUC . | Sensitivity (%) . | Specificity (%) . | PPV (%) . | NPV (%) . | LR (+) . | LR (−) . | |
Stepwise logistic regression | 0.95 (0.92–0.98) | 98.6 (92.2–100.) | 49.2 (45.4–53.1) | 16.7 (13.2–20.7) | 99.7 (98.3–100.0) | 1.94 (1.79–2.10) | 0.03 (0.00–0.21) | 0.95 (0.93–0.98) | 100.0 (94.8–100.0) | 50.2 (46.3–54.0) | 17.2 (13.6–21.3) | 100 (98.9–100.0) | 2.01 (1.86–2.16) | 0.00 (0.00) |
Random foresta | 1.00 (0.99–1.00) | 100.0 (94.8–100.0) | 81.8 (78.7–84.7) | 36.3 (29.5–43.6) | 100.0 (99.3–100.0) | 5.50 (4.68–6.47) | 0.00 (—) | 0.96 (0.93–0.98) | 98.6 (92.2–100) | 74.9 (71.5–78.2) | 28.9 (23.2–35.2) | 99.8 (98.9–100.0) | 3.93 (3.44–4.50) | 0.02 (0.00–0.14) |
SVM | 0.94 (0.90–0.98) | 97.1 (89.9–99.6) | 47.6 (43.7–51.5) | 16.1 (12.7–20.0) | 99.4 (97.8–99.9) | 1.85 (1.71–2.01) | 0.06 (0.02–0.24) | 0.93 (0.89–0.97) | 97.1 (89.9–99.6) | 52.4 (48.5–56.3) | 17.4 (13.8–21.6) | 99.4 (98–99.9) | 2.04 (1.87–2.23) | 0.06 (0.01–0.22) |
Single-hidden layer neural network | 0.96 (0.94–0.99) | 95.7 (87.8–99.1) | 68.6 (64.9–72.1) | 24.0 (19.1–29.5) | 99.3 (98.1–99.9) | 3.05 (2.70–3.45) | 0.06 (0.02–0.19) | 0.95 (0.0.93–0.97) | 98.6 (92.2–100.0) | 70.4 (66.8–73.9) | 25.7 (20.5–31.4) | 99.8 (98.8–100.0) | 3.33 (2.95–3.76) N | 0.02 (0.00–0.14) |
PECARN Rule7 | — | 97.1 (89.9–99.6) | 64.0 (60.2–67.6) | 21.8 (17.3–26.9) | 99.5 (98.3–99.9) | 2.69 (2.42–3.01) | 0.05 (0.01–0.18) | — | 98.6 (92.2–100) | 60.2 (56.4–64.0) | 20.4 (16.2–25.2) | 99.8 (98.6–100.0) | 2.48 (2.25–2.73) | 0.02 (0.00–0.17) |
Step-by-Step8 | — | 94.2 (85.8–98.4) | 67.4 (63.7–71.0) | 23.0 (18.3–28.4) | 99.1 (97.8–99.8) | 2.89 (2.55–3.27) | 0.09 (0.03–0.22) | — | 92.8 (83.9–97.6) | 67.6 (63.9–71.1) | 22.9 (18.1–28.2) | 98.9 (97.5–99.6) | 2.86 (2.52–3.25) | 0.11 (0.05–0.25) |
Aronson6 | — | 98.6 (92.2–100.0) | 31.7 (28.2–35.4) | 13.0 (10.2–16.2) | 99.5 (97.4–100.0) | 1.44 (1.36–1.53) | 0.05 (0.01–0.32) | — | 100.0 (94.8–100.0) | 30.8 (27.3–34.4) | 13.0 (10.3–16.2) | 100.0 (98.2–100.0) | 1.44 (1.37–1.52) | 0.00 (0.00) |
Method . | Derivation Cohort . | Validation Cohort . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AUC . | Sensitivity (%) . | Specificity (%) . | PPV (%) . | NPV (%) . | LR (+) . | LR (−) . | AUC . | Sensitivity (%) . | Specificity (%) . | PPV (%) . | NPV (%) . | LR (+) . | LR (−) . | |
Stepwise logistic regression | 0.95 (0.92–0.98) | 98.6 (92.2–100.) | 49.2 (45.4–53.1) | 16.7 (13.2–20.7) | 99.7 (98.3–100.0) | 1.94 (1.79–2.10) | 0.03 (0.00–0.21) | 0.95 (0.93–0.98) | 100.0 (94.8–100.0) | 50.2 (46.3–54.0) | 17.2 (13.6–21.3) | 100 (98.9–100.0) | 2.01 (1.86–2.16) | 0.00 (0.00) |
Random foresta | 1.00 (0.99–1.00) | 100.0 (94.8–100.0) | 81.8 (78.7–84.7) | 36.3 (29.5–43.6) | 100.0 (99.3–100.0) | 5.50 (4.68–6.47) | 0.00 (—) | 0.96 (0.93–0.98) | 98.6 (92.2–100) | 74.9 (71.5–78.2) | 28.9 (23.2–35.2) | 99.8 (98.9–100.0) | 3.93 (3.44–4.50) | 0.02 (0.00–0.14) |
SVM | 0.94 (0.90–0.98) | 97.1 (89.9–99.6) | 47.6 (43.7–51.5) | 16.1 (12.7–20.0) | 99.4 (97.8–99.9) | 1.85 (1.71–2.01) | 0.06 (0.02–0.24) | 0.93 (0.89–0.97) | 97.1 (89.9–99.6) | 52.4 (48.5–56.3) | 17.4 (13.8–21.6) | 99.4 (98–99.9) | 2.04 (1.87–2.23) | 0.06 (0.01–0.22) |
Single-hidden layer neural network | 0.96 (0.94–0.99) | 95.7 (87.8–99.1) | 68.6 (64.9–72.1) | 24.0 (19.1–29.5) | 99.3 (98.1–99.9) | 3.05 (2.70–3.45) | 0.06 (0.02–0.19) | 0.95 (0.0.93–0.97) | 98.6 (92.2–100.0) | 70.4 (66.8–73.9) | 25.7 (20.5–31.4) | 99.8 (98.8–100.0) | 3.33 (2.95–3.76) N | 0.02 (0.00–0.14) |
PECARN Rule7 | — | 97.1 (89.9–99.6) | 64.0 (60.2–67.6) | 21.8 (17.3–26.9) | 99.5 (98.3–99.9) | 2.69 (2.42–3.01) | 0.05 (0.01–0.18) | — | 98.6 (92.2–100) | 60.2 (56.4–64.0) | 20.4 (16.2–25.2) | 99.8 (98.6–100.0) | 2.48 (2.25–2.73) | 0.02 (0.00–0.17) |
Step-by-Step8 | — | 94.2 (85.8–98.4) | 67.4 (63.7–71.0) | 23.0 (18.3–28.4) | 99.1 (97.8–99.8) | 2.89 (2.55–3.27) | 0.09 (0.03–0.22) | — | 92.8 (83.9–97.6) | 67.6 (63.9–71.1) | 22.9 (18.1–28.2) | 98.9 (97.5–99.6) | 2.86 (2.52–3.25) | 0.11 (0.05–0.25) |
Aronson6 | — | 98.6 (92.2–100.0) | 31.7 (28.2–35.4) | 13.0 (10.2–16.2) | 99.5 (97.4–100.0) | 1.44 (1.36–1.53) | 0.05 (0.01–0.32) | — | 100.0 (94.8–100.0) | 30.8 (27.3–34.4) | 13.0 (10.3–16.2) | 100.0 (98.2–100.0) | 1.44 (1.37–1.52) | 0.00 (0.00) |
Numbers in parenthesis represent 95% CIs. AUC, area under the curve; LR (+), positive likelihood ratio; LR (−), negative likelihood ratio; NPV, negative predictive value; PPV, positive predictive value; —, not applicable.
Model with highest specificity in threshold analysis performed on validation cohort.
AUPRC and F1 Scores for Evaluated Models
. | Validation . | Derivation . | ||
---|---|---|---|---|
AUPRC . | F1 Score . | AUPRC . | F1 Score . | |
Stepwise logistic regression | 0.74 | 0.66 | 0.74 | 0.67 |
Random forest | 0.97 | 0.90 | 0.78 | 0.86 |
SVM | 0.79 | 0.64 | 0.71 | 0.69 |
Single-hidden layer neural network | 0.82 | 0.81 | 0.72 | 0.83 |
. | Validation . | Derivation . | ||
---|---|---|---|---|
AUPRC . | F1 Score . | AUPRC . | F1 Score . | |
Stepwise logistic regression | 0.74 | 0.66 | 0.74 | 0.67 |
Random forest | 0.97 | 0.90 | 0.78 | 0.86 |
SVM | 0.79 | 0.64 | 0.71 | 0.69 |
Single-hidden layer neural network | 0.82 | 0.81 | 0.72 | 0.83 |
Feature (variable) importance of the random forest model, as classified using the mean decrease in accuracy. Features with a large mean decrease in accuracy carry greater importance for classification of data.
Feature (variable) importance of the random forest model, as classified using the mean decrease in accuracy. Features with a large mean decrease in accuracy carry greater importance for classification of data.
Missed Cases of SBI
No more than 4 SBIs were missed with any machine learning model. The random forest and logistic regression models missed 1 patient. The SVM and neural network models missed 4 cases (Table 5). The PECARN, Step-by-Step, and Aronson models had 3, 9, and 1 false-negatives, respectively.
Missed (False-Negative) Cases of SBI Using Each Methodology
Source . | Age, d/Sex . | Unstructured Suspicion of SBI . | Elevated Temperature in ED . | WBC, Cells per μL . | ANC, Cells per μL . | Bands, Cells per μL . | Urinalysis . | Procalcitonin (ng/mL) . | Pathogen . | Missed Algorithms . |
---|---|---|---|---|---|---|---|---|---|---|
Derivation | 21/male | High | Yes | 7730 | 1500 | 10 | Positive | 0.24 | Escherichia coli UTI | SVM |
Derivation | 21/male | Low | No | 8220 | 1320 | 0 | Positive | 0.2 | Klebsiella pneumoniae UTI | SVM |
Derivation | 26/female | Low | Yes | 7000 | 5390 | 3780 | Negative | 3.61 | Group B Streptococcus bacteremia | NN |
Derivation | 27/male | Low | No | 10 700 | 5960 | 0 | Negative | 0.24 | E coli UTI | Step-by-Step |
Derivation | 36/male | Low | No | 2300 | 920 | 280 | Negative | 0.16 | Pseudomonas aeruginosa UTI | PECARN |
LR | ||||||||||
Aronson | ||||||||||
Step-by-Step | ||||||||||
Derivation | 37/female | Low | Yes | 2940 | 260 | 260 | Negative | 1.77 | Group B Streptococcus meningitis | NN |
Derivation | 52/female | Low | Yes | 21 080 | 6750 | 0 | Negative | 0.42 | E coli UTI | NN |
Step-by-Step | ||||||||||
Derivation | 55/female | Low | Yes | 3800 | 2200 | 110 | Negative | 0.2 | E coli UTI | PECARN |
Step-by-Step | ||||||||||
Validation | 30/male | Low | Yes | 6700 | 2680 | 0 | Negative | 0.14 | Enterobacter cloacae bacteremia | PECARN |
RF | ||||||||||
NN | ||||||||||
Step-by-Step | ||||||||||
Validation | 36/male | Low | Yes | 12 710 | 5320 | 30 | Negative | 0.13 | Staphylococcus aureus bacteremia | SVM |
Step-by-Step | ||||||||||
Validation | 42/female | Low | Yes | 11 000 | 7260 | 33 | Negative | 0.23 | Enterococcus UTI | Step-by-Step |
Validation | 50/male | Low | No | 10 620 | 6370 | 0 | Negative | 0.27 | Staphylococcus aureus bacteremia | Step-by-Step |
Validation | 50/male | Low | Yes | 17 000 | 5950 | 0 | Negative | 0.19 | Klebsiella pneumoniae UTI | Step-by-Step |
Validation | 54/male | Low | No | 10 520 | 1580 | 110 | Positive | 0.17 | E coli UTI | SVM |
Source . | Age, d/Sex . | Unstructured Suspicion of SBI . | Elevated Temperature in ED . | WBC, Cells per μL . | ANC, Cells per μL . | Bands, Cells per μL . | Urinalysis . | Procalcitonin (ng/mL) . | Pathogen . | Missed Algorithms . |
---|---|---|---|---|---|---|---|---|---|---|
Derivation | 21/male | High | Yes | 7730 | 1500 | 10 | Positive | 0.24 | Escherichia coli UTI | SVM |
Derivation | 21/male | Low | No | 8220 | 1320 | 0 | Positive | 0.2 | Klebsiella pneumoniae UTI | SVM |
Derivation | 26/female | Low | Yes | 7000 | 5390 | 3780 | Negative | 3.61 | Group B Streptococcus bacteremia | NN |
Derivation | 27/male | Low | No | 10 700 | 5960 | 0 | Negative | 0.24 | E coli UTI | Step-by-Step |
Derivation | 36/male | Low | No | 2300 | 920 | 280 | Negative | 0.16 | Pseudomonas aeruginosa UTI | PECARN |
LR | ||||||||||
Aronson | ||||||||||
Step-by-Step | ||||||||||
Derivation | 37/female | Low | Yes | 2940 | 260 | 260 | Negative | 1.77 | Group B Streptococcus meningitis | NN |
Derivation | 52/female | Low | Yes | 21 080 | 6750 | 0 | Negative | 0.42 | E coli UTI | NN |
Step-by-Step | ||||||||||
Derivation | 55/female | Low | Yes | 3800 | 2200 | 110 | Negative | 0.2 | E coli UTI | PECARN |
Step-by-Step | ||||||||||
Validation | 30/male | Low | Yes | 6700 | 2680 | 0 | Negative | 0.14 | Enterobacter cloacae bacteremia | PECARN |
RF | ||||||||||
NN | ||||||||||
Step-by-Step | ||||||||||
Validation | 36/male | Low | Yes | 12 710 | 5320 | 30 | Negative | 0.13 | Staphylococcus aureus bacteremia | SVM |
Step-by-Step | ||||||||||
Validation | 42/female | Low | Yes | 11 000 | 7260 | 33 | Negative | 0.23 | Enterococcus UTI | Step-by-Step |
Validation | 50/male | Low | No | 10 620 | 6370 | 0 | Negative | 0.27 | Staphylococcus aureus bacteremia | Step-by-Step |
Validation | 50/male | Low | Yes | 17 000 | 5950 | 0 | Negative | 0.19 | Klebsiella pneumoniae UTI | Step-by-Step |
Validation | 54/male | Low | No | 10 520 | 1580 | 110 | Positive | 0.17 | E coli UTI | SVM |
Exploratory Analyses
In our first exploratory analysis, we used imputation to test model performance on a cohort of patients that had missing data elements in <15% of cases, keeping other inclusion criteria the same. A total of 1537 encounters were included, with 158 (10.3%) having SBIs (Supplemental Table 7). Application of the random forest model to this data set resulted in one missed case, with high sensitivity and specificity (Supplemental Table 8). In our second exploratory analysis, the model performed similarly to results reported in the primary analysis when using an outcome limited to bacterial meningitis and bacteremia (Supplemental Table 9). In our third exploratory analysis using a modified definition for UTI, 17 UTIs with lower colony counts were reclassified as non-UTIs. Model performance was similar in this analysis as for the primary study (Supplemental Table 10). In a post hoc analysis, we performed imputation for all missing data, including procalcitonin, while using the full SBI definition. This model demonstrated a sensitivity of 94.9% (95% CI: 92.3%–96.8%) and specificity of 74.3% (95% CI: 72.9%–75.7%; Supplemental Table 11).
In creation of a model without procalcitonin, 3989 patients were included (Supplemental Fig 5). Three hundred ninety (9.8%) had an SBI (335 [8.4%] UTIs, 73 [1.8%] bacteremia, and 21 [0.5%] bacterial meningitis; 38 [1.0%] with concurrent infections). The variables of physician suspicion of SBI, age, sex, urinalysis, and ANC were selected from the PCA biplot. This model achieved a sensitivity of 98.0% (95% CI: 94.9%–99.4%) and specificity of 42.4% (95% CI: 40.1%–44.7%) (Supplemental Tables 12 and 13). In development of a random forest model including only objective measures (ANC, urinalysis, patient age, and sex, omitting the clinician suspicion for SBI), the specificity from the validation cohort was lower compared with the model that included clinical assessment (Supplemental Tables 14 and 15).
Discussion
We developed machine learning models to risk-stratify infants ≤60 days of age for SBI. We identified a random forest model that demonstrated high sensitivity (99%) and specificity (75%) compared with previously published models.1–8,31 This model missed 1 patient out of 138 with SBI (0.7%). Machine learning models for the risk-stratification of febrile infants demonstrated high accuracy and may help support clinical decision-making to minimize unnecessary hospitalizations, antibiotics, and lumbar punctures.
Kuppermann et al,7 in their recent model, demonstrated a sensitivity of 98% and a specificity of 60%, using recursive partitioning. In our study, we use the public-use data set from that investigation, which may be a subset of the parent study. We also used slightly different inclusion criteria, such as the retention of procalcitonin levels performed within the first 2 days of admission. The Step-by-Step model, another recent predictive tool using ANC, urinalysis, procalcitonin, and C-reactive protein, demonstrated a sensitivity of 92% and specificity of 47%.8 A model reported by Aronson et al,6 for invasive bacterial infections, revealed a sensitivity of 99% and a specificity of 31%. Similar to these models, our model does not require CSF for risk-stratification. This model also demonstrates superior test characteristics compared with older models, including the Philadelphia (sensitivity 97%–99%, specificity 39%–42%)2,31 and Rochester criteria (sensitivity 82%–97%, specificity 40%–50%).3,5,8
The use of supervised learning techniques carries potential to decrease medical interventions. Applying the random forest rule, for example, may have allowed lumbar punctures to be avoided when compared with previous rules.6–8 Although incremental, this reduction in procedures represents an area of improvement. In addition to reduced procedures, more accurate classification of lower-risk infants may potentially result in reduced empirical antimicrobial therapy, hospitalizations, false-positive cultures, and nosocomial infection risk.32
Notably, one patient was classified as a false-negative in the random forest model with Enterobacter cloacae bacteremia. This patient was additionally misclassified as a false-negative in the PECARN and Step-by-Step rules.7,8 According to Kuppermann et al,7 a repeated blood culture before the administration of antibiotics was negative, and the patient had a benign hospitalization. This misclassification stresses a continued need to develop more robust methods in the identification of febrile infants with SBI.
The random forest model was able to achieve the highest diagnostic accuracy among investigated machine-learning models. This occurred while using similar variables to those used in previous decision rules. Although a strength of a recursive partitioning approach is that it can be used by bedside clinicians without calculations, it does not fully exploit the computational power available in modern health care settings. In contrast, the random forest algorithm is computationally intense but likely better able to leverage the data, particularly with respect to continuous variables. The unbalanced nature of this data set, in which 10% of patients had SBIs, may have contributed to this result. With unbalanced data, standard classifiers such as SVMs, logistic regression, and decision trees tend to bias toward the majority (or negative) class and are further limited by smaller sample sizes.33 In contrast, the random forest model, which provides predicted probabilities from multiple decision trees, may be better suited to these types of data.
Despite demonstrating superior test characteristics, our models carry important limitations compared with previously published guidelines. Unlike rules that provided parameter cutoffs,1–8 these models require use of algorithms which are not intuitive. A sample tree, provided in Supplemental Fig 6, demonstrates one such decision tree made in the random forest model; the final model contains 5000 of such trees, the results of which are aggregated to provide a final probability. Integration of these findings into health information systems, including through interactive web applications or electronic medical record systems, may mitigate this disadvantage. For example, a validated model was recently developed into a web application to support the real time risk assessment of UTI in children 2 to 23 months of age.34 Other investigators have reported on use of sepsis alerts to notify clinicians of at-risk patients using machine learning data.35–37 A model to identify at-risk febrile infants may allow for greater diagnostic accuracy by leveraging available computational resources available through integration with the electronic medical record. Importantly, before clinical deployment, validation of such a model is paramount. Models may demonstrate a decline in performance when tested on other data sets.38 This concern is especially relevant with respect to machine learning models.39
Our findings are subject to limitations, including those with respect to use of a convenience sample and use of cultures to determine true-positives. However, rates of SBI from the parent study appear to be similar to those documented elsewhere, suggesting that this is not likely a major source of error.8,13 Models were developed on a smaller-sized data set. We excluded patients with procalcitonin, which resulted in a large number of exclusions. This step was similarly performed in the PECARN model.7 We did not use the exact same inclusion criteria that was used in the PECARN model, or the same derivation and validation cohorts. This resulted in slightly different study numbers. Our models may be overfit and require external validation to assess generalizability.40 Despite these limitations, the findings from this analysis suggest that machine learning models have the potential to perform well in the identification of SBI among populations of well-appearing febrile infants.
Conclusions
We evaluated machine-learning algorithms in the risk-stratification of well-appearing febrile infants ≤60 days old from a multicenter prospective study. Although they need external validation, we suggest that ensemble machine learning algorithms maintain the high sensitivity of recently published decision tools while providing higher specificity. This in turn would allow for more accurate identification of patients without disease, leading to reduced invasive procedures, antimicrobial agents, and hospitalizations.
Dr Ramgopal contributed to conceptualization and design of the study, methodology, investigation, formal analysis, and drafting of the initial manuscript; Dr Horvat contributed to methodology, formal analysis, and editing of the manuscript for intellectually important content; Dr Alpern contributed to conceptualization and design of the study, methodology, and editing of the manuscript for intellectually important content; Dr Yanamala contributed to formal analysis and editing of the manuscript for intellectually important content; and all authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.
FUNDING: Dr Horvat is sponsered by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (1K23HD099331-01A1). Funded by the National Institutes of Health (NIH).
- ANC
absolute neutrophil count
- AUPRC
area under the precision-recall curve
- AUROC
area under the receiver operating curve
- CFU
colony-forming unit
- CI
confidence interval
- CSF
cerebrospinal fluid
- ED
emergency department
- PC1
first principal component
- PC2
second principal component
- PCA
principal component analysis
- PECARN
Pediatric Emergency Care Applied Research Network
- SBI
serious bacterial infection
- SVM
support vector machine
- UTI
urinary tract infection
- WBC
white blood cell count
References
Competing Interests
POTENTIAL CONFLICT OF INTEREST: The authors have indicated they have no potential conflicts of interest to disclose.
FINANCIAL DISCLOSURE: The authors have indicated they have no financial relationships relevant to this article to disclose.
Comments