To develop an institutional machine-learning (ML) tool that utilizes demographic, socioeconomic, and medical information to stratify risk for 7-day readmission after hospital discharge; assess the validity and reliability of the tool; and demonstrate its discriminatory capacity to predict readmissions.
We performed a combined single-center, cross-sectional, and prospective study of pediatric hospitalists assessing the face and content validity of the developed readmission ML tool. The cross-sectional analyses used data from questionnaire Likert scale responses regarding face and content validity. Prospectively, we compared the discriminatory capacity of provider readmission risk versus the ML tool to predict 7-day readmissions assessed via area under the receiver operating characteristic curve analyses.
Overall, 80% (15 of 20) of hospitalists reported being somewhat to very confident with their ability to accurately predict readmission risk; 53% reported that an ML tool would influence clinical decision-making (face validity). The ML tool variable exhibiting the highest content validity was history of previous 7-day readmission. Prospective provider assessment of risk of 413 discharges showed minimal agreement with the ML tool (κ = 0.104 [95% confidence interval 0.028–0.179]). Both provider gestalt and ML calculations poorly predicted 7-day readmissions (area under the receiver operating characteristic curve: 0.67 vs 0.52; P = .11).
An ML tool for predicting 7-day hospital readmissions after discharge from the general pediatric ward had limited face and content validity among pediatric hospitalists. Both provider and ML-based determinations of readmission risk were of limited discriminatory value. Before incorporating similar tools into real-time discharge planning, model calibration efforts are needed.
Hospital readmissions are increasingly used as a marker of quality inpatient care,1,2 costly to the health system,3,4 and stressful for patients and their caregivers.2 As a result of insurer nonpayment for same-cause readmissions in adult populations, there is increasing pressure to reduce hospital readmissions, both at state and federal levels.1 In 2018, the Children’s Hospital Association Solutions for Patient Safety (a collaborative including >100 facilities) identified the reduction of 7-day readmissions as a metric to improve the quality of pediatric inpatient care.5 However, identifying pediatric populations at risk for readmission and actionable variables to reduce their frequency remains challenging for clinicians and the health care system.6
Although risk scores developed using predictive models have shown potential to identify pediatric patients at risk for readmission,7 there remain significant gaps in core test domain assessments (eg, validity) to support their use in everyday management of hospitalized children. For example, even if a model exhibits high usability and construct validity for identifying readmissions, the score may not be accepted by clinicians or may introduce unconscious bias in decision-making if it lacks face validity (the qualitative judgment that a score is “good” or not by experts and frequent users).8 In contrast, models with high face validity hinge on presumptions of content validity (ie, that the model captures all factors contributing to a readmission).8 The validity and usability of a model for identifying patients at risk for readmission further rely on clinicians’ trust and understanding of the algorithm and role for machine learning (ML) in augmenting clinical decision-making.9
To begin to address these knowledge gaps, we aimed to develop and then measure the face and content validity of an institutional ML predictive modeling tool for identifying pediatric inpatients at increased risk of 7-day readmission after discharge from the general pediatric ward. We secondarily assessed the interrater reliability in assigning a level of risk for readmission and the discriminatory capacity to predict readmission of both attending hospitalists and the tool (ie, construct validity).10
Methods
Overall Study Design
The overall study design was conducted in 3 phases. First, we developed an ML tool to stratify a patient’s risk for readmission (this tool remains under investigation and is not currently used for clinical care). Next, we performed a cross-sectional, survey-based study to assess the face and content validities of the developed institutional tool. Finally, we performed a single-center, prospective, observational cohort study of pediatric inpatients to assess the discriminatory ability (ie, construct validity) for the tool to predict 7-day readmissions. This study was approved by our institutional review board.
Readmission Model Data Source and Study Sample
The ML model was developed from data readily accessible in our institution’s electronic health record. Encounter data used to train, validate, and test the model were extracted from encounters classified as inpatient and observation status and included patients discharged by a hospital medicine (HM) attending from July 2016 to June 2019. This data set comprised 4947 patients, 127 patients (2.5%) of whom were readmitted within 7 days and 323 patients (6.5%) between 7 and 30 days after the index discharge.
Model Variables
Variables that were conveniently accessible from the electronic health record data structure and updated daily in institutional patient census reports were considered for inclusion in the model. We aimed to incorporate readily available variables from previously published readmission models related to demographics, insurance, primary care physician, location of hospitalization, and severity of illness in our model.7,11–13 Demographic data included age, sex, self-identified race and ethnicity, home Zip code, and county of residence. Self-identified race and ethnicity were included in the model to allow for understanding of how these contextual factors could be influencing disparities in readmission. Insurance status was classified as either private, public, or uninsured. Hospital unit and medical team data (eg, hospital medicine, critical care) at the time of admission and discharge, length of hospitalization, and number of previous 7-day readmissions were recorded. Illness severity was assessed using the diagnosis-related group severity of illness classification.14
Model and Clinical Decision Support Tool Development
The data set was divided into training (70% of eligible patients) and test (30%) subsets to develop and assess performance of the model, respectively. Creation of the 2 data sets was performed using k-fold stratified cross-validation for data-splitting to reduce the influence of bias in the training and testing sets.15 The training set included both inpatient (N = 2641) and observation (N = 821) readmissions. For both 7- and 30-day readmission groups, we developed a deep neural network with 3 hidden layers16 and used Bayesian optimization17 to identify optimal hyperparameters. Batch normalization in hidden layers was used to scale data between layers. Overfitting was avoided using the dropout technique and L2 regularization of hidden layers.18 A 30-day readmission model was simultaneously developed to inform understanding of the performance of the 7-day readmission, given that 30-day models typically have higher event rates and have been evaluated to a greater extent.2,12,19–25
Because the output from an ML model is not itself intuitive, we created a practical visualization using probability outputs of the model (Supplemental Fig 4) built in Tableau (v 2019.2, Mountain View, CA) to present real-time risk scores to clinicians (ie, clinical decision support tool). Daily readmission risk probability scores for each patient were calculated by the model for children admitted to the HM service and were a priori assigned in the tool as high risk (>80 percentile of daily scores), medium risk (>50 percentile–80 percentile), or low risk (≤50 percentile) for 7-day readmission. Percentile groupings were chosen by invoking the Pareto principle to create meaningful groupings of continuous probability scores.26,27 The output was visualized using a color scale of preestablished color meanings (ie, red = high, yellow = medium, green = low).28–31
Face and Content Validity
We first conducted a cross-sectional survey of pediatric HM attending physicians (ie, “experts”) to determine the face validity of the tool.32–34 This included perceptions regarding confidence in assigning risk for readmission to the general pediatric unit and the potential acceptance of a tool developed from ML methods. Using a parametric Likert scale, respondents rated their confidence with assessing a patient’s risk of 7-day readmission and the likelihood a tool’s assessment would influence their clinical practice. Content validity of the tool was assessed by the respondent rating the importance for readmission risk of the 10 variables with the greatest weight from the model. The voluntary, anonymous survey was electronically distributed (Qualtrics, Drive Provo, UT) to all pediatric hospitalists and analyzed in aggregate.
Attending Assessment of Readmission Risk
The prospective assessment of readmission risk was conducted from December 2019 to June 2020 in a 40-bed general pediatric medicine unit within a freestanding, 259-bed, university-affiliated, quaternary pediatric referral center. Inpatients aged 0 to 21 years admitted to the HM service were included for study. Those admitted to the same geographic unit under the primary care of pulmonology (patients with cystic fibrosis only) or hematology/oncology subspecialty services were excluded because hospitalists do not care for these patients. Attending hospitalists were approached by a study team member, verbally consented, and instructed to stratify a convenience sample of patients under the care of the attending, with anticipated discharges within 24 hours into categories of low, medium, or high risk for 7-day readmission using the same percentile stratification as the tool. The determination of risk was solely on the basis of professional opinion and the attending physicians were blinded to the output from the tool.
Statistical Analysis
Overall performance of the ML model using the training data set was assessed via measurement of the area under the receiver operating characteristic curve (AUROC). Performance of the classification algorithm on the test data set by predicted and actual readmission status were displayed in a confusion matrix. Accuracy (ratio of patients correctly classified as readmitted among all patients), recall (ratio of patients correctly predicted to be readmitted among those readmitted), and precision (ratio of patients readmitted among those predicted to be readmitted) were calculated to further summarize performance of the model on the test data set.35
We used descriptive statistics of medians (interquartile range), and categorical comparisons of risk assessment between provider and machine were evaluated using Fisher’s exact test. Agreement between physician- and machine-predicted risk of readmission (interrater reliability) was assessed using a Cohen’s κ statistic to quantify the level of agreement as poor (<0.00), slight (0.00–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), and almost perfect (0.81–1.00).36 The discriminatory ability of the provider and machine assessments of high risk in predicting readmission to the hospital (construct validity) was examined using AUROC analyses. Significance for all statistical comparisons was set to an α of 0.05.
Results
Readmission Model Development
The mean AUROC for all developed models (no readmission, 7-day readmission, and 30-day readmission) was 0.76, with AUROC for detection of 7-day readmission of 0.77 (95% confidence interval [CI] 0.69–0.83) (Supplemental Fig 5). The confusion matrix for the test data set is presented in Supplemental Table 3. The model’s ability to identify 7-day readmissions from the test data set (recall) was 0.55 and the accuracy of the model was 0.82.
Face and Content Validity of Clinical Decision Support Tool
Overall, 75% (15 of 20) of hospitalists responded to the survey and reported a mean (range) 12.7 (5–25) years of experience as an attending and 0.6 (0.1–1.0) full-time equivalent clinical effort. Eighty percent (12 of 15) of providers were at least somewhat confident with their ability to accurately assess a patient’s risk for 7-day readmissions (Fig 1A). Fifty-three percent (8 of 15) of respondents perceived that the tool would at least somewhat influence their clinical decision-making (face validity) (Fig 1B). Rated importance of the 10 ML variables with the greatest weight from the model as reported by attending physicians (content validity) is summarized in Fig 2. Only length of hospital stay, county of primary residence, history of previous 7-day readmission, unit of admission, and insurance class had median importance scores of 3 of 5 or higher.
Provider confidence with identifying patients at risk for readmission to the hospital within 7 days of discharge (A), and likelihood of incorporating an ML readmission risk clinical decision support tool into clinical practice (B).
Provider confidence with identifying patients at risk for readmission to the hospital within 7 days of discharge (A), and likelihood of incorporating an ML readmission risk clinical decision support tool into clinical practice (B).
Box plot of provider ranking of importance (scale of 1–5) of 10 variables with greatest weight in readmission risk model (A). Relative importance of individual variables within the model for predicting patients at risk for readmission to the hospital within 7 days of discharge (B). LOS, length of stay; PCP, primary care provider.
Box plot of provider ranking of importance (scale of 1–5) of 10 variables with greatest weight in readmission risk model (A). Relative importance of individual variables within the model for predicting patients at risk for readmission to the hospital within 7 days of discharge (B). LOS, length of stay; PCP, primary care provider.
Provider Agreement With Readmission Risk
During the prospective study period, there were a total of 1529 discharges from the general pediatric ward, of which 33 patients (2.2%) were readmitted within 7 days. Attending physicians provided a risk assessment for 432 of 1529 (28.3%) of all discharges. A risk assessment was not calculated from the tool for 19 of the attending assessments. Accordingly, 413 of 1529 (27.1%) physician-tool pairings provided by a total of 17 attending hospitalists were included in the final analysis and provided a median of 24 (interquartile range 16–30.5) responses.
The frequencies of each risk stratification (high/medium/low) differed between attending physicians and the tool (P = .001) (Fig 3). The provider assessment of risk matched that of the tool for 52.8% (218 of 413) of assessments, resulting in no to slight agreement between provider and tool (Table 1; κ = 0.104; 95% CI 0.028–0.179). Discordance between assignments of low risk and high risk for readmission occurred in 7.0% (29 of 413) of discharges. In a sensitivity analysis dichotomizing the assessment to low risk or high/medium risk, the agreement between provider and tool remained at none to slight (κ = 0.153 [95% CI 0.058–0.248]).
Distribution of risk assignments by provider and tool (A). AUROC of readmission risk stratification for children readmitted to the hospital with 7 days of discharge for provider and ML tool.
Distribution of risk assignments by provider and tool (A). AUROC of readmission risk stratification for children readmitted to the hospital with 7 days of discharge for provider and ML tool.
Performance Characteristics of Provider and Clinical Decision Support Tool
Performance characteristics of the high-risk assessment by provider and tool to predict 7-day readmissions are summarized in Table 2. The AUROC analysis of both the provider (AUROC = 0.67, 95% CI 0.55–0.79) and tool (AUROC = 0.52, 95% CI 0.38–0.67) demonstrated poor to no discriminatory ability among the risk categories for identifying children readmitted to the hospital (P = .11). Overall, a high-risk assessment for readmission by the provider (0.14, 95% CI 0.18–0.43) and tool (0.29, 95% CI 0.08–0.58) had a low sensitivity for predicting readmission. The specificity of the high-risk assessment was similar for both the provider (0.91, 95% CI 0.87–0.94) and tool (0.90, 95% CI 0.57–0.93).
Provider and Machine Learning Tool Assignments of High, Medium, and Low Risk for 7-Day Readmission
. | Machine Learning Tool . | ||
---|---|---|---|
Provider assessment . | High . | Medium . | Low . |
High | 9 (2%) | 9 (2%) | 12 (3%) |
Medium | 21 (5%) | 34 (8%) | 84 (20%) |
Low | 17 (4%) | 52 (13%) | 175 (42%) |
. | Machine Learning Tool . | ||
---|---|---|---|
Provider assessment . | High . | Medium . | Low . |
High | 9 (2%) | 9 (2%) | 12 (3%) |
Medium | 21 (5%) | 34 (8%) | 84 (20%) |
Low | 17 (4%) | 52 (13%) | 175 (42%) |
Distribution of Risk Assignments by Provider and Machine Learning Tool for Patients Readmitted Within 7 Days of Discharge (N = 14) and Performance Characteristics of a High-Risk Assessment in Predicting 7-Day Readmissions
Risk Assignment . | Provider N (%) . | Machine N (%) . |
---|---|---|
High risk | 2 (14) | 5 (36) |
Medium risk | 8 (57) | 1 (7) |
Low risk | 4 (29) | 8 (57) |
Performance characteristic | Provider | Machine |
Sensitivity (95% CI) | 0.14 (0.02–0.43) | 0.29 (0.08–0.58) |
Specificity (95% CI) | 0.91 (0.87–0.94) | 0.90 (0.87–0.93) |
PPV (95% CI) | 0.07 (0.02–0.21) | 0.11 (0.05–0.23) |
NPV (95% CI) | 0.96 (0.95–0.97) | 0.97 (0.96–0.98) |
+LR (95% CI) | 1.53 (0.40–5.77) | 2.92 (1.21–7.05) |
−LR (95% CI) | 0.95 (0.76–1.17) | 0.79 (0.57–1.10) |
Risk Assignment . | Provider N (%) . | Machine N (%) . |
---|---|---|
High risk | 2 (14) | 5 (36) |
Medium risk | 8 (57) | 1 (7) |
Low risk | 4 (29) | 8 (57) |
Performance characteristic | Provider | Machine |
Sensitivity (95% CI) | 0.14 (0.02–0.43) | 0.29 (0.08–0.58) |
Specificity (95% CI) | 0.91 (0.87–0.94) | 0.90 (0.87–0.93) |
PPV (95% CI) | 0.07 (0.02–0.21) | 0.11 (0.05–0.23) |
NPV (95% CI) | 0.96 (0.95–0.97) | 0.97 (0.96–0.98) |
+LR (95% CI) | 1.53 (0.40–5.77) | 2.92 (1.21–7.05) |
−LR (95% CI) | 0.95 (0.76–1.17) | 0.79 (0.57–1.10) |
−LR, negative likelihood ratio; +LR, positive likelihood ratio; NPV, negative predictive value; PPV, positive predictive value.
Discussion
In this study, we assessed the performance characteristics of an institutional ML clinical decision support tool for identifying pediatric patients at risk for 7-day readmission. Overall, pediatric hospitalists expressed limited self-confidence in identifying patients at risk for readmission and only approximately half expressed a likelihood of incorporating the tool into practice. The most heavily weighted variables in the model lacked significant content validity among the surveyed hospitalists, and agreement in risk stratification between provider and tool was low. Furthermore, both the provider and tool lacked adequate discriminatory ability (construct validity) for identifying patients who are readmitted. Our results are among the first to explore core test domains of an ML tool for predicting pediatric readmissions and compare the model’s performance to provider intuition.
Although children have lower readmission rates and potentially less preventability than adults, readmissions remain an indicator for quality of care, particularly those that are deemed preventable.37–40 Novel strategies to reduce readmissions are necessary, given the otherwise stagnant historical readmission rates.41 The use of ML is one promising approach to utilizing large, complex data sets to inform and support real-time clinical decisions.42 However, a better understanding of ML model performance characteristics is necessary to guide successful implementation of these models into clinical practice.
To preliminarily address this knowledge gap, we measured the face and content validity of our tool.8 Pediatric hospitalists in our study were confident with their assignment of risk of readmission but would not universally use the tool to support their clinical decisions. The acceptance and perceived benefit of an ML tool are important for implementation in a clinical setting. Clinicians utilizing ML in practice are challenged with aligning data-driven predictions with the needs and desires of individual patients.43 A lack of face or content validity may prevent the attainment of trust necessary to use a tool in an effective manner.9
One crucial factor influencing clinician acceptance of the tool is the general agreement with the tool’s risk stratification. In our prospective assessment of >400 discharges, attending physicians showed slight to no agreement with the stratification of the tool ( = 0.104), despite the model’s accuracy with the training and testing data sets. Although previously published readmission models have not assessed a clinician’s agreement with the risk of readmission, studies involving other conditions have shown variable results. Developmental pediatricians exhibited moderate agreement ( = 0.73) with an ML algorithm used to identify children with autism spectrum disorder.44 Similarly, in a study of an ML-based, early warning system for identifying adults with sepsis, only 40% of providers and 12% of nurses felt the patient actually had sepsis within 6 hours of the trigger.45
There are several important possible explanations for the lack of agreement. First, we suspect it is difficult for providers to conceptualize the differences between low to medium and medium to high risk. However, the agreement with the tool improved only slightly when dichotomizing the risk to low risk or medium to high risk. Second, although providers were informed of the variables included in the model, the interplay between variables and their relative impact on determining risk for readmission may be difficult to comprehend. In our study, we purposefully measured the providers’ assignment of risk in a blinded fashion to reduce any bias from the tool. However, clinical practice may be best augmented if the design of a tool includes transparency to the context of the model and how it may fall short in its predictive efforts.46 Clinicians might ultimately agree more with the tool when not blinded to the visualized output or with the inclusion of more physiologically relevant variables, such as time off of supplemental oxygen before discharge, time since last abnormal vital sign, or if focused on a specific diagnosis (eg, bronchiolitis). Finally, clinicians may hesitate to assign patients as high risk (because, then why discharge?) or low risk for readmission (subconsciously acknowledging there may be factors that the provider cannot identify).
The ability of both a provider and our tool to accurately predict readmissions (construct validity) was also limited. The positive predictive value of the high-risk assessment by the provider (6.7%) and the machine (10.9%) were both low and suffered from attempting to predict the uncommon event of a readmission similar to other published models.7,12 The conclusions from these results are twofold. First, providers using intuition alone have a limited ability to predict readmission to the hospital, despite reporting confidence in assessing risk. Second, our predictive tool alone would not augment this ability, despite being well calibrated using historical training data. These findings challenge implementing ML tools directly into practice solely on the basis of performance with retrospective data. Furthermore, our data highlight an interesting dilemma in assessing content validity of the tool among providers exhibiting limited ability themselves to predict a readmission. One strength of ML is identifying unique relationships among variables that may not be immediately apparent to a clinician; thus, variables in an ML model that are highly predictive of an event may not ultimately have obvious content validity. However, incorporating variables that solely optimize a model’s AUROC without sufficient input and buy-in from end-users could result in the dismissal or trusted use of a model/tool.46–48
One key challenge facing the integration of ML readmission tools into practice is how to best guide clinician behavior when a readmission is predicted to occur. Even an accurate model with high validity and reliability does not automatically provide solutions to prevent a readmission event, particularly when built using static patient-level and institutional factors. In the adult inpatient setting, incorporating readmission risk prediction into daily multidisciplinary team discussions has successfully reduced readmissions, even in the absence of interventions suggested directly from the model; a similar approach may also be effective for pediatric patients.49 Alternatively, the strong negative predictive value of an ML tool may be used to reduce unnecessary consumption of valuable and limited resources to patients at low risk for readmission, as has been described in predicting rare transfers to the PICU within 24 hours of admission.50 A better understanding of evidence-informed interventions incorporating ML tool predictions is necessary to prevent the universal adoption of potentially low-value solutions (such as unnecessarily prolonging a patient’s hospitalization when a readmission is predicted).
This investigation had several limitations. This study was conducted at a freestanding children’s hospital with a research development team, and thus does not allow for easy generalizability. Similarly, surveying only hospitalists at our institution and the associated small sample size limit the generalizability of the assessment of validity and reliability to providers outside of our institution. In addition, we captured paired data on only 27% of discharges because of convenience sampling, which may introduce sampling bias in our data. The provision of variables in the model to respondents for assessing content validity during the cross-sectional phase could have influenced their decision-making in stratifying risk of readmission in the prospective phase. However, given the slight agreement between provider and model, we would not expect any influence of bias to drastically change the results. Finally, because of the low incidence of readmissions, the performance characteristics of high-risk assessment lack precision. As such, direct comparisons of our model’s performance to others should be limited.
Conclusions
An ML tool developed at our institution exhibited limited face, content, and construct validity among pediatric hospitalists, despite being accurate using a training data set. Furthermore, both provider and tool assignments of high risk for readmission had low sensitivity for predicting readmission. Future efforts should focus on evaluating the test characteristics and real-time performance of ML tools in identifying targeted outcomes before deployment in a clinical setting. Reliance on historical training and validation data sets are insufficient for understanding the acceptability and utility of ML tools to providers.
Acknowledgments
We thank Dr Ali Jalali for his assistance with development of the ML model.
FUNDING: No external funding.
CONFLICT OF INTEREST DISCLAIMER: Dr Goldenberg is a site investigator for a clinical trial investigating rivaroxaban for treatment of venous thromboembolism funded by Bayer and receives personal consultant fees from Anthos, Bristol Myers Squibb, Bayer, Daiichi-Sankyo, Pfizer, and Novartis. The remainder of the authors have indicated they have no conflicts relevant to this article to disclose.
Dr. Morrison conceptualized and designed the study, and drafted the initial manuscript; Dr Ahumada developed the analytical model, and reviewed and revised the manuscript; Drs Casey, Dudas, Sochet, Rehman, Goldenberg, and Dees carried out the initial analyses, and critically reviewed and revised the manuscript; and all authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.
Comments