This study aimed to develop and evaluate an algorithm to reduce the chart review burden of improvement efforts by automatically labeling antibiotic selection as either guideline-concordant or -discordant based on electronic health record data for patients with community-acquired pneumonia (CAP).
We developed a 3-part algorithm using structured and unstructured data to assess adherence to an institutional CAP clinical practice guideline. The algorithm was applied to retrospective data for patients seen with CAP from 2017 to 2019 at a tertiary children’s hospital. Performance metrics included positive predictive value (precision), sensitivity (recall), and F1 score (harmonized mean), with macro-weighted averages. Two physician reviewers independently assigned “actual” labels based on manual chart review.
Of 1345 patients with CAP, 893 were included in the training cohort and 452 in the validation cohort. Overall, the model correctly labeled 435 of 452 (96%) patients. Of the 286 patients who met guideline inclusion criteria, 193 (68%) were labeled as having received guideline-concordant antibiotics, 48 (17%) were labeled as likely in a scenario in which deviation from the clinical practice guideline was appropriate, and 45 (16%) were given the final label of “possibly discordant, needs review.” The sensitivity was 0.96, the positive predictive value was 0.97, and the F1 was 0.96.
An automated algorithm that uses structured and unstructured electronic health record data can accurately assess the guideline concordance of antibiotic selection for CAP. This tool has the potential to improve the efficiency of improvement efforts by reducing the manual chart review needed for quality measurement.
Appropriate antimicrobial selection is critically important as antimicrobial resistance becomes both more common and more difficult to overcome.1 Despite this imperative, >50% of hospital antimicrobial prescribing remains incorrect or inappropriate.2 Community-acquired pneumonia (CAP) is one of the most common indications for antibiotic use in pediatrics.3,4 The unnecessary use of broad-spectrum antibiotics for CAP contributes to both increased antimicrobial resistance and an increased risk of adverse events for patients.5 To aid prescribers in choosing the optimal antibiotic regimen for CAP, national organizations and many institutions have created clinical practice guidelines (CPGs) that typically provide first-line antibiotic recommendations, as well as appropriate alternatives in certain clinical scenarios.6–8 CPGs have been shown to improve guideline-concordant care and patient outcomes.9–11 Unfortunately, nonadherence or deviation from CPGs remains frequent, and multiple interventions are often required to improve guideline uptake.12–15
Although identifying patients with CAP is feasible based solely on structured electronic health record (EHR) data, such as diagnosis codes, assessing CPG adherence often requires more nuanced unstructured data, such as clinical note text. For this reason, most improvement efforts rely on manual chart review, which can limit large-scale evaluation efforts because it is resource- and time-intensive.2,16 We aimed to reduce the chart review burden of quality measurement by creating an automated, usable, and flexible algorithm to label antibiotic selection for patients with CAP as guideline-concordant or -discordant based on both structured and unstructured EHR data elements. The objective of this study was to test the algorithm’s performance in labeling antibiotic selection for CAP as guideline-concordant or -discordant.
Methods
Setting, Study Design, and Data Collection
This was a single-center quality improvement research study conducted at a large, freestanding tertiary care pediatric hospital. We queried our enterprise data warehouse to extract retrospective data on all patients seen in the emergency department (ED) or admitted to the hospital from 2017 to 2019 who had an International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10) billing diagnosis of CAP and received at least 1 dose of an antibiotic at our institution (Supplemental Information Part 1). Per standard practice, two-thirds of the data were used to build and refine the algorithm and one-third was reserved for validation of the algorithm. The study was conducted by a multidisciplinary team that included pediatric hospitalists, clinical informaticists, and infectious diseases and antimicrobial stewardship specialists. The Boston Children’s Hospital Institutional Review Board reviewed and approved this study.
Algorithm Overview
A 3-part algorithm to assess adherence to antibiotic recommendations in our institutional CAP CPG (summarized in Fig 1) was developed. We only applied the algorithm to the first antibiotic that was administered during a patient’s encounter, given that the CPG focuses on initial empirical antibiotic selection. If a patient was seen in the ED and then transferred to an inpatient unit, only the first antibiotic given during the entire encounter (likely in the ED) was, therefore, included. In part 1 of the algorithm, we used structured data with logical rules to identify patients who received guideline-concordant therapy, accounting for drug allergies, comorbid conditions, other concurrent infections, and age. Structured data are defined as data recorded as discrete values or codes in EHR fields (eg, age, sex, allergies, diagnosis codes) rather than as free text. In part 2, we used unstructured data from clinical note text and applied regular expressions to classify patients who were likely in scenarios necessitating alternative antibiotic choices: a moderate-to-large or complicated pleural effusion, outpatient treatment failure, aspiration pneumonia, critical illness, under-immunization, or immunocompromise or immunodeficiency. In part 3, we summarized the algorithm output into tables that facilitated a rapid review of the algorithm label, including note context surrounding any identified regular expressions and relevant details for each patient. Figure 2 reveals a detailed diagram of the algorithm with a summary of the logical rules and label outputs at each step. The algorithm was created in R version 4.2.2 (Vienna, Austria),17,18 and natural language processing was done with the Stringr package version 1.5.0 (Boston, MA).19 Both R and Stringr are open-source tools that are freely available to the public.
Antibiotic selection pathway for CAP. This algorithm summarizes the portions of the institutional clinical practice guideline focused on antibiotic selection. We have excluded much of the other clinical guidance, including diagnostic evaluation, and this figure is not intended to guide clinical decision-making.
Antibiotic selection pathway for CAP. This algorithm summarizes the portions of the institutional clinical practice guideline focused on antibiotic selection. We have excluded much of the other clinical guidance, including diagnostic evaluation, and this figure is not intended to guide clinical decision-making.
Algorithm logical rules and predicted labels summary. H&P, History and Physical
Part 1: Classification of Guideline Concordance Using Structured Data
Structured data were used to determine patient eligibility on the basis of drug allergies, comorbidities, other concurrent infections, and age. We first applied the CPG exclusion criteria: age <3 months or excluded comorbidity (eg, chronic lung disease except for asthma) at the time of encounter (Supplemental Information Part 2). We additionally chose to exclude patients who were admitted to the ICU, the high-acuity unit, or a subspeciality service (eg, oncology). We did include patients seen in and discharged from the ED. Finally, although not explicitly mentioned in our CPG, we also chose to exclude patients who had another common infection at the time of the encounter (eg, a urinary tract infection) because those patients may have required an alternative antibiotic.
To identify patients with excluded concurrent infections or comorbid conditions, we used a combination of specific ICD-10 codes or prefixes.20 For example, we used “J95.851” to exclude ventilator-associated pneumonia and “N39*” (in which * is a wildcard to match anything that follows N39 in the prefix) to exclude urinary tract infection (Supplemental Information Part 2). We additionally used regular expression matching to exclude other conditions on the basis of the ICD-10 code description (Supplemental Information Part 2). Regular expressions involve using specific character sequences to search for terms in free text that match keywords or phrases. For example, we searched for “transplant” to exclude any ICD-10 code with “transplant” in the description (eg, Z94.1: Heart transplant status, T86.00: Unspecified complication of bone marrow transplant).
After the exclusion criteria were applied, logical rules using the patient’s drug allergies and age at the time of the encounter were applied. For example, if a patient had a recorded penicillin allergy and received an approved alternative antibiotic (Fig 1), this prescription was labeled “guideline concordant.” Atypical pneumonia coverage was only considered guideline-concordant in patients aged 5 years or older. We similarly accounted for macrolide allergies in these patients, but we did not apply further logical rules governing atypical coverage because we found that clinical decision making on this was rarely documented.
Part 2: Classification of Guideline Concordance Using Unstructured Data
Part 2 of the algorithm leverages unstructured data from clinical note text. To analyze this text, the algorithm utilizes natural language processing (NLP), which is an area of computer science that focuses on interpreting free text. NLP can include machine learning approaches, as well as regular expression approaches.
We applied regular expressions to clinical note text to classify patients who were likely in certain clinical scenarios that necessitated alternative antibiotic choices. To generate regular expressions, 2 authors (JY and JH) reviewed the CPG and identified reasons for deviation from first-line antibiotic recommendation that were not captured in the structured data listed above. The reasons included (1) outpatient treatment failure, (2) aspiration pneumonia, (3) moderate or large or complicated pleural effusion, (4) critical illness, (5) un- or under-immunized status, and (6) immunodeficiency (Table 1). It is notable that both aspiration pneumonia and immunodeficiency are also exclusion criteria per our CPG. Accordingly, patients with ICD-10 codes for these conditions were already excluded in part 1. These conditions were nevertheless included in the unstructured data analysis because the authors suspected that appropriate ICD-10 codes frequently may be missing from records for patients with these conditions, such that these exclusions would be missed in part 1 of the algorithm.
Example Output of Note Context for Rapid Manual Note Review
Patient ID . | Note Type . | Note DT . | Phrase Identified . | Context . |
---|---|---|---|---|
Patient 1 | Pediatrics admission MD | XXX | Failed outpatient treatment | A 2 y-old otherwise healthy girl presenting with dehydration and failed outpatient treatment of CAP 1 wk ago developed cough post-tussive emesis progressively worsening cough fevers ×4 |
Patient 2 | Emergency MD | XXX | Critically ill | Sign-out and transferred care to accepting inpatient attending Dr XXX. I have spent 60 minutes at the bedside of the critically ill patient at risk for life-threatening cardiac respiratory and neurologic decompensation XXX, MD |
Patient 3 | Emergency MD | XXX | Aspiration pneumonia | On examination in the PCP’s office several days ago, and he was spiking higher fevers. Differential diagnosis also includes aspiration pneumonia or pneumonitis given his choking event that was complicated by emesis during mother’s interventions retained foreign |
Patient ID . | Note Type . | Note DT . | Phrase Identified . | Context . |
---|---|---|---|---|
Patient 1 | Pediatrics admission MD | XXX | Failed outpatient treatment | A 2 y-old otherwise healthy girl presenting with dehydration and failed outpatient treatment of CAP 1 wk ago developed cough post-tussive emesis progressively worsening cough fevers ×4 |
Patient 2 | Emergency MD | XXX | Critically ill | Sign-out and transferred care to accepting inpatient attending Dr XXX. I have spent 60 minutes at the bedside of the critically ill patient at risk for life-threatening cardiac respiratory and neurologic decompensation XXX, MD |
Patient 3 | Emergency MD | XXX | Aspiration pneumonia | On examination in the PCP’s office several days ago, and he was spiking higher fevers. Differential diagnosis also includes aspiration pneumonia or pneumonitis given his choking event that was complicated by emesis during mother’s interventions retained foreign |
DT, date time; PCP, primary care provider.
We listed as many common phrases as possible for each reason for deviation and iteratively refined them using the training cohort. The regular expressions additionally included logic that precluded the algorithm from matching a phrase if it was preceded or followed by certain negating phrases (eg, “no concern for aspiration” or “complex effusion was not visualized”). Similarly, the regular expressions did not match if a phrase was preceded by certain historical context phrases including both family and personal medical history (eg, “family history of immunodeficiency” or “history of aspiration pneumonia”). See Supplemental Information Part 3 for a full list of phrases in each regular expression category. Each category of phrases was refined by 1 author (JY) by comparing the algorithm’s classifications to classifications on the basis of a manual review of the training and refinement cohort data. This refinement continued until no further phrases were identified by manual review that could reasonably be incorporated into the regular expressions. The full list of phrases for each regular expression category and final R script are available via the link in Supplemental Information Part 4.
Part 3: Summary Output for Review
The output from the algorithm included 3 spreadsheet files to facilitate rapid manual review and annotation. The first spreadsheet was a summary of each excluded patient and the reason for exclusion. The second spreadsheet listed all included patients with their predicted guideline concordance (eg, “Guideline concordant” or “Possibly discordant, needs review”) and an additional label indicating the reason for the algorithm’s label assignment. An example of a label assigned in part 1 using structured data is “Correct alternative in the setting of a cephalosporin allergy,” whereas an example of a label assigned in part 2 using unstructured data is “Outpatient treatment failure.” Finally, a third spreadsheet was created that included NLP results, including details about the clinical note, the actual phrase identified (eg, “failed outpatient high-dose amoxicillin”), the category to which that phrase belonged (eg, “Outpatient treatment failure”), and 300 characters of note context before and after the phrase. This spreadsheet enabled the rapid review of the regular expression phrases identified by the algorithm. The algorithm additionally generated summary tables that included the total number of patients in the cohort, the number of excluded patients, and specific reasons for exclusion with their corresponding counts.
Usability and Flexibility
To ensure that the algorithm was agnostic to EHR vendor and institution, we used R Studio to standardize the data columns, formats, and types. We defined all variables and user inputs upfront, with explanatory text describing the expected information in each variable. Additionally, we included comments throughout the dataset with plain-language descriptions and examples of functions and regular expressions (Supplemental Information Part 4).
Performance Evaluation
To evaluate algorithmic performance, we calculated precision, recall, and an F1 score for the algorithm overall, and for each category of regular expression phrases. Precision can be interpreted similarly to positive predictive value, recall can be interpreted similarly to sensitivity, and the F1 score is the harmonic mean of precision and recall. Metrics are calculated as Precision = true positives/all predicted positives, Recall = true positives/(true positives + false negatives), and F1 = 2[(precision x recall)/(precision + recall)]. We additionally calculated a macro average and weighted macro average, given that our classes were imbalanced (ie, some labels were assigned to many patients, and others were assigned to only a few).21,22 Two physicians (JY and SK or EK) independently reviewed every patient in the validation cohort and manually assigned each patient to a category. This category assignment label was considered to be the “actual” label.
Disagreements between reviewers were discussed among all reviewers until a consensus was achieved. Cohen’s κ statistic was calculated to assess interrater reliability. We performed all statistical analyses in R version 4.2.2 (Vienna, Austria).17
Results
We identified 1345 patients with CAP, of whom 893 were included in the training cohort and 452 in the validation cohort. Figure 2 displays the algorithm flow diagram with patient counts for the validation cohort for each step of the algorithm. Among the 166 patients excluded from the validation cohort, the most common reasons for exclusion were admission to an excluded clinical service (n = 110) or having an excluded comorbid diagnosis (n = 32). After applying the remaining rules in part 1, 193 of the 286 patients in the validation cohort (67%) were labeled as having received guideline-concordant antibiotics. After applying part 2 of the algorithm, an additional 48 patients (17%) were labeled as likely in one of the scenarios in which deviation from the CPG was appropriate, and 45 patients (16%) were given the final label of “possibly discordant, needs review.” Of these 45 patients, 33 (73%) actually received guideline-discordant therapy. The subcategories of labels are summarized in Fig 2. The summary output of the phrases identified in the notes is provided in Table 1.
In total, 435 of the 452 cases in the validation cohort were labeled correctly (96%). Table 2 reveals recall, precision, and F1 score for the overall algorithm and for a sub-analysis of only the unstructured data elements. Figure 3 graphically depicts the agreement between the algorithm-predicted labels and actual labels on a Sankey diagram. Of all cases labeled by the algorithm in the validation dataset, only 17 (3.8%) were labeled incorrectly. Of the regular expressions categories, the lowest recall was for outpatient treatment failure: 6 of 26 cases were labeled by the algorithm as “possibly discordant, needs review” instead of outpatient treatment failure. We found on chart review that each of these notes used phrases that were either specific or phrased atypically (eg, “failure of outpatient PO abx”). The category with the lowest precision was immunosuppression, for which 1 of 2 cases should have been labeled as “possibly discordant, needs review.” Generally, “possibly discordant, needs review” comprised the majority of incorrect labels, with incorrect assignment of other labels in only 5 cases. Cohen’s κ statistic of interrater reliability for the generation of actual labels was 0.95.
Overall Algorithm Performance Metrics
Class . | Recalla . | Precisionb . | F1c . | Predicted (n) . | Actual (n) . |
---|---|---|---|---|---|
Structured data | |||||
Excluded from CAP guideline | 0.98 | 1.00 | 0.99 | 166 | 169 |
Guideline concordant | 1.00 | 1.00 | 1.00 | 193 | 193 |
Unstructured data | |||||
Aspiration pneumonia | 0.87 | 0.81 | 0.84 | 16 | 15 |
Critically ill | 1.00 | 0.67 | 0.84 | 3 | 2 |
Immunocompromised/immunodeficiency | 1.00 | 0.50 | 0.67 | 2 | 1 |
Large/complicated pleural effusion | 1.00 | 1.00 | 1.00 | 7 | 7 |
Outpatient treatment failure | 0.77 | 1.00 | 0.87 | 20 | 26 |
Possibly discordant, needs review | 0.87 | 0.73 | 0.80 | 45 | 38 |
Un- or under-immunized | 0.00 | N/A | N/A | 0 | 1 |
Unstructured data weighted macro average | 0.84 | 0.83 | 0.83 | — | — |
Overall weighted macro averaged | 0.96 | 0.97 | 0.96 | — | — |
Class . | Recalla . | Precisionb . | F1c . | Predicted (n) . | Actual (n) . |
---|---|---|---|---|---|
Structured data | |||||
Excluded from CAP guideline | 0.98 | 1.00 | 0.99 | 166 | 169 |
Guideline concordant | 1.00 | 1.00 | 1.00 | 193 | 193 |
Unstructured data | |||||
Aspiration pneumonia | 0.87 | 0.81 | 0.84 | 16 | 15 |
Critically ill | 1.00 | 0.67 | 0.84 | 3 | 2 |
Immunocompromised/immunodeficiency | 1.00 | 0.50 | 0.67 | 2 | 1 |
Large/complicated pleural effusion | 1.00 | 1.00 | 1.00 | 7 | 7 |
Outpatient treatment failure | 0.77 | 1.00 | 0.87 | 20 | 26 |
Possibly discordant, needs review | 0.87 | 0.73 | 0.80 | 45 | 38 |
Un- or under-immunized | 0.00 | N/A | N/A | 0 | 1 |
Unstructured data weighted macro average | 0.84 | 0.83 | 0.83 | — | — |
Overall weighted macro averaged | 0.96 | 0.97 | 0.96 | — | — |
Precision can be interpreted similarly to positive predictive value.
Recall can be interpreted similarly to sensitivity.
F1 score is the harmonic mean of precision and recall. F1 = 2[(precision × recall)/ (precision + recall)].
Weighted macro average is used when classes are imbalanced (some labels were assigned to many patients, but others assigned to only a few patients).
Discussion
We describe the creation of an automated algorithm that used both structured and unstructured EHR data as inputs and accurately labeled antibiotic selection for CAP as guideline-concordant or possibly discordant. We found excellent recall, precision, and F1 score for the overall algorithm and satisfactory performance for a sub-analysis of only the unstructured data elements. This suggests that it is possible for an automated algorithm to partially or greatly reduce the amount of chart review that might be necessary for quality improvement efforts in assessing CAP guideline adherence, which we do not believe has previously been described in the literature. Although the authors of previous studies have used NLP for the analysis of note text, these studies were primarily limited to extracting diagnoses or identifying the presence or absence of parts of the medical history in note documentation.23–25 Additionally, the authors of most previous work have only used either structured or unstructured data elements, although including both types of data has been shown to improve performance.26,27 Our approach was unique because we used both structured and unstructured data elements, and we incorporated clinical logic into an automated algorithm to assess guideline concordance of antibiotic selection with high accuracy.
We believe this algorithm or a version of it adapted to a given institution’s CAP CPG, could provide significant time and resource savings for the institution’s quality improvement teams or Antimicrobial Stewardship Program. If we presumed that quality measurement without the algorithm would require chart review for all patients who did not receive the first-line antibiotic (n = 302 in our validation cohort), the use of the algorithm could afford a 6-fold reduction in the number of charts requiring manual chart review because only 48 charts were labeled as “possibly discordant, needs review.” When the algorithm identifies clinical note text that suggests scenarios warranting alternative antibiotic choices, it streamlines the review of the assigned label through a spreadsheet that summarizes the note context. This facilitates the rapid review of the algorithm assignments for teams if maximizing accurate labels is necessary. Overall, the greater efficiency of the automated algorithm over manual chart review could make feasible the assessment of institutional guideline adherence across many guidelines at once, and over more sustained time periods.
The algorithm was designed to be user-friendly, making it relatively easy for a new user to learn, modify, and deploy the algorithm at their institution. We strove to optimize usability by having additional users test the algorithm for understandability, including extensive comments and examples throughout, and only utilizing rules-based NLP. The use of a rules-based NLP model (as opposed to a machine learning model) renders the algorithm more interpretable and makes its output easy to understand and explain.28 This design approach thereby advances the goal of designing flexible systems because rules could easily be modified to accommodate a different health system or diagnosis. In addition, through the modification or replacement of the current ICD-10 codes, regular expressions, and medications, the algorithm could be adapted to changes in the current CPG or be applied to a CPG for an entirely different condition relatively easily. We have made the full R script fully available as supplemental information.
The limitations of this study include the use of a large academic medical center with extensive informatics and technical resources as the study setting, which may limit generalizability to other centers with fewer resources. Additionally, our algorithm only focuses on the first antibiotic ordered, which does not capture the entirety of antibiotic use in practice. We rely on ICD-10 codes as diagnosis data, which may be inaccurate or incomplete. The use of NLP has limitations given that clinical documentation invariably includes unusual spelling or phrasing that may prevent accurate labeling. Finally, our actual label is not truly a gold-standard label because we relied exclusively on chart review to generate these labels. Chart review does reflect a physician’s decision making at the time, but this decision making could be flawed. For example, a physician may misinterpret a guideline when defining outpatient treatment failure or aspiration risk.
Conclusions
Our study reveals the potential for an automated algorithm to accurately label antibiotic selection as likely guideline-concordant or -discordant, using both structured and unstructured EHR data. This tool has the potential to save substantial time for quality improvement teams or Antimicrobial Stewardship Programs, and we encourage other institutions to consider adapting our algorithm or developing similar algorithms to reduce manual chart review. Further research could explore ways to improve the accuracy and generalizability of our algorithm and to incorporate additional antibiotic selection evaluation criteria beyond the first antibiotic ordered.
Dr Yarahuan conceptualized and designed the study, designed the data collection methods, created and refined the algorithm, participated in validation, conducted data analysis, drafted the initial manuscript; Drs Kim and Kisvarday participated in validation; Dr Yan critically reviewed, tested, and de-bugged all algorithm code; Dr Hron conceptualized and designed the study, assisted with algorithm refinement, and reviewed and advised on data analysis; Drs Nakamura and Jones assisted with study conceptualization and assisted with the acquisition and interpretation of antimicrobial stewardship recommendations; all authors critically reviewed and revised the manuscript, approved the final manuscript as submitted, and agree to be accountable for all aspects of the work.
FUNDING: Dr Yarahuan was supported by a National Library of Medicine (NLM) training grant: 5T15LM007092-30. The NLM had no role in the design and conduct of the study.
CONFLICT OF INTEREST DISCLOSURES: The authors have indicated they have no potential conflicts of interest relevant to this article to disclose.
Comments