Detection of delayed diagnosis using administrative databases may illuminate the healthcare settings at highest risk. A method for detection of delays in claims has been validated in children’s hospitals. We sought to further validate the method in community emergency departments (EDs).
We studied patients <21 years old diagnosed with appendicitis from 2008 to 2019 in 8 eastern Massachusetts EDs. Eligible patients had 2 ED encounters within 7 days, the second with an appendicitis diagnosis. Delayed diagnosis was evaluated in medical records by trained reviewers. A previously validated trigger tool was applied to participants’ electronic medical record data. The tool used data elements included in administrative data, including initial encounter diagnoses, time between encounters, presence of medical complexity, and ultimate length of stay. The tool assigned a probability of delayed diagnosis for each patient. Test characteristics at 4 confidence thresholds were determined, and the area under the receiver operating curve was calculated.
We analyzed 68 children with 2 encounters leading to a diagnosis of appendicitis (i.e., possible delay). When assigning a delayed diagnosis prediction to patients at 4 thresholds of confidence (>0%, >50%, >75%, and >90% confident), the positive predictive values were respectively 74%, 89%, 92%, and 89%; the negative predictive values were respectively 100%, 57%, 50%, and 33%. The area under the receiver operating curve was 0.837 (95% confidence interval 0.719–0.954).
A trigger tool that identifies delays in diagnosis using only administrative data in community EDs has a high positive predictive value for true delay. The tool may be applied in community EDs.
Appendicitis is common in children but can be difficult to diagnose.1–3 Timely diagnosis can prevent complications, including perforated appendicitis, sepsis, and rarely a need for bowel resection.4 Approximately 5% to 10% of children with appendicitis had related visits preceding diagnosis, and complications are more likely after delayed treatment.5–8 Understanding systems factors that predict delays in diagnosis would be useful toward reducing the rate of delayed diagnosis. Identifying delayed diagnosis is challenging because case review is time consuming and requires significant expertise. Thus, a useful approach to delayed diagnosis identification would have high accuracy and would not require manual case review.
We recently developed and validated a method to identify delayed diagnosis in large administrative databases, which are databases that contain patient demographics and healthcare claim (i.e., billing) information generated in the course of patient care.9 The advantage of a method that uses only administrative data is that it allows for the study of children who visit community hospitals, which represent most childhood emergency department (ED) visits.10 Such hospitals do not ordinarily share data for research, so using administrative data are the only feasible source of information for large-scale studies of care. The method consists of a trigger tool, which assigns a probability that delayed diagnosis occurred for each child with appendicitis in the database. This method was created and validated using children’s hospital data. Thus, it is unclear how generalizable it is to community hospitals, where case mix, acuity, reasons for delayed diagnosis, and diagnosis coding differ. The ability to use the tool with community hospital data would allow for broad study of rates and consequences of diagnostic delays across all hospital types.
Our objective was to externally validate an approach for retrospectively detecting delayed diagnosis of appendicitis in administrative data from general hospitals, extending our prior work in children’s hospitals. Successful validation would indicate that the approach could be used in all types of hospitals.
Methods
We performed a retrospective, cross-sectional study to test a trigger tool that incorporates only variables typically included in administrative data. The tool is used to predict the presence of delayed diagnosis of appendicitis, which we compared with the criterion standard of detailed electronic health record (EHR) review (see Trigger Tool below). Participants were under 21 years old and visited 1 of 8 general EDs in eastern Massachusetts (children with appendicitis per year range 3–127), had a first-time diagnosis of appendicitis, and had an ED visit in the preceding 7 days at any of the sites. EHRs became available at different sites in different years, ranging from 2008 to 2017. The data were originally collected as part of a study on diagnostic error rates across several diseases.11
The ED encounter associated with the appendicitis diagnosis was designated as the “diagnosis encounter,” and the preceding encounter was designated as the “initial encounter.” For patients with more than 1 previous encounter, the most recent was designated the initial encounter. Cases were identified for inclusion using diagnosis codes (International Classification of Diseases, Ninth Edition, Clinical Modification [ICD-9-CM] 540.×, 541, 542 and ICD-10-CM K35.×-K37.×). Patients were excluded if insufficient medical records existed to determine whether a delayed diagnosis occurred, if no record of a prior encounter existed, if the patient left the ED without being seen, or the patient was transferred at the conclusion of the initial ED visit (which made determination of a delayed diagnosis impossible). This study was approved by the facilities’ institutional review boards under a waiver of informed consent.
All data were drawn from the hospitals’ EHRs, including the administrative components (diagnosis and procedure codes, timestamps, and demographics) and clinical components (operative and clinical notes, medication administration records, and test results).
Outcome
The reference standard primary outcome was delayed diagnosis as determined by manual case review of the EHR. Delay was defined as appendicitis being present at the initial encounter. Reviewers rated the likelihood that appendicitis was present as “near-definitely not,” “probably not,” “possibly,” “probably,” or “near-definitely” using the same definitions as in the prior validation study.9 The definitions were originally developed by multispecialty consensus panel.12 The reviewer assessment of delayed diagnosis was dichotomized as delayed diagnosis (probably or near-definitely delay) or not delayed diagnosis (possibly, probably not, or near-definitely not delay). A subset of cases (34%) was evaluated by a second reviewer.
Trigger Tool
The goal of the trigger tool was to assign a probability that a patient’s administrative data (ie, their billing records) represented a real clinical delay in diagnosis. It was originally developed using administrative data from children’s hospitals and validated through chart review.9 The tool is a logistic regression model that takes a patient’s administrative data and outputs the probability that a real delay in diagnosis occurred. The inputs to the tool are age, sex, history of a complex chronic condition,13 revisit interval (days between initial and diagnosis encounters), diagnosis code for perforated appendicitis (ICD-9-CM 540.0-1, ICD-10-CM K35.2×, K35.32-33), length of stay of the diagnosis encounter (0–1, 2–3, 4–7, or >7 days), and individual presence or absence of specific diagnoses at the initial encounter, including abdominal pain, constipation, dehydration, fever, gastroenteritis, genitourinary condition, head, ear, eye, nose, or throat condition, leukocytosis, urinary tract infection, viral infection, or none of the above. The trigger tool was not modified from its original form. Thus, this study represents an external cohort validation.
Analysis
The prevalence of delayed diagnosis was determined in the whole cohort. We constructed a receiver operating characteristic curve to illustrate the tradeoff of sensitivity versus specificity of the trigger tool in correctly classifying delayed diagnosis. The areas under the receiver operating curve (AUC) were computed. Sensitivity, specificity, positive predictive value, negative predictive value, and accuracy were determined at several thresholds of delayed diagnosis likelihood: >0%, >50%, >75%, and >90%. Test characteristics were reported as percentages with 95% binomial exact confidence intervals. We determined interrater reliability using Cohen’s κ.
We conducted a sensitivity analysis by varying the threshold for determination of the outcome by recategorizing “possible delayed diagnosis” cases as true delays in diagnosis.
Results
There were 2777 children with appendicitis. Among them, we included 72 (2.6%) children with possible delayed diagnosis of appendicitis based on having at least 2 ED encounters leading to an appendicitis diagnosis. Four were excluded: 1 for insufficient records to perform the case review, 2 for leaving without being seen, and 1 for being transferred out after the initial encounter. We analyzed 68 (94%) cases arising from the 8 hospitals (Table 1).
Demographic Characteristics of the 68 Analyzed Patients
. | All Patients N = 68, n (%) . | Patients With Probable or Near-definite Delayed Diagnosis Upon Manual Review N = 50, n(%) . |
---|---|---|
Age, median (IQR) | 14.5 (10.5–17.2) | 14.2 (9.8–16.6) |
Female | 33 (48.5) | 25 (50.0) |
Race | ||
Asian | 3 (10.3) | 2 (9.1) |
Hispanic | 8 (27.6) | 8 (36.4) |
Non-Hispanic Black | 3 (10.3) | 1 (4.5) |
Non-Hispanic white | 15 (51.7) | 11 (50.0) |
Primarily English speaking | 60 (88.2) | 42 (84.0) |
Primary insurance | ||
Private | 42 (61.8) | 33 (66.0) |
Public | 16 (23.5) | 11 (22.0) |
Other | 10 (14.7) | 6 (12.0) |
Complex chronic condition | 11 (16.2) | 7 (14.0) |
Perforated appendicitis at time of diagnosis | 17 (28.3) | 17 (34.0) |
. | All Patients N = 68, n (%) . | Patients With Probable or Near-definite Delayed Diagnosis Upon Manual Review N = 50, n(%) . |
---|---|---|
Age, median (IQR) | 14.5 (10.5–17.2) | 14.2 (9.8–16.6) |
Female | 33 (48.5) | 25 (50.0) |
Race | ||
Asian | 3 (10.3) | 2 (9.1) |
Hispanic | 8 (27.6) | 8 (36.4) |
Non-Hispanic Black | 3 (10.3) | 1 (4.5) |
Non-Hispanic white | 15 (51.7) | 11 (50.0) |
Primarily English speaking | 60 (88.2) | 42 (84.0) |
Primary insurance | ||
Private | 42 (61.8) | 33 (66.0) |
Public | 16 (23.5) | 11 (22.0) |
Other | 10 (14.7) | 6 (12.0) |
Complex chronic condition | 11 (16.2) | 7 (14.0) |
Perforated appendicitis at time of diagnosis | 17 (28.3) | 17 (34.0) |
Numbers do not add up to 100% because of missing data. IQR, interquartile range.
The prevalence of true delayed diagnosis was 50 of 68 (74%), of whom 10 were classified as probable delay and 40 were classified as near-definite delay. The receiver operating characteristic curve for the trigger tool prediction of delayed diagnosis is shown in Fig 1. The AUC was 0.84 (95% confidence interval [CI] 0.72–0.96). Test characteristics are shown in Table 2 at varying thresholds of confidence in the trigger tool prediction of delayed diagnosis.
Receiver operating characteristic curve for trigger tool prediction of delayed diagnosis of appendicitis at varying prediction thresholds. Prespecified cutoffs for evaluating test characteristics included a delayed diagnosis likelihood of >0%, >50%, >75%, and >90%.
Receiver operating characteristic curve for trigger tool prediction of delayed diagnosis of appendicitis at varying prediction thresholds. Prespecified cutoffs for evaluating test characteristics included a delayed diagnosis likelihood of >0%, >50%, >75%, and >90%.
Test Characteristics of the Trigger Tool’s Prediction of Delayed Diagnosis Using the Criterion Standard of Electronic Health Record Review
Trigger Tool Predicted Delay Threshold . | Sensitivity % (95% CI) . | Specificity % (95% CI) . | PPV % (95% CI) . | NPV % (95% CI) . |
---|---|---|---|---|
>0% | 100 (93–100) | 0 (0–19) | 74 (61–83) | NA |
>50% | 78 (64–88) | 72 (47–90) | 89 (75–96) | 54 (33–74) |
>75% | 72 (58–84) | 83 (59–96) | 92 (79–98) | 52 (33–71) |
>90% | 34 (21–49) | 89 (65–99) | 89 (67–99) | 33 (20–48) |
Trigger Tool Predicted Delay Threshold . | Sensitivity % (95% CI) . | Specificity % (95% CI) . | PPV % (95% CI) . | NPV % (95% CI) . |
---|---|---|---|---|
>0% | 100 (93–100) | 0 (0–19) | 74 (61–83) | NA |
>50% | 78 (64–88) | 72 (47–90) | 89 (75–96) | 54 (33–74) |
>75% | 72 (58–84) | 83 (59–96) | 92 (79–98) | 52 (33–71) |
>90% | 34 (21–49) | 89 (65–99) | 89 (67–99) | 33 (20–48) |
NA, not available; NPV, negative predictive value; PPV, positive predictive value.
Twenty-three (34%) cases underwent determination of interrater reliability. Overall agreement occurred in 21 of 23 (91%) cases, and Cohen’s κ was 0.78, representing moderate agreement.14
The sensitivity analysis involved recategorizing patients judged on review to have a possible delayed diagnosis. After reassigning such cases to be considered as having a delayed diagnosis, the proportion with delay increased to 54 of 68 (79%). The AUC improved to 0.93 (95% CI 0.87–0.99). The positive predictive value of the trigger tool at a threshold of >75% was 97% (95% CI 87–100). Cohen’s κ was 1.0, representing perfect agreement.
Discussion
In a cohort of 68 children with possible delayed diagnosis of appendicitis, a previously validated trigger tool accurately distinguished between children with and without true delays. At a trigger tool confidence threshold of >75%, the positive predictive value was 92%, indicating that cases flagged as having delayed diagnosis nearly always do. Trigger tool sensitivity was reasonable (72%). Taken together, these findings suggest that this trigger tool can produce reasonably accurate counts of children with delayed diagnosis.
The goal of the trigger tool is to allow population research on rates and systems risk factors for delayed diagnosis. Additionally, the tool would be useful to quality and safety managers. They could use it to monitor and identify cases of delayed diagnosis within health systems, which would allow common cause analysis or root cause analysis and feedback to clinicians. Such tools make manageable the number of case reviews needed to perform quality assurance work.15
The trigger tool uses only information available in claims data. The advantage of this approach is that no human review is required to determine whether a delayed diagnosis occurred.15 Because the trigger tool has been validated in both pediatric and general EDs, it can be used on large claims datasets to evaluate rates and predictors of delayed diagnosis. In the future, the trigger tool is intended to be used at a prespecified threshold of 75%. However, the tool is flexible; a user may use a lower threshold if greater sensitivity is desired, or a higher threshold of perfect specificity is needed.
In the sensitivity analysis, children reviewed as having a possible delayed diagnosis were categorized as having a true delay. This recategorization improved model performance significantly, with a nearly perfect positive predictive value of 97% and perfect interrater reliability. This indicates that false positive results from the trigger tool are largely because of children with possible delayed diagnosis, some of whom are likely to have experienced delay. It also highlights the challenges of human review of delayed diagnosis. Assigning a level of confidence to the determination of delayed diagnosis is inherently subjective, particularly for the cases that exist in a gray area (i.e., cases with a “possible” delayed diagnosis).
Study limitations include the restricted geography (eastern Massachusetts only) and the use of EHR administrative data (rather than true claims).
In conclusion, a trigger tool that identifies delays in diagnosis using only health claims in community EDs has a high positive predictive value for true delayed diagnosis. The tool may be applied in community EDs to evaluate diagnostic quality.
FUNDING: Dr Michelson received funding from CRICO and through award K08HS026503 from the Agency for Healthcare Research and Quality.
CONFLICT OF INTEREST DISCLOSURES: The authors have indicated they have no conflicts of interest relevant to this article to disclose.
COMPANION PAPERS: Companions to this article can be found online at www.hosppeds.org/cgi/doi/10.1542/hpeds.2023-007157 and www.hosppeds.org/cgi/doi/10.1542/hpeds.2023-007249.
The data are available upon reasonable request.
Dr Michelson conceptualized and designed the study, performed statistical analysis, drafted and approved the final manuscript; Mr McGarghan collected data, designed the data collection instruments, and revised and approved the final manuscript; Drs Waltzman and Samuels-Kalow directed regulatory and data collection efforts at study sites and revised and approved the final manuscript; and Dr Bachur supervised the design and analysis of the study and revised and approved the final manuscript.
Comments