OBJECTIVES. Our goal was to determine if (1) preterm children were referred, identified, and received early-intervention/special education services at rates equivalent to term children after implementation of a universal, periodic Ages & Stages Questionnaire screening and surveillance system, (2) pediatricians sufficiently lowered their screening thresholds with preterm children, and (3) quality-improvement opportunities exist.
PATIENT AND METHODS. Secondary analysis was performed on 64 lower-risk, predominantly late-preterm and 1363 term children who originally presented for their 12- or 24-month well-child visits. Higher-risk premature infants already involved with an early-intervention agency or identified with a delay were excluded. Board-certified pediatricians (n = 18) and nurse practitioners (n = 2), who were blind to the Ages & Stages Questionnaire results, were secondary participants. Differences between preterm and term early-intervention agency referrals were examined by comparing pediatric developmental impression to the Ages & Stages Questionnaire under natural clinic conditions using a combined in-office or mail-back data-collection protocol. Medical chart and county early-intervention or special education agency follow-up outcomes were conducted at 36 to 60 months.
RESULTS. Preterm referral rates were 9.5% (term: 5.6%) with pediatric developmental impression and 26.2% (term: 8.1%) with the Ages & Stages Questionnaire. In follow-up, 37.5% of preterm and 20.8% of term children received referrals, of which 50.0% of preterm and 42.4% of term children were eligible for services, 54.2% of preterm children were identified with a developmental-behavioral disorder, and 29.2% of preterm and 20.8% of term children did not follow-up. For the Ages & Stages Questionnaire, only preterm referrals (55.6%) were subsequently identified with an eligible delay or disorder or both. Preterm children were ∼2 times more likely to be eligible than term children.
CONCLUSIONS. Combined referral, quality-improvement, and outcome data suggest that clinicians should lower their threshold for administering a quality developmental screening instrument when providing surveillance for premature infants. Quality improvement exists with diligent developmental surveillance and a standardized, reliable, but more interpersonal referral process.
Interpreting measures of agreement
This paper reports a secondary analyses of a very important data set describing the outcome at age three to five years of 1393 term and 63 preterm children referred for evaluation of developmental delay based on screening by the clinician impression (PDI) and the age-appropriate ASQ at 12 and 24 months of age. The authors report that the agreement between PDI and ASQ for terms (82.4%) was significantly higher than for pre-terms (66.4%) and later note that "52.9% of referred preterm cases would not have promptly occurred without the ASQ." The authors interprete these differences as evidence that clinicians are less aware of delay in preterms than in terms.
This is an important finding if substantiated. However, this type of interpretation requires information usually presented in 2x2 tables to rule out plausible, rival interpretations. No relevant 2x2 tables are presented in the paper and cannot be calculated. Presentation of results is confusing, important information is sometimes missing or hard to find, and numbers sometimes differ between presentation in the narrative, the table and the figure.
Without information for the 2x2 tables, there is no way for the reader to determine whether the observed difference truly reflects a difference resulting from the pediatricians failure to recognize cases, or whether the difference is an artifact resulting from the high false positive rate with the ASQ, or more likely, an artifact of the difference in the chance of disagreement between the term and preterm. Methodological discussions of measuring agreement for categorical data(1;2) note that analysis of raw agreement is inappropriate and whatever measure is used should be accompanied by presentation of the 2X2 tables of data. The most common method recommended for this analysis is Cohen’s kappa, which corrects for chance. The chance of agreement is different for terms and preterms where 23% (14/65) of preterms were verified as EI eligible for service or monitoring vs. 12% (159/1363) for terms. (These numbers increase to 32% and 14%, respectively when adjusted for verification bias.) A significant difference between the kappas would support the authors’ interpretation. However, one would still need to look at the 2x2 tables because kappa can be misleading when the distribution is skewed as seems likely with 86-88% of the terms likely to be in one category. Meade et al(3) describe the problem as follows: “…when the proportion of positive ratings is extreme, the possible agreement above chance agreement is small and it is difficult to achieve even a moderate value of kappa.” Meade et al.(3) and McGinn et al.(2) also describe the use of ö which is a chance-independent measure of agreement based on the odds ratio . This measure is not influenced by extremes in the distribution of positive and negative results and has several mathematical advantages over kappa. Regardless of the method used, the 2x2 tables need to be presented to support the interpretation.
One should also consider how the difference in verification rate for the PDI and the ASQ contributed to differences in the agreement rate between term and preterm. In the original publication(4), referrals were verified 96% of the time for the PDI and 71% of the time for the ASQ. This information is not explicit in the present article but raises the possibility that disagreement between the two measures reflects the greater inaccuracy in the ASQ.
Another problem occurs with the apparent disparagement of the clnicians’ decisions. Anecdotal data in this article elaborate on the 4 or 5 (specific number is not clear) of the preterms whose “disorder” was “missed by the board certified pediatrician at 12 or 24 months” but identified by the ASQ so that referral “more promptly occurred,” as judged by status assessed from 1 to 4 years later. It is difficult to assess this kind of statement when data are presented with confusing (Table1) or unclear (Figure 1) labels; however, it appears, at least in Figure 1, that referrals made by the PDI alone are reflected in the numbers for referrals made when the ASQ was not returned. When the ASQ was returned, 40% (17/43) of preterms were referred and 82% (14/17, adjusted) were verified; when the ASQ was not returned, 32% (7/22) of preterms were referred (presumably by the PDI alone) and 86% (6/7, adjusted) were verified. Among terms, 22% (161/733) were referred when the ASQ was returned and the verification rate was 66% (106/161, adjusted) whereas 19% (122/630) were referred by the PDI alone with a verification rate of 70% (86/122, adjusted). Viewed from this perspective, the results are much more balanced than one would expect from the narrative. What useful purpose to science is served by such one-sided reporting? Surely, the most important point is to find the best, most feasible, combination of procedures to identify children early. One of the most important findings reported in this article is the information that even “low-risk” preterms are recognized to have a significantly higher rate of qualifying for developmental services than term infants as early as 12 and 24 months. What is not clear is how the other information can be interpreted to guide the most appropriate use of resources available. This is impossible without clearer and more complete presentation of the data such as appropriate 2x2 tables and rates of verification for PDI alone, ASQ alone and the combination of ASQ and PDI. Bonnie W. Camp, MD, PhD Professor Emeritus of Pediatrics and Psychiatry University of Colorado School of Medicine
James R. Murphy, PhD Professor of Biostatistics Head, Division of Biostatistics and Bioinformatics National Jewish Health and Adjunct Professor Colorado School of Public Health
Reference List
(1) Altman DG. Some common problems in medical research. Practical statistics for medical research.New York, NY: Chapman and Hall; 1991. p. 396-438.
(2) McGinn T, Guyatt G, Cook R, Korenstein D, Meade M. Measuring agreement beyond chance. In: Guyatt G, Rennie D, Meade MO, Cook DJ, editors. Users' Guides to the Medical Literature. 2 ed. Chicago, IL: McGraw-Hill Professional; 2008. p. 481-9.
(3) Meade MO, Cook RJ, Guyatt G, Groll R, Kachura JR, Bedard M, et al. Interobserver variation in interpreting chest radiographs for the diagnosis of acute respiratory distress syndrome. Am J Resp Crit Care Med 2000;161:85-90.
(4) Hix-Small H, Marks K, Squires J, Nickel R. Impact of implementing developmental screening at 12 and 24 months in a pediatric practice. Pediatrics 2007 Aug;120(2):381-9.
Conflict of Interest:
None declared