This article describes a 2-phase process implemented by the American Board of Pediatrics in 2021 to investigate and remove potential bias on its General Pediatrics Certifying Examination at the item (question) level based on gender or race and ethnicity. Phase 1 used a statistical technique known as differential item functioning (DIF) analysis to identify items in which 1 subgroup of the population outperformed another subgroup after controlling for overall knowledge level. Phase 2 involved a review of items flagged for statistical DIF by the American Board of Pediatrics’ Bias and Sensitivity Review (BSR) panel, a diverse group of 12 voluntary subject matter experts tasked with identifying language or other characteristics of those items that may have contributed to the observed performance differences. Results indicated that no items on the 2021 examination were flagged for DIF by gender and 2.8% of the items were flagged for DIF by race and ethnicity. Of those items flagged for race and ethnicity, 14.3% (0.4% of total items administered) were judged by the BSR panel to contain biased language that may have undermined what the item was intending to measure and were therefore recommended to be removed from operational scoring. In addition to removing potentially biased items from the current pool of items, we hope that repeating the DIF/BSR process after each examination cycle will increase our understanding of how language nuances and other characteristics impact item performance so that we can improve our guidelines for developing future items.

The American Board of Pediatrics (ABP) seeks to “improve child health by certifying pediatricians who meet standards of excellence and are committed to continuous learning and improvement.”1  Those standards include passage of an initial General Pediatrics Certifying Examination (GP Exam) by a board-eligible physician. As part of their 2017 strategic plan, the ABP started collecting self-reported race and ethnicity information in 2018 from trainees, candidates, and board-certified pediatricians to help achieve 2 specific objectives: (1) to enhance the tracking and reporting of pediatric training and workforce trends2,3  and (2) to identify and eliminate any unintended bias in ABP assessments. This manuscript focuses on efforts pertaining to the latter objective, with a specific focus on investigating potential bias in individual examination questions (hereafter “items”).

The ABP aims to ensure that all its assessment scores, including those from its GP Exam, are valid and accurate reflections of each test taker’s knowledge. An assessment is said to be culturally biased if it systematically underestimates or overestimates what it intends to measure based on a cultural variable such as gender or race and ethnicity.4  If observed performance differences reflect real differences in relevant knowledge between individuals or groups, however, then the assessment is deemed not biased.

Consider, for example, 2 groups who have unequal opportunities to acquire knowledge in a specific content area. Those 2 groups would likely perform differently on an item measuring knowledge in that content area, but the item would not be considered biased because the differential performance reflects real differences in knowledge. In these circumstances, factors that may be contributing to unequal opportunities to learn include structural racism and discrimination, socioeconomic factors, geographic location of training, and many others.

Contrastingly, a biased item is one where two groups have equal levels of knowledge in a specific content area, but something about the way the item is written (eg, unclear phrasing, regional terminology, microaggressions) produces differential performance. Biased items are important to identify because they undermine what the assessment intends to measure.4  Identifying and addressing item bias is separate and narrower in scope than the larger effort of understanding and addressing factors that may be contributing to examination-level performance differences. Although the ABP is committed to both, in this manuscript, we focus specifically on the process we implemented to investigate and remove cultural item bias from the GP Exam.

The GP Exam, administered in October of each year, is a 7-hour examination consisting of 335 multiple choice items. For security purposes, the ABP develops multiple versions of the GP Exam, which are randomly administered to test takers. All versions are developed to be equivalent with respect to content by adhering to the General Pediatrics Content Outline,5  although individual test takers may receive different items. For example, each version of the GP Exam contains items pertaining to “anticipatory guidance” (ie, subdomain 1.E within the content outline), but the specific items will likely vary from 1 version to another. Although versions are also constructed to be equivalent with respect to difficulty level, statistical analyses are conducted after each administration to determine the relative difficulty level of each version as an added layer of quality assurance. If necessary, scores are adjusted to ensure all test takers are held to the same standard and that no test taker is advantaged or disadvantaged by receiving a slightly easier or more difficult version of the GP Exam.6 

The process implemented by the ABP for investigating and addressing potential item bias on the GP Exam consists of 2 major phases (Fig 1). Phase 1 includes a statistical technique known as differential item functioning (DIF) analysis7  that is used to identify items in which 1 subgroup of the population, based on gender or race and ethnicity, outperformed another subgroup after controlling for overall knowledge level, as measured by the total examination score. This process is conducted following the administration of the GP Exam because it requires item response data from test takers. Phase 2 includes a review of items by the Bias and Sensitivity Review (BSR) panel, a volunteer panel of subject matter experts. The BSR panel reviews any items flagged for DIF and identifies language or other characteristics of those items that may have contributed to observed performance differences.

FIGURE 1

Process to identify and remove potential item bias from the GP Exam.

FIGURE 1

Process to identify and remove potential item bias from the GP Exam.

Close modal

The ABP first piloted DIF analysis using items from the 2019 GP Exam to examine the potential number of items exhibiting differences in performance by gender or race and ethnicity subgroups. The BSR panel was established in 2020, and a 2-phase analysis and review approach was incorporated into the operational scoring of the 2020 GP Exam and repeated in 2021 and 2022 with minor modifications. At the time this manuscript was drafted, 2021 represented the most recent iteration, so the process and results from the 2021 GP Exam are reported in this article. This study was deemed exempt by the ABP’s institutional review board of record.

Data were obtained from the ABP database and fell into 2 categories: (1) examination data pertaining to test takers’ performance on the 2021 GP Exam and (2) demographic data pertaining to the 4206 individual test takers who took the 2021 GP Exam.

The examination data used in these analyses included each test taker’s response to each individual item, along with each test taker’s overall examination score. Because multiple versions were administered in 2021, the total number of items administered was greater than 335. For security purposes, the ABP does not disclose the number of versions developed each year nor the exact number of unique items that compose all versions.

Demographic data, specifically gender and race and ethnicity, were self-reported by resident trainees and board-eligible candidates through examination registration, maintenance of certification enrollment, and other surveys. Gender had traditionally been limited to 2 options (male and female). Changes were made in 2021 to add “nonbinary” and “prefer not to answer” options. The race and ethnicity question used by the ABP for the 2021 examination was based on recommendations from the US Census Bureau’s 2015 National Content Test Race and Ethnicity Analysis Report8,9  (Supplemental Fig 3).

DIF analysis refers to statistical methodology that compares the item performance of a reference group (typically the largest group) with that of a focal group (typically the smaller group). The specific method used in these analyses was the Mantel-Haenszel χ2 test,7,10  which first matches test takers based on overall ability (as measured by their overall examination scores) and then determines whether “matched” test takers from the reference and focal groups perform similarly on a given item. To ensure sufficient statistical power, sample sizes were required to be at least 50 for both the reference and focal groups for an item to be included in the analysis.11  Even with the minimum sample size requirement, DIF statistics based on smaller samples may suffer from larger levels of random error. To reduce the likelihood of falsely flagging acceptable items as being biased based on statistical noise (ie, false positives), items were flagged at a significance level of 0.01 for any items in which group sample sizes were larger than 200, and 0.005 when either group fell below that threshold. In addition, items where greater than 90% of all test takers answered the item correctly were excluded from further review. The software program Winsteps12  was used to conduct all analyses.

DIF analysis was used to investigate the performance of both gender and race and ethnicity subgroups. Because female and white examinees represented the majority of test takers, they were treated as the reference groups, respectively. Males represented the sole gender focal group because the sample size for nonbinary (N = 1) was insufficient for analysis. Asian; Black or African American; Hispanic, Latino, or Spanish origin; and Middle Eastern or North African represented the race and ethnicity focal groups. American Indian or Alaska Native and Native Hawaiian or Other Pacific Islander test takers were excluded because of insufficient sample sizes. For analysis purposes, those who self-reported multiple race and ethnicity identities were classified into 1 or more focal groups. For example, if a test taker selected “white” and “Asian,” they were classified as “Asian” in the Asian versus white analysis. If a test taker selected “Asian” and “Black or African American,” they were included in both the Asian versus white and Black or African American versus white analyses. Item performance for each gender and racial/ethnic focal group was independently compared with that of the associated reference group using pairwise comparisons.

The ABP selected 12 total panelists (10 pediatricians and 2 nonpediatricians) for the BSR panel to allow for adequate diversity while also providing each panelist the opportunity to meaningfully participate. Prospective volunteers were identified through 2 channels, described in the following section. Regardless of how they were identified, all prospective volunteers received an e-mail outlining the purpose and responsibilities of the panel, along with a link to an electronic survey. The survey contained 4 questions: (1) “Why are you interested in volunteering for the Bias and Sensitivity Review Panel?” (free text); (2) “How comfortable are you discussing potentially sensitive topics, especially those pertaining to racial/ethnic or gender bias?” (1 to 5 scale, 1 being “not at all comfortable” and 5 being “very comfortable”); (3) “How would you rate your ability to identify potentially offensive or biased content?” (1 to 5 scale, 1 being “not at all qualified” and 5 being “very qualified”); and (4) “Please briefly describe any relevant experiences you have had pertaining to bias, sensitivity, diversity, and/or inclusion” (free text).

The first group of prospective volunteers recruited included 52 pediatricians identified through the ABP’s volunteer database, a directory of pediatricians who had previously expressed interest in volunteering for the ABP. These 52 pediatricians met the following criteria: currently certified in general pediatrics, not certified in a pediatric subspecialty, and self-reported racial/ethnic identity was non-white. From this group, 9 survey responses were returned. A second group of prospective volunteers was recruited through the American Academy of Pediatrics’ Minority Health Equity and Inclusion listserv. Because no data were known about these individuals, all members of the listserv received the invitation, which resulted in an additional 102 survey responses. In addition to the 4 survey questions described previously, this version of the survey also asked respondents to self-report their race and ethnicity.

A rubric developed by ABP staff was used to rank survey responses; 10 pediatricians were selected for the panel. Two academicians (nonpediatricians) from the University of North Carolina, Greensboro’s Office of Intercultural Engagement, were also invited to serve on the panel, bringing the total number of panelists to 12. The panel consisted of 8 (67%) female and 4 (33%) male panelists, and the racial/ethnic composition was as follows: 4 (33%) Asian or Indian (2 Indian, 2 Asian); 3 (25%) Black or African American; 3 (25%) Hispanic, Latino, or Spanish origin; 1 (8%) Middle Eastern or North African; and 1 (8%) white. No panelists self-identified as more than 1 racial or ethnic category. Panelists were not paid for their participation, but each pediatrician received 10 points of lifelong learning and self-assessment (Part 2) credit that could be applied toward their maintenance of certification requirements.13 

An initial 2-hour training webinar was facilitated by an ABP psychometrician shortly before the administration of the 2021 GP Exam. During the training webinar, all BSR panelists received an overview of project objectives and timelines as well as general information about how items are flagged statistically using DIF analysis. Panelists also reviewed example items that had been flagged in previous administrations and engaged in discussions about the types of language or other issues that may unfairly disadvantage certain subgroups of the population.

Immediately following the administration of the 2021 GP Exam, BSR panelists participated in 2 additional virtual meetings with an individual review assignment between meetings. During the first of these meetings, ABP staff provided an overview of the DIF analysis results, including the total number of flagged items and details about how many items were flagged for each reference-focal group comparison. Panelists also reviewed and discussed 3 flagged items as a training exercise ahead of receiving their individual assignments. A detailed report summarizing the statistical performance of all flagged items was subsequently provided to each panelist. Panelists were able to refer to this statistical report while conducting their independent review of flagged items. Panelists were specifically asked to leave comments for any item containing content (eg, wording or other item characteristics) that may have caused the differential item performance. If a panelist left a comment, other panelists were able to view the comment and the commenter’s name.

At the final meeting, all items receiving at least 1 comment were reviewed and discussed by the entire BSR panel. Following the discussion of each item, the panel made a final recommendation for how the item should be treated: (1) leave the item as scored and allow it to remain available for future examination selection (ie, no problematic content detected); (2) leave the item as scored but send it back to the General Pediatrics Exam Committee14  with suggested revisions (ie, fair to include in scoring but it should be revised before future use); or (3) remove the item from scoring and send it back to the General Pediatrics Exam Committee with suggested revisions (ie, problematic content identified and unfair to include in scoring).

Following the final BSR meeting, those items that were recommended to be unscored by the BSR panel (option 3 listed previously) were referred to the General Pediatrics Oversight Committee (GPOC).14  The GPOC comprises practicing general pediatricians experienced in item development who are tasked with making final scoring and other policy-related decisions for the GP Exam. The GPOC reviewed the items, associated comments from the BSR panel, and relevant data on how removing the items from scoring would impact test takers. The GPOC then made a final scoring decision for each item.

At the conclusion of the review process, all items receiving at least 1 comment during the BSR review, regardless of scoring decisions, were returned to the GP Exam Committee, which is responsible for drafting, reviewing, and approving items for the GP Exam. The BSR panel comments were used to guide revisions to the flagged items from the 2021 GP Exam. Going forward, this feedback will also provide generalizable knowledge that should help this committee and others when developing future items.

A total of 4206 candidates took the 2021 GP Exam. Figure 2 provides the distribution of all 2021 test takers by both gender and race and ethnicity. The DIF analysis was completed during the first week after regular administration to avoid delays in scoring and subsequent score reporting. The dataset at the time of analysis included 3916 test takers. Data were unavailable for 290 test takers whose examinations were scheduled later for health or other approved personal reasons15  or who rescheduled their examinations because of issues at the testing centers. A total of 3618 test takers were included in the DIF analyses investigating racial/ethnic differences because of the exclusion of 286 test takers with unknown status (ie, those who did not answer or who selected “other” or “prefer not to say”) and 13 test takers from subgroups with fewer than 100 test takers per item (ie, nonbinary, American Indian or Alaska Native, Native Hawaiian or Other Pacific Islander).

FIGURE 2

Distribution of 2021 GP Exam test takers by gender and race/ethnicity.

FIGURE 2

Distribution of 2021 GP Exam test takers by gender and race/ethnicity.

Close modal

Because the ABP does not disclose the number of examination versions that are developed nor the total number of unique items that compose all versions of the examination within a given administration window, results are only reported as percentages. All items administered as part of the 2021 GP Exam were included in the analysis, and of those items, a total of 2.8% were flagged for exhibiting a statistically significant level of DIF because of race and ethnicity. No items were flagged because of gender. Flagged items were provided to the BSR panel for each member to independently review.

Of the 2.8% of items flagged for DIF, just over half (57.1%) received at least 1 comment by a BSR panelist and were subsequently reviewed by the full BSR panel. These items represented just 1.6% of all items administered. Of these items reviewed, most items (75.0%) were judged to be fair to include in the set of scored items, whereas the remaining 25.0% were recommended to be removed from scoring. The items that were ultimately recommended to be removed from scoring represented 14.3% of all items flagged for DIF and 0.4% of all items administered. The GPOC approved the BSR panel’s recommendation for all items that were recommended for removal.

Table 1 provides additional detail regarding the DIF analysis results and scoring recommendations for each racial/ethnic pairwise comparison, including the percentage of items flagged for statistically significant DIF, the percentage of items receiving at least 1 comment from a BSR panelist, and the percentage of items that were removed from scoring. Because some items exhibited significant DIF for multiple pairwise comparisons (eg, Asian versus white and Black or African American versus white), they are included in multiple columns, and the percentages reflected in the “Total” column do not reflect the sum of the items in each pairwise comparison column.

TABLE 1

DIF Analysis Results and Scoring Decisions for Racial/Ethnic Pairwise Comparisons

Pairwise Comparison
White vs AsianWhite vs BlackWhite vs HispanicWhite vs Middle EasternTotal
Items exhibiting significant DIF 0.4% 1.3%a 0.7% 1.1%b 2.8% 
Items receiving a BSR comment 0.4% 0.5% 0.5% 0.9% 1.6% 
Items removed from scoring 0.3% 0.0% 0.3% 0.1% 0.4% 
Pairwise Comparison
White vs AsianWhite vs BlackWhite vs HispanicWhite vs Middle EasternTotal
Items exhibiting significant DIF 0.4% 1.3%a 0.7% 1.1%b 2.8% 
Items receiving a BSR comment 0.4% 0.5% 0.5% 0.9% 1.6% 
Items removed from scoring 0.3% 0.0% 0.3% 0.1% 0.4% 
a

1.2% of the DIF items advantaged white test takers; 0.1% of the DIF items advantaged Black or African American test takers.

b

1.0% of the DIF items advantaged white test takers; 0.1% of the DIF items advantaged Middle Eastern or North African test takers.

Three operational items from the 2021 GP Exam are provided in Table 2 to show the types of items that were flagged, the nature of comments left by BSR panelists, and resulting recommendations. One example has been selected for each of 3 possible outcomes for flagged items: (1) the item was retained in scoring because no reason for the differential item performance was identified; (2) the BSR panel identified cultural reasons for the differential item performance, but the item was deemed to be fair and measuring important pediatric knowledge and was therefore retained in scoring; and (3) the item was removed from scoring because of language or other item characteristics deemed to be culturally biased. All 3 of the example items in Table 2 have been retired from future use to allow for inclusion in this article.

TABLE 2

BSR Comments and Scoring Decisions for 3 Example Items

ItemDIF ResultSample BSR CommentsScoring Decision and Rationale
The predisposition of infants and young children to acute otitis media is principally attributed to which of the following?
  • Decreasing immunoglobulin concentrations over the first several years of age

  • Functional and anatomic differences in the eustachian tubes (key)

  • Frequent sinusitis

  • Sensitivity to milk protein

  • Adenotonsillar hypertrophy

 
White vs Asian (P < .01); white vs Middle Eastern or North African (P < .01) No comments. This item was retained in scoring.
The BSR panelists did not identify any problematic content for this item, so the differential item performance was judged to be statistical noise. 
A mother who has just been diagnosed with breast cancer and has tested positive for the BRCA1 gene requests genetic testing for her 9-y-old daughter. Which of the following is the most appropriate response?
  • Order genetic testing for the daughter

  • Ask the daughter if she would like to be tested

  • Refer the daughter to a gynecologist

  • Recommend that genetic testing be deferred until the daughter is able to make an informed decision (key)

  • Reassure the mother that breast cancer is unlikely in prepubertal girls

 
White vs Asian (P < .01) Problematic because of accepted cultural differences in the perspective of informed consent in a minor.
Problematic because of cultural differences in a child's role, as well as the nuances surrounding ability to make an informed decision.
Problematic, in regard to cultural differences, and in regards to a child needing to make informed consent verses what's best for the child. 
This item was retained in scoring.
Although it was acknowledged that cultural differences exist with respect to parental roles and responsibilities in making healthcare decisions for their children, the principles of informed consent and assent are well established in pediatric practice (citations), and all pediatricians, regardless of cultural background, are expected to possess knowledge of these principles and demonstrate evidence-based care. 
A 19-y-old college student is receiving follow-up evaluation for attention deficit hyperactivity disorder. Review of his medical record shows that he was an excellent student in high school, during which he was using a consistent dose of dextroamphetamine. He has requested replacement of 2 lost prescriptions in the past 3 mo. He reports that his current dose is inadequate for maintaining his attention and requests an increased dosage. Which of the following is the most appropriate next step in management?
  • Increase the dose

  • Prescribe a different stimulant medication

  • Arrange a psychiatry referral

  • Contact his parents to confirm history

  • Discuss the possibility he is selling amphetamines (key)

 
White vs Asian (P < .01)
white vs Middle Eastern or North African (P < .01) 
Problematic because in the Middle Eastern culture it is considered inappropriate to directly “accuse” someone of something without proof/evidence and can be viewed as insulting and potentially damaging to the doctor patient relationship.
Problematic because of the automatic assumption of malintent when other parts of the situation should be explored; these assumptions are more real for some test takers than others and can often be triggering as well. 
This item was removed from scoring.
Although pediatricians need to be able to recognize and intervene appropriately if a patient, family member, or caregiver may be misusing or selling amphetamines (option E), it was acknowledged that this item had the potential to be perceived as accusatory and disproportionately triggering for some cultures, which may have impacted performance and undermined the utility of this item from an assessment perspective. 
ItemDIF ResultSample BSR CommentsScoring Decision and Rationale
The predisposition of infants and young children to acute otitis media is principally attributed to which of the following?
  • Decreasing immunoglobulin concentrations over the first several years of age

  • Functional and anatomic differences in the eustachian tubes (key)

  • Frequent sinusitis

  • Sensitivity to milk protein

  • Adenotonsillar hypertrophy

 
White vs Asian (P < .01); white vs Middle Eastern or North African (P < .01) No comments. This item was retained in scoring.
The BSR panelists did not identify any problematic content for this item, so the differential item performance was judged to be statistical noise. 
A mother who has just been diagnosed with breast cancer and has tested positive for the BRCA1 gene requests genetic testing for her 9-y-old daughter. Which of the following is the most appropriate response?
  • Order genetic testing for the daughter

  • Ask the daughter if she would like to be tested

  • Refer the daughter to a gynecologist

  • Recommend that genetic testing be deferred until the daughter is able to make an informed decision (key)

  • Reassure the mother that breast cancer is unlikely in prepubertal girls

 
White vs Asian (P < .01) Problematic because of accepted cultural differences in the perspective of informed consent in a minor.
Problematic because of cultural differences in a child's role, as well as the nuances surrounding ability to make an informed decision.
Problematic, in regard to cultural differences, and in regards to a child needing to make informed consent verses what's best for the child. 
This item was retained in scoring.
Although it was acknowledged that cultural differences exist with respect to parental roles and responsibilities in making healthcare decisions for their children, the principles of informed consent and assent are well established in pediatric practice (citations), and all pediatricians, regardless of cultural background, are expected to possess knowledge of these principles and demonstrate evidence-based care. 
A 19-y-old college student is receiving follow-up evaluation for attention deficit hyperactivity disorder. Review of his medical record shows that he was an excellent student in high school, during which he was using a consistent dose of dextroamphetamine. He has requested replacement of 2 lost prescriptions in the past 3 mo. He reports that his current dose is inadequate for maintaining his attention and requests an increased dosage. Which of the following is the most appropriate next step in management?
  • Increase the dose

  • Prescribe a different stimulant medication

  • Arrange a psychiatry referral

  • Contact his parents to confirm history

  • Discuss the possibility he is selling amphetamines (key)

 
White vs Asian (P < .01)
white vs Middle Eastern or North African (P < .01) 
Problematic because in the Middle Eastern culture it is considered inappropriate to directly “accuse” someone of something without proof/evidence and can be viewed as insulting and potentially damaging to the doctor patient relationship.
Problematic because of the automatic assumption of malintent when other parts of the situation should be explored; these assumptions are more real for some test takers than others and can often be triggering as well. 
This item was removed from scoring.
Although pediatricians need to be able to recognize and intervene appropriately if a patient, family member, or caregiver may be misusing or selling amphetamines (option E), it was acknowledged that this item had the potential to be perceived as accusatory and disproportionately triggering for some cultures, which may have impacted performance and undermined the utility of this item from an assessment perspective. 

These example items help illustrate the primary task of the BSR panel, which was (1) to review flagged items to determine if they could explain why certain racial/ethnic groups had performed differently on the item after controlling for overall knowledge level and (2) to make a judgment about whether the differential performance on the item reflected a true difference in knowledge or whether the differential performance reflected the impact of another factor unrelated to the content intended to be assessed by the item.

In the case of item 1, the BSR panel could not identify a reason for the differential performance, so it was assumed to be statistical noise, and the item was retained in scoring. Items 2 and 3, however, illustrate the key distinction between items that were retained or removed from scoring. For both items, the BSR panel identified cultural differences that may have resulted in differential item performance. Item 2 was retained in scoring because the BSR panel determined that there were no language or other characteristics that changed what the item intended to measure (principles of consent and assent); thus, differences in subgroup performance were deemed to be primarily from differences in knowledge of an important and relevant topic. Item 3 was intended to measure a test taker’s knowledge of how to recognize and intervene in cases of suspected misuse of amphetamines, but the panel determined that it had the potential to be perceived as accusatory and disproportionately triggering for some cultures. As a result, the item was judged to be measuring something unintended (for some cultures) in addition to what was intended and was therefore recommended to be removed from scoring.

Overall, the DIF/BSR process resulted in 0.4% of all items administered being flagged for review and ultimately removed from the final calculation of test takers’ total scores. This finding suggests that a majority of items are not exhibiting the type of statistical bias that impacts the performance of test takers. Although data are not yet available, the ABP has introduced a mechanism to track the performance of items that have been flagged and revised using feedback from the BSR panel to determine if the revisions have reduced or eliminated observed subgroup item performance differences on future examinations. The number of revised items included on future examinations is likely to be small, so it may take several years to obtain a useful analytic dataset, but this work is being incorporated into the ABP’s quality improvement efforts.

We should note that DIF analysis is a post hoc measure that detects statistical item bias only after the items have been administered. The ABP, however, is continually taking steps to improve and enhance its item development processes as well. Recent examples include incorporating emerging guidance on inclusive and antibiased language (eg, “Words Matter: AAP Guidance on Inclusive, Anti-biased Language, 202116 ; AAP perspective: race-based medicine, 202117 ) into the item writing orientation and training materials. Measures such as this are designed to prevent offensive item content regardless of whether it leads to statistical performance differences.

One limitation of this study is that per the ABP’s item development guidelines, patient demographic characteristics (eg, gender, age, race) are only included in an item if they are deemed to be relevant and critical. No mechanism currently exists for tracking whether items contain various demographic characteristics, however, so we cannot evaluate the impact of their inclusion on an item’s likelihood to be flagged for DIF.

The ABP has also begun to investigate whether item-level scoring differences by subgroups are more prevalent or pronounced in certain knowledge domains within the General Pediatrics Content Outline,3  which may reflect differences in pediatric training and/or experience. These small but meaningful improvements, along with repeating the DIF/BSR process and providing BSR panelist comments to item writers after each examination cycle, will hopefully increase our understanding of subtle language and other characteristics that should be avoided.

One natural question that arises in discussions pertaining to DIF analysis is, “Should all items that exhibit statistical bias (ie, DIF) be automatically removed from scoring?” Although it is tempting to consider removing all such items, doing so could produce unintended negative consequences. Potential differences in relevant pediatric knowledge may exist because of economic, social, or other structural factors that disproportionately impact certain groups. Differences in knowledge could also be due to limited exposure as a result of training in geographic regions with different prevalence and epidemiology of childhood disorders. In these cases, automatically removing all items exhibiting statistical DIF would likely mask or minimize those knowledge differences,4  leading to less effective identification of knowledge gaps and implementation of strategies aimed at closing those gaps or addressing structural factors that contribute to those gaps. Thus, the BSR panel’s role is critical for ensuring the validity of scores by only removing those items in which the knowledge that is intended to be assessed is perceived by the panel to have been undermined by problematic language or content.

There are a few additional limitations of this research. For the population of GP Exam test takers, race and ethnicity status may be confounded with other factors that impact performance, such as language of origin, quality of training, and access to resources and test preparation materials that may reflect systemic disparities in opportunity for trainees. These factors may partially account for the item performance differences identified during DIF analysis. Investigating the effects of other confounding variables at the item level and at the overall examination level using multivariate modeling are important next steps in this line of research and in the ABP’s efforts to address possible bias in its examinations.

One additional limitation pertains to the small size of the BSR panel and the raw number of panelists in each racial and ethnic subgroup. Only 14% of flagged items were recommended by the panel to be removed from scoring, but a larger panel with more representatives from each subgroup may have been better equipped to identify complex and nuanced content bias. On the other hand, the large percentage of items recommended to be retained in scoring is an indication that many DIF items may have been falsely flagged. Another indication that DIF analysis may have been over flagging items is that no items were flagged in the gender DIF analysis, which had larger sample sizes and greater statistical power. On a similar note, small sample sizes limited the ability to examine groups with fewer than 50 measurements, namely nonbinary, American Indian or Alaskan Native, and Native Hawaiian or Other Pacific Islander candidates. Pooling item responses over several years, where possible, may provide additional opportunities to investigate cultural item bias for these groups.

The article outlines measures that the ABP has taken to investigate and eliminate cultural item bias in the GP Exam, an important step in ensuring that the GP Exam is a fair and accurate assessment of pediatric knowledge and therefore useful for making valid certification decisions. To help establish this process, we have drawn from assessment standards and guidelines18,19  and from the work of other certifying boards.20,21  We recognize the limitations of this effort, however, and our hope is that by sharing our methods and results with the pediatric community and with other certifying boards, it will produce increased transparency and continued improvement in pursuit of exams that are valid and fair for all test takers.

We acknowledge many individuals, both internal staff members and volunteer pediatricians, who helped design, implement, and improve the process described in this paper. First and foremost, we wish to thank all past and current BSR panelists for their valuable contributions and dedication to improving our certification exams, including Sarah Ann Anderson-Burnett, Guru V. Bhoojhawon, Patricia R. Castillo, Sneha Daya, Jorge F. Ganem, Courtney James, Denise H. Kung, Camila M. Mateo, Asha Morrow, Marwa Moustafa, Joanna Perdomo, Jonathan Tolentino, Coral Yap, Augusto E. Peña, and Amberlina Alston. We sincerely appreciate the efforts of other ABP staff involved in the process, including Amy Olson, who helped recruit BSR panelists and coordinate operational meetings; Jake Cho, who helped identify the most appropriate DIF methodology, execute the operational analysis, and summarize the results for staff and BSR panelists; and Drs Judy Schaechter and Suzanne Woods for their editorial reviews. Last, we thank Ndidi Unaka for her critical review of the manuscript.

Dr Dwyer conceptualized and designed the study, provided implementation oversight, drafted the initial manuscript, and critically reviewed and revised the manuscript; Dr Brucia conceptualized and designed the study, facilitated the work of the Bias and Sensitivity Review Panel, drafted the initial manuscript, and critically reviewed and revised the manuscript; Dr Du conceptualized and designed the study, carried out the initial analyses, drafted the initial manuscript, and critically reviewed and revised the manuscript; Dr Althouse conceptualized and designed the study and critically reviewed and revised the manuscript; Drs Leslie and Turner critically reviewed and revised the manuscript; and all authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.

FUNDING: All phases of this work were performed by employees of the American Board of Pediatrics. No external funding.

CONFLICT OF INTEREST DISCLOSURES: All authors are employed by the American Board of Pediatrics and receive salary compensation for their work at the ABP.

ABP

The American Board of Pediatrics

BSR

Bias and Sensitivity Review panel

DIF

differential item functioning

GP Exam

General Pediatrics Certifying Examination

GPOC

General Pediatrics Oversight Committee

1
The American Board of Pediatrics
.
Vision and mission
.
Available at: https://www.abp.org/content/vision-and-mission. Accessed September 19, 2022
2
Turner
AL
,
Gregg
CJ
,
Leslie
LK
.
Race and ethnicity of pediatric trainees and the board-certified pediatric workforce
.
Pediatrics
.
2022
;
150
(
3
):
e2021056084
3
The American Board of Pediatrics
.
Latest race and ethnicity data for pediatricians and pediatric trainees
.
4
Reynolds
CR
,
Suzuki
LA
.
Bias in psychological assessment: an empirical review and recommendations
. In
Reynolds
CR
,
Suzuki
LA
, eds.
Handbook of Psychology
. 2nd ed.
Hoboken, NJ
;
2012
5
The American Board of Pediatrics
.
General Pediatrics Content Outline
.
Available at: https://www.abp.org/sites/abp/files/gp_contentoutline_2017.pdf. Accessed September 19, 2022
6
The American Board of Pediatrics
.
Scoring
.
Available at: https://www.abp.org/content/scoring-0. Accessed September 19, 2022
7
Dorans
N
,
Holland
P
.
DIF detection and description: Mantel-Haenszel and standardization
.
ETS Research Report Series
.
1992
;
1992
(
1
):
i-40
8
US Census Bureau
.
2015 National content test: race and ethnicity analysis report
.
9
The American Board of Pediatrics
.
Pediatric physicians workforce: methodology summary
.
10
Belzak
WCM
.
Testing differential item functioning in small samples
.
Multivariate Behav Res
.
2020
;
55
(
5
):
722
747
11
Michaelides
MP
.
An illustration of a Mantel-Haenszel procedure to flag misbehaving common items in test equating
.
Pract Assess Res Eval
.
2008
;
13
(
1
):
7
12
Linacre
JM
.
Winsteps v5.3.1
.
Available at: https://www.winsteps.com. Accessed March 28, 2023
13
The American Board of Pediatrics
.
The four parts of MOC
.
Available at: https://www.abp.org/content/four-parts-moc. Accessed February 28, 2023
14
The American Board of Pediatrics
.
Current committees
.
Available at: https://www.abp.org/content/current-committees. Accessed September 19, 2022
15
The American Board of Pediatrics
.
ABP corporate policy
.
16
American Academy of Pediatrics
.
Words matter: AAP guidance on inclusive, anti-biased language
.
17
American Academy of Pediatrics Board of Directors and Executive Committee
.
AAP perspective: race-based medicine
.
Pediatrics
.
2021
;
148
(
4
):
e2021053829
18
Educational Testing Service
.
ETS Guidelines for Fairness Review of Assessments
.
Princeton, NJ
:
Educational Testing Service
;
2009
19
Hambleton
R
,
Rogers
J
.
Item bias review
.
Pract Assess Res Eval
.
1994
;
4
(
1
):
6
20
O’Neill
TR
,
Peabody
MR
,
Puffer
JC
.
The ABFM begins to use differential item functioning
.
Ann Fam Med
.
2013
;
11
(
6
):
578
579
21
O’Neill
TR
,
Wang
T
,
Newton
WP
.
The American Board of Family Medicine’s 8 years of experience with differential item functioning
.
J Am Board Fam Med
.
2022
;
35
(
1
):
18
25

Supplementary data