Detecting disparities in health care requires special statistical consideration to assess meaningful differences in exposure, process, or outcome between 2 or more groups on the basis of race, ethnicity, or language. Statistical tests with resulting P values need to be contextualized and thresholds of significance selected carefully before drawing conclusions.

Achieving equitable health for patients and families requires identifying disparities, conducting root cause analyses of the drivers of disparities, and designing and implementing interventions to address the root causes in collaboration with patients/families.1  An important first step in this framework, detecting disparities, requires special statistical consideration to assess meaningful differences in exposure, process, or outcome that adversely affect groups which are socially disadvantaged.2 

Statistics are flexible tools that can be used in different situations for drawing conclusions from data but using the same standards in all situations may not be necessary or even desirable. Rather, the context in which conclusions from the analysis will be used is important when choosing the most appropriate statistical approach. For example, a local, equity-focused quality improvement activity may establish a different statistical threshold compared with national policy-generating research.

Statistical hypothesis tests are designed to scientifically evaluate whether the available data support a particular hypothesis or not. In the context of detecting disparities, we may test the null hypothesis (ie, the statement we believe is true unless the data disprove it) that there is no difference in an outcome between 2 or more groups based on race, ethnicity, or language against the alternative hypotheses that there is a difference in outcomes (ie, a disparity is present). For illustrative purposes, we will compare outcomes between Black patients and white patients. In statistical nomenclature:

Null (H0): outcome in Black patients = outcome in white patients (ie, no disparity present).

Alternative (H1): outcome in Black patients ≠ outcome in white patients (ie, disparity present).

We use the data we have collected and apply an appropriate statistical test to see if there is enough evidence to reject H0 and conclude that there is a disparity. Typically, we decide we have sufficient evidence of a difference if the P value of the statistical test is below a certain threshold (eg, P < .05, a commonly used, albeit arbitrary, threshold). There is ongoing debate in the scientific community regarding the use of P value thresholds for making decisions about the hypotheses35  and some helpful guides to using P values appropriately.68  In this article, we discuss important considerations when using a P value to detect disparities in health care and question if the commonly used level of vigilance is necessary in this context so that important disparities are not overlooked and appropriate action is taken.

The P value is commonly thought of as a measure of the strength of evidence we have against the H0 hypothesis. The smaller the P value, the more evidence we have from our data that H0 should be rejected in favor of H1. Since the “p” in P value stands for probability, the P value always ranges between 0 and 1. So, the closer the P value is to 0, the more evidence there is to reject H0. But how do we know how small the P value must be for us to reject H0?

Typically, in the Methods portion of an article, there will be a statement along the lines of “P < .05 was considered statistically significant.” But why .05? Although it has developed as a standard threshold across many disciplines for statistical significance, it does have credibility when we consider possible errors when deciding whether to reject H0.

There are 2 types of errors that we can make when testing a hypothesis: we can reject H0 when it is true (Type I) or we do not reject H0 when it is false (Type II). From our example above, a Type I error would conclude that the outcome for Black and white patients is different when, in fact, there is no difference; and a Type II error would conclude that the outcome for Black and white patients is essentially the same when, in fact, a disparity is present (Table 1).

TABLE 1

Errors That Can be Committed When Drawing Conclusions From the Data

Reality
H0 Is True (There Is No Disparity)H0 Is False (There Is a Disparity)
What we conclude from our data   
 Accept H0 (there is no disparity) Correct Type II error 
 Reject H0 (there is a disparity) Type I Error Correct 
Reality
H0 Is True (There Is No Disparity)H0 Is False (There Is a Disparity)
What we conclude from our data   
 Accept H0 (there is no disparity) Correct Type II error 
 Reject H0 (there is a disparity) Type I Error Correct 

The .05 that we compare the P value against is the probability that we will commit a Type I error and reject H0 when it is true. As an example, in a study of medication effectiveness, this error would lead researchers to conclude that the medication improved outcomes when, in fact, it did not. Sometimes, researchers will protect even further from committing a Type I error by setting the threshold more conservatively and use P < .01 for statistical significance.

One important property about the probabilities of committing Type I and Type II errors is that they are inversely related. Consequently, if we increase the probability of committing a Type I error from .05 to .10, we simultaneously decrease the probability of committing a Type II error. It is important to consider the context when considering the trade-offs of both types of errors when detecting disparities. In some contexts, committing a Type II error (ie, concluding the absence of a disparity when it is actually present) may be considered as egregious as committing a Type I error (ie, concluding the presence of a disparity when it is actually absent).

The consequence of this trade-off raises the question of whether the “usual” level of vigilance against a Type I error (ie, P < .05) is necessary when detecting disparities in health care. Would we be sufficiently convinced that there is a disparity if P = .10? This would mean that we still have only a 10% chance of erroneously concluding the presence of a disparity when there is, in fact, not one, while simultaneously lowering the chance that we conclude no disparity exists when there is one. What about P = .20?

Although there is certainly a role for statistical testing to decide if there is adequate evidence of a disparity, we believe that detection of disparities does not warrant the same level of scrutiny as detection of differences in other covariates or outcomes (ie, P < .05), since committing a Type II error can erroneously exclude the presence of a disparity and miss an important opportunity to improve disparate outcomes for minoritized populations. Teams and stakeholders should discuss the trade-offs of committing Type I versus Type II errors, and decide how much of each error they are willing to tolerate before deciding on a specific threshold for significance for the P value. These discussions also provide an opportunity for discussing and deciding on what constitutes a meaningful difference between groups, paying close attention to the lived experience of the group experiencing the disparity.

P values are driven, in part, by how much data are available (ie, the sample size). The larger the sample size, the more likely that small differences will be deemed statistically significant. We illustrate this principle by comparing mortality rates between Black and white patients using different population samples. In this example, mortality is 1.03% in Black patients and 1.00% in white patients. We want to test the following hypothesis:

H0: mortality in Black patients = mortality in white patients (ie, no disparity present).

H1: mortality in Black patients ≠ mortality in white patients (ie, disparity present).

We randomly create 100 data sets with 1000 Black patients (mortality = 1.03%) and 1000 white patients (mortality = 1.0%). For each of these 100 data sets, we run a χ2 test to decide if we should reject H0 (ie, P < .05) or not. With only 1000 patients in each group, P < .05 only 6 out of 100 times (ie, we would reject H0 in only 6 of the 100 simulations). However, if we increase the size of the data sets to 100  000 Black patients and 100  000 white patients using the same mortality percentages, P < .05 in 59 of 100 instances. So, as the sample size increased, we were much more likely (59% vs 6%) to conclude that there was a disparity, even though the disparity is quite small (1.03% vs 1.00%).

When establishing “significance” in disparities, it is important to meaningfully engage patient, family, and/or community stakeholders from the disenfranchised groups being evaluated. The P value should not be the final answer. Rather, the magnitude and seriousness of the disparity should be contextualized by people who best understand the lived experience and are most directly impacted by the disparity. For example, when considering birth outcomes, what does a disparity of .01 days in length of stay mean to a Latino family that requires interpreter services even though a study with a very large sample size has deemed it statistically significant? A longer length of stay may be expected to provide the appropriate level of interpretation throughout the hospitalization and at discharge. As another example, what does a P value of .10 mean for a disparity for Black children in a rare but serious outcome (eg, infant mortality or central line associated blood stream infections) when there is not enough data to establish more statistical significance? The lived experience of patients or family members from those groups can help to establish context and meaning as they relate to understanding impact.

When presenting the results of disparities, presenting the results as odds ratios, rate ratios, or effect sizes may provide more useful results. The precision of these estimates can be presented as confidence intervals rather than P values alone. Additionally, we believe it is important to quantify the magnitude of disparities in terms of actual counts, in addition to percentages, averages, and rates typically reported. In our example, this means expressing the mortality disparity in terms of the number of additional Black patients who died who would not have died if they were white. A simple calculation can achieve this: (proportion of Black patients who died − proportion of white patients who died) * Number of Black patients.

Assessing disparities requires us to carefully evaluate the role of statistical tests. We recommend that teams avoid relying solely on P values to decide whether disparities exist. Although a P value can be a useful metric to summarize experimental results, teams should consider what differences may be meaningful for minoritized populations. Statistical tests can be underpowered (often because of insufficient nonmajority patient populations) or overpowered on the basis of the volume of data available, so P values need to be interpreted carefully in context of the magnitude and severity of the observed difference among groups. We recommend the use of statistical tests and resulting P values as a compass pointing in a suggested direction, but not as the final word.

In addition, consider increasing the threshold of significance from an overly conservative P < .05 to at least P < .10, thus reducing the likelihood of committing a Type II error and overlooking true disparities. For quality improvement, an even higher threshold may be considered to make sure that disparities are not overlooked, and appropriate action taken. Regardless of its use, teams should discuss the trade-offs of controlling Type I (over calling disparities) versus Type II (under calling disparities) errors, particularly for minoritized populations, before deciding on a threshold for significance.

Finally, clearly presenting the results of disparities is critical. Although presenting statistics such as odds ratios, rate ratios, or effect sizes (all with appropriate confidence intervals) are important, we also recommend calculating the actual number of people affected by a disparity to make the information more patient-centered, representative of the impact on specific groups, and real to the consumer. It may be important to report meaningful differences even if there is not “statistical significance.”

FUNDING: No external funding.

CONFLICT OF INTEREST DISCLAIMER: The authors have indicated they have no conflicts of interest relevant to this article to disclose.

Dr Hall conceptualized and designed the study, drafted the initial manuscript, and critically reviewed, revised and approved the final manuscript; Drs Tieder, Richardson, Parikh, and Shah assisted in the study design, critically reviewed, revised, and approved the final manuscript; all authors agree to be accountable for all aspects of the work.

1.
Chin
MH
.
Advancing health equity in patient safety: a reckoning, challenge and opportunity. [Published online ahead of print December 29, 2020]
BMJ Qual Saf
.
2020
.
10.1136/bmjqs-2020-012599
2.
Braveman
PA
,
Kumanyika
S
,
Fielding
J
, et al
.
Health disparities and health equity: the issue is justice
.
Am J Public Health
.
2011
;
101
(
Suppl 1
):
S149
S155
3.
McShane
BB
,
Gal
D
,
Gelman
A
,
Robert
C
,
Tackett
JL
.
Abandon statistical significance
.
Am Stat
.
2019
;
73
(
Suppl 1
):
235
245
4.
Hurlbert
SH
,
Levine
RA
,
Utts
J
.
Coup de Grâce for a tough old bull: “statistically significant” expires
.
Am Stat
.
2019
;
73
(
Suppl 1
):
352
357
5.
Amrhein
V
,
Greenland
S
,
McShane
B
.
Scientists rise up against statistical significance
.
Nature
.
2019
;
567
(
7748
):
305
307
6.
Wasserstein
RL
,
Schirm
AL
,
Lazar
NA
.
Moving to a world beyond “P < .05.”
Am Stat
.
2019
;
73
(
Suppl 1
):
1
19
7.
Altman
N
,
Krzywinski
M
.
Interpreting P values
.
Nat Methods
.
2017
;
14
(
3
):
213
214
8.
Harrington
D
,
D’Agostino
RB
Sr
,
Gatsonis
C
, et al
.
New guidelines for statistical Reporting in the Journal
.
N Engl J Med
.
2019
;
381
(
3
):
285
286