Provider- and claims-focused administrative databases are powerful tools for conducting health services research, and these studies often have good generalizability owing to diversity of hospitals from which samples are derived. In this research methods article, we describe administrative data and how available provider- and claims-focused administrative databases can be used to conduct health services research. We describe common observational study designs using administrative data and provide real-world examples. We highlight the strengths and weaknesses of studies conducted using administrative data and describe methodological considerations to reduce bias and improve the rigor of observational studies using administrative data. Finally, we provide guidance on the types of study questions suitable for observational study designs using administrative data.

The Agency for Healthcare Research and Quality (AHRQ) has defined health services research (HSR) as a “multidisciplinary field of scientific investigation that studies how social factors, financing systems, organizational structures and processes, health technologies, and personal behaviors affect access to health care, the quality and cost of health care, and ultimately, our health and well-being.”1  Health services researchers need standardized multi-institutional data to understand systems of care and improve the generalizability of research results. Administrative data sets are ideal for this purpose because of the breadth of included demographic, diagnostic, financial, and other data. However, researchers must appreciate that these data were not collected primarily for research. Consequently, while there is strength in the quantity of data, there can be limitations in data quality. Herein, we describe the importance, availability, uses, strengths, and limitations of administrative data for HSR.

In the context of health care, administrative data are the data collected by health systems primarily for the purpose of operations (eg, billing, administration, legal). These data can also be secondarily leveraged for other reasons, including research, benchmarking, and health care quality assessments.

Administrative data are generated for every health care encounter. When a patient interacts with a health care provider, data collected include demographics and payer information, clinical information from provider documentation (eg, diagnoses, procedures), resource utilization (eg, laboratory studies, diagnostic imaging, medication administration), and facility fees. There are national standards set by the National Uniform Billing Committee2  that dictate use of a common language to capture this data and send to payers for reimbursement. These data are also sent to other stakeholders (eg, state health agencies, AHRQ, Children’s Hospital Association) for inclusion in administrative databases. Provider-focused administrative databases collect all encounters in a specific clinical setting (eg, emergency department, inpatient settings), regardless of the patient’s payer. Data sent to individual payers for reimbursement (eg, Medicaid) are a special type of administrative data, commonly referred to as “claims data.” Some of the most frequently used databases for HSR and a summary of available content are highlighted below and in Table 1.

TABLE 1.

Characteristics of Common Administrative Databases Used for Health Services Research.

AgeSexRace/EthnicityMedian Household IncomeGeographic LevelDiagnosis and Procedure CodesSeverity and Comorbidity MeasuresDates of ServiceAdmission and Readmission LinkageObservation StaysLaboratory TestingDiagnostic ImagingMedicationOther Billed ResourcesHospital Type/OwnershipTeaching StatusHospital LocationBed SizeHospital ChargesClaim TypeClaim Status
AvailabilityAccessaDemographicsClinical CharacteristicsHospital CharacteristicsFinancial DataAdditional Information
Provider-Focused Administrative Databases
AHRQ HCUP Databases
Kids' Inpatient Database
- All-payer, US inpatient database for children <21 y of age that allows for weighted national estimates
- Includes a 10% sample of normal newborn discharges and an 80% sample of complicated newborn and other pediatric discharges from AHA-designated community hospitals1  
1997–present approximately every 3 y Publicly available + ZIP code quartiles + NCHS code NA NA NA NA NA NA + Region NA NA Detailed information can be found at: https://www.hcup-us.ahrq.gov/db/nation/kid/kiddbdocumentation.jsp 
National Inpatient Sample
- All-payer, US inpatient database for all ages that allows for regional and national estimates
- A 20% sample of all discharges from AHA-designated community hospitalsb 
1988–present annual Publicly available + ZIP code quartiles + NCHS code NA NA NA NA NA NA + Census division/region NA NA Detailed information can be found at: https://www.hcup-us.ahrq.gov/db/nation/nis/nisdbdocumentation.jsp 
Nationwide Readmissions Database
- All-payer, US readmission database for all ages that allows for national readmission estimates
- Sample derived from individual patient encounters within American Hospital Association-designated community hospitalsb 
2010–present annual Publicly available NA + ZIP code quartiles + NCHS code NA NA NA NA NA + UIC NA NA Detailed information can be found at: https://www.hcup-us.ahrq.gov/db/nation/nrd/nrddbdocumentation.jsp 
State Inpatient Database
- All-payer inpatient database for all ages from individual US states
- A 100% sample of all discharges from AHA-designated community hospitalsb 
1990–present annual Publicly available + ZIP Code quartiles + NCHS code +d NA NA NA NA NA NA NA Detailed information can be found at: https://www.hcup-us.ahrq.gov/db/state/siddbdocumentation.jsp 
Other Provider-Focused Administrative Databases 
Pediatric Health Information System
- All-payer database that includes emergency department, inpatient, and observation discharges from participating tertiary and quaternary care US children's hospitals
- A 100% sample of all discharges from approximately 50 children's hospitals 
1992–presentc Proprietary, access for participating institutions + Zip Code NA + US census region NA NA Detailed information can be found at: https://www.child renshospitals.org/content/analytics/product-program/pediatric-health-information-system 
Premier Healthcare Database
- All-payer database that includes emergency department, inpatient, and observation discharges from participating US hospitals 
2000–presentc Proprietary, two-year licensing NA + US census region + US census region NA NA Detailed information can be found at: https://premierinc.com/about 
Claims-Focused Administrative Databases 
IBM MarketScan Research Databases
- Claims database that includes a large convenience sample of over 273 million unique patients since 1995
- Captures the continuum of care including ambulatory visits, hospital stays, pharmacy utilization, and carve-out care 
1995–present Proprietary, but publicly available NA NA +e NA NA NA NA NA NA NA NA Detailed information can be found at: https://www.ibm.com/products/marketscan-research-data bases 
Transformed Medicaid Statistical Information System
- Claims database of Medicaid and CHIP enrollees 
2014–present Publicly available + MSA +e NA NA NA Detailed information can be found at: https://www.medicaid.gov/medicaid/data-systems/macbis/t ransformed-medicaid-statistical-information-system-t-msis/index.html 
AgeSexRace/EthnicityMedian Household IncomeGeographic LevelDiagnosis and Procedure CodesSeverity and Comorbidity MeasuresDates of ServiceAdmission and Readmission LinkageObservation StaysLaboratory TestingDiagnostic ImagingMedicationOther Billed ResourcesHospital Type/OwnershipTeaching StatusHospital LocationBed SizeHospital ChargesClaim TypeClaim Status
AvailabilityAccessaDemographicsClinical CharacteristicsHospital CharacteristicsFinancial DataAdditional Information
Provider-Focused Administrative Databases
AHRQ HCUP Databases
Kids' Inpatient Database
- All-payer, US inpatient database for children <21 y of age that allows for weighted national estimates
- Includes a 10% sample of normal newborn discharges and an 80% sample of complicated newborn and other pediatric discharges from AHA-designated community hospitals1  
1997–present approximately every 3 y Publicly available + ZIP code quartiles + NCHS code NA NA NA NA NA NA + Region NA NA Detailed information can be found at: https://www.hcup-us.ahrq.gov/db/nation/kid/kiddbdocumentation.jsp 
National Inpatient Sample
- All-payer, US inpatient database for all ages that allows for regional and national estimates
- A 20% sample of all discharges from AHA-designated community hospitalsb 
1988–present annual Publicly available + ZIP code quartiles + NCHS code NA NA NA NA NA NA + Census division/region NA NA Detailed information can be found at: https://www.hcup-us.ahrq.gov/db/nation/nis/nisdbdocumentation.jsp 
Nationwide Readmissions Database
- All-payer, US readmission database for all ages that allows for national readmission estimates
- Sample derived from individual patient encounters within American Hospital Association-designated community hospitalsb 
2010–present annual Publicly available NA + ZIP code quartiles + NCHS code NA NA NA NA NA + UIC NA NA Detailed information can be found at: https://www.hcup-us.ahrq.gov/db/nation/nrd/nrddbdocumentation.jsp 
State Inpatient Database
- All-payer inpatient database for all ages from individual US states
- A 100% sample of all discharges from AHA-designated community hospitalsb 
1990–present annual Publicly available + ZIP Code quartiles + NCHS code +d NA NA NA NA NA NA NA Detailed information can be found at: https://www.hcup-us.ahrq.gov/db/state/siddbdocumentation.jsp 
Other Provider-Focused Administrative Databases 
Pediatric Health Information System
- All-payer database that includes emergency department, inpatient, and observation discharges from participating tertiary and quaternary care US children's hospitals
- A 100% sample of all discharges from approximately 50 children's hospitals 
1992–presentc Proprietary, access for participating institutions + Zip Code NA + US census region NA NA Detailed information can be found at: https://www.child renshospitals.org/content/analytics/product-program/pediatric-health-information-system 
Premier Healthcare Database
- All-payer database that includes emergency department, inpatient, and observation discharges from participating US hospitals 
2000–presentc Proprietary, two-year licensing NA + US census region + US census region NA NA Detailed information can be found at: https://premierinc.com/about 
Claims-Focused Administrative Databases 
IBM MarketScan Research Databases
- Claims database that includes a large convenience sample of over 273 million unique patients since 1995
- Captures the continuum of care including ambulatory visits, hospital stays, pharmacy utilization, and carve-out care 
1995–present Proprietary, but publicly available NA NA +e NA NA NA NA NA NA NA NA Detailed information can be found at: https://www.ibm.com/products/marketscan-research-data bases 
Transformed Medicaid Statistical Information System
- Claims database of Medicaid and CHIP enrollees 
2014–present Publicly available + MSA +e NA NA NA Detailed information can be found at: https://www.medicaid.gov/medicaid/data-systems/macbis/t ransformed-medicaid-statistical-information-system-t-msis/index.html 

AHRQ HCUP, Agency for Healthcare Research and Quality Healthcare Cost and Utilization Project Databases; CHIP, Children's Health Insurance Program; MSA, metropolitan statistical area; NA, not available; NCHS, National Center for Health Statistics Urban Rural Classification System; UIC, urban influence codes.

a

Most databases require data use agreements and/or training prior to purchase and use.

b

AHA-designated community hospitals are defined as, “non-federal, short-term, general, and other specialty hospitals, excluding hospital units of institutions.” Veterans Hospitals and other federal facilities (Department of Defense and Indian Health Service) are excluded. Of note, the definition of community hospitals within the AHRQ HCUP databases has varied slightly over time. Specific details can be found on their website.

c

Data availability may lag by up to 6 months.

d

Some individual SID databases allow tracking of individuals across encounters.

e

Linkage may be limited by data availability (e.g., states contributing to MarketScan vary by year).

The AHRQ developed a suite of administrative databases as part of the Healthcare Cost and Utilization Project (HCUP). The HCUP databases are publicly available for purchase and include all payers but are focused on individual hospital settings or on aspects of utilization (eg, readmissions). The HCUP databases most relevant to pediatric inpatient research include Kids’ Inpatient Database (KID), National Inpatient Sample, Nationwide Readmission Database (NRD), and State Inpatient Database (SID). The KID, National Inpatient Sample, and NRD are samples of hospitalizations, but sampling weights are provided to enable generation of national estimates. Sampling weights allow for correction of the over- or undersampling of different groups, improving the representativeness of samples to the broader United States population. The SID includes all hospitalizations that occur within a state. Each HCUP database generally includes demographic and hospital characteristics, diagnostic and procedure codes, and utilization data (eg, length of stay, charges) (Table 1). Notably, individual sampling strategies, frequencies of data release, and availability of data elements (eg, NRD does not publish race/ethnicity data) vary by database. The Children’s Hospital Association (Lenexa, KS), a membership organization of children’s hospitals, owns and operates the proprietary Pediatric Health Information System (PHIS), which includes all emergency department, ambulatory surgery, and inpatient/observation encounters within ∼50 tertiary and quaternary care referral children’s hospitals. PHIS includes the same data elements as the HCUP databases, but additionally includes more detailed geographic and billing data, daily resource use (eg, laboratory and radiology charges), and the ability to track patients returning to the same hospital across years.

Other important provider-focused databases include the Premier Healthcare Database, which is similar to PHIS, but includes both adult and pediatric encounters drawn from over 1100 facilities in the United States. Additionally, numerous commercial electronic health records (EHR; eg, Cerner, Epic, Intermountain Healthcare, Kaiser Permanente) allow for individual hospital data analysis and in some instances multisite data analysis (eg, Cerner Health Facts). EHR-based databases, while generally more challenging to work with due to lack of standardization, can provide more robust clinical data and allow for tracking individuals between inpatient and outpatient settings.

Claims-focused databases collect all claims made to a specific payer regardless of the clinical setting. Claims-focused databases are advantageous for investigations across health care settings (eg, hospital and clinic) and investigations where detailed payment and/or retail pharmacy data are needed. The IBM MarketScan Research Databases (Commercial, Medicare, and Multi-State Medicaid) include health data from a sample of over 273 million patients across the continuum of care, including inpatient, emergency department, ambulatory, and pharmacy visits. Patients can be tracked across clinical settings and years using a unique enrollee ID. Another claims-focused administrative database, the Transformed Medicaid Statistical Information System, includes information on beneficiary and provider enrollment, service utilization, claims, and expenditure data for Medicaid and the Children’s Health Insurance Program (CHIP). Individual state-based all-payer claims databases also may be a potential data source for HSR, though availability varies by state.3 

TABLE 2

Take Home Points.

1. Administrative data are collected by hospitals and health systems primarily for purposes other than research (e.g., billing), though it is a rich resource for health services researchers. 
2. A variety of provider-focused and claims-focused administrative databases exist. Sampling and available data elements vary across each database. 
3. Observational study designs (e.g., cross-sectional, case-control, and cohort) are typically used for investigations using administrative data. 
4. Researchers need to be aware of the strengths (e.g., fast, inexpensive, have good generalizability, assessments of rare conditions) and limitations (e.g., risk of misclassification bias, limited covariates) inherent to administrative database studies. 
5. No administrative database is perfect. Research questions and/or database selection should be tailored to best address the research objectives. 
1. Administrative data are collected by hospitals and health systems primarily for purposes other than research (e.g., billing), though it is a rich resource for health services researchers. 
2. A variety of provider-focused and claims-focused administrative databases exist. Sampling and available data elements vary across each database. 
3. Observational study designs (e.g., cross-sectional, case-control, and cohort) are typically used for investigations using administrative data. 
4. Researchers need to be aware of the strengths (e.g., fast, inexpensive, have good generalizability, assessments of rare conditions) and limitations (e.g., risk of misclassification bias, limited covariates) inherent to administrative database studies. 
5. No administrative database is perfect. Research questions and/or database selection should be tailored to best address the research objectives. 

Investigations utilizing administrative data almost exclusively use observational study designs. Observational study designs include both descriptive (ie, describing characteristics of a population) and analytic designs (ie, examining associations between variables in a population). The three major types of observational study designs are cross-sectional, case-control, and cohort studies. Among these three designs, cross-sectional studies generally are lowest on the hierarchy of evidence, with cohort studies having the highest quality of evidence.

Cross-sectional studies are used to describe characteristics of a population at a single moment in time or over a very short period. Cross-sectional studies can be used to assess prevalence, or the proportion of a population with a disease/outcome at a given moment. Due to their restricted timeframe, cross-sectional studies are not able to estimate incidence, or the proportion of a population that develops a disease/outcome over time. Examples of recent cross-sectional studies using administrative data include the examination of outcomes among children’s hospitals with and without dedicated observation units using PHIS4  and an examination of pediatric hospitalizations in the United States using KID.5 

Case-control studies are analytic studies in which the investigator identifies a sample of patients with the disease/outcome of interest (ie, cases) and a sample without the disease/outcome (ie, controls). The investigator then assesses for differences in the predictor variables between these two populations to identify which predictors are associated with development of the disease/outcome. Case-control studies are especially helpful for investigations of rare diseases. Examples of case-control studies include one using PHIS to understand the effects of diabetes and obesity on outcomes from cardiac surgery6  and another one using MarketScan to examine health care costs among children with anorectal malformations.7 

Cohort studies are analytic studies in which a group of individuals is identified from the broader population based on the presence or absence of an exposure of interest. Cohort studies can be prospective (ie, a group of individuals is recruited, baseline measures obtained, and followed into the future for the development of a disease/outcome) or retrospective (ie, a group of individuals in which the exposure/predictors and disease/outcome status have previously been collected and examining associations between these factors). While administrative data can be leveraged for prospective designs, retrospective cohort designs are more common. Since cohort studies examine groups over time, incidence can be estimated along with relative risk. Examples of cohort studies include the assessment of the contribution of children with medical complexity to pediatric hospitalizations using KID8  and an examination of discordant antibiotic choice for urinary tract infection and length of stay using PHIS.9 

Administrative databases contain a plethora of standardized patient-level information, often derived from a diverse set of patients, all collected as part of ongoing hospital operations. Consequently, observational studies using administrative data are generally fast, inexpensive, and have improved generalizability compared with study designs utilizing primary data collection. Importantly, administrative data can enhance the ability to evaluate rare conditions that may be difficult to investigate within a single institution. Since administrative databases contain numerous encounters, sample size insufficiency is generally not of concern, unless examining a very rare outcome.

Several important limitations may curtail the strength of evidence in study designs utilizing administrative data. Misclassification bias (ie, incorrect assignment of patient population, exposure, or outcome) is a primary concern in designing studies using administrative data.10  Utilizing validated definitions and/or performing primary chart review on a sample of the population can reduce the impact of misclassification bias. Investigators have limited control over sampling and the availability of data elements. Important data such as vital signs, exam findings, and diagnostic results are frequently unavailable. Designing studies using administrative data requires a thorough understanding of available data elements and consideration of whether important covariates can be accounted for. Researchers should carefully consider and acknowledge how bias in the results may be introduced by these issues and the direction of the bias (ie, toward or away from the null hypothesis).

Augmenting administrative data with chart review, other administrative databases, or data registries are several techniques to supplement data with unavailable covariates and to validate the accuracy of billing codes.1115  However, for these techniques to be successful, the databases need to have common data elements on which the files can be merged directly (eg, medical record numbers) or indirectly (eg, using patient characteristics) and may require additional Institutional Review Board considerations beyond what is typically afforded to projects using administrative data only.

Careful consideration of an appropriate analytic approach is required to draw meaningful conclusions from administrative database studies. Although these studies can establish associations between predictor and outcome variables, the ability to derive causal inferences is limited. Statistical strategies, including modeling, that account for important confounding variables can improve the strength of causal inferences.

There are no perfect databases. Since administrative data are not collected primarily for research, choosing a database for a research project requires a careful assessment of the research questions to get as close as possible to addressing them. This may mean that research questions will need to be adjusted to be answerable with the available data. While not always ideal, addressing these revised (and often simpler) questions may provide essential information to motivate future investigations with more sophisticated (and costly) study designs (eg, clinical trials).

When selecting a database, some important considerations include whether linked inpatient and outpatient data are needed and, if so, what type of outpatient data are required (eg, emergency department, ambulatory clinic visits, or both). As outpatient data are not typically available in provider-based databases (with some exceptions), these studies often require claims-based databases, which include data across the continuum of care.

Another important consideration is how much detail is needed regarding inpatient resources. The HCUP databases and most claims-based databases provide procedures performed, length of stay, and charges but generally lack information regarding diagnostic testing (e.g., laboratory tests, diagnostic imaging) and other billed treatments. If the research question requires understanding diagnostic test utilization and treatments provided to patients, PHIS or Premier’s Healthcare Database may be good options.

Yet another consideration is whether population estimates are needed. Understanding the sampling strategy of a database is essential to determine if population estimates can be derived. Databases such as PHIS allow for a detailed description of care delivered at freestanding children’s hospitals; however, an estimated 75% of children receive care in other settings (eg, nonfreestanding children’s hospitals). Consequently, deriving population estimates for all United States children using PHIS data would result in erroneous conclusions. Provider-focused databases such as KID and NRD have specifically been developed to include sampling weights for deriving population estimates.

FUNDING: No external funding.

Dr Markham drafted the initial manuscript. Drs Markham, Hall, Stephens, Richardson, and Gay all reviewed and revised the manuscript, and approved the final manuscript as submitted.

1.
Agency for Healthcare Research and Quality
.
An Organizational Guide to Building Health Services Research Capacity
.
2.
National Uniform Billing Committee
.
Available at: https://www.nubc.org. Accessed March 26, 2022
3.
APCD Council
.
Interactive State Report Map
.
Available at: https://www.apcdcouncil.org/state/map. Accessed March 26, 2022
4.
Macy
ML
,
Hall
M
,
Alpern
ER
, et al
.
Observation-status patients in children’s hospitals with and without dedicated observation units in 2011
.
J Hosp Med
.
2015
;
10
(
6
):
366
372
5.
Leyenaar
JK
,
Ralston
SL
,
Shieh
MS
,
Pekow
PS
,
Mangione-Smith
R
,
Lindenauer
PK
.
Epidemiology of pediatric hospitalizations at general hospitals and freestanding children’s hospitals in the United States
.
J Hosp Med
.
2016
;
11
(
11
):
743
749
6.
Shamszad
P
,
Rossano
JW
,
Marino
BS
,
Lowry
AW
,
Knudson
JD
.
Obesity and Diabetes Mellitus Adversely Affect Outcomes after Cardiac Surgery in Children’s Hospitals
.
Congenit Heart Dis
.
2016
;
11
(
5
):
409
414
7.
Rollins
MD
,
Bucher
BT
,
Wheeler
JC
,
Horns
JJ
,
Paudel
N
,
Hotaling
JM
.
Healthcare Burden and Cost in Children with Anorectal Malformation During the First 5 Years of Life
.
J Pediatr
.
2022
;
240
:
122
128.e2
8.
Berry
JG
,
Ash
AS
,
Cohen
E
,
Hasan
F
,
Feudtner
C
,
Hall
M
.
Contributions of Children With Multiple Chronic Conditions to Pediatric Hospitalizations in the United States: A Retrospective Cohort Analysis
.
Hosp Pediatr
.
2017
;
7
(
7
):
365
372
9.
Jerardi
KE
,
Auger
KA
,
Shah
SS
, et al
.
Discordant antibiotic therapy and length of stay in children hospitalized for urinary tract infection
.
J Hosp Med
.
2012
;
7
(
8
):
622
627
10.
Hall
M
,
Attard
TM
,
Berry
JG
.
Improving Cohort Definitions in Research Using Hospital Administrative Databases-Do We Need Guidelines?
JAMA Pediatr
.
2022
;
11.
Godown
J
,
Thurm
C
,
Dodd
DA
, et al
.
A unique linkage of administrative and clinical registry databases to expand analytic possibilities in pediatric heart transplantation research
.
Am Heart J
.
2017
;
194
:
9
15
12.
Aronson
PL
,
Williams
DJ
,
Thurm
C
, et al
;
Febrile Young Infant Research Collaborative
.
Accuracy of diagnosis codes to identify febrile young infants using administrative data
.
J Hosp Med
.
2015
;
10
(
12
):
787
793
13.
Aplenc
R
,
Fisher
B
,
Huang
Y
, et al
.
Merging of the National Cancer Institute-funded cooperative oncology group data with an administrative data source to develop a more effective platform for clinical trial analysis and comparative effectiveness research: a report from the Children’s Oncology Group
.
Pharmacoepidemiol Drug Saf
.
2012
;
21
(
suppl 2
):
37
43
14.
Tieder
JS
,
Sullivan
E
,
Stephans
A
, et al
;
Brief Resolved Unexplained Event Research and Quality Improvement Network
.
Risk Factors and Outcomes After a Brief Resolved Unexplained Event: A Multicenter Study
.
Pediatrics
.
2021
;
148
(
1
):
e2020036095
15.
Keren
R
,
Shah
SS
,
Srivastava
R
, et al
;
Pediatric Research in Inpatient Settings Network
.
Comparative effectiveness of intravenous vs oral antibiotics for postdischarge treatment of acute osteomyelitis in children
.
JAMA Pediatr
.
2015
;
169
(
2
):
120
128

Competing Interests

POTENTIAL CONFLICT OF INTEREST:Troy Richardson and Matt Hall are employed by Children’s Hospital Association, the proprietor of the Pediatric Health Information System database.

FINANCIAL DISCLOSURE: The authors have no financial relationships relevant to this article to disclose.