“Everything is related to everything else, but near things are more related than distant things,” William Tobler (1970)
Our patients live near others who are demographically similar to them and typically access resources (eg, stores, health care) that are close to them. Their behavior, health, and life expectancy can be measured locally and are impacted by the environment of their geographic community; including but not limited to structural and social determinants of health (SDOH) that vary dramatically by neighborhood.1 Therefore, in addition to studying health outcomes among individual children, it is meaningful to spatially study pediatric populations to understand how community characteristics are associated with health outcomes that vary by place (location). The former informs the implementation of effective individual-level interventions, whereas the latter informs community-level solutions.
Studying people by virtue of where they live is a basic tenant of epidemiologic research. The concept of geocoding, a centuries-old technique beginning with John Snow’s mapping of cholera deaths to prevent further spread, is the basis of this population-level work and is an important and necessary first step for individuals seeking to contextualize health outcomes with community-level data. Geocoding allows for community characteristics to contextualize our patients’ health, similar to vital signs.2 These “community vital signs” enhance what we know about our patients and can enrich individual-level work.2 The knowledge of where our patients live is critical to any research, clinical, or community engagement efforts seeking to understand associations between communities and health.
What is Geocoding and Why is it Important for Geospatial Analysis?
Geocoding refers to the process of converting a person or place’s street address into spatial data. Spatial data can then be displayed as features on a map. It is a technique that falls under the broader research methodology of geospatial analysis. Beyond visualization, geocoding helps advance cartography (conception, production, dissemination, and study of maps), geostatistics (analyzing and predicting spatial and spatiotemporal phenomena), and spatial analysis (evaluating and modeling spatial data). Geocoded data importantly visualizes the distribution of diseases over time and throughout space, as well as the density of resources in communities and their proximity to a specific patient. By facilitating the identification of geographic clusters of disease and assessing the effectiveness of public health interventions, geocoding has become an important tool for understanding the complex associations between the environment (both physical and social) and health outcomes.
Table 1 outlines the steps of geocoding. The first step is to clean and validate a patient’s address. Next, the validated address is applied to a reference spatial database and matched to geographic coordinates (ie, latitude and longitude). Using these coordinates, the address is then projected as a point on a map. Points can then be linked to the larger administrative boundaries an address belongs, including census tracts or zip codes.3 Once geocoded, advanced techniques within spatial software programs like ArcGIS (ESRI, Redlands, California) or GeoDa (Luc Anselin) allow users to apply various tools to analyze the data. Analysis may include calculating a patient’s commuting distance to a hospital or identifying statistically significant clusters of patients with a particular disease. Table 2 provides a summary of commonly used geospatial analytic tools, the questions they can answer, the data needed, and their pitfalls/considerations.
Geocoding Steps
Step . | Example . |
---|---|
1. Decide if the question requires geocoding. | What are the SDOH of the community where my hospital is located? |
2. Extract the needed address(es). | Children’s National Hospital’s address: 111 Michigan Ave Northeast, Washington, DC 20010 |
3. Review the address for accuracy and format for the geocoding software as needed. | 111 Michigan Ave Northwest, Washington, DC 20010 |
4. Geocode and review results.a | ![]() |
5. Import your base map layer and plot the newly geocoded address on this base map.b | ![]() |
6. Identify layers of community data needed for the analysis. | There are myriad sources (in the form of spreadsheets or shapefiles) of SDOH data. In this example, we will use the Child Opportunity Index (https://www.diversitydatakids.org/), which is a composite neighborhood measure of SDOH. |
7. Enrich health data by adding a layer of community data. | This example shows a choropleth map of Washington, District of Columbia, where the census tracts are color-coded by quintile of the Child Opportunity Index (eg, red indicates very low opportunity). ![]() |
Step . | Example . |
---|---|
1. Decide if the question requires geocoding. | What are the SDOH of the community where my hospital is located? |
2. Extract the needed address(es). | Children’s National Hospital’s address: 111 Michigan Ave Northeast, Washington, DC 20010 |
3. Review the address for accuracy and format for the geocoding software as needed. | 111 Michigan Ave Northwest, Washington, DC 20010 |
4. Geocode and review results.a | ![]() |
5. Import your base map layer and plot the newly geocoded address on this base map.b | ![]() |
6. Identify layers of community data needed for the analysis. | There are myriad sources (in the form of spreadsheets or shapefiles) of SDOH data. In this example, we will use the Child Opportunity Index (https://www.diversitydatakids.org/), which is a composite neighborhood measure of SDOH. |
7. Enrich health data by adding a layer of community data. | This example shows a choropleth map of Washington, District of Columbia, where the census tracts are color-coded by quintile of the Child Opportunity Index (eg, red indicates very low opportunity). ![]() |
The geocoder used in this example is Decentralized Geomarker Assessment for Multisite Studies (Brokamp C, Wolfe C, Lingren T, Harley J, Ryan P. Decentralized and reproducible geocoding and characterization of community and environmental exposures for multisite studies. Journal of the American Medical Informatics Association. 2018;25(3):309–314).
Maps in this example were created with ArcMap (Version 10.8.1). This base map layer is from Open Data DC (opendatadc.gov).
Examples of Frequently Used Geospatial Analytic Tools
Sample Question . | Geospatial Tool to Use . | Tool Explanation . | Data Needed . | Pitfalls/Considerations . |
---|---|---|---|---|
Which neighborhoods have the highest asthma hospitalization at-risk rates? | Choropleth map | Uses different colors, or shades of colors, to demonstrate varying values | • Addresses of asthma hospitalizations • Defined geocoded boundaries (eg, neighborhoods, census tracts, zip codes) | • Count data may reflect bias. • When possible, normalize data by using a denominator. For this example, you could use the number of children with asthma in the same geocoded boundaries. |
Are there clusters of neighborhoods with elevated asthma prevalence? | Hot spot analysis | Identifies statistically significant clusters of high values (ie, hot spots), as well as low values (ie, cold spots) | • Addresses of children with asthma • Defined geocoded boundaries • Population counts of all children for geocoded boundaries | • To create rates, ideally, the denominator (populations of children) matches the age range of the numerator (children with asthma). • Small numbers for numerators and denominators can bias clustering. • Hot spot analysis can be done on continuous values (eg, rates) and individual geocoded addresses (eg, points). |
How close does my patient live to the hospital? | Near tool or point distance | Measures the distance from 1 feature (eg, point or polygon) to another | • Addresses of patients • Address of hospital | • Consider driving or commuting distance. • Consider public transportation. |
How many pharmacies are within 1 mile of my patient? | Buffer tool | Creates a buffer (ring) that specifies a certain distance around a feature (eg, a patient’s address) | • Addresses of patients • Addresses of pharmacies | • Consider that pharmacy accessibility is complicated by h of availability. • Use distances for buffers that reflect accessibility for both walking and driving distances. |
Sample Question . | Geospatial Tool to Use . | Tool Explanation . | Data Needed . | Pitfalls/Considerations . |
---|---|---|---|---|
Which neighborhoods have the highest asthma hospitalization at-risk rates? | Choropleth map | Uses different colors, or shades of colors, to demonstrate varying values | • Addresses of asthma hospitalizations • Defined geocoded boundaries (eg, neighborhoods, census tracts, zip codes) | • Count data may reflect bias. • When possible, normalize data by using a denominator. For this example, you could use the number of children with asthma in the same geocoded boundaries. |
Are there clusters of neighborhoods with elevated asthma prevalence? | Hot spot analysis | Identifies statistically significant clusters of high values (ie, hot spots), as well as low values (ie, cold spots) | • Addresses of children with asthma • Defined geocoded boundaries • Population counts of all children for geocoded boundaries | • To create rates, ideally, the denominator (populations of children) matches the age range of the numerator (children with asthma). • Small numbers for numerators and denominators can bias clustering. • Hot spot analysis can be done on continuous values (eg, rates) and individual geocoded addresses (eg, points). |
How close does my patient live to the hospital? | Near tool or point distance | Measures the distance from 1 feature (eg, point or polygon) to another | • Addresses of patients • Address of hospital | • Consider driving or commuting distance. • Consider public transportation. |
How many pharmacies are within 1 mile of my patient? | Buffer tool | Creates a buffer (ring) that specifies a certain distance around a feature (eg, a patient’s address) | • Addresses of patients • Addresses of pharmacies | • Consider that pharmacy accessibility is complicated by h of availability. • Use distances for buffers that reflect accessibility for both walking and driving distances. |
When to Consider Using Geocoding and Geospatial Analysis
For patients, place-based data may critically optimize decisions that we make related to their health. For communities, geocoding and geospatial analysis can help us to identify patterns and trends (eg, disease clustering or intervention planning), which in turn may reveal the disparate impact of policy on health and structural causes of health inequity. Although there is documented evidence of place-based disparities across all health outcomes, it is essential to be deliberate in our consideration of when it is appropriate and meaningful to use geospatial analysis as a tool to enhance research, clinical practice, and community engagement efforts.
Research
Geocoding and geospatial analysis can be used to understand the spatial distribution of health outcomes and how community factors may be related to these disparities. For example, our team applied geocoding to create at-risk rates for pediatric asthma morbidity (emergency department and hospitalization encounters divided by the population of children with asthma) for each census tract in Washington, District of Columbia.4 We found select census tracts with disproportionately high rates of pediatric asthma morbidity and associations with increased violent crime and decreased educational attainment.4 Patel et al applied geocoded community-level data with machine learning to predict the risk of hospitalization among children presenting to the emergency department with asthma.5 For additional examples, the Centers for Disease Control and Prevention’s “GIS Snapshots” is a valuable repository of studies using geocoding and various geospatial analysis methods.6 Results from this research can identify areas at higher risk of certain health issues, inform efficient allocation resources, and help to focus interventions on specific populations.5 Findings from individual studies can be aggregated across different locations and synthesized to increase our understanding of diseases (see Tyris et al for a pediatric asthma morbidity example), which can inform policies to improve community SDOH or enrich care guidelines.7
Clinical Practice
Electronic medical records (EMR) have the capability to geocode each patient’s address and link it to various neighborhood data and SDOH indices. Bazemore et al outlined a roadmap for how practitioners can use this information for real-time decision support to enhance clinical care and contextualize care plans.2 For example, a hospitalized patient living in an area with few, or low-quality, pharmacies may benefit from receiving medications from an onsite pharmacy or by delivery.8 Lindau et al also highlight how EMRs enabled with geocoding can facilitate the identification of nearby resources which can be provided to patients and families during health care encounters to assist with social needs.9
Community Engagement
Geocoding facilitates the transformation of addresses into geographic coordinates, hence facilitating the mapping of various community stakeholders, resources, and issues. Thus, geocoding can help identify neighborhoods with children who have increased morbidity and inform hospital–community partnerships that focus on interventions to improve health outcomes.10 Beck et al provide the exemplar of how understanding the geographic distribution of a health care system’s patient population (eg, neighborhoods with the highest rates of pediatric hospitalizations) can serve as the foundation for community engagement efforts and collaborative interventions that successfully improve health outcomes locally.10
It is important to also highlight that geocoded data and geospatial analysis can be incorporated into community health needs assessments (required every 3 years for 501[c][3)] organizations). For example, our institution recently used the Child Opportunity Index (1 composite measure of place-based SDOH that is comprehensive, child focused, and geocoded at the census tract and zip code level).1 By understanding the communities that our institution primarily serves, the index was applied to visualize these communities’ strengths and weaknesses.11 These findings were reviewed and enhanced through community engagement activities to inform plans to optimize pediatric health.11
Practicalities to Consider for Geocoding and Geospatial Analysis
There are various practicalities to consider when incorporating geocoding into practice.
Software is required, and sometimes separate options are needed for geocoding data versus geospatial analysis. There are alternate options for institutions where EMRs have not implemented automated geocoding. Certain departments of health have location-specific geocoding software, whereas others available for public use (eg, degauss.org) are not unique to 1 location. Importantly, be sure to avoid uploading any patient data to free Web-based geocoders to avoid breaches in patient privacy. Software options to perform geospatial analysis can be accessed as open source (eg, GeoDa, R, or QGIS) or purchased for use by individuals, departments, or enterprises (eg, ArcGIS).
Geocoding requires street addresses. Street addresses are messy data, often pulled from the EMR with errors (eg, misspellings, incorrect zip code). Addresses must be reviewed, cleaned of errors, and possibly formatted before being geocoded. If not manually corrected, spelling or typographical errors may result in error or bias because of many addresses not being geocoded. Additionally, the assumption when sourcing address data from the EMR is that this address is where the child lived at the time of the health condition of interest. However, this may not be the case. It is important to know how and when those data were collected. Perhaps the registration team always enters the guarantor’s address into the EMR, it is not updated with each visit, or it is a nonresidential address. There are also time lags between residential development (creation of a street address) and updates for reference spatial databases, especially in rural areas. These scenarios impact how accurately an address is geocoded. Matching accuracy can help identify these issues and is evaluated by using the “matching rate” (step 4 in Table 1) in geocoding software.
Correct cartographic techniques are essential for accurate location of mapped data. This includes using the correct coordinate system to accurately display a map (eg, World Geodetic System 1984) and to project geocoded data onto that map (eg, North American Datum 1983 for mapping data in North America).
Once addresses are geocoded and mapped, it is important to be intentional about how to best visualize the data. Small counts of point data (single addresses) for a given area often need to be suppressed using that area’s population denominator to protect patient privacy. Geomasking, a technique to displace data from its exact location while maintaining geographic associations, is another method to protect patient privacy when displaying point data.12
For research using geocoding to link health and community data, there are additional considerations. Advanced technology (geographic information systems and geospatial software), combined with the availability of community data linked to geographic boundaries, now allows researchers and clinicians to geocode and link relevant data more easily. For example, layers of high-quality spatial community data are often publicly available through websites for departments of health, US Census Bureau, Esri ArcGIS, and local government agencies. Second, when health data are geocoded to a geographic boundary, 1 pitfall to avoid is the use of count data as an outcome measure. Instead, it is important to identify a denominator for that given geographic boundary and evaluate rates as the outcome. This method can help normalize outlier data (eg, patients with extremely high counts of encounters), as can accounting for multiple encounters per patient in analyses. Alternatively, outliers can be analyzed separately or excluded altogether to reduce bias.
Once health and community data are ascertained, it is important that they are geocoded to the same geographic boundary to be accurately linked together. For example, if using census tract community data, then addresses for the health data must be geocoded to census tract, as well, rather than zip code. Although no area measure reflects a completely homogeneous population, census tracts (or even smaller census block groups if the sample size per group is high enough) are preferred for population analyses because they were designed to reflect smaller and likely more homogeneous populations compared with zip codes.13 This is best summarized in Hogan et al’s recent commentary and illustrated by Krieger et al, who compared associations between health outcomes and community data among 3 different geographic areas: For certain health outcomes, associations at the zip code level were flipped (or nonexistent) compared with those at the census tract or census block group.14,15
The Limitations of Geocoding and Geospatial Analysis
Geocoded data and geospatial analysis are fraught with population bias, area distortion, spatial variability, and data aggregation issues, which can result in misinterpretation or misleading conclusions. In addition to these factors, research using geospatial analysis is largely observational. Thus, 1 significant limitation of this work is the inability to extrapolate causation from findings relating a patient’s community to their health outcomes.
A second, related limitation is that, whenever geocoding is used to link health- and place-based data, the bias of ecological fallacy must be considered.16 This concept refers to the possibility that aggregated administrative-level data may not accurately reflect the lived experience of each individual living within that community.16 This concept is well demonstrated by Gottlieb et al, who found discordance between community-level social disadvantage and patient-reported social risks.17 Using smaller areas (ie, census tracts) may help to mitigate this limitation. Some states provide neighborhood-level data, which may better reflect one’s lived experience. This consideration also emphasizes the importance of considering interventions and strategies that address health disparities at multiple, complementary levels (eg, individual patients and whole communities).
Conclusions
Geocoded health data are truly a “sixth” vital sign that should not be ignored.2 Knowing where our patients live facilitates our ability to visualize and contextualize our patients’ health with an understanding of their community and its characteristics. Apart from the universal adoption of EMR-based automation of geocoding patient data, there is a need to improve how we collect address data with time stamps and ask for the address where a patient lives or spends most of their time. Improving data linkages to community-informed neighborhood boundaries will also improve the accuracy of the assumptions made by geocoding. We anticipate that soon, it will be standard practice to include geocoded-linked data, both in research and in our day-to-day clinical care of patients. Thus, we suggest that geocoding be considered a uniquely powerful and necessary technique that is a gateway to myriad health disparities and health equity efforts, whether they are based in research, clinical practice, or community engagement.
Take-Home Points
Geocoding is a technique that facilitates the precision of spatial data and is a necessary first step for geospatial analysis or any work evaluating health disparities by linking health and location data.
Geocoding can be used in research, clinical practice, and community engagement.
Potential limitations of geocoding are related to accuracy of primary address data, temporal mismatches of patient and administration data, and ecological fallacy.
Dr Tyris conceptualized and drafted the initial manuscript, and revised and reviewed the manuscript; Ms Dwyer, and Drs Parikh, Gourishankar, and Patel conceptualized the manuscript, and critically revised and reviewed the manuscript; and all authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.
FUNDING: No external funding.
CONFLICT OF INTEREST DISCLOSURES: Dr Parikh is supported by grant 1R03HS028484-01A1 from the Agency for Healthcare Research and Quality and 1R01HL161665-01 from the National Institutes of Health/National Heart, Lung, and Blood Institute. Dr Tyris is supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development through the Pediatric Scientist Development Program. These do not present a conflict of interest for this work. The remaining authors have indicated they have no conflicts of interest relevant to this article to disclose.
Comments