Brief Notes on the Modifiable Areal Unit Problem (MAUP) in Spatial Analysis: The Case of the Zip Code

[Note: Here, I refer to “zip codes” as used in the U.S. This does not apply to zip codes in the E.U. Thanks to Michiel Meeteren for prompting this clarification!]

 

So this September 2016 article by Prof. Richard Casey Sadler (Michigan State University, Division of Public Health) is making the rounds again on Twitter. Entitled, “How ZIP codes nearly masked the lead problem in Flint,” it offers an instructive case study of how our units and scales of analysis matter, and have consequences- sometimes life-altering or life-shortening consequences.

The problem with zip codes

I had previously tweeted threads about why zip codes are not suitable as areal units of analysis, which boiled down to:

(1) zip codes are based on postal service routes, not neighborhoods as residents live/know them

(2) zip codes vary quite a bit in size and population density, making them unsuitable- to a degree- for comparative analyses

(3) year-to-year comparisons of zip codes are methodologically unsound b/c zip codes’ geographical areas change- e.g. road construction, demographic shifts (depopulation, population growth), or simply restructuring of postal service routes amid attacks on the public sector’s capacity

However, I never explicitly named the “problem.” In geography, we call it the “Modifiable Areal Unit Problem” or MAUP.   In his 1984 paper entitled, “The Modifiable Areal Unit Problem”, geographer Stan Openshaw observed that “the areal units (zonal objects) used in many geographical studies are arbitrary, modifiable, and subject to the whims and fancies of whoever is doing, or did, the aggregating.” He identified the MAUP as a type of ecological fallacy, wherein inferences based on one scale of analysis differ from inferences on another level of analysis. There are twin problems associated with the MAUP- (1) the scale or aggregation effect [the transformation of data from one areal unit to another (e.g. zip code to census tracts, or census tracts to counties) can change statistical inferences b/c it may obscure or average out spatial relationships], and (2) the grouping or zoning effect [which areal unit you use matters].

I agree with Dr. Sadler’s conclusion that we should not throw zip codes out with the bath water. They are, after all, still useful for local analyses (of secondary effects, such as spatial dependence), but we should recognize that smaller study areas come at a cost to spatial precision because they are more homogeneous. That’s not to say that larger areas are necessarily better! Larger study areas can have larger standard errors due to the effects of population size within the study area.

Background

First, I have to address the basics of spatial thinking and analysis in GIS. Zooming out, we can recognize that Geographic Information Systems (and Science) and spatial thinking are complementary ways to represent the world. In this case, we represent identified objects, phenomena, or processes with layers of spatial data. Spatial data come in 4 types, which are discussed in detail here:

  1. Point (or Event) – One-dimensional objects representing objects in space, or latitude/longitudinal (x,y) coordinates with a spatial reference frame. Other terms for “point” include “vertex” and “node.”
  2. Network – vector lines that connect edges and junctions. Unlike early “spaghetti” models, network data always have a topology, or “a set of rules how points, lines and polygons share their geometry” [source and some useful lecture slides]. These rules can be summed as: connectivity (features connect or touch), adjacency (common boundaries), and contiguity (feature is contained within or between another or other features- e.g. a vector line between edges and junctions). Network data are often used to model traffic and water flows, or simply to represent roads, rivers, utility pipes, and more.
  3. Area/polygon – Two-dimensional objects, or areas bounded by vector lines. Human geographers often use standardized administrative areal units (which may be changed over time, as is the case with zip codes and PUMAs*- user beware!). Polygons can be used to represent features such as city boundaries, lakes, rivers, and much more. Polygons are implicitly or explicitly conceptualized as “containers,” which introduce complexities in our analyses as we aggregate finer scale or point data into areal units- including but not limited to “edge effects” and the Modifiable Areal Unit Problem (MAUP).
  4. Field/raster – a data model that uses cells or pixels with associated values to represent features in a scalar field where every object has a magnitude [source]

The first three are classified as “vector” data, while field data is also called “raster data.” Vector data formats are appropriate for discrete values, while raster data represents continuous values, like precipitation, temperature, or elevation. Rasters are also commonly used to represent discrete or categorical values such as vegetation or land use classifications.

Now, for a bit of the theory that underpins spatial analysis. The first law of geography is as follows: “All things are related, but nearby things are more related than distant things” (Tobler, 1970). Here, Tobler referred both (1) spatial dependence and (2) pattern and process. The first point is core to spatial statistics, because spatial data violate the traditional statistical assumption of independence because that closer areas are likely to be similar.

In tension with the first law, Michael Goodchild (2004) second law of geography is the principle of spatial heterogeneity (or non-stationarity). Here, context and specificity of “place” is important for any spatial analytic project because there is no “average” place that can represent everywhere else (why? because averaging obscures heterogeneity in the total study area). For example, processes might be ‘place-dependent’, differing by “where.” This also highlights the issues of sampling as researchers collect, select, or aggregate spatial data. The second law of geography, I think, foregrounds the importance of having background knowledge about the process or phenomenon that you’re studying and the study area. Here, we can think of how processes and parameters vary in space.

Now, we’re getting a bit into the weeds. When working with polygons, like zip codes, PUMAs, or census tracts, we might want to “scale up” to a larger areal unit, like a county. In the case of zip codes or PUMAs, the areal units violate the assumption of contiguity (feature is contained within another feature).

For example, below, I have mapped Chicagoland county boundaries (black) and the city of Chicago’s boundaries (orange) overlaid with zip code boundaries (purple).

Example of Zip Code Violation of Contiguity
Map by Arrianna M. Planey (20 Sept 2018) [Image Description: A map of Chicagoland counties (black) with zip code boundaries overlaid (in purple). The city of Chicago’s boundaries are highlighted in orange to point out that zip codes- which reflect postal workers’ work routes- are not contiguous or contained within counties or city boundaries.]
Zip codes- which reflect postal workers’ work routes- are not contiguous or contained within counties or city boundaries. So, if you were looking at facility locations or disease cases at the zip code level within Chicago or Cook County, you may well be including facilities or cases in neighboring cities or counties. If you are able, zoom in on the map at the northwest corner of Cook County, or at Chicago’s southwest, where the zip codes are not at all contained by the county or city boundaries. For example, residents in zip code 60638 are split across 4 cities- Chicago, and its neighboring suburbs Forest View, Bedford Park, and Stickney.

Example of Zip Code Violation of Contiguity - Close up
Map by Arrianna M. Planey [Image Description: A map showing the southwest corner of Chicago (orange) and the zip code boundaries (purple). You can see that residents in zip code 60638 are split across 4 cities- Chicago, and its neighboring suburbs Forest View, Bedford Park, and Stickney]
Now, what about aggregating zip code data at the census tract level instead? Converting from zip code to census tract would be easier done if you have point data. If you have point data for the phenomenon or process you are studying, then you can aggregate them at multiple scales or by different types of areal units. If the point data are not available, there are ways of estimating the population (e.g. census block groups) and calculating the risk at the census tract level based on their distribution in zip codes. Still, it is preferable to have the un-aggregated data to work with.

Unlike zip codes, census tracts are nested administrative areal units. This U.S. Census Bureau hierarchy map illustrates how it works, from nation > regions >  states > counties > census tracts > block groups > census blocks.

Moreover, as you scale up or down the hierarchy and aggregate data, there’s the “change of support” problem. That is, the size, shape, orientation, and volume of each spatial measurement changes as we transform the data. This is a pretty serious consideration in any statistical analysis of spatial processes. For example, as you move down the hierarchy, the population of each areal unit gets smaller, which means you are more likely to get unstable rates as you calculate, say, proportions of facilities, providers, or cases per population. [For more information, I recommend these slides by Prof. Peter Craigmile (Ohio State, Statistics) on spatial change-of-support and misalignment problems.]

I think I’ll stop there for now. I might do a follow-up with a short tutorial for basic spatial regression model fitting using area data using Geo*Da (which you can download here).

 

 

Glossary

Global vs. Local Analysis – (see: Anselin, 1995; Getis & Ord, 1996) “Global” refers to 1st-order effects (variation in the mean) and trends over the study area (what is referred to as “spatial heterogeneity” in the 2nd Law of Geography). “Local” refers to 2nd-order effects, or spatial dependence in the study area (1st Law of Geography). “Local” analyses typically have a model structure that accounts for spatial autocorrelation by varying locally.

  • Global Methods: Geary’s C, Moran’s I, Bivariate Moran’s I
  • Local Methods: Local Moran’s I, Getis-Ord Hot-Spot Analysis

Spatial Data Transformation – There are many ways to transform spatial data across scales and data types. The below are a sample of the options available to use through GIS.

  • Point to Polygon: Thiessen (or Voronoi) polygons, aggregated counts of events within an area
  • Polygon to Point: Centroid (e.g. population-weighted)
  • Point to Raster: Kernel Density Estimation, Kriging (most temperature, elevation, pollution, or noise data collection uses this approach)
  • Polygon to Raster: Interpolation, Kernel Density Estimation

*PUMAs – Public Use Microdata Areas, or “collection of counties or tracts within counties with more than 100,000 people, based on the decennial census population counts.” These are updated annually to reflect population change. For example, the PUMAs for the Milwaukee Metro area differ between 2000 and 2010. [source]

Topology – The spatial relationships that facilitate GIS as a technology (GISystems) and a technique (GIScience). These include connectivity, adjacency, nestedness (or contiguity), and directionality.

 

 

References

Alcalde, M.G. (2018).Zip Codes Don’t Kill People—Racism Does. Health Affairs Blog. DOI: 10.1377/hblog20181127.606916

Goodchild, M.F. (2004) GIScience, geography, form, and process. Annals of the Association of American Geographers 94(4): 709–714.

Grubesic, T.H. (2008). Zip codes and spatial analysis: Problems and prospects. Socio-Economic Planning and Sciences. 42(2), pp 129-149, DOI: https://doi.org/10.1016/j.seps.2006.09.001

Krieger, N., Waterman, P., Chen, J.T., Soobader, M.J., Subramanian, S.V., and Carson, R. (2002). Zip Code Caveat: Bias Due to Spatiotemporal Mismatches Between Zip Codes and US Census–Defined Geographic Areas—The Public Health Disparities Geocoding Project. American Journal of Public Health. 92(7). 1100-1102 [link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1447194/ ]

Openshaw, S (1984) Ecological Fallacies and the Analysis of Areal Census Data. Environment & Planning A. 16(1) https://doi.org/10.1068%2Fa160017

Openshaw, S., (1984). The modifiable areal unit problem. CATMOG (38). [URL: https://www.scribd.com/document/343456450/Openshaw-1984-MAUP ]

Tobler, W. (1970). A computer movie simulating urban growth in the Detroit region. Economic Geography. 46, 234-240

 

Recommended Texts & Resources

  • Cromley E, and McLafferty S. (2002). GIS and Public Health. New York: Guilford Press
  • Waller L, Gotway C (2004)  Applied Spatial Statistics for Public Health Data.  New York:  Wiley.
  • Prof. Mei-Po Kwan’s website summarizing her work on the Uncertain Geographic Context Problem (UGCoP), which is distinct from the MAUP http://www.meipokwan.org/UGCOP.html
  • Prof. Mei-Po Kwan’s website summarizing her work on the Neighborhood Effect Averaging Problem (NEAP) http://meipokwan.org/NEAP.html

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s