All About Census Geography and Summary Levels
Census data is almost all summary data. It usually starts with a survey but the results are provided as summaries, and the first and universal stratefier of such summaries is a geographic area. We do not get census data telling us that according to the latest ACS data we see that Joe and Mary Miller have a household income of $15,000 and are therefore below the poverty level. What we do get is information regarding the median household income for the state of Illinois or the percentage of persons below the poverty level for the city of Chicago. It's always a summary statistic and it always describes a geographic area. The subject of this essay is the extensive collection of geographic area types for which we can get census data, with particular emphasis on the coding scheme used to identify all the types and the way in which they relate to one another. We shall provide examples of how all this is reflected in the Missouri Census Data Center's archive datasets and will conclude the module with a somewhat extended example of how an understanding of the geography and summary level codes is crucial to doing a custom data query.
The Basic Geographies
If you use the Bureau's American Factfinder web-based data access system, you have probably followed the link they feature on that site optimistically labeled "Explain Census Geography". It takes you to the following chart with explanation:
We count 27 different geographic entities on this chart, which they claim represent all the "types for which data are available in FactFinder". There are two basic categories for the entities shown on these charts: Those that have legal status and are not controlled by the Bureau (includes counties, places (incorporated cities), congressional districts, state legislative districts, school districts, etc.); and those which the Bureau is responsible for defining (census tracts, block groups, blocks, public use microdata areas, ZCTAs, etc.)
What is not at all apparent in the Bureau's explanation page is just how complicated the world of census geography really can be. The thing to keep in mind is that census geography, once you get far beyond the very basics, is way too complicated to cover in a web page or two. There are gotchas and footnotes and OMB Tech Docs associated with just about every one of the geographies listed on the Bureau's diagram. One of the most complicating factors which is not addressed by this diagram and explanation is the time dimension. Regions, Divisions and States are no problem, since they tend to stay put over time. You might think Counties would fall into that catgegory as well, but not so. There are small changes going on to counties all the time. Since the 2000 census we have had county changes in Colorado, Alaska and Virginia. (See Substantial Changes to Counties and County Equivalent Entities: 1970-Present for details.)
The down-the-middle census geography hierarchy of Census Tracts - Block Groups - Blocks is redefined every ten years. So the entities in use when accessing a 1990 STF1 dataset are not the same ones used for tabulating a 2000 SF1 dataset. Same concept, and in many areas you'll see a lot of unchanged tracts, but absolutely a different geographic layer. The 2010 tract-BG-block geography has been mostly defined (the Bureau makes final tweaks based on what they find in taking the census) and will be unveiled next March (or so) when the first results of the 2010 census are released. The Public Use Mocrodata Areas ("PUMA"s) are similar, although the 2010 PUMAs will not be defined until after the 2010 population counts become available — probably in 2012 or 2013.
Blocks As Atomic Units
In the explanatory text portion of the Bureau's geography chart they say: Notice that many lines radiate from blocks, indicating that most geographic types can be described as a collection of blocks.
Actually, there is only one entity shown on the chart that is not built from blocks, and that is ZIP codes. But one could argue that ZIP codes do not belong on the chart, since the Bureau does not really publish any data at the ZIP code level, strictly speaking. They use ZCTAs as a proxy to provide data for users who would really like to have it by true ZIP codes but understand that ZCTAs are close enough.
So, blocks are very important geographic entities. Not for data reporting, per se — there is very little data available at the block level, mostly just basic census counts from the decennial census. But everything-is-made-of-blocks turns out to be a very useful situation for someone trying to do analyses that involve relationships between the various geographic layers. It turns out to be the basis for building a large set of block-level geographic-lookup tables that we call the master area block level equivalency (MABLE) files. These tables, when combined with the Geocorr web applications, provide a tool that lets users generate reports/spreadsheets documenting and measuring geographic relationships (for example, what counties a ZIP code intersects with and the population of those intersections).
History buffs will be interested in knowing that this situation of defining census blocks in such a way that they are not split by any other census-recognized geographic entities did not start until 1990. Lots of things have changed about blocks over the decades. They started out as 3-digit codes, then became 3-digit with a 1-character alpha suffix, and then finally the current 4-digit numbers. We believe they are going to get 5-character values for 2010.
Geography Over Time
The time factor is particularly troublesome for entities such as places (cities), school districts and ZIP codes, which tend to change all the time. While the Bureau will always tabulate population estimates for places using the latest available boundaries, the data from the previous census will always be frozen to reflect the city's boundaries as of January 1 of the census year. (Which many of us think is a good thing.) We mention ZIP codes, but the Bureau really does not publish very much (nothing?) based on true ZIP codes since they are not the type of geographic entity that lend themselves to use as a geographic entity. Many ZIP codes represent buildings, not communities. The Bureau came up with the ZCTA concept to allow them to tabulate data at some unit that approximates ZIP codes, but are different from them in ways that are not always trivial, depending on the application. See our All About ZIP Codes page for a more detailed discussion of these.
Congressional Districts are subject to change every two years, but for the most part only undergo major changes every ten years following a post-decennial-census reapportionment. But smaller changes can and do occur throughout the decade (especially in Texas) and getting data for the instances of this coverage can be difficult or impossible.
Metropolitan areas is an umbrella term that actually covers a number of different geographic entities. If you take the term as it has been used over time then there are numerous other entities that could be included. On the alternate chart we mentioned above, they have an entry labeled "core based statistical areas" instead of metropolitan areas. The CBSA terminology, used to describe a new system for defining these kinds or urban media-market type areas that was developed around 2000, has been discouraged by the Bureau and is not seen too often any more on their web site or in their documentation. A CBSA is either a metropolitan statistical area or a micropolitan statistical area, depending on the size of the core area. The CBSA entities were created as a replacement for the previous generation of comparable entities used from about 1982 through 2000, which were referred to as MSAs, CMSAs (Consolidated MSAs) and PMSAs (Primary MSAs). MSAs were simple stand-alone metro areas like St. Louis and Kansas City, while CMSA's were much larger urban entities that were made up of adjoining PMSAs. For example, the Dallas-Forth Worth, TX CMSA is comprised of the Dallas and Ft. Worth PMSAs). It was a little bit complicated and confusing and the CBSA's were an attempt to improve on the concept, especially by introducing the Micropolitan Statistical Areas which were just like the MSAs but on a smaller scale. Accompanying the new CBSA entities came two other related entities, Consolidated Statistical Areas (CSAs — combinations of adjoining CBSAs) and Metropolitan Divisions (sub-areas of CBSAs, which are roughly equivalent to the PMSAs of the earlier system).
While just trying to keep up with the various entities and the summarly levels and codes that go with them, there is also the variation over time which make using these geographic areas challenging. The CBSAs can and do change year by year. In most years the changes are rather few and somewhat small, but not always. You add a county here or there, occasionally (rarely) a county gets subtracted. New CBSAs can be created at any time during the decade and occasionally one gets decommissioned because it no longer meets the criteria. These changes do not draw much publicity and area easy to miss. OMB issues the formal bulletins that signal when changes occur. The Bureau then incorporates those changes into a text file with complete definitions (sometimes several months following the official OMB release). We currently use the definitions posted on the Bureau's web site, which advertises itself as being updated as of November, 2008.
The 2000 Census Summary File 3 Summary Level Sequence Chart
A familiar sight to anyone who has ventured into the field to Census Bureau technical documentation for their data prodicts (specificially, for the "Summary (Tape) File" products) is the document summarizing the geographic levels being summarized on these files. Here is a partial snapshot (only about a fourth of the entire 2-page document) of the Summary Level Sequence Chart provided with the 2000 Census SF3 data product.
Some things to note about the SLSC:
- The Geographic Component column alerts the user to the existence of special summaries that only consider a subpopulation of the area. For example a geographic component code of 01 indicates an "Urban portion" summary, 02 means an "Urban in centrl place of Urban Area" summary, etc. For most users all this means is that they are probably going to want to filter those rows where the value of the Geographic Component code is not 00 (the code indicating that is not a geographic component summary, but rather a summary of the entire geographic area.)
- The summary level codes and their meanings are indicated in the rightmost Summary level column. The first row entry of 040 State says we have state level summaries on this file indicated by a summary level code of 040. The second line indicates County level summaries with a code of 050. Note the indentation, which is crtitical to understanding these charts, as it indicates a hierarchy of entities. Counties are obviously nested within states. The footnote reference is provided to remind or inform the user that when we say "County," we include other entities that serve as county equivalents, such as parishes in Louisiana or boroughs in Alaska.
- Note the use of hyphens and slashes in the Summary level description, as explained above the table. In the row for level 070, for example, we have 3 dashes which means a 4-level hierarchy (starting with State and ending with Place/Remainder). The "/" in the last level says that this can either be a place (city or Census Designated Place) or it can be the "Remainder" of a county — the portion not within any place.
- The four codes 070, 080, 085, and 090 form a classic census geography hierarchy in which each level is subordinate to the previous one, all of them contained within the County Subdivision level. In Missouri a County Subdivision is called a "township" (in New England it is called a "town", and other names apply in other states). These levels are referred to as split (or hierarchal) geographies, i.e. as "split place" (070), "split tract" (080), and "split block group" (090). There are other summary levels (160, 140 and 150, appearing just below in the chart) which are the "un-split" summaries for these geographies.
- Being from Missouri, I have a certain bias regarding what summary levels are most useful and which ones we could almost get by without. Because it is rare for Missourians to care about township geography, there is not much interest in data tabulated to any of the gegraphies in the 070 to 090 hierarchy. These are very voluminous levels which typically can occupy a large majority of the space on a census summary file, and yet people in Missouri almost never care to use them. The exception to the rule is the 090 ("split block group") summaries. These are important summaries not because anyone cares about such data per se, but because they serve as building blocks when aggregating census data to other geographic levels. This is because these are the smallest geographic areas on Summary File 3, which means the smallest unit for long-form (aka "sample") data in the census. If you have only the 090 level summaries and the right programmer/software you can just about recreate through aggregation any of the component geographies (e.g. place, complete census tract, township, etc.).
- When people refer to "tract" and "block group" level data they are almost always referring to the un-split versions — summary levels 140 and 150, respectively. We (and the Census Bureau) refer to these un-slit levels as inventory levels and the split versions as hierarchal levels. This can somewhat explain why within the MCDC data archives you will sometimes see data files with names such as moi and moh. These contain inventory and hierarchal summaries, respectively. In 2000 there were 12,631 inventory summaries for Missouri and 30,172 hierarchal summaries. The moh dataset is 2.5 times larger than the moi dataset and about 1/10th as useful. See this page for a report indicating the SumLev variable values for the moh dataset and how often each occurs on the dataset.
Order Matters: Summary Levels 390 and 381 (for example)
If you follow the link just above to the complete summary level sequence chart for SF3, you will notice that it actually comprises two charts, labeled "A: State Summary File 3" and "B: National Summary File 3". Experienced census data users will recognized the Bureau's convention of release summary file products as a series of alpha-coded "files", such "Stf 3, File A" and "Stf 3, File B". The different files usually contain the same tabular data, but they do it for different geographic universes and summary units. In the case of the 2000 census, Summary File 3, there was a set of state-universe files that formed the "File A" series and a single national file ("File B") that presented data for mostly larger geographic areas but for the entire U.S.
On the second page of the A file SLSC you should see the entry:
390 State-Metropolitan Statistical Area/Consolidated Metropolitan Statistical Area
and on the second page of the B file SLSC the entry:
380 Metropolitan Statistical Area/Consolidated Metropolitan Statistical Area
381 Metropolitan Statistical Area/Consolidated Metropolitan Statistical Area-State
You'll notice that the only difference between the description for the 390 and 381 summary levels is the order in which the component geographies are listed. On the A file State comes first, then the MSA/CMSA, while on the B file it is reversed. Of course we are really talking about the same kind of geographic entity — the state portion of metro area (which may or may not be in multiple states, by the way). It's not a signficant difference unless you think you know what the code is for such an entity and you try to plug the value in to your Dexter filter spec and have it fail because you are accessing the national file and using the state-file code. We have used this code-pair as an example of how this works. It also applies to other codes for geographic entities that can cross state lines such as Urbanized Areas, Core Based Statistical Areas and even ZIP (ZCTA) codes.
Summary Levels and Area Names
There are rigorously followed conventions for how the Bureau attaches names to geographic entities so that they can be readily identified. The name always describes the last entity for a hierarcical summary, sometimes followed by the notation "(part)". So here is what we get when we extract rows from the 2000 SF3 national ("B") file for the St. Louis metropolitan area:
Notice the areaname values for the two 381 state portion summaries: you get the name of the state rather than the name of the metro area or a combination thereof. When viewed in context as it is here it makes pretty good sense. But when you work with the file and extract all the state-portion summaries and try to do analysis on them, it becomes a problem not having the name of the metro area on the records. Even when the MSA is contained entirely within a single state, the Areaname identifies the state rather than the metro area — we get "Texas (part)" instead of "Abilene, Texas Metropolitan Statistical Area".
Here is a listing of all the 390 level summaries on the Missouri "A" file:
The names here are much more informative. They even use the parenthetical "(part)" notation to inform the user that the metro area spans states and this is only the part within Missouri. On the national file 381 summaries the word "(part)" is always appended, regardless of whether or not the MSA spans states.
Notice the GeoCode column shown in these extracts. This is not a Census Bureau field; it is one that the Missouri Census Data Center tries to add to most of our multi-summary level datasets. It contains the codes for all the fields that comprise the summary level, separated by by dashes. Thus the 381 level has 7040-29 for the Missouri portion of the St. Louis MSA, while the corresponding 390 level value is 29-7040.
Summary Levels by Size and Type
The summary level code for a place (the Bureau uses the the term "place" to mean an incorporated city or town, or a census designated place, which is a census-defined entity that has no legal definition but is used as a unit for data reporting) is 160. At least that is the code used on the 2000 SF3 files (per the Summary Level Sequence chart shown above). When the summary is for the portion of a place within a county the summary level code is 155.
So when the Bureau releases their "sub-county" population estimates, which include estimates for multiple geography types including places and minor civil divisions (county subdivisions that are legally recognized governmental units), they use summary level codes to identify the various levels summarized. Both complete place and place-within-county summaries are provided. You might expect to see these codes (155 and 160) used on these files. But here is what we get:
The expected 160 codes are 162s and the expected 155s are 157s. The explanation is that the Bureau uses a different code for these geographic units because the estimate do not include any CDPs (census designated places). A similar situation exists for the data estimated at the MCD level. MCDs are county subdivisions so you might expect 060 summary level codes. But instead you will see only 061 codes instead. These are a subset of the entities included within the 060 category.
This same thing happens when there is a geographic size limitation. In the 1990 national summary file they reported data at the place (city) level only for cities of at least 10,000 population. The summary level code used on those files was 161 rather than 160.
The moral of the story is that there may be more than one summary level code used to describe a geographic entity, depending on the context of where the summary data appears.
The concept of a geographic summary level code has been around at least as long as the Census Bureau has been producing summary files (or "Counts" as they were called in the 1970 decennial census). For the 1970 and 1980 data products a 2-digit code was used. 01 was the code for a national total, 02 for a region, 03 for a division, 04 for a state and, you might expect, 05 might be a county. But actually 11 was the code for a county summary and all the rest are of no particular logical relationship to the new 3-digit codes that went into effect with the 1990 census products. Like a lot of historical facts, this does not have a lot of practical application. Unless maybe you happen to be required to go back and use some original census summary files from that earlier era. If you use data stored in the MCDC data archive, such as the stf803 and stf803x2 data directories ("filetypes"), you will see 3-digit SumLev variables that we created by converting the original 2-digit codes.
The Definitive Master List
Somewhat surprisingly, there does not appear to be a (public) master list of all the summary level codes used by the Census Bureau. We have attempted to compile a complete list of all the ones we know about. See the master list. Note that this list also includes codes that are not numeric (i.e. that contain all or some alpha characters); these are not official Census Bureau codes. They are codes that we have used on our datasets when we created our own new geographic level (e.g. a regional planning commission or a U of Missouri Extension region), or when we were unable to find out what code the Bureau was using for something (e.g., we use 61c as a code for the county portion of a state uppler-level chamber legistlative district. We are pretty sure the Bureau has a code for this but at the time we did not know what it was. It would really be nice if the Bureau would publish a complete and annotated list of all these codes.
Summary Levels and Geoids
With the advent of the American FactFinder online data retrieval system, the Census Bureau has developed a new universal geographic-entity code called a geoid. The idea is fundamental data warehouse methodology — the entities that your warehouse is describing are required to have a unique identifier. The codes are alphanumeric and of varying length and content depending upon the geographic level.
For example, the code for Springfield, MO is 16000US2970000. The first 3 characters of the ID are "160". Springfield is a city ("place") and 160 is the summary level code for a complete place. The next 2 characters in the geoid are "00" and this turns out to be the geographic component code. This is followed by the characters "US" (presumably the Bureau either currently or in the future does data for other countries so that this becomes relevant), and then by the relevant geographic codes. In this case we see "29" indicating the state, and "70000" indicating the place code (see data application links for Springfield).
So the Census Bureau is using these geoid codes to keep track of geographic entities within their database and these codes start with a 3-digit summary level code. So what? So nothing, for most folks. But if you are one of us who like to know how things work, or who need to process raw ACS data or who might even be tempted to figure out what AFF query files look like and code their own then this is very useful information.
Summary Levels and Keyvals in Dexter
The ability to understand and use summary level codes is fundamental to being able to use the Missouri Census Data Center's archive via Uexplore and Dexter. The large majority of our datasets are summary data describing geographic entities, and the majority of these contain more than one type of geography. To be an efficient user of these data requires that you be able to distinguish between a state level summary, a county level summary or a city level summary. By this we mean that you be able to code a Dexter query that specifies which levels you are interested in having included in your extracted output. We recently ran a check on our metadata to see how many of our datasets contained a summary level code variable which was designated as a "key variable" for the dataset. We found that over half (52%) were so designated. Of course there are additional datasets that contain only a single level of geography and therefore do not require a summary level code. But many/most of these single-level datasets will still have a SumLev code present. They come in handy in cases where you join multiple datasets with different geographic levels.
To see how it works we'll walk you through a typical example. Let's say you are accessing an ACS Profile report on our site. You have navigated the menu front ends and chosen Chicago and the state of Illinois as the geographic areas. You get a page that looks like this:
Note the strange secret code in parentheses after the name of the Area (16000US1714000) — recognize it? It is the geoid code for the entity being summarized (Chicago). A common way to access one of our datasets is to be referred to Dexter with a specific dataset already selected by an application such as this one (acsprofiles). The "reference" is provided as a link in the top right of the output page:
The highlighted link can be clicked on to take you to a Dexter data extract form that lets you access a dataset — not just any dataset but the one that was used by the current application. Maybe you wanted to see how some of these table items compared to other large U.S. cities (New York, LA, Philadelphia, etc.) or maybe how it compared to other cities in the state of Illinois. For the sake of our example we'll say that you are interested in looking at various data related to poverty for all cities in the United States with a population of at least 500,000. We'll walk you through that and you'll see how knowing about summary level values helps you choose the data of interest.
Here is some of what you should see when you follow the link from the ACS Profiles page:
The highlighted link to detailed metadata can be followed to see a page containing:
This provides you with key variables links. We are particularly interested in the link to sumlev. Click it and you'll see this:
If you've been reading this document carefully you should already know that the code use to indicate a place (same as city, in Census lingo) is 160. Actually it is the code but just a code that is used — we saw that in some instances they used codes like 161 and 162 to indicate subsets of places. It is not important or necessary that you memorize all the code values. All that you need to remember is what the codes are used for and where to look for this kind of metadata to help you see what levels are present on a dataset in which you have interest.
Armed with the information about 160 being the code for place-level summaries, you should now be able to make the following entries on the Dexter form:
The first line implements the population size condition, and the second line is the condition that eliminates (filters) all those rows that summarize some entity other than a place. We only keep rows where SumLev = 160. We complete the query by choosing what variables/columns we want in section 3. Something such as:
We click the nearest Extract data button. The result is a CSV (comma-separated value) file which our browser knows can be opened in Excel. It should look something like this:
Want to see bunch of summary level values and some related geographic identifiers, as used by the Bureau in their latest American Community Survey datasets? See a good working example of summary level values that we extracted from the acs2008 data directory, dataset allgeos3yr. This dataset has the geographic identifier data for all 14,536 entities that qualified for summary data based on 3 years of data ending in 2008. We kept the report to a more reasonable size by filtering out any geographic-component other than 01 Rural and any state-specific areas except for Colorado. The report is sorted by geoid, which means by SumLev and then Geocomp and then whatever specific geocodes apply.
For More Information
For more information about Census Bureau geography in general, see the Bureau's Guidance for Geography Users page.
For a comparable page maintained by the Missouri Census Data Center see our Geography page.