The American Community Survey (ACS) is a very important new source of data of the type usually associated with a decennial census, but based on survey data that are less than two years old instead of over six years. However, it has a number of things about it that a user needs to be aware of before attempting to use these data. This list of items attempts to make users aware of some of the more important potential "gotcha"s that go with these data. Much of what we say here is specific to the 2005 edition of the data and may not apply to data released in previous or future years.
We have tried to avoid getting into statistical technicalities, but it is not always possible to avoid such topics, since many of the items cited here are the result of statisticians doing things to reshape the data. In such cases, even if you cannot follow the details of why something might be the way that it is, at least know that the problem exists.
The 2005 ACS data provides us with more information about our population and housing stock than we have ever had in our history outside of a decennial census year. The results of the survey for 2005 have been tabulated in a very detailed fashion, with over 1000 detailed ("base") tables, to go along with a series of custom profile, ranking and subject tables, geographic comparison tables, and even (coming soon) narrative profiles summarizing the data for a geographic area. These data are easily accessed by most users via the Census Bureau's American FactFinder web application, which has undergone a very significant upgrade that was made available with the first release of the 2005 ACS data in August 2006.
In addition to making the data itself readily available, the Bureau has also provided the usual access to excellent metadata and background information. There are even a series of online tutorials/Powerpoint presentations that provide excellent introductions to the data geared to new users. See the ACS Home Page for more info.
The 2005 survey did not include persons in group quarters (i.e. living in dormitories, nursing homes, prisons, military barracks, etc.). They will be included next year for the 2006 data year (and, hopefully, all future years). This limitation makes it difficult to cite any trend based on comparing 2000 decennial census data and the 2005 ACS data. The different survey universes used must always be taken into consideration. This limitation is noted in footnotes on the American FactFinder web site but is not mentioned in table labels. The label says "Total Population" with the implicit qualification is "included in this year's sample universe".
This somewhat arbitrary magic number is designed to avoid creating tables where the sample size would result in large standard errors (sampling error). Of course, many tables that are published in the ACS have universes well below the 65,000 threshold. Tables come with confidence interval sizes (MOE, for margin of error) to alert users to the reliability of the numbers. These MOE values are often quite large, especially when dealing with detailed subpopulations; just because an item gets published does not mean it can or should be used without noting the significant uncertainty involved.
We will not see any numbers for smaller geographic areas until 2008. In that year, the Bureau plans to publish tables for geographic areas of at least 20,000 population. These tables will be based on combining the survey data for three consecutive calendar years — 2005 through 2007. In 2010, the Bureau will publish tables for geographic areas of essentially any size (down to census tracts and even block groups) based on data collected over the previous five calendar years (2005 through 2009). These tables based on multiple years of survey results are commonly referred to as moving average tables. Note that for larger areas, you will be able to choose from different sets of tables starting in 2008. For example, in 2010 there will be three sets of data for Boone County, MO (over 65,000 population): single-year tables based on just the 2009 surveys; three-year period estimate tables based on survey data for 2007 through 2009; and five-year period estimate tables based on survey data for 2005 through 2009. The tradeoff will be between larger sample size vs. more current data
If you are looking forward to getting new and improved data regarding the number of Hispanics (or African Americans, or foreign-born persons, or poor persons, etc.), don't get your hopes up. The ACS does not provide any new data regarding the counts of persons or households. This is because the Census Bureau does not weight ACS survey returns the way they do with decennial census surveys. In the decennial census, the Bureau assigns weights based upon their master address list, which is assumed to be definitive and complete. This is not the case with the MAF (master address file) used for the ACS. Although an initial weight may be assigned based on the number of households found in an area on the MAF, the person record weights are adjusted so that total population counts at the county level by certain age, race, gender, and Hispanic cohorts will match numbers published in the Bureau's detailed county-level demographic estimates. The result is that the number of cases (persons) in a table is really just a reflection of those estimates, and the data collected in the ACS simply controls the apportioning of those cases (total persons in households, households with Hispanic head, total males living in households, etc.) based on characteristics. So the ACS may tell us what portion of African Americans are classified as living in poverty in a county, but the actual number of such persons is the result of applying that portion to the number of African Americans that are estimated (some would say "guesstimated") in the Bureau's estimates program. To make matters worse, the Bureau also adjusts the weights at the household level separate from the weights for persons.
This is a surprising, rather frustrating, and unintended result that the Bureau does not yet fully understand. It has to do with how the questions regarding income are asked on the two survey instruments. The decennial survey asks a person about their income in the previous calendar year, whereas the ACS survey asks about income in the previous 12 months. Everything gets adjusted for inflation, but when the Bureau looks at test results, they have strong evidence indicating that income reported with the ACS version of the question is consistenly lower (by about 4.4% nationwide and 5.2% in Missouri). See the Bureau's rather readable 16-page paper about this issue by Nelson, Welniak, and Posey. The official Bureau stance is that users should exercise caution when trying to do trend anlaysis regarding income or poverty measures using the decennial census vs. the ACS data.
It should be noted that when the Bureau did a press conference when the 2005 ACS income data were released, all of the economic data trends that they cited were based on the data from the Current Population Survey (which coincidentally were released on the very same day) and not from the ACS.
This has caused considerable grief among journalists and other data analysts who were chomping at the bit to publish articles regarding trends related to these hot topics, only to be told (somewhat belatedly) that they should probably not do it (that is what "exercise caution" really means in this context), since the data were not truly comparable.
The income comparability problem is just one rather dramatic instance of an item being collected in the ACS that has issues of comparability with the same subject area as measured in the decennial census.
Users of decennial census data who have been around long enough to remember the problems we had with the 1970 and 1980 summary data sets because of data suppression will be disappointed to find out data suppression is back for the ACS. It happens at the base table level for the 1- and 3-year data products, but will not be done for the 5-year data to be released for all geographic areas starting in 2010.
The Bureau applies what they refer to as their Data Release Rules to the base tables in order to protect us from tables whose reliability is unacceptable. Some of us are not impressed with these rules, which seem somewhat arbitrary and which suppress entire tables rather than just the unreliable cells within the tables (and, conversely, allow the publishing of very unreliable cells within tables whose overall reliabilty is deemed acceptable).
We do want to warn users about some of the unfortunate consequences of this approach by citing an example. Base table B17010 deals with the poverty status of families. It breaks the data down by type of family and presence of related children. The table has 41 cells in it. Many of these cells pertain to rather uncommonly-occurring categories such as "Non family, male-headed family households with no related children < 18". Because of this detail, and because the Bureau's algorithm for suppressing tables is designed to protect us from tables with small cell counts, this table winds up being suppressed for 4 of the 16 Missouri counties for which we have ACS data for 2005. The way this is supposed to work is that when a table is too detailed like this, then there will be a comparable C table with less detail. But there is no table C17010. So, you might think that at least we can go to the economic profile table (D03), which has an item telling us what percentage of all families in an area are below the poverty line. But it turns out that the Bureau does not go back to the original data to generate the profiles, but instead just derives/copies them from the base tables. This results in a missing value for the percent poor families item on the economic profile for Cape Girardeau county, MO. This, in spite of the fact that there are almost 19,000 family households in that county. And in the very same profile a poverty estimate appears for related children < 5 years, even though the number of children under 5 in the county is less than 4,000.
The decennial census takes a snapshot of the population and housing stock based on a single day — April 1 of the decennial year. But ACS surveys are distributed year-round, so we have January data and December data. This can be a key factor in interpreting differences in data between the census and the ACS, especially so in areas that have seasonal populations, such as resort areas or college towns.
In the decennial census, you are counted where you are residing on April 1 (with a very few exceptions, such as a person who was on a trip that day and fills out the form when they get back home). With ACS it is more complicated; where you get counted is based on where you reside when you get the survey (unless you are only staying there temporarily, defined as less than two months). This should mean increased populations for places like Lake of the Ozarks (resort area with a large summer-only population) and lower populations for places like Lawrence, KS (college town, where most students are there on April 1 but not in the summer, when many will be away for more than two months.) However, since the population counts are then adjusted so that they sum to the numbers from the estimates program, maybe not. It may wind up affecting the characterstics (educational attainment goes down in Lawrence) without affecting the actual head counts.
The data products were released in waves over the late summer and early fall of 2006. As of mid-September, waves 1 and 2 had been released, covering subjects in the first three DP categories. The housing data were released the first week of October; the narrative profile data products along with some more detailed data regarding population subgroups were due in November.
The public use microdata sample (PUMS) data allows users to access a 1% sample of ACS surveys. This represents about 40% of the available data, since the overall ACS sample is about 1-in-40 or 2.5% of all households within a given year. Researchers who are comfortable with the statistical aspects of analyzing such data (typically with a commercial statistical software package such as SAS or SPSS) can create their own custom tables. The smallest unit of geogrpaphy on these files is the PUMA, or public use microsample area — the same units identified on the 2000 Census PUMS files. Care must be taken when using PUMS files because of the small sample size.
The MCDC has a complete collection of the ACS PUMS data, which is kept in its own separate data directory (filetype) called acspums. This directory contains such files for multiple years. These datasets are basically just copies of the datasets as released by the Census Bureau; we did not have to convert them.
One of the things people are used to doing with data from the Census Bureau is creating thematic maps or summary reports that show spatial distributions of data within their state or region. This sort of thing is not generally doable with the ACS data (yet) because of the limited geography available (e.g., you cannot do a state-wide map or report by county, because many or most of the counties have no published data as yet). There are two levels of geography where the data at these levels is available for all areas, covering the state; they are congressional districts and PUMAs. The former tend to be too large for mapping purposes, while the latter are considerably smaller and hence better suited for a mapping application.
Users who are not familiar with PUMAs may find it worth their while to become more familiar. To learn more, you can start with a set of PDF base maps accessible from the Bureau's web site. When you get to the PDF document, be sure to note that the first page is an index page that displays entities called Super PUMAs. These are not the PUMAs you want. The PUMAs you do want are sometimes referred to as 5% PUMAs, because they were the geography used on the 5% sample PUMS files in 2000, whereas the super PUMAs (also known as 1% PUMAs) were the ones used on the 1% PUMS files in 2000. The key to using these maps is to understand that the 5% PUMAs nest within the super PUMAs, and these PDF files have one or more inset maps showing more detail for metropolitan areas within the state, and then one page for each super PUMA showing the boundaries of the 5% PUMAs. The maps also show relevant place and county boundaries to help you see what geographic areas correspond to the PUMAs.
For example, look at the 3rd page of the PDF file for Colorado. You can see from this page that PUMA 00101 comprises a series of rural counties in the northwestern corner of the state, going across the northerm border as far east as Larimer county. We see that the PUMA is made up of five complete counties (Moffat, Rio Blanco, Garfield, Routt, and Jackson) as well as parts of Mesa and Larimer.
A more precise and easy way of seeing the relationships of PUMAs to other geographic entities, such as counties, is using one of the MCDCs Geocorr web applications. For example, we invoked the application and specified that we wanted:
Try replicating these specs yourself. You should get a pair of output files, one a CSV file that can be used for importing to an Excel, and the other a report file (text or HTML). Here is what you should see on the first few lines of that report:
Each line represents the intersection of the 00101 PUMA with a Colorado county. The 4th column shows the 2000 census pop count for the intersection (the portion of the county within the PUMA), and is followed by 2 columns of allocation factors. The first allocation factor says what portion of the PUMA's total population is in the County (43.8% of persons living in PUMA 00101 also live in Garfield county), while the second indicates what portion of the county's population also reside in the PUMA (100% of Garfield county resideents live in PUMA 00101, and only about 5% of Larimer countians reside in that PUMA).
For more information regarding PUMA geography see the MCDC's page describing PUMAs in considerable detail, including a link to a custom report that shows all the PUMA codes in the U.S. along with their Super PUMAs and what counties and major cities are contained within each.
The author acknowledges the valuable contribution of Leonard Gaines of the New York State Data Center, who reviewed an early version of the page and made several valuable suggestions and corrections.