OverviewThese files provide detailed population counts from the 2000 decennial census. The data found in these files are tabulated based on information collected on the census short form, which was sent to all households (as opposed to the long form which went to only about one in six households.) The U.S. Census Bureau has documented these files in great detail in a 600-page pdf document available at http://www.census.gov/prod/cen2000/doc/sf1.pdf. (The MCDC has broken this document down into a directory of smaller pdf documents that can be accessed in the Techdoc subdirectory of the sf12000 data directory.) Everything you will need to know and more is contained in that document. Our purpose here is to complement that information with things pertaining to what we have done to the data here at the Missouri Census Data Center.
Briefly, what we have done, is to transform the Bureau's collection of 40 ascii files containing the data for a state into a series of database files (aka tables, datasets, etc.) We are going to assume that you have read enough of the Bureau's Technical Documentation to understand what kind of information you can expect to find in these database files. What we want to do here is provide pointers to help you find what you are looking for.
We have only processed the complete set of data files for a relatively small number of states (Missouri and some of its neighbors). However, we have downloaded and converted the geographic headers files for all states and have stored the results in the xxgeos subdirectory. These sets contain just a minimal amount of actual census data (total pop and housing unit counts) but are useful as geographic reference sets.
Alternative Data SourcesFor many users wanting to access the information contained in Summary File 1, learning about and trying to access these rather complex data files may be more than you really need or want to deal with. You should be aware of alternative sources for accessing these data in more end-user-friendly formats. These include a number of sites that offer profiles and other reports on the web, most notably via the Census Bureau's American Fact Finder site. The MCDC offers its own set of reports that are referenced below in the section on related reports. We also offer an alternate data collection ("filetype") which we refer to as "sf12000x". (click here to access this alternate data directory.) The data in the sf12000x collection are distilled from the data in these full sf12000 files. Instead of dealing with over 8000 variables (in sf12000 files) with names like "pct12i34" that represent cells of multi-dimensional tables, in sf12000x files you will instead be dealing with just over 200 variables with names such as TotPop, Over65, Families and pct_Chinese.
Prior to releasing the SF1 data starting in June of 2001, the Census Bureau released a sort of "preview product" in May. This product was simply called the "Demographic Profiles", and consisted primarily of nicely formatted summary reports posted on the web in pdf format (see http://www.census.gov/Press-Release/www/2001/demoprofile.html for details on this product. There were also comma-delimited data files containing the data values used in the reports. An important limitation of these products is that they were available only for governmental units (states, counties, cities, some county subdivisions, etc.) but NOT for census tracts, block groups, etc. The MCDC has a complete collection of these data stored as the separate filetype sf1prof.
Data Files: What Goes WhereNOTE: We are in the process of changing our strategy for converting and naming these files (as of 1/14/2002). The description that follows describes the "new" strategy and there may be a short period where some of our data does not match the following description.
We have created a consistent set of data files (or datasets -- we'll use the two terms interchangeably) for each state that we process (with "us" serving as a pseudo-state for any national collection we might process). We identify the state that each of our files relates to by using the state postal abbreviation as the first 2 characters of the file name. So any file you see that begins with "mo" you can be assured contains data relevant to the state of Missouri, while any file that begins with "ks" contains Kansas data. A file named starting with "us" would be a national collection. We could have made it very simple and just put all the data for a state into a single dataset. But we just could not do that because it would have been way too wasteful of storage space and -- more importantly -- time required to access the data. So instead we have broken the data down into some smaller subsets, and have tried to segregate some of the most frequently used data into their own datasets. If all you need to access, for example, are the data in the "identifiers" section (the "geos" file) then you can limit your access to the XXgeos dataset. For many applications you should be able to get by with accessing just the moph dataset; this has the basic tables - P and H - for all geographic summary levels except the census block. If you need summaries for blocks, look in the XXblks dataset which contains summaries for block and blocks alone. (Here "XX" stands for the state postal code - substitute your state's code here.)
For the state of Missouri (for example) we have the following data files:
SF1 Data Files For Typical Universe (State of Missouri) mogeos.sas7bdat Contains geographic codes and other identifiers only. For all geographic entities, incl. blocks. moph.sas7bdat Contains the P and H table cell values as well as the geos data; for all geographic entities except blocks mophblks.sas7bdat Contains the P and H table cell values as well as the geos data for just census blocks. The id variables geocode and AreaName are excluded from this set. Very large dataset. mopct.sas7bdat Contains the PCT table cell values as well as the geos data for all geographic areas for which PCT tables are available (so nothing at the block or block group levels.) This does NOT include the collection of race/hispanic qualified PCT12<r> tables, which are stored separately. mopct12r.sas7bdat Contains the PCT12a, PCT12b, ..., PCT12o table cell values as well as the geos data for all geographic areas for which PCT tables are available (so nothing at the block or block group levels.) These are very large, tedious-to-access tables, each with 209 cells. moageracesexhng.sas7bdat An alternative way of storing the information contained in the mopct12r tables. We create this almost microdata-like view of the data with 1 obs per data cell. The "ng" stands for "no geocodes". There is only a single geo_id variable that links this data to the mogeos dataset. See detailed explanation, below. moageracesexh.sas7bvew A SAS view that just merges the data in the above dataset with the mogeos, making it look as though we had all the geographic identifiers stored.
As you might be able to guess from the names, these are stored as SAS data files. SAS is the proprietary software package that we use to process and store the data. Like most database packages, SAS has its own special format that it likes to use to store data that only it can directly create and read. Of course, with our uexplore/xtract software you can indirectly access it - - once you know what you are looking for. So what are these seven SAS data files all about?
If you have read the SF1 Technical Documentation (which you really ought to at least skim before attempting to access the data here), you'll know that the data is organized into tables and that there are a number of different kinds of table. There are "P" tables that contain basic Population data (age,race,sex,household type, etc.), and "H" tables that have data related to housing subjects (tenure, vacancy status, etc.) There are also "PCT" tables that are like P tables except that they are not available for geographic summaries below the census tract level, i.e. for census blocks and block groups. (This is a new idea from the Census Bureau for 2000; they never varied the tables within a summary file by geographic summary level in any prior censuses.) In Missouri, there are 279,300 records (observations) on the state's SF1 file, and of these 256,811, about 92%, are either census block or block group summaries. The PCT tables are typically rather detailed. One of the tables, PCT12, provides 209 cells of information giving a summary of persons by sex and single years of age (103 age categories altogether). In addition to all this detail, SF1 also contains a series of tables named PCT12A, PCT12B, ... PCT12O (that's an "O" as in "Overkill", not a zero --there are 15 of these) that had the same tabulations but each was for a different race or hispanic subgroup. In total, if you add up the cells in all the tables for one geographic area you have over 8,000 cells of data! Of these, about 5,000 data cells are in the less- frequently used PCT and PCT12R tables. (Those of you with any experience at all in data base may see now why storing all the data in a single, simple "rectangular" dataset would be so wasteful, since over half (5/8) of the data cells would be missing for over 90% of the rows!)
In addition to the 8,000 or so table matrix data items on each SF1 summary record, SF1 also has a wide array (about 70) of geographic and other various "header" items: geographic codes, internal point coordinates, various area measurements, etc.
Restructuring the PCT12 Detailed Age by Sex by Race TablesThe moagersexh data sets are a restructuring of the data contained in all the PCT12 tables (that includes table PCT12 containing 103 age by 2 sex + 3 subtotals, or 209 total cells for the total population, and the collection of 14 PCT12<r> tables, where <r> is a letter from a to o and indicates that these counts are for some race/hispanic subgroup such as "Persons reporting White Only for Race" (PCT12a) or "Persons Reporting Multiple Races and Not Hispanic" (PCT12o).) There are 15 different PCT12 tables, each with 209 cells, so over 3000 variables in all. In the alternative data set, the observations (rows, records) represent indvidual cells in a 5- dimensional table. The 5 dimensions are geographic area (the census tract within place and county subdivision, the smallest geographic area for which PCT tables are available; the exact number of such entities is large and varying by state), age (103), race (7), sex (2) and hispanic origin (2). A single numeric variable, Persons, contains the population count for that cell. Only non- zero cells are stored. A typical observation might report the number of persons in Boone County, tract 20, city of Centralia, township of Centralia (geographic dimension), who are (or were on April 1, 2000) exactly 5 years old, female, reporting white alone for race and indicating that they are not hispanic. This is a small bit of information, to be sure, and not one that many people would be interested in per se. What many people should find of interest, however, is the relative ease with which you can use this data set as input to a tabulation procedure that can generate a report like the one we created at http://mcdc2.missouri.edu/webrepts/misc/mo/age_by_race_hispanic_for_the_state.html using less than 20 lines of SAS code (which can be viewed by following the link in the footnote of the report page.)
An important fact to keep in mind when processing this data set is that the observations are of two basic types, and you should rarely use both types in a single query:
The first group should be used for looking at total population (by age, sex and geography), while the latter group should be used when you need detail by race and hispanic origin. If you select both kinds and do an aggregation you will probably get answers that are twice the actual values.
- Observations where the dimension variables race and hispanic are blank
- Observations where neither race nor hispanic are blank.
As a user (or at least potential user) of uexplore/xtract what does all this mean to you? Basically, that when you are looking for data you need to know what table or tables you are interested in. The data have been broken into 3 subsets based on grouping of the tables (ph, pct and pct12R). The geographic header information is stored in a separate dataset as well as in the various table-cell datasets (in an earlier version, we tried to omit the geography data from the cell-base datasets and use something called SAS views to link them together; but this turned out to have some unaniticipated problems and we have abandoned that strategy).
The xxgeos Geographic Headers CollectionThe special subdirectory xxgeos is used to hold geographic headers data only, i.e. there are no demographic tables here. All we did was to download the Census Bureau's complete collection of geographic header files (these .zip files are stored in the subdirectory, at least for now) and then convert these to SAS data sets. In doing the conversion, we created 2 data sets per state, one for census block headers only and the other with header info for all other geographic levels. We thus have 51 x 2 or 102 state-level SAS data sets in this directory.
We have also created a series of national level data sets, where we have selected specific geographic entities from all states and combined them into us level sets. Thus far we have created these national sets for counties (uscntys), 5-digit ZIP codes (uszctas) and places (usplaces). For the zctas and places we have included headers for both the complete area levels as well as the area-within-county summaries. Thus the uszctas set has levels 871 and 881, while usplaces has levels 155 and 160.
Warning About SizeThe sf12000 data directory contains some of the largest files in our data archive. Specifically and especially, you need to watch out for the block level summary data sets. The mophblks.sas7bdat file, for example, containing the P and H tables for all 241,532 census blocks in Missouri (even the ones that have no population and no housing units). This file is about 374 megabytes, and that is using the SAS compress option, which is the only reason it is smaller than a gigabyte. A SAS data step (which includes a Uexplore/xtract access) may take over a minute of real time. You can make it go faster if you use a where filter to specify one or more counties (since this dataset is indexed by county). All of which means you have to be careful and sometimes patient when accessing these files. Because of the tremendous size, you never want to run an extract on one of these without a filter. The result will be way too big to handle, and you will hit one of the filesize limits built into the xtract program.
Geographic CoverageThe Census Bureau publishes SF1 data for the entire United States and Puerto Rico. You can access that data via the American Fact Finder web site at http://factfinder.census.gov/java_prod/dads.ui.homePage.HomePage. The Missouri Census Data Center will be making its version of the data available here for the states of Missouri, Illinois, Kansas and Delaware, as well as a national collection of higher level geographic summaries. Data for other states is possible, but not probable. Organizations who have access to the SAS(r) software package who would be interested in creating their own datasets comparable to what we have here can access the SAS code we used in the Tools subdirectory of this directory. Specifically, they should study the code in the http://mcdc2.missouri.edu/data/sf12000/Tools/cnvtsf1.sas SAS source code file.
Geographic Summary LevelsAnyone who does any work at all with census summary files knows that the most important section of the 600-page Technical Documentation is something called the Summary Level Sequence Chart. This is Chapter 4 in the manual. It is 6 pages long, with 2 pages for each of 3 versions of SF1: (A)State Summary, (B)Advanced National and (C)Final National versions. (Most of the data you will see on this site (for now, at least) will be from the (A) versions of the file.)
The Summary Level Sequence Chart (SLSC) is just a way of displaying all the different geographic levels for which data is aggregated on a summary file. As you can see if you look at the chart for SF1 there are a lot of levels available. However, most users tend to only be interested in just a few. Here is a list of the levels that we have found are most frequently used by most users along with their summary level codes:
SUMLEV Code Summary Level 040 State (see also GEOCOMP section) 050 County (or county equivalent) 060 County subdivision (MCD, township, CCD) 140 Census Tract 150 Block Group 101 Census Block 155 Place (within county) 160 Place (complete) 390 Metropolitan Statisticl Area (MSA) or CMSA within state 395 Primary MSA within state 500 Congressional District (106th) 851 3-digit ZIP Code Tabulation Area 871 5-digit ZIP Code Tabulation Area 881 5-digit ZIP Code Tabulation Area within County Hierarchal Census Geography 070 Place within MCD 080 Tract within Place and MCD 091 Block Group within Place and MCD 158 Tract within place
Note that, by definition, County subdivisions (MCD's) nest within county, as do census tracts. Block groups nest within census tracts and are composed of census blocks. Census blocks are atomic units, meaning they nest within all other geographies.
Custom Aggregations to Other Geographic UnitsYou might think that all the summary levels described in the previous section would be enough. But you'd be wrong. Users are interested in many other layers of geography for which the Bureau has not provided any summaries. The Missouri Census Data Center specializes in doing data allocations/aggregation to create such custom summarizations. When we do this we create a pair of data sets for each geographic universe-unit pair, one with the ph tables and the other with the pct tables. We do not create a p12r data set, and there is no separating of the geographic variables into a separate set and using views to combine them with the data.
To date, we have created the following custom aggregations:
- Missouri School Districts: moschls[ph/pct] are summaries for complete districts, while moschlcos[ph/pct] are summaries for school districts split by county. SUMLEV values on these data sets are "sdu" for unified districts and "sde" for elementary. Both kinds are included.
- Missouri State Legislative Districts: mosenate[ph/pct] are summaries for MO state senate districts, while mohouse[ph/pct] are summaries for the state House districts. SUMLEV values on these data sets are the ones established by the Bureau. These are for the districts as they were defined when the SF1 files were generated, i.e. as of 2000. We have also been able to generate similar summaries for the legislative districts as they have been redefined by the 2001 Missouri redistricting effort, creating new political geography that is effective starting with the elections of 2002. These datasets are named with "02" suffixes in their dataset names to distinguish them from the older geographic versions.
Codebook - Description of the Variables/Columns in Each Data FileOur favorite tools for seeing what variables contain what information are the SAS source code modules that provide labels for the variables. View these in the Tools subdirectory, i.e.:
This Page Last Modified: 3/14/2007 8:52AM jgb