Summary File 3 — 2000 Census: MCDC Standard Extract


A complete set of summary tables from SF3 for a geographic area at or above census tract level is comprised of over 16,000 data cells. While it is good to have all this detail available when needed, clearly this is way too much information for the casual browser. The standard extract datasets in this directory are an attempt to distill this ocean of data into a more comprehensible collection of just over 200 data items (not counting derived percentages). The full SF3 datasets are comprised mostly of multi-dimensional matrices of data , e.g. Age of Householder (7 categories) by Household Income (16 categories). Many of the tables are even repeated for a set of 9 different race/hispanic origin categories. But this extract contains no detailed cross-tabulations, and no detailed racial data other than the basic counts of persons. Many of the items here are means, medians and percentages that are suitable for doing comparisons among areas, even when they may be of very different sizes.

If you are interested in median family income by the age and/or race of the householder, this is not the place to look. Here, all you will find is the simple median family income. You will also see race and age data, but not cross-classified. We believe that a relatively small, carefully chosen subset of the available data can be used to answer a large percentage of user's questions. So we have spent considerable time considering what variables we wanted to include in these extracts. It is like creating a Greatest Hits collection; noone is going to agree with all your choices. We had direct input from a number of people regarding what to include here. If we had included every item that at least one person thought should be included we would easily have had over a thousand variables. We wanted to keep our count under 250. At last count (we reserve the right to tweak these files by adding new variables from time to time -- we'll never drop one, however) we had 217 independent variables and another 188 derived percentages.

Just because a variable is not included on this dataset does not mean it is not an important piece of information. Census data are used by so many people for so many different kinds of applications that there is no way you can create an extract that serves everyone's needs for all requests. That is not the intention. We still have the more detailed table files to go to, and we have created tools for providing links directly from this extracted data to the more detailed "parent" tables. (Look for these links on the corresponding profile reports, described below.)

Related Profile Reports ("dp3_2k" and "dp3_2kt")

The primary intended use of this collection is to generate profile reports based upon the data. These profiles should provide a basic overview of the area. What kind of people live there? What are their ages, their racial breakdown, their income levels and poverty status; their propensity to own versus rent, the age of the housing, how long have people lived there, the number of PhD's and the number who never made it through high school. Access to these reports is available using the MCDC's menu-driven dp3_2k web application.

Another series of profile reports combine data from these 2000 demographic profile datasets with comparable data from the 1990 census. The MCDC's menu-driven dp3_2kt web application is one of the most popular on the MCDC web site.

Sample vs. Complete Count

Users of any census data should be aware of the difference between data based on the short form questionnaire (called "complete count" data, because it is based on data received from everybody), and data based on the long form questionnaire, which was only sent to a sample subset of households/persons (called "sample" data.) The 7 questions on the short form were also asked on the long form. So, we have counts of persons by age, race and sex (for example) that are based on the complete count (both short- and long-form results are used) and the sample count (only long-form responses are used.) The complete count data are tabulated on Summary File 1. The long form data are tabulated on Summary File 3. SF1 data is more accurate, especially for small geographic areas because it contains no sampling error. But SF3 data is more interesting, because it is based on many more questions regarding things such as income, education, housing values, etc. The variables in this extract included in Tables 2-5 (see the Variables.pdf data dictionary file to see how the variables are assigned to "Tables"; these correspond to where they are used in the Profile reports, of course) are also available in our corresponding SF1 standard extract datasets. There is a variable there named Age20_24, just as there is one on this dataset with that name; both are attempts at reporting the number of people in their early 20's residing in an area. The value of this variable on this dataset (sf32000x) for Washington, MO is 721, while the corresponding value on the sf12000x dataset is 765. The difference is what statisticians call "sampling error". Here is the first part of Table 1 for Washington, from out dp3_2k standard profile report:

Subject Number Percent SF3 Table
1. Population Basics - Universe: Total Population
Total Persons (Sample Est) 13,092 P1
Unweighted Sample Count of Persons 1,598 P2
Total Persons (100% Count) 13,243 P3
Pct Persons Sampled 12.1 P4

The "Total Persons (Sample Est)" field is the population of the city based on the sample estimate. The actual enumerated population was 13,243 and 1,598 of these people (unweighted sample count) got the long form questionnaire, which was 12.1 percent of the total. All the SF3 "sample" data are based on the responses of these 1,598 people. The estimate is off by 151 people, or about 1.1%. The sampling error tends to be higher for smaller geographic areas. It is not a serious problem for Washington, but it could be for a city of less than 1000. The Bureau actually oversamples in places of under 2500, but it can still be a problem.

Each person who fills out the long form is assigned a "sampling weight" value. The average sampling weight for persons in Washington was about 8.25 (the 12.1% sampling rate is about 1 in 8.25 people). That SF3 count of persons aged 20-24 was derived by counting each person filling out the long form who checked their age as being in that interval not as one person but rather as 8.25 persons (on average -- the sampling weights will actually vary from person to person, it's a very complex sampling scheme.) The bottom line here is that the figures on SF1 are somewhat more accurate that those on SF3 (complete count data is "better" than sample-based data). The problem with using the SF1 data is one of consistency when trying to analyze an area. It can be pretty confusing to users when two tables describing the total population or even some subset of it, do not add up to the same totals. For example, when we report the Marital Status data in Table 6 (sample data not available on SF1) we use as our universe persons aged 15 and over. This is the sample estimate of such persons, consistent with the five marital status counts that follow it. These differences tend to disappear for higher level summaries (states, larger cities and counties) but can be very significant for smaller geographic areas. The simple fact of the matter is that you really have to take sample census data for areas of less that a few thousand people with a grain of salt. The sampling errors for such areas can be significant. Sample data for areas with fewer than 100 people are basically worthless, as far as knowing the characteristics of those areas. They do have considerable value as building blocks to aggregate to larger entities such as school districts or 10-mile radii of proposed nuclear reactor sites. The sampling error you get when you aggregate 100 areas of 100 persons each is equivalent to the error for a place with 10,000 people - i.e. not bad in most cases. But watch out for tables with small universes.

Users might want to view the Census Bureau's note regarding this matter as it relates to 2000 census data.

Uexplore Access

We expect that many or most readers will arrive at this web page via a link on the MCDC's uexplore web application page. But in case you did not, the URL for accessing the data (and some related metadata, including this page) via uexplore is A lot of the information provided on this Readme page assumes that you are familiar with the uexplore application and that you are interested in using it to extract data from the MCDC's sf32000x data collection. If you are new to uexplore and Dexter, you may want to at least look at the uexplore overview page before continuing here. (Note: Dexter is the actual extraction program; uexplore is the navigation program that lets you select datasets from which to extract.)

By far the easiest way to access the various datasets within the directory is to click on the Datasets.html page within the sf32000x directory. Here you will find a more logical ordering of the datasets along with much more detailed descriptions and metadata references.

To access datasets with complete SF3 table files you need to access the sf32000 filetype, which means using the same URL as for this sf32000x filetype and just dropping the final x. You may want to access the MCDC's SF3 home page (which is also the Readme file for the sf32000 directory.) That file will provide more general background about the Summary File 3 2000 data, with access to complete Technical Documentation. We understand that for many -- perhaps the great majority -- of users, all that information will be a lot more than they may have the time or interest to digest. This standard extract has been created mostly for those people.

The Observations (Rows)

The datasets in the sf32000x directory, like most datasets in the MCDC data archive, are comprised of observations (aka records or rows) that summarize geographic areas. The complete names of these datasets have 2 parts, like files in your Windows directories. The extension portion for files that contain extractable data will be sas7bdat or sas7bvew. While you do not need to know it, these are SAS data files, and the underlying Dexter program is written in SAS. Rest assured, you do not have to know this or anything about SAS to use the application. The first observation in the moi dataset, for example, contains a summary of the long-form census data collected for the entire state of Missouri. There are 115 observations on this dataset containing summaries for the counties in Missouri, and another 1040 summarizing ZIP codes, etc. There are a total of 12,631 geographic entities summarized altogether on the moi dataset. The SumLevs.html report page in the sf32000x directory provides a codebook with information that will be vital to your being able to access these datasets. You just don't know it yet. Here is a small excerpt from the full table displayed on that page:

SumLev code SumLev Meaning Inventory
or Hierarchal
Frequency Count
040 State i 94
050 County i 460
060 County Subdivision i 1,379
070 County Subdivision-Place/Remainder h 2,681

The key to selecting observations from these datasets (and most other datasets based on census data) is the SumLev variable. This 3-digit code is provided by the Census Bureau on all their summary files to allow users to distinguish the type of geographic entity being summarized. Per the report we see that the code indicating a state level summary is 040 (leading zeroes matter here), while the code to indicate a county summary is 050. The "Inventory of Hierarchal" column indicates the specified level is classified. The levels indicated as being "inventory" (basically these are complete areas, while hierarchal areas are created by intersecting inventory areas) will be found on datasets such as "moi" and "ili", etc. Those indicated as hierarchal will be found on datasets such as "moh" and "ilh". Inventory summaries are by far the most commonly used, but hierarchy summaries are more numerous and take up a lot more space. That is why we like to segregate them - it makes using the inventory data go a lot faster.

The Frequency Count shown in the report is based on how many times each level occurs on the original sf3 data files for Missouri. The reason there are 94 state level summaries is because the Bureau has something called a geographic component summary. On sf3, you not only get a summary for the whole state but for the "geoographic components" such as "Urban", "Rural", "Urban in Rural Cluster", "Rural Farm", etc. to name just a few of the more interesting ones. Most users hate "geocomps" because they just cause confusion. For that reason, we have omitted them from the standard extract datasets, at least for now. We await a groundswell of user interest to see if we need to make them available. But for now, on sf32000x, the moi.sas7bdat dataset, there will be only a single 040 summary observation. (See for the complete list of geocomp codes if you are interested. The really important one is '00'.)

The Variables (Columns)

See the definitive document in the Variables.pdf document of the sf32000x data directory (or Variables.html", if you have a problem accessing pdf files). You can even use uexplore to access a SAS dataset, variables, where the metadata shown in the Variables report documents are stored. You probably won't want to, but you can.

Notice that the variables are organized into a series of 29 subgroups called tables. These are the same tables as labeled in the dp3_2k profile reports. The table numbers and titles are shown as subheaders in the Variables report, and the report is sorted in table number order. Here is a sample of the report - describing the variables comprising the Educational Attainment table:

Table=13. Educational Attainment Universe=Persons Over 25
Variable Name Label Definition - Code used to derive Comment Universe Variable Weight
Over25 Over 25 Yrs of Age p37i1 TotPop
LessThan9th Less Than 9th Grade sum(of P37i3-P37i6 P37i20-P37i23) Over25
SomeHighSchool 9th thru 12th grade, No Diploma sum(of P37i7-P37i10 P37i24-P37i27) Over25
HighSchool High School Grad or GED P37i11 + P37i28 These are people with nothing beyond High School Over25
NoCollege Did Not Attend College LessThan9th + SomeHighSchool + HighSchool Over25
SomeCollege Some College, no degree sum(of P37i12-P37i14 P37i29-P37i31) This is not the complement of NoCollege. It is people with some college but no degree except maybe an associates Over25
Bachelors Bachelors P37i15 + P37i32 People with a Bachelors and no more Over25
Masters Masters P37i16 + P37i33 People with a Masters and no more. NA on STF3 in 1990. Over25
ProfPHD Prof School Degree or PhD P37i17 + P37i18 + P37i34 + P37i35 NA on STF3 in 1990 Over25
GradProf Graduate or Professional Degree Masters + ProfPHD Added together for compatibility with 1990 STF3 Over25

The value displayed in bold in the Variable Name column is the name on the dataset. These are the names you will see displayed on the drop-down variables menu list when running the Dexter program. To run an extract that included the education attainment table you would just select these 10 variables. Note that the first variable in the table, Over25, is a special table-universe variable. It is not really an education item per se, but it is included because of its importance as a denominator used in calculating Pct variables. What are those? A footnote at the bottom of the report gives a brief explanation, but we need to make it clearer. Most variables on this dataset represent counts of things with a certain property. For example, the variable LessThan9th is the count of persons over 25 with less than a 9th grade education. We also generate a variable, PctLessThan9th, containing the percentage this is of the universe. The "Universe Variable" column tells you what variable, if any, we use as the denominator to create the corresponding Pct variable. Notice that the Over25 row has an entry of "TotPop" in the Universe Variable column. This tells us that the dataset has a variable named PctOver25 and that the value of this variable is the Over 25 population as a percentage of the total population. I.e., PctOver25=100*Over25/TotPop . Note that these percentage variables are the source of the values that appear in the Percent column of the Demographic Profile 3 (dp3_2k) reports. (See sample).

The Label column contains a more extended description of the variable and corresponds to the Label item stored on the SAS dataset. It will appear in the second row of Dexter-generated CSV files and as the column label on html output.

The Definition column is for people who want to know precisely how we derived the variable. It is a SAS numeric expression that was used on the right side of a SAS assignment statement to define the variable. For example, the SAS program that creates these extract datasets (by accessing the full-table sf32000-filetype datasets) contains the statement: HighSchool=P37i11 + P37i28; You can verify that the formula is correct by browsing the Plabels.txt file in the Varlabs subdirectory of the sf32000 data directory. That file contains the following text:

/* 25 YEARS AND OVER [35] */
/* Universe: Population 25 years and over */

 P37i1='Total:'     /* P037001 */
 P37i2=' Male:'     /* P037002 */
 P37i3=' No schooling completed'   /* P037003 */
 P37i4=' Nursery to 4th grade'   /* P037004 */
 P37i5=' 5th and 6th grade'    /* P037005 */
 P37i6=' 7th and 8th grade'    /* P037006 */
 P37i7=' 9th grade'    /* P037007 */
 P37i8=' 10th grade'    /* P037008 */
 P37i9=' 11th grade'    /* P037009 */
 P37i10=' 12th grade, no diploma'   /* P037010 */
 P37i11=' High school graduate (includes equivalency)'  /* P037011 */
 P37i12=' Some college, less than 1 year'  /* P037012 */
 P37i13=' Some college, 1 or more years, no degree'  /* P037013 */
 P37i14=' Associate degree'    /* P037014 */
 P37i15=' Bachelor''s degree'    /* P037015 */
 P37i16=' Master''s degree'    /* P037016 */
 P37i17=' Professional school degree'   /* P037017 */
 P37i18=' Doctorate degree'    /* P037018 */
 P37i19=' Female:'     /* P037019 */
 P37i20=' No schooling completed'   /* P037020 */
 P37i21=' Nursery to 4th grade'   /* P037021 */
 P37i22=' 5th and 6th grade'    /* P037022 */
 P37i23=' 7th and 8th grade'    /* P037023 */
 P37i24=' 9th grade'    /* P037024 */
 P37i25=' 10th grade'    /* P037025 */
 P37i26=' 11th grade'    /* P037026 */
 P37i27=' 12th grade, no diploma'   /* P037027 */
 P37i28=' High school graduate (includes equivalency)'  /* P037028 */
 P37i29=' Some college, less than 1 year'  /* P037029 */
 P37i30=' Some college, 1 or more years, no degree'  /* P037030 */
 P37i31=' Associate degree'    /* P037031 */
 P37i32=' Bachelor''s degree'    /* P037032 */
 P37i33=' Master''s degree'    /* P037033 */
 P37i34=' Professional school degree'   /* P037034 */
 P37i35=' Doctorate degree'    /* P037035 */
You can see from this that P37i11 is the male high school graduates and P37i28 is the female high school grads. So the sum is the total high school graduates, as advertised. Users are encouraged to examine these definitions carefully and to report any errors to the author. In some cases, the definition may not be in error exactly, but it may be less than perfectly clear from the name and label what it really represents. This is where the Definition column can be very helpful.

The Comment column is pretty obvious. Where we felt there was some need to clarify something about the variable we added it here. Thus the explanation for SomeCollege is to warn users that is not all people who have at least some college experience, but rather only those with some college experience but no degree. There is a lot of this "fine print" that has to be understood when dealing with census data.

The Universe Variable column has already been discussed above. If it is blank then there will not be a corresponding Pct variable. If a value appears, it means it was used as the denominator in generating a Pct variable.

The Weight Variable column is blank for all entries in our sample Educational Attainment table. It will only have a value for variables that can be aggregated by taking a weighted average. If, for example, the Bureau had reported (which they did not) total years of school completed then this table might have had an entry labeled "Average Years of School". The weight variable for such an item would have been the universe variable Over25. An example of this that actually exists on the dataset occurs in Table 21. The PCI variable (Per Capita Income) has TotPop listed as the weight variable. This means when you aggregate a dataset containing the PCI variable you need to take a weighted average of the variable, using the total population count as the weight.

SF32000: The Source

In case you missed it, all the data in these sf32000x extracts are direct derivatives of the complete tabular data stored in the parent filetype, sf32000. Additional information regarding SF3 is available in the SF3 Readme file, which doubles as the MCDC's "home page" for Summary File 3. You can, of course, also use uexplore to extract data from the sf32000 data directory.