Ten Things to Know about the OSEDA/MCDC Data Archive

The OSEDA/Missouri Census Data Center Archive is a collection of data files created over a period of more than 30 years. Most of the work has been done by programmers working at OSEDA (Office of Social and Economic Data Analysis, part of the University of Missouri) under contract with the Missouri Census Data Center (part of the Missouri State Library within the office of the Missouri Secretary of State). This informal document is to provide some insight for those who are wanting to know what the archive contains and how they might be able to access it.

1. It's a big collection of data.

At latest count we had about 450 gigabytes of data in the archive. That includes more than 100,000 database files (SAS datasets). We have tried to use directories and file naming conventions to make it easy (or at least easier) for us — both the programmers creating it and the users using it — to find things.

2. The data are organized into about 100 categories (filetypes).

The names associated with these filetypes are limited to eight characters. This is because these filetypes correspond to entities called data libraries in SAS, and the names of such libraries are limited to eight characters. While somewhat cryptic, the names are not as totally unfathomable as they might first appear.

For example, one filetype is called sf12010x. The archive directory page names this "Summary file 1, standard extract". Summary file 1 contains detailed tables based on data collected via the short-form census questionnaire. So, this sf12010x is a collection of data that is a "standard extract" based on summary file 1 in the 2010 decennial census. Once you figure out what that means, then you'll feel comfortable when you see a filetype of sf32000x, which is a standard extract based on the summary file 3 data product from the 2000 decennial census.

3. Most of the datasets in the archive contain geographic area summaries.

The main datasets are from the Census Bureau and contain data about people, households, and housing units. But the rows (observations) that comprise these datasets do not directly describe any of these three entities, but rather geographic areas (states, counties, cities, etc.). The rows comprise columns (variables) that identify or summarize the geographic area. The variables that help identify the entity are referred to as identifiers. These fields have no numeric significance and are stored in the datasets as character strings.

The true numeric variables contain summary characteristics. This means that if you look at data about household income, you should not expect to find a variable called HouseholdIncome. It would not make sense. There is no HouseholdIncome variable for the state of Missouri. What you can get is a median household income, a mean household income, or perhaps a count of households with income over a certain dollar amount for the state, i.e. statistical summaries. There are some important exceptions to this general rule. For example, the various public use microsample ("PUMS") filetypes (acsums, pums2000, etc.) contain data sets that describe individual persons or housing units.

4. Become comfortable with the SumLev and State variables.

This data archive is an informal collection of datasets that have some key naming conventions and coding consistencies that make it much easier to understand and navigate. There are naming conventions for datasets and for the columns/variables that comprise the rows/observations within the datasets.

The most important set of conventions pertain to the variables that identify the geographic entities being summarized on a dataset. By far the two most important are SumLev — a three-character code used to identify the kind of geographic entity being summarized on this observation — and State, containing the two-character FIPS state code for that entity.

The identifier variable SumLev (geographic summary level) will always contain a three-character code. (See the Sumlev.sas format in our SAS formats library to see a master list of these codes.) There are many places where you can see the state codes, including our state.sas format library entry.

Examples: Once you know that 050 is the summary level code indicating a county-level summary and 140 the code for census tract summaries, you can access (e.g.) any of the 51 [SS]selectedinv.sas7bdata datasets (where [SS] is the state postal abbreviation) in the sf12010x data directory. Knowing that 06 is the FIPS code for the state of California means that all you ever need to do to get the California subset of any nationwide dataset is to code the filter State = 06.

5. Datasets.html files are your friends.

Not every filetype directory contains one of these special files. But almost all of the newer and more important ones do. These pages appear as you navigate the archive using Uexplore. For example, compare the sf12010x archive page vs. the sf12010x Datasets.html page). With the former you get:

With the Datasets.html page you get:

6. It's not just Missouri data.

Most of our archive dataset collections cover the entire U.S. That's not always the case with some of our older data, although we do have national collections of STF3 files for both 1980 and 1990. We often do custom aggregations for Missouri just to get data by various regions within the state. And sometimes we are able to generate custom tabulations for Missouri only because we have access to Missouri data that we do not have for others. For example, we knew the boundaries of the post-2010 state legislative districts for Missouri almost two years ahead of when the Bureau released them for the entire U.S. So we had 2012 state legislative districts, for Missouri only, for almost two years.

7. Knowledege of SAS is not necessary.

You don't have to know about SAS in order to understand and extract the data. The great majority of our users likely know little or nothing about SAS. However, knowledge of SAS will be of some benefit to advanced users, particularly in the Dexter data extraction application:

8. The most popular datasets are those based on the 2010 census and the American Community Survey.

We also have two popular datasets in our georef (geographic reference) collection. Here are the top 25 most-accessed datasets over the period Oct. 2012 to Nov. 2014:

Rank filetype  dset            # Times Accessed
 1   georef    zcta_master     1,603
 2   sf12010x  moselectedinv     872
 3   acs2011   usstcnty5yr       716
 4   georef    zipcodes          631
 5   pl942010  uscounties        415
 6   acs2011   ustracts5yr       409
 7   acs2012   uszctas5yr        392
 8   acs2011   uszctas5yr        356
 9   acs2012   usmcdcprofiles3yr 348
 10  acs2012   usstcnty5yr       344
 11  acs2012   usmcdcprofiles    322
 12  sf12010x  usstcnty          315
 13  acs2011   usbgs5yrtemp      284
 14  corrlst   zip07_cbsa06      235
 15  sf12010   uszips            233
 16  sf32000x  ustracts          229
 17  sf12010x  moblocks          205
 18  acs2012   usbgs5yr          201
 19  sf12010   uscounties        198
 20  sf12010x  uszips871         178
 21  sf32000   usgeos            172
 22  sf12010   moinventory       162
 23  acs2012   uscdslds5yr       158
 24  corrlst   us_stzcta5_county 157
 25  corrlst   uscdslds2012      155

Over half (56%) of the datasets accessed by Dexter during this period were in one of the four filetypes stf12010, sf12010x, acs2011 and acs2012. To see what any of these datasets are all about, you can navigate via Uexplore to the directory page for the filetype.

9. Detailed tables vs. standard extracts.

Some of our most popular filetypes occur in pairs, such as sf12010 and sf12010x, sf32000 and sf32000x, etc. The Census Bureau creates data products called summary tape files that comprise large collections of detail summary tables. To see how such files are related, let's look at the two filetypes stf903 and stf903x. The entry for the stf903 filetype on the Uexplore/Dexter home (directory) page reads as follows:

stf903 — 1990 Summary tape file 3
Each dataset here contains over 3,300 cells of pre-tabulated data based on the 1990 census long-form questionnaires. Each observation contains data for a single geographic area. We have complete A files for Missouri, Illinois and Kansas plus a few other states; we also have the complete C file (national) with summaries for the country, states, counties, and larger cities. And, we have the B file — ZIP level summaries. This filetype has been made accessible at the table level from Dexter. As with any of the census summary file filetypes, you really need to have access to the technical documentation — available in the stf903/Docs subdirectory of this archive — before attempting to use these data. The stf903x and stf903x2 filetypes are derived from these files and are appropriate for quick overviews or access to frequently-used variables.

The datasets in this collection can be thought of as data tables, with each row a geographic area and each column containing either some a geographic identifier or a count of persons or households or a mean or median measure of some sort. Instead of variables with easy-to-read names like TotPop, Age0_4, or Hispanic, these datasets include variable names such as P6I1, P6I2 P6I3 and P6I4. These particular variables correspond to the SF3 table called "P6". The letter "I" in the variable names stands for "item", so the variable P8I4 would be the 4th cell in table P8.

So how do you know what the variable names mean? That's where the Docs subdirectory becomes important: It contains the technical documentation of the STF3 data product as distributed by the Census Bureau. These tech docs are long and complex, but they are consistent in structure and content for all summary data files. So, once you know what a "summary level sequence chart" is and where to look for the table matrix outline information, you should be able to find these key reference sections.

For example, say you actually need to access these data; you want to look at the distribution of divorced females by county for the state of Missouri; and you know about Table P27. Now you need to find a dataset that has county-level summaries for Missouri. The Datasets.html metadat page shows two possibilities: moi contains "Inventory" for the value of units. The standard list of inventory summary levels include counties. (You could also use the uscntys dataset that has only county-level summaries, but for the entire country. You would use that data set to do this same extract for any other state.) Click the moi dataset name to invoke Dexter. You can then view detailed metadata for the set and get the key values for the slvl variable. The key values report shows that 050 is the code for county, so you can now go back and code my filter in Sec. II of the form: County Equals 050. In section III, the identifiers choice is pretty simple: Just go with FIPCO and AREANAME. On the right, choose your desired table (P27). In this example, all you really need is the last cell in this table (the count of divorced females), but you can always take the whole table and throw away what you don't need.

So now what about the companion filetype, stf903x? In brief, the data in these datasets are not tables — they are variables derived from those tables. These are the data we use for our profile applications — the data that can be used to answer 80% of the questions asked with only a fraction of the number of data items to slog through. Most users, for most applications, will be using an extract collection rather than a complete table collection. Look back at the list of most-frequently-accessed datasets, above, and you'll see only three table summary data sets (all in the sf12010 filetype collection): uscounties, uszips and moinventory.

You'll note that the metadata provided for the extract collections is a lot shorter and is not written by anyone at the Census Bureau. These extracts are certainly based on census data files and you really need to understand about the "parent" complete-table STF collection in order to understand the extract, but we keep the two levels of documentation separate.

10. ACS summary tables do not have a separate filetype.

We decided that since we were already having to deal with a new set of ACS data every year now instead of every 10 years, we wanted to avoid having multiple new filetypes for each new year. So, instead of a new filetype, we created subdirectories of the acsYYYY data directories. If you are looking for summary tables based on 2012 vintage ACS data, then you need to look at the two subdirectories (of acs2012) base_tables and base_tables_5yr.

The margin-of-error measures in the ACS files requires that we have these measures (columns, variables) as well as the usual table entries. Our naming convention for these is similar to our table-cell naming convention; we just use the letter "m" instead of "i". So, on the five-year ACS base tables data sets, we have the variables b01001i2 and b00101m2, where we store the second cell of table B00101 ("Males") and the corresponding margin-of-error value.

Because of the large number of tables (and table cells) on the ACS base tables, we partitioned them based on topics. This gives us dataset names such as usstcnty17_20, which contains, for every state and county in the U.S., all tables associated with topics 17, 18, 19, and 20. The first two digits of a table name comprise the topic code; we have numerous TableTopicCodes.txt files in our base_tables and base_tables_5yr directories. If it is your first time accessing our base tables, it is strongly suggested that you read the Readme.html file in the acs2012/btabs5yr directory.

For more information...

See the various training modules related to Uexplore/Dexter. Note that there is a PowerPoint module specifically on the topic of the MCDC Data Archive. Most of the other modules focus on how to use the software to access the data.