Ten Things to Know about the OSEDA/MCDC Data Archive
The OSEDA/Missouri Census Data Center Archive is a collection of data files created over a period of more than 30 years. Most of the work has been done by programmers working at OSEDA (Office of Social and Economic Data Analysis, part of the University of Missouri) under contract with the Missouri Census Data Center (part of the Missouri State Library within the office of the Missouri Secretary of State). This informal document is to provide some insight for those who are wanting to know what the archive contains and how they might be able to access it.
1. It's a big collection of data.
At latest count we had about 450 gigabytes of data in the archive. That includes more than 100,000 database files (SAS datasets). We have tried to use directories and file naming conventions to make it easy (or at least easier) for us — both the programmers creating it and the users using it — to find things.
2. The data are organized into about 100 categories (filetypes).
The names associated with these filetypes are limited to eight characters. This is because these filetypes correspond to entities called data libraries in SAS, and the names of such libraries are limited to eight characters. While somewhat cryptic, the names are not as totally unfathomable as they might first appear.
For example, one filetype is called sf12010x. The archive directory page names this "Summary file 1, standard extract". Summary file 1 contains detailed tables based on data collected via the short-form census questionnaire. So, this sf12010x is a collection of data that is a "standard extract" based on summary file 1 in the 2010 decennial census. Once you figure out what that means, then you'll feel comfortable when you see a filetype of sf32000x, which is a standard extract based on the summary file 3 data product from the 2000 decennial census.
3. Most of the datasets in the archive contain geographic area summaries.
The main datasets are from the Census Bureau and contain data about people, households, and housing units. But the rows (observations) that comprise these datasets do not directly describe any of these three entities, but rather geographic areas (states, counties, cities, etc.). The rows comprise columns (variables) that identify or summarize the geographic area. The variables that help identify the entity are referred to as identifiers. These fields have no numeric significance and are stored in the datasets as character strings.
The true numeric variables contain summary characteristics. This means that if you look at data about household income, you should not expect to find a variable called HouseholdIncome. It would not make sense. There is no HouseholdIncome variable for the state of Missouri. What you can get is a median household income, a mean household income, or perhaps a count of households with income over a certain dollar amount for the state, i.e. statistical summaries. There are some important exceptions to this general rule. For example, the various public use microsample ("PUMS") filetypes (acsums, pums2000, etc.) contain data sets that describe individual persons or housing units.
4. Become comfortable with the SumLev and State variables.
This data archive is an informal collection of datasets that have some key naming conventions and coding consistencies that make it much easier to understand and navigate. There are naming conventions for datasets and for the columns/variables that comprise the rows/observations within the datasets.
The most important set of conventions pertain to the variables that identify the geographic entities being summarized on a dataset. By far the two most important are SumLev — a three-character code used to identify the kind of geographic entity being summarized on this observation — and State, containing the two-character FIPS state code for that entity.
The identifier variable SumLev (geographic summary level) will always contain a three-character code. (See the Sumlev.sas format in our SAS formats library to see a master list of these codes.) There are many places where you can see the state codes, including our state.sas format library entry.
Examples: Once you know that 050 is the summary level code indicating a county-level summary and 140 the code for census tract summaries, you can access (e.g.) any of the 51 [SS]selectedinv.sas7bdata datasets (where [SS] is the state postal abbreviation) in the sf12010x data directory. Knowing that 06 is the FIPS code for the state of California means that all you ever need to do to get the California subset of any nationwide dataset is to code the filter
State = 06.
5. Datasets.html files are your friends.
Not every filetype directory contains one of these special files. But almost all of the newer and more important ones do. These pages appear as you navigate the archive using Uexplore. For example, compare the sf12010x archive page vs. the sf12010x Datasets.html page). With the former you get:
- Access to all the public files and subdirectories for the filetype.
- All the data sets as well as other metadata and non-database files.
- Varying display of descriptive material to help you know what each of the files is about. This will vary considerably with filetype. Older filetypes that predate the introduction of Datasets.html files will tend to have more file description entries than more current ones.
- For datasets that have been released recently, there will sometimes be entries here that have not yet been entered on the Datasets.html page.
With the Datasets.html page you get:
- Only the database files, organized by geographic universe and (where applicable) type of summary. In most cases, files with Missouri as the universe occupy the first rows of the metadata matrix, with national ("us") files displayed next.
- The Geographic Universe and Units columns can be quickly scanned to see if there is a dataset with the summary level(s) you need for the geographic universe (such as state) of interest.
- Links to Details give quick access to more detailed metadata regarding most datasets. Not all datasets have metadata entries (and these are intentionally displayed last in the table). Typically such datasets are not current or of wide general interest.
6. It's not just Missouri data.
Most of our archive dataset collections cover the entire U.S. That's not always the case with some of our older data, although we do have national collections of STF3 files for both 1980 and 1990. We often do custom aggregations for Missouri just to get data by various regions within the state. And sometimes we are able to generate custom tabulations for Missouri only because we have access to Missouri data that we do not have for others. For example, we knew the boundaries of the post-2010 state legislative districts for Missouri almost two years ahead of when the Bureau released them for the entire U.S. So we had 2012 state legislative districts, for Missouri only, for almost two years.
7. Knowledege of SAS is not necessary.
You don't have to know about SAS in order to understand and extract the data. The great majority of our users likely know little or nothing about SAS. However, knowledge of SAS will be of some benefit to advanced users, particularly in the Dexter data extraction application:
- One of the output format options (in section I of the Dexter query form) is "SAS dataset (Windows)". Whenever data are converted from one format to another, there will often be some loss of information. Since the archive data is already stored in the form of SAS datasets, users who get the data in SAS format are not going to lose much. There is also the debatable advantage of using SAS as your tool for accessing and analyzing the data. From being able to view the data with FSBROWSE or VIEWTABLE windows; to being able to access it with power tools such as proc tabulate, proc summary, proc report or proc SQL; or to access the data with the power of the SAS macro and data step languages — the skilled SAS user is in an excellent position for turning the data into knowledge.
- Being able to take advantage of SAS variable attribute statements (section V-c. of the Dexter query form). Even though the formats are pretty simple and are described in the online documentation, there is clearly some advantage of already being familiar with these types of SAS statements.
- When the user specifies the variables they want to keep on output. Instead of the usual select lists (ID and numeric variables, section III), there is the option of entering a list of variables. Once again, if you read the online documentation, you will get a mini-tutorial on the various ways of specifying lists of variables in SAS.
8. The most popular datasets are those based on the 2010 census and the American Community Survey.
We also have two popular datasets in our georef (geographic reference) collection. Here are the top 25 most-accessed datasets over the period Oct. 2012 to Nov. 2014:
Rank filetype dset # Times Accessed 1 georef zcta_master 1,603 2 sf12010x moselectedinv 872 3 acs2011 usstcnty5yr 716 4 georef zipcodes 631 5 pl942010 uscounties 415 6 acs2011 ustracts5yr 409 7 acs2012 uszctas5yr 392 8 acs2011 uszctas5yr 356 9 acs2012 usmcdcprofiles3yr 348 10 acs2012 usstcnty5yr 344 11 acs2012 usmcdcprofiles 322 12 sf12010x usstcnty 315 13 acs2011 usbgs5yrtemp 284 14 corrlst zip07_cbsa06 235 15 sf12010 uszips 233 16 sf32000x ustracts 229 17 sf12010x moblocks 205 18 acs2012 usbgs5yr 201 19 sf12010 uscounties 198 20 sf12010x uszips871 178 21 sf32000 usgeos 172 22 sf12010 moinventory 162 23 acs2012 uscdslds5yr 158 24 corrlst us_stzcta5_county 157 25 corrlst uscdslds2012 155
Over half (56%) of the datasets accessed by Dexter during this period were in one of the four filetypes stf12010, sf12010x, acs2011 and acs2012. To see what any of these datasets are all about, you can navigate via Uexplore to the directory page for the filetype.
9. Detailed tables vs. standard extracts.
Some of our most popular filetypes occur in pairs, such as sf12010 and sf12010x, sf32000 and sf32000x, etc. The Census Bureau creates data products called summary tape files that comprise large collections of detail summary tables. To see how such files are related, let's look at the two filetypes stf903 and stf903x. The entry for the stf903 filetype on the Uexplore/Dexter home (directory) page reads as follows:
- stf903 — 1990 Summary tape file 3
- Each dataset here contains over 3,300 cells of pre-tabulated data based on the 1990 census long-form questionnaires. Each observation contains data for a single geographic area. We have complete A files for Missouri, Illinois and Kansas plus a few other states; we also have the complete C file (national) with summaries for the country, states, counties, and larger cities. And, we have the B file — ZIP level summaries. This filetype has been made accessible at the table level from Dexter. As with any of the census summary file filetypes, you really need to have access to the technical documentation — available in the stf903/Docs subdirectory of this archive — before attempting to use these data. The stf903x and stf903x2 filetypes are derived from these files and are appropriate for quick overviews or access to frequently-used variables.
The datasets in this collection can be thought of as data tables, with each row a geographic area and each column containing either some a geographic identifier or a count of persons or households or a mean or median measure of some sort. Instead of variables with easy-to-read names like TotPop, Age0_4, or Hispanic, these datasets include variable names such as P6I1, P6I2 P6I3 and P6I4. These particular variables correspond to the SF3 table called "P6". The letter "I" in the variable names stands for "item", so the variable P8I4 would be the 4th cell in table P8.
So how do you know what the variable names mean? That's where the Docs subdirectory becomes important: It contains the technical documentation of the STF3 data product as distributed by the Census Bureau. These tech docs are long and complex, but they are consistent in structure and content for all summary data files. So, once you know what a "summary level sequence chart" is and where to look for the table matrix outline information, you should be able to find these key reference sections.
For example, say you actually need to access these data; you want to look at the distribution of divorced females by county for the state of Missouri; and you know about Table P27. Now you need to find a dataset that has county-level summaries for Missouri. The Datasets.html metadat page shows two possibilities: moi contains "Inventory" for the value of units. The standard list of inventory summary levels include counties. (You could also use the uscntys dataset that has only county-level summaries, but for the entire country. You would use that data set to do this same extract for any other state.) Click the moi dataset name to invoke Dexter. You can then view detailed metadata for the set and get the key values for the slvl variable. The key values report shows that 050 is the code for county, so you can now go back and code my filter in Sec. II of the form:
County Equals 050. In section III, the identifiers choice is pretty simple: Just go with FIPCO and AREANAME. On the right, choose your desired table (P27). In this example, all you really need is the last cell in this table (the count of divorced females), but you can always take the whole table and throw away what you don't need.
So now what about the companion filetype, stf903x? In brief, the data in these datasets are not tables — they are variables derived from those tables. These are the data we use for our profile applications — the data that can be used to answer 80% of the questions asked with only a fraction of the number of data items to slog through. Most users, for most applications, will be using an extract collection rather than a complete table collection. Look back at the list of most-frequently-accessed datasets, above, and you'll see only three table summary data sets (all in the sf12010 filetype collection): uscounties, uszips and moinventory.
You'll note that the metadata provided for the extract collections is a lot shorter and is not written by anyone at the Census Bureau. These extracts are certainly based on census data files and you really need to understand about the "parent" complete-table STF collection in order to understand the extract, but we keep the two levels of documentation separate.
10. ACS summary tables do not have a separate filetype.
We decided that since we were already having to deal with a new set of ACS data every year now instead of every 10 years, we wanted to avoid having multiple new filetypes for each new year. So, instead of a new filetype, we created subdirectories of the acsYYYY data directories. If you are looking for summary tables based on 2012 vintage ACS data, then you need to look at the two subdirectories (of acs2012) base_tables and base_tables_5yr.
The margin-of-error measures in the ACS files requires that we have these measures (columns, variables) as well as the usual table entries. Our naming convention for these is similar to our table-cell naming convention; we just use the letter "m" instead of "i". So, on the five-year ACS base tables data sets, we have the variables b01001i2 and b00101m2, where we store the second cell of table B00101 ("Males") and the corresponding margin-of-error value.
Because of the large number of tables (and table cells) on the ACS base tables, we partitioned them based on topics. This gives us dataset names such as usstcnty17_20, which contains, for every state and county in the U.S., all tables associated with topics 17, 18, 19, and 20. The first two digits of a table name comprise the topic code; we have numerous TableTopicCodes.txt files in our base_tables and base_tables_5yr directories. If it is your first time accessing our base tables, it is strongly suggested that you read the Readme.html file in the acs2012/btabs5yr directory.
For more information...
See the various training modules related to Uexplore/Dexter. Note that there is a PowerPoint module specifically on the topic of the MCDC Data Archive. Most of the other modules focus on how to use the software to access the data.