Ten Things to Know about the OSEDA/MCDC Data Archive

The OSEDA/Missouri Census Data Center Archive is a collection of data files created over a period of more than 30 years. Most of the work has been done by progammers working at OSEDA (Office of Social and Economic Data Analysis, part of the University of Missouri Columbia campus) under contract with the Missouri Census Data Center (part of the Missouri State Library within the office of the Missouri Secretary of State). This informal document is to provide some insight for those who are wanting to know what the archive contains and how they might be able to access it.

  1. It's a big collection of data. Not that size really matters that much in judging the quality/usefulness of a data collection. Bigger could just be an indication of inefficiency or the unwillingness to delete or archive (using an alternate meaning of the word) obsolete, redundant, or other never or no longer useful data. At latest count we had about 366 gigabytes of data in the active archive. That includes just over 6000 database files (i.e. files that can be queried via Dexter - stored as SAS datasets), with almost another 30,000 other kinds of files (most of them irrelevant to most users.)

    When you have this much data you have to be at least a little bit organized or you'll never be able to find anything. So we have tried to use directories and file naming conventions to make it easy (or at least easier) for us (both the programmers creating it and the users using it, which also includes those programmers) to find things.

  2. The data are organized into about 100 categories that we refer to as "filetypes". The names associated with these filetypes are limited to eight characters. The reason for this is because the archive was designed in the mid-1990's by programmers who were used to having to live with names of things that do not really describe what they are. These filetypes/data directories correspond to entities called data libraries in the world of SAS(r) programming, and the names of such libraries were (and still are) limited to 8 characters. The programmers who designed the archive thought it would be cool to name the directories in such a way that those names would also double as SAS data library names. So that is our lame excuse for the less-than-meaningful names of these fundamental archive divisions. People over the age of 50 tend to not let this bother them. People who spend time with the archive come to recognize that, while somewhat cryptic, the names are not as totally unfathomable as they might first appear. For example, one of the filetypes is called sf12010x . If you read the paragraph describing this data category on the archive directory page you'll see that this gets expanded to a more descriptive "Summary File 1 Standard Extract" label. Then you get a brief paragraph expanding on what this might mean. In that paragraph it makes reference to "Standard practice and naming conventions for all our decennial summary (tape) filetypes". The "standard practice" is not described, but here is how it works. The Census Bureau makes available data products called "Summary Files" based on the results of each decennial census (since 1970, at least, although in 1970 they were called "Counts" instead of Summary Files, and prior to 2000 they were called "Summary Tape Files" because they were always so large that they were almost always delivered on round reel tapes.)

    People who want to access "census data" know (or need to know) that this almost always means accessing these summary files, or something based upon them. Summary files always have numbers (1 through 4 per decade, typically). Summary File 1 is the first of the SF products to be released, and contains detailed tables based on the data collected on the short form questionnaire in the census. So it turns out that this "sf12010x" is a collection of data that is a "standard eXtract" based on this Summary File 1 collection of detailed tables based on the short form in the 2010 decennial census. Once you figure out what that means, then you'll feel comfortable when you see a filetype of sf32000x which you will not be surprised to learn is a standard extract based on the "Summary File 3" data product from the 2000 decennial census.

  3. Most of the datasets in the archive (over 90%) contain geographic area summaries. Most of the most popular datasets are from the Census Bureau and contain data about people, households and housing units. But the rows (aka "observations") that comprise these datasets do not directly describe any of these three entities. What they do describe are geographic areas (states, counties, cities, etc.). The rows are comprised of columns (aka "variables") that identify or summarize the geographic area. The variables that provide information to help identify the (usually geographic) entity being summarized we refer to as "Identifiers". These fields have no numeric significance (even though many are stored as numeric codes such as FIPS state and county codes, or ZIP/ZCTA codes) and are stored in the datasets as character strings. The true numeric variables (the kind where it makes sense to aggregate or determine a mean value for) contain summary characteristics. Note that this means that if you want to look at data regarding household income in a typical summary-data dataset that you should not be expeciting to find a variable called HouseholdIncome. It would not make sense. There is no HouseholdIncome variable for the state of Missouri. What you can get is a Median Household Income, a Mean Household Income, or perhaps a count of households with income over a certain dollar amount for the state, i.e. statistical summaries.

    There are some important exceptions to this general rule. For example, the various Public Use MicroSample ("PUMS") filetypes (acsums, pums2000, etc.) contain data sets that describe individual persons or housing units.

  4. Become comfortable with the SumLev and State variables. This data archive is by no means a formal database. It is an informal collection of datasets that have some key naming conventions and coding consistencies that make it much easier to understand and navigate. There are naming conventions for datasets and for the columns/variables that comprise the rows/observations within the datasets. The most important set of conventions are those pertaining to the variables that identify the geographic entities being summarized on a data set. By far the two most important examples of these are SumLev--a 3-character code used to identify the kind of geographic entity being summarized on this observation--and State, containing the 2-character FIPS state code for that entity. The identifier variable SumLev ("Geographic Summary Level") will always contain a 3-character code. See the Sumlev.sas module in our SAS formats library to see a master list of these codes. (Don't worry if you don't know what a SAS formats library is; it just reads like a simple text file.) There are many places where you can see the state codes, including our state.sas format library entry. Once you know that 050 ia the summary level code indicating a county level summary and 140 the code for census tract summaries you can access any of the 51 SSselectedinv.sas7bdata datasets (where SS is the state postal abbreviation) in the sf12010x data directory, you are half-way home in being able to create useful extracts from these datasets using Uexplore/Dexter. Knowing that 06 is the FIPS code for the state of California means that all you ever need to do to get the California subset of any nationwide dataset (almost always one with a name beginning with "us") is to code the filter:
    State   Equals   06 .
    The summary level variable appears under the name slvl in many of our 1980 and 1990 decennial census datasets.

  5. Datasets.html files are your friends. Not every filetype directory contains one of these special files. But almost all of the newer and more important ones do. The directory pages displayed as you navigate the archive using Uexplore (see, for example, the sf12010x uexplore page vs. the sf12010x Datasets.html page). With the former you get

    With the latter (i.e. the Datasts.html page) you get

  6. It's not just Missouri data. Most of our archive dataset collections these days cover the entire U.S. That is not always the case with some of our older data (but we do have national collections of STF3 files for both 1980 and 1990). We often do custom aggregations for Missouri only to get data by various regeions within the state. And sometimes we are able to generate custom tabulations for Missouri only because we have access to data for our state that we do not have for others. For example, we knew the boundaries of the post-2010 state legislative districts for Missouri almost two years ahead of when the Bureua released them for the entire U.S. So we had "2012" state legislative district for Missouri only for almost two years. But now we have that data for the entire U.S.

  7. Knowledege of SAS is not necessary. However.... The second most prevalent misconception about this archive (after the only-for-Missouri myth) is that you have to know about SAS in order to understand and extract the data. That is certainly not true. The great majority of our users know little or nothing about SAS; many, I suspect, don't even know what SAS is. However, the word "SAS" does appear in two locations on the Dexter query form and they are the basis for the two of the three "however" items describing why a knowledge of SAS would be of some benefit to archive users.

  8. The most popular datasets in the archive are mostly those based on the 2010 decennial census and the American Community Survey. We also have two datasets in our georef (geographic reference) collection that made the top 5 most-accessed datasets during a recent 2-year period. Here are the top 25 most-accessed datasets over the period Oct 2012 to Nov 2014:
    Rank  filetype  dset      # Times Accessed
      1   georef    zcta_master     1,603
      2   sf12010x  moselectedinv     872
      3   acs2011   usstcnty5yr       716
      4   georef    zipcodes          631
      5   pl942010  uscounties        415
      6   acs2011   ustracts5yr       409
      7   acs2012   uszctas5yr        392
      8   acs2011   uszctas5yr        356
      9   acs2012   usmcdcprofiles3yr 348
      10  acs2012   usstcnty5yr       344
      11  acs2012   usmcdcprofiles    322
      12  sf12010x  usstcnty          315
      13  acs2011   usbgs5yrtemp      284
      14  corrlst   zip07_cbsa06      235
      15  sf12010   uszips            233
      16  sf32000x  ustracts          229
      17  sf12010x  moblocks          205
      18  acs2012   usbgs5yr          201
      19  sf12010   uscounties        198
      20  sf12010x  uszips871         178
      21  sf32000   usgeos            172
      22  sf12010   moinventory       162
      23  acs2012   uscdslds5yr       158
      24  corrlst   us_stzcta5_county 157
      25  corrlst   uscdslds2012      155   

    Over half (56%) of the datasets accessed by Dexter during this period were in one of the four filetypes: stf12010, sf12010x, acs2011 and acs2012.
    To see what any of these datasets are all about you can navigate via uexplore to the directory page for the filetype. (Hint: follow the link to the MCDC Data Archive home page which you'll find near the top of the Navy blue navigation box that appears at the upper left of most mcdc web pages). From there you can use the Major Category Index to help locate the filetype, or you can do a text search ("find") to locate it. Once you have found and followed the link and are on that filetype directory page (for example, sf12010x) you can select (click on) the Datasets.html file entry and from that special directory page locate the dataset of interest. For example, if you were looking for information regarding the usstcnty dataset in the sf12010x filetype directory you would find it in the 15th row of the matrix displayed on the Datasets.html page. You can click on the (leftmost) Dexter link column to go directly to Dexter with this dataset selected, or you can use the "Link to Details" link to go straight to the metadata for the chosen dataset. If you opt to do the former (going straight to Dexter), then you can use the link near the top of the Dexter query form "See detailed metadata for this dataset" to access the metadata.

  9. Detailed Tables vs. Standard Extracts. You should notice that some of our (most popular) filetypes occur in pairs such as sf12010 and sf12010x, sf32000 and sf32000x, stf903 and stf903x/stf903x2, etc. The Census Bureau creates data products called Summary [Tape} Files that are comprised of rather large collections of detail summary tables. (The word "Tape" was dropped starting in 2000 because almost nobody was writing these things on round reel tapes any more.) To see how such files are related let's look at the two filetypes stf903 and stf903x.

    The entry for the stf903 filetype on the Uexplore/Dexter home (directory) page reads as follows:

    stf903/ 1990 Summary Tape File 3 Each dataset here contains over 3300 cells of pre-tabulated data based on the 1990 census long-form questionnaires. Each observation contains data for a single geographic area. We have complete "A" files for Missouri, Illinois and Kansas plus a few other states; we also have the complete "C" file (national) with summaries for the country, states, counties and larger cities. And, we have the "B" file - ZIP level summaries. This filetype has been made accessible at the table level from Dexter. As with any of the census summary file filetypes, you really need to have access to the technical documentation -- available in the stf903/Docs subdirectory of this archive -- before attempting to use these data. The stf903x and stf903x2 filetypes are derived from these files and are appropriate for quick overviews or access to frequently-used variables.

    That's a bit more information than we typically provide for a filetype but there was a time when this was one of our most frequently accessed filetypes and we wanted to provide users with some guidance. The important thing to understand here is that the datasets in this collection (which can be thought of as data tables, with each row a geographic area and each column containing either some kind of geographic identifier info or a count of persons or households or a mean or median measure of some sort. Each of the latter are actually cells comprising the Summary File tables. So we have tables within tables. Instead of variables with mnemonic names such as TotPop, Age0_4, or Hispanic you have variable names such as P6I1, P6I2 P6I3 and P6I4. These 4 variables correspond to the SF3 table called P4. The letter "I" in the variable names stands for "Item"; so the variable P8I4 would be the 4th cell in table P8. So how do you know what the tables are? That's where the Docs subdirectory becomes important. This subdirectory contains the complete offical technical documentation of the STF3 data product as distributed by the Census Bureau. We have created a series of files representing the chapters and appendices of the original 464-page pdf document distributed by the Bureau. These "tech docs" are long and complex, but they are consistent in terms of structure and content for all the Bureau's summary data files. So once you figure out what a "Summary Level Sequence Chart" is about (telling you what geographic entities are summarized on various "files") and where to look for the table matrix outline information, you should be able to find these key sections and use them for reference. In this case you can access an index.html file that will make it quite easy to follow links to the various components - such as Chapter 6 - Summary Level Sequence Charts and Chapter 5 - Table Outlines. We have also provided a set of "ascii" (plain text) files within the Docs subdirectory. The tbl_mtx.asc file is probably the most valuable single file in this collection of documents. It lets you see what tables are available and helps you see what the variable names are going to be corresponding to the cells of those tables. For example:

    P27. SEX(2) BY MARITAL STATUS(6) [12]
        Universe:  Persons 15 years and over
          Never married                        P0270001    9      N         1,1
          Now married:
             Married, spouse present           P0270002    9      N         1,2
             Married, spouse absent:
                Separated                      P0270003    9      N         1,3
                Other                          P0270004    9      N         1,4
          Widowed                              P0270005    9      N         1,5
          Divorced                             P0270006    9      N         1,6
          (Repeat MARITAL STATUS)              P0270007     54    N         2,1
    is part of this file and defines Table P27. The column containing the database names (as used by the Bureau on their CD-ROM database files) can be rather easily translated into the variable names used on our datasets. If you wanted to get a count of divorced persons in an area you would want to access the 6th cell of this table and that variable would be P0270006 (Census name) or p27i6 (our name). To get the number of divorced females you would need to access cell (variable) p27i12 (the female counts are in cells 7 to 12 so you need to add 6 to the corresponding Males cell). If you go back up one directory level to the stf903 data directory you files Varlabs.sas and Varlabs.txt that look like this:
     P27I1    /* MALE:NEVER MARRIED */
     P27I4    /* :::OTHER */
     P27I5    /* :WIDOWED */
     P27I6    /* :DIVORCED */
     P27I7    /* FEMALE:NEVER MARRIED */
     P27I10   /* :::OTHER */
     P27I11   /* :WIDOWED */
     P27I12   /* :DIVORCED */
    . This is pretty terse but you might see where it could help locate names for cells once you had viewed the table outline matrix. (This reflects a not fully mature to presenting table outline metadata; we do better in our later files for the 2000, 2010 censuses and for the ACS summary tables.)

    So let's say you actually needed to access these data. You wanted to look at the distribution of divorced females by county for the state of Missouri. Let's say you know about Table P27 (you went to the Docs directory and did some searching to find this on your own, let's say). So now you need to find a dataset that has county level summaries for Missouri. There are two possibilities: moi contains "Inventory" for the value of Units in the Datasets.html metadata. The standard list of inventory summary levels include counties. You could also use the uscntys dataset that has only county level summaries but has it for the entire country. (You would use that data set to do this same extract for any other state.) So I go ahead and click on the moi data set to invoke Dexter and access it. I can then access the link to "detgailed metadata" for the set and then follow the link to get the "key values" for the slvl variable (remember, we said that we used this alternate name for the SumLev variable on earlier datasets). I see from the key values report that 050 is the code for county so I can now go back and code my filter in Sec. II of the form:

    County  Equals  050
    Now comes the tricky part, Section III where I get to choose my variables. The Identifiers choice is pretty simple . Just go with FIPCO and AREANAME. And on the right I get to choose my Numeric variables. But - surprise! - the select list on the right does not have the usual "Numerics" label at the top -- instead it says "Tables". And the entries are not variables, they are table descriptions. This means that the system has flagged this as a table-based summary file and has made access available at the table rather than the variable level. That makes it a little easier since all I have to do is click on the P27 entry. Of course all I really need is the last cell in this table (the count of divorced females) but I can always take the whole table and throw away what I don't need once I get it into Excel.
    Note: Most, but not all, table-based filetypes will present you with a Tables select menu rather than a Variables. It takes a bit of extra work on our part to make a filetype accessible at the table level and some files we have not had the resources or motivation to do this.

    So now what about the companion filetype, stf903x? In the interest of keeping it brief (too late?) we'll just say that the data in these datasets are not tables - they are variables derived from those tables. These are the data we use for our Profile applications, the data that can be used to answer 80% of the questions asked with only a fraction of the number of data items to slog through. Most users, for most applications, will be using an extract collection rather than a complete table collection. Look back at the list of most-frequently-accessed datasets, above, and you'll see only 3 table summary data sets (all in the sf12010 filetype collection): uscounties, uszips and moinventory.

    You'll note that the metadata provided for the extract collections is a lot shorter and is not written by anyone at the Census Bureau. These extracts are certainly based on census data files and you really need to understand about the "parent" complete-table STF collection in order to understand the extract, but we keep the two levels of documentation separate.

  10. ACS Summary Tables Do Not Have A Separate Filetype. That seems pretty inconsistent. Since the ACS Summary Files have superceded what we used to get on collections such as sf32000, shouldn't we have something like acssf12 where we store the ACS summary files for vintage 2012 data? Maybe we should, but that's not how it turned out. We decided that since we were already having to deal with a new set of ACS data every year now instead of every 10 years, we wanted to avoid having multiple new filetypes for each new year. So, instead of a new filetype, what we created were subdirectories of the acsYYYY data directories ("filetypes"). If you are looking for summary tables based on 2012 vintage ACS data then you need to look at the two subdirectories (of acs2012) basetbls and btabs5yr. We could have named these smrytbls but we chose to use the "base" name because that seemed to be consistent with the Bureau's references to these data. Base and Summary tables are the same thing. Why do we have a separate subdirectory for "5yr" base tables? Because they are humongous and we want to be able to archive them independently as they become out of date. For the 2012 collection the the btabs5yr data directory (containing the base/summary tables based on 2008-2012 five-year period estimates) we have over 80 data sets comprising about 19 gigabytes of data. This is more than double the size of the basetbls subdirectory where we keep all the 1- and 3-year estimates. At some point our plan is to keep only the most recent pair of non-overlapping 5-year interval data, and moving all others to some archival area.

    The margin of error measures in the ACS files requires that we have these measures (columns, variables) as well as the usual table entries. Our naming convention for these is similar to our table-cell naming convention; we just use the letter "m" instead of "i". So on the 5yr ACS base tables data sets we have the variables b01001i2 and b00101m2 where we store the second cell of table B00101 ("Males") and the corresponding margin-of-error value.

    Because of the large number of tables (and table cells) on the ACS base tables, we partitioned them based on topics. This gives us dataset names such as usstcnty17_20, which contains, for every state and county in the U.S. all the tables associated with topics 17, 18, 19 and 20. The first 2 digits of a table name constitute the topic code and we have numerous TableTopicCodes.txt files in our basetbls/btabs5yr directories. If it is your first time accessing our base tables it is strongly suggested that you read the Readme.html file in the acs2012/btabs5yr directory. (We may or may not create a new version for each new acs vintage.)

For more information...

See the various training modules related to Uexplore/Dexter at mcdc.missouri.edu/tutorials/uexploreDexter (which is linked to from the Uexplore/Dexter home page). Note that there is a PowerPoint module specifically on the topic of the MCDC Data Archive. Most of the other modules focus on how to use the software to access the data.