The MCDC data archive is a collection of datasets (files) created over a period of more than 30 years by programmers at the University of Missouri under contract with the Missouri Census Data Center (part of the Missouri State Library within the office of the Missouri Secretary of State). This page provides some information about the archive's contents and how to access it.
As of 2025, the collection contains more than 600 gigabytes of data, including 10,000 - 15,000 SAS datasets (database files). We try to use directory and file-naming conventions to make it easier to find things.
These filetypes correspond to SAS entities called libraries, the names of which are limited to eight characters. While somewhat cryptic, the names are not as totally unfathomable as they might first appear.
For example, one filetype is called sf12010x. The archive directory page names this "Summary file 1 standard extract". Summary file 1 contains detailed tables based on data collected via the short-form census questionnaire. So, this sf12010x is a collection of data that is a "standard extract" based on summary file 1 in the 2010 decennial census. Other filetypes use similar naming conventions.
The main datasets from the Census Bureau contain data about people, households, and housing units. But the rows (observations) that comprise these datasets do not directly describe any of these three entities, but rather geographic areas (states, counties, cities, etc.). The rows comprise columns (variables) that identify or summarize the geographic area. The variables that identify the entity are called identifiers or ID variables. These fields have no numeric significance and are stored in the datasets as character strings.
The numeric variables contain summary characteristics. Most characteristics are provided as statistical summaries (aggregates) — counts, means, medians, etc. — although there are some important exceptions to this general rule. For example, the various public use microdata sample ("PUMS") filetypes (acsums, pums2000, etc.) contain datasets that describe individual persons or housing units.
This data archive uses naming conventions for both datasets themselves and for the columns/variables that comprise the rows/observations within each dataset. The most important conventions concern the variables that identify the geographic entities being summarized on a dataset. The two most important of these ID variables are SumLev, a three-character code used to identify the kind of geographic entity being summarized, and State, containing the two-character FIPS state code for each entity. See the Sumlev.sas and state.sas formats for a master list of these codes.
For example, counties have a summary level code of 050, and California's state FIPS code is 06. To get all California counties from a national dataset, use Dexter to filter the dataset on Sumlev = 050
and State = 06
.
Most of our archive dataset collections cover the entire U.S. We often do custom aggregations for Missouri just to get data by various regions within the state, and sometimes we're able to generate custom tabulations for Missouri only because we have access to Missouri data that we don't have for others.
You don't have to know about SAS in order to understand and extract the data. Most of our users likely know little or nothing about SAS. However, knowledge of SAS will be of some benefit to advanced users, particularly in the Dexter data extraction application.
For example, one of the output format options (in section I of the Dexter query form) is "SAS dataset (Windows)". Whenever data are converted from one format to another, there will often be some loss of information. Since the archive data is already stored in the form of SAS datasets, users who get the data in SAS format are not going to lose much. There is also the possible advantage of using various SAS procedures such as proc tabulate, proc summary, proc report or proc SQL Users experienced with SAS may also take advantage of SAS variable attribute statements and the various ways of specifying variables to keep on output.
Many filetypes in the MCDC collection occur in pairs, such as sf12010 and sf12010x, sf32000 and sf32000x, etc.
To see how such files are related, let's look at the two filetypes dhc2020 and dhc2020x. The entry for the dhc2020 filetype on the Uexplore/Dexter home (directory) page reads as follows:
dhc2020 — Demographic and Housing Characteristics (DHC), complete tables
This is the primary data product based on the 2020 decennial census, released in the summer of 2023. As in 2010, there was only a short-form questionnaire in 2020, so these tables contain just basic demographics (age, sex, race, hispanic origin, household types, etc.). The DHC tables replace summary file 1 (in earlier censuses) and contain many of the same tables and variables as SF1. See dhc2020x, above, for the standard extracts based on these complete tables.
The datasets in this collection can be thought of as tables, with each row a geographic area and each column containing a geographic identifier or a count, mean, or median measure of some sort concerning persons or households. Instead of variables with easy-to-read names like TotPop, Age0_4, or Hispanic, these datasets include names such as P6I1, P6I2 P6I3 and P6I4. These particular variables correspond to the table named "P6". The letter "I" in the variable names stands for "item", so a variable named P8I4 would be the 4th cell in table P8. The MCDC data collection usually includes technical documentation for each data product that lists the table names, variable names, and what each one means. These tech docs are long and complex, but they are consistent in structure and content for all summary data files.
The companion filetype (dhc2020x in this example) is not the original data contained in dhc2020. Rather, it includes variables derived from the summary tables. These are the data we use for our profile applications — the data that can be used to answer most questions. Most users, for most applications, will be using an extract collection rather than a complete summary table collection. These extracts are based on census data files; you really need to understand about the "parent" complete-table summary file collection in order to understand the extract, but we keep the two levels of documentation separate.
American Community Survey data are different. ACS summary file data are released every year in two groups: the one-year and five-year data. Because of this, we don't use separate top-level filetypes for each ACS year. Rather, for each ACS year, we include a directory for the profiles (e.g., acs2019) and two subdirectories (basetbls and btabs5yr, respectively) for the two collections of base tables.
The margin-of-error measures in the ACS files requires that we have these measures (columns, variables) as well as the usual table entries. Our naming convention for these is similar to our table-cell naming convention; we just use the letter "m" instead of "i". So, on the five-year ACS base tables data sets, we have the variables b01001i002 and b00101m002, where we store the second cell of table B00101 ("Males") and the corresponding margin-of-error value.
Because of the large number of tables and cells in the ACS base tables, they are partitioned by topic. This results in dataset names such as usstcnty17_20, which contains, for every state and county in the U.S., all tables associated with topics 17, 18, 19, and 20. The first two digits of a table name comprise the topic code; see our general ACS filetype documentation for a list of topic codes and groups that we use here.
See the Uexplore/Dexter tutorials and examples page for much more information on how to use the data archive and Dexter.