Ten More Things to Know About the American Community Survey

Item 8: Downloading Custom Extracts from MCDC Site

The Missouri Census Data Center maintains a large collection of public data files, most of them from the Census Bureau or other government agencies. We have created a web-based set of software tools to assist users in navigating this collection and accessing specific datasets in order to do custom extractions. The navigation tool is called Uexplore, and the tool for doing the custom extractions is called Dexter (for data extracter). There is a data archive overview page that describes the system, and a data archive home page where you can start exploring the system. But such a start-from-the-top approach is too much for our purposes here. Instead, we'll focus on the less daunting task of just accessing the ACS data.

The Missouri Census Data Center ACS data collection can be broken into three basic categories:

data directly downloaded from the Census Bureau web site corresponding to the Census Bureau data profiles
data we use in our own MCDC ACS data profiles, derived primarily from the data in the Bureau data profiles, augmented with some items derived from the base tables
Base tables, the 1200+ set of tables corresponding to the ones you can see in data.census.gov when you select the detailed tables option for an ACS dataset.

We store the census profiles and the base tables data in their own separate subdirectories, leaving just our own MCDC custom profiles data to be in the main ACS data directories. We have a complete national collection (excluding Puerto Rico) of data for the survey years 2006 through 2008 with the 3-year period estimates for 2005-2007 and 2006-2008. We keep the 3-year period estimates in the same data directory with the 1-year data for the last year of the period (referred to sometimes as the vintage year). We plan to continue this pattern; so in late 2010 you can expect to see vintage 2009 data stored in an acs2009 data directory.

Where normally someone needing to access one of these datasets would come in through the archive home page and navigate their way to the directory, we provide a shortcut that will take you not only to that directory, but will also choose the relevant dataset and drop you into a web application: the Dexter query form.

At the top right corner of an ACS profiles report you should see a set of links:

The third link is to Variable metadata. Right-click on this link and open it in a new window or browser tab. You can peruse it a bit now if you like, but then return to the profile page window/tab. We'll want to return to the metadata when we are doing our extraction step, below. The final links is Extract data via Dexter; clicking on this takes you straight to the Dexter form, which should look something like this:

Click here to see the Dexter form "live", instead of just this snapshot of the top portion.

The dataset being accessed here is called usmcdcprofiles3yr, stored in the directory acs2008. The mnemonic dataset name (usmcdcprofiles3yr) can be parsed. The "us" at the beginning of the name tells you that this is a national dataset (has data for geographic areas covering the entire country); the "mcdcprofiles" part tells us what type of dataset we have (Missouri Census Data Center standard profiles data); and "3yr" indicates that this is a set of 3-year period estimates. According to the info displayed at the top of the form, this dataset has 14,265 rows/observations and 1,034 columns/variables. Each row corresponds to a geographic area, while each column/variable contains one piece of information regarding that area. These variables correspond to the data items in the MCDC ACS profile reports.

Let's say we want to look at indicators of economic well being (income and poverty measures and housing values/rents) for all counties in your state. For our example, we'll use New York as the state, but you should be able to easily substitute your own state. We want to put these data into an Excel spreadsheet so that we can do various manipulations of the data once we have them in house.

The first section of the query form is titled I. Output Formats. If we leave everything alone here, we shall be asking the application to give us a comma-delimited (CSV) file. A comma-delimited file (with filename extension of ".csv") is readily imported and converted to an Excel file. And it helps if the first row or two contain column identifier information. This is exactly what Dexter will produce for you, by default.

The second section of the query form is titled II. Filter Rows. This is the most challenging part of the form to complete, because it requires some understanding of the dataset being queried and often requires knowledge of coded values. In this case, we need to know that each row of the data set represents a geographic entity: the nation, a state, a county, etc. Try clicking on the Quick Look button at the top and it will display a page like this:

We don't see too much in the screen shot here, but if you do it live and scroll down to see the first 100 rows you should be able to recognize a pattern. The rows begin with a complete United States summary, followed by an urban portion of the U.S., a rural portion, etc. There are 16 rows/observations that are the entire U.S. geographic component summaries. These are followed by summaries for the 4 U.S. regions (rows 18-21) and then by geographic components of those regions (rows 22-65, 11 components per region).

The key here is to recognize the meaning and importance of the two fields/variables SumLev and Geocomp. One indicates the kind of geographic area (e.g., 020 is the code for a region) and the other for a special component of the geographic area, such as the urban or rural portion. We want to choose rows where the value of SumLev indicates a county-level summary, and where the value of Geocomp indicates "Not a geographic component". It's pretty obvious that the code for the latter is "00", but what about the SumLev code to indicate a county level summary? These codes tend to be well known by experienced census data users but not by casual or firt-time users. We need to have a method for helping users access a codebook of some sort to help them with such coded values. This is where the link to detailed metadata near the top of the form (just above the Quick Look button) comes in. Clicking on this link takes you to this page (abridged):

There is a lot of useful information on this page, but for now see the three highlighted links in the Key Variables section of the page. Clicking on any of these key variable names takes you to a key values page such as:

What you learn from this page is that there are 1,821 rows/observations on this dataset where the value of the variable SumLev is "050" and that this code means county. Clicking on the geocomp and state key variable links will produce similar "Values with frequencies..." pages that will know the codes for those two variables. You can now proceed to complete the data filter as described.

We define a filter with three conditions. A row will be selected if and only if all three of the conditions are true. The first condition is that the SumLev variable must have a value indicating a county-level summAry. The second condition is that the row not be a special geographic component — we just want the "Not a geographic component" summaries. The third and final condition is the state we are interested in. The variable State contains FIPS states codes. The code for New York is 36. We click on the down arrows and select values for the variable names and the logical Operator values, and we type in the desired values in the text boxes. Here is what the filter section looks like:

Note that we leave the last two rows as-is and we do NOT do anything with the check boxes for inserting parentheses. These are rarely used.

We now proceed to the III. Choose columns (variables) section. This is where you will probably want to refer back to the metadata file linked to from the profile report (which you accessed above), where you'll see:

It's a table with rows that parallel those in the profile reports. The key fields here are the variable name and label columns. So what is a denominator? It is the variable that we use as the denominator in calculating a percentage variable. For example, the variable Age0_4 has a denominator variable of TotPop; this tells us that the corresponding percent variable (which will be called PctAge0_4) is derived by taking Age0_4 as a percentage of TotPop. You should use this report to help develop a list of the variables you are interested in keeping on your extract.

Section III of the Dexter form is where you specify what columns/variables you wish to keep. Initially it will look like this:

Note that Dexter knows enough about the data to be able to segregate identifiers (mostly geographic and time-period codes) from statistical data so that you select these item from separate lists. The Numerics select list is displaying variables that are closely related to the ones shown in the metadata codebook. In addition to the base variables (i.e. the actual counts) and the corresponding percent variables, you also see variables with "_MOE" suffixes in their names; these are the margin of error measures.

The numerics list is very long (about 1000 entries) and it can be pretty tedious to scroll through and find just the items you want. There are some tools that can be used to make this process a bit easier. Notice the box labeled Filter by regular expression — here you can enter strings that the will be used to modify the numerics select list, keeping only entries that match the expression you enter. The simplest and most useful kind of expression to enter is a series of keywords separated by vertical bar symbols (without blanks). For example, we can limit our search to variables that have a name or label containing any of the words income, poverty, poor, or population by entering "income|poverty|poor|population" (without quotation marks).

There are two sections of Dexter customizing parameters left on the form, but none of them are essential (things like report titles, sort order, variables formats or labels, etc.) So just go ahead and this point and click on the Extract data button at the bottom of Section III. If all goes well you should be presented with an intermedit Output Menu page like so:

What happens when you click on the Delimited File link is browser dependent. Some browsers may display the CSV as text in a browser window, while others may download the link to your desktop or whever you choose. We recommend that the right-click on this link and ask to open the link in a new window or tab. That way you still have your output menu page displayed, and you can use the browser Back button to return to the form, which is important should you decided to modify your query.

We have a file in CSV format, with the top two rows clearly different from the rest. The first row contains variable names, the second row contains variable labels. Rows 3 and over contain the actual data. The values are separated by commas and in most cases enclosed in double quotes (values that contain commas are so enclosed). We can save this file to our desktop and then open it with Excel:

This completes our example of an extraction from an MCDC profiles (standard extract) dataset. We also provide access to the detailed tables via the same Uexplore/Dexter tools. We shall not be doing a detailed description of how this would be done, but we can tell you where to start: at the Uexplore home page on the MCDC web site. Click on American Community Survey in the index list at the top of the page, then click on the acs2008 link (or choose a different vintage year if you prefer), which invokes Uexplore and lets you browse the ACS data collection (directory) for that year.

Click on the basetbls link to access the subdirectory containing the detailed base tables. In that directory you will see the names of our base table datasets such as ustabs21_24.sas7bdat and ustabs21_243yr.sas7bdat . These datasets will contain, respectively, all base/detailed tables in the 21 to 24 topic series for all geographies in the U.S., single year (2008) and 3-year period estimates (2006-2008).

The important difference between doing an extraction from one of these base tables datasets vs. a smaller mcdcprofiles dataset is that it is flagged as a table-based dataset, which means that in Section III of the Dexter query form you get a table select list where you would otherwise expect a numeric variables list. So you get to select entire tables rather than the individual variables comprising those tables.

We have created a separate Powerpoint tutorial, Accessing Large Table Files With Dexter, to help users get started with table-level access. The Dexter mechanics are really not much different.

Return to Ten More Things ... page.