Author: John Blodgett, OSEDA/MCDC
Summary: We came across this article on the MCDC web site - Measures of Income in the Census which includes a nice table at the end of it that does a comparitive summary of seven different common measures of economic well being. We are interested in seeing how we are doing within our state (Missouri, of course, but this can easily be modified to be whatever state you are interested in) in terms of some common measures of economic well being. We'd like to get a report and an Excel spreadsheet that display these seven indicators for each city and county in the state, using the most recently available public data. We understand that the Census Bureau has begun releasing such data based on the American Community Survey so that this is a reasonable request. In addition to the 7 economic variables, we would also like to see data on the number of persons, households and families in each area, since the economic measures are based on these universes. Finally, we would like to see these data for the state (totals) and the nation for comparison purposes.
Data Set Accessed: acs2008.usmcdcprofiles3yr (Easily adapted to work with acs2009).
Inputs: Dexter Query Form (live and modifiable!) | Detailed metadata page | SumLev values
Outputs: Dexter Summary log | csv (comma-delimited) | report(html) |
Related Queries: none
Degree of difficulty: 3 or 4. The filter in Sec II requires use of parentheses for logical grouping which may frighten some users. And the use of the "third way" of specifying variables in Sec III is not trivial. We do use some Advanced options to enhance the report format but these are not hard.
What You Need to Know (Broad, Common Knowledge)
The Census Bureau is generally accepted as the best source of such data, at least in the public domain. There are many commercial purveyors of demographic / economic data who will be glad to sell you such data. But if you are like us - poor and skeptical - you would prefer something coming almost straight from the horse's mouth and with no fees involved. Until the advent of the American Community Survey, most such data were only available based on decennial census surveys. There were Per Capita Income estimates for governmental units when they were required for allocating federal revenue sharing dollars not too long ago. And there have been Median Household Income estimates for states and counties coming from the Census Bureau's SAIPE (Small Area Income and Poverty Estimates) group for quite some time. But these were/are largely viewed as statistical best-guess figures and only available for limited kinds of geography. With the flow of data from the American Community Survey this has totally changed.
There are entire web sites and conferences, hundreds if not thousands of published articles and books on the subject of the American Community Survey (ACS). You do not need to be an expert on the ACS in order to be able to access and make good use of the data produced as a result of it. Unless you are a data professional with a need to access such data on a regular basis, you may find that trying to wade through all the information on the various web sites (such as ours - see the American Community Survey link in the Navy Blue navigation bar to the left on our MCDC home page; or go straight to the official site at the Census Bureau - http://www.census.gov/acs/www/ ) . What you may discover if you spend a few hours or weeks exploring the ACS web site and then the American FactFinder site trying to access ACS data, is that while there are lots of ways to access the results (i.e. the published summary data) of these surveys, there are no simple tools created by the Bureau or otherwise widely available that can handle this data request. At least none that we are aware of (in July of 2010; the Bureau is always adding capabilities to their FactFinder site so we expect this to change some day.) It is relatively easy to access these data one-geography-at-a-time via the Census Bureau's (or the MCDC's) ACS Profile products. But being able to get the specific data items for specific geographic areas or types of areas is a different matter. This is where Dexter can help.
Navigating the Archive (how to get to this Dexter page) You can arrive here via the usual Uexplore navigation path by starting at the Uexplore/Dexter home page and following the links to the American Community Survey major catetgory and the acs2008 specific filetype (data directory) and finally to the Datasets.html page for that subcollection. But there is another way. Many (probably most) users will come across ACS data on the MCDC web site by following the link to our ACS Profile Reports web application, found as the first line of the Quick Links menu box on the right side of most major MCDC web pages, including the home page. So we can navigate from there:
Click on the Extract Data Via Dexter link (or, better yet, right click to open it in a new tab or window so you can establish a live session to follow along in). The data set to be "queried" has the geeky name
usmcdcprofiles3yr. (In case your care: us indicates a national file, mcdcprofiles tells us what kind of data we are dealing with, and 3yr is used to indicate that the dataset contains 3-year period estimates.)
What's In The Data Set
The data set has over 14,000 rows and 1050 columns (variables). Each row represents a geographic area (the nation, the rural portion of the nation, a state, a county, a city, etc.), and each column contains some information about that area. The key to getting a better understanding is to take advantage of the detailed metadata page that has been created for it. Right click on the detailed metadata link at the top of the page and then choose the
Open in New Window (or tab) option so you follow along with this discussion.
The top of the metadata page has the short label for the data set: MCDC profile extract for all US areas with 20,000+ population based on 2006-2008 surveys . This is meant to convey to you that this data set is the source of the MCDC ACS Profile reports and contains data for all ACS areas in the entire U.S. based on data collected over the 2006-2008 time period. Notice that the entry for
Units: ends with etc , meaning there are too many to mention here. But you can get detailed information about what kind of geographic entities are included in the data set by going to the Key variables section and clicking on the entry there labeled sumlev. Which displays this valuable web page:
What does it mean?
010=United States (N=17) tells us that the variable SumLev contains the value
010 on 17 observations in this data set. The code's value label is "United States"; so when a row has the value
010 for this variable it indicates a nation-level summary. The 4th line on this page tells us that the code for state level summaries is 040 and this code occurs 599 times on the data set, and the line below that says it contain 1821 county level summaries as indicated by a sumlev code value of
It is important that you understand the purpose of the Key variables section of the metadata pages; they are the keys to helping you code your query filters in most cases. They tell you what values you need to enter in the critical third column ("Value") in Section II.
Usage Note(s) section of the metadata page has a number of items worth noting. We recommend you read them carefully before proceeding. Reading these and then following the link near the bottom of the metadata page to the Variables.pdf file should give you a good handle on what to expect from this data set. Notice that while there are about 1050 variables, 14 of these are identifiers (geographic and time-period) and the numeric variables mostly appear in triplets (the estimate variable, the estimate percentage and the estimate Margin-of-error). For example:
Age0_4 PctAge0_4 Age0_4_MOE. You will see this pattern consistently as you go through the data set. What this means is that the data set contains about 340 distinct data items, most of which have 3 variables asscociated with them (variables such as means and medians do not have corresponding Pct variables).
Coding the Query Filter
Note that the complete dexter data query form (DQF) is contained in the dqf (upper right) frame. You can modify this query if you'd like and then hit at Extract Data button to run the query ("live").
Section II of the Dexter query form is where we can specify which rows/observations we want to include in our extract. Since the rows in this data set correspond to geographic areas it means this is where we get to specify which of those we want. Which is:
This would be a pretty straighforward filter were it not for the extra added complications having to do with Geographic Component summaries. Because these special special sub-geography summaries occur at the state and nation level on this data set and because we do not want to see them for this extract, we need to add something to filter them out. Our filter is going to consist of 3 parts:
- United States totals
- State totals (Missouri)
- County and place level data (Missouri)
We use four of the five available rows in Section II of the DQF.
- Select a row if it is a nation level summary.
- Select a row if it is either a state, county or place level summary AND it is in the state of Missouri.
- Add the condition that it NOT to be a geographic component summary. This condition applies in conjunction with each of the other 2 conditions.
The trickiest part of the condition involves the use of an extra set of parentheses to specify the order in which Dexter is to combine the pieces of the logical query. Normally complex conditions (those that specify multiple conditions) are evaluated in a top-to-bottom order except that "AND'ing" always precedes "OR'ing". So normally it would combine (using the AND operator) the conditions from rows 3 and 4 before it would OR that result with the result of ANDing rows 1 and 2. Which would give us all cases which were Missouri state, county or place summaries and would also give us Nation level summaries that were not geographic components. This is close to what we want but not exactly. It turns out that there are also geographic component summaries at the state level, so allowing Dexter to do the default logic processing would mean that the geographic component "filter" would not apply to the Missouri selections. The way to get around this is to use parentheses to control the order of logical expression evaluation. You should note that there is a checkbox to the left and right of each row, displayed next to a left or right parenthesis. By checking the left parenthesis box in row 1 and the right parenthesis box at the end of row 3 we are specifying that the logical expression based on these 3 rows should be evaluated first, and then combined with the expression from row 4. Which is exactly what we want.
- In the first row we specify that the State code should be that of Missouri (
State Equal to (=) 29.
- In the second row we add the condition that the summary level code must be one of the 3 values specified in the Value text box (
SumLev In List 040:050:160).
- The 3rd row is where we add the condition saying we are interested in the nation level summary (
SumLev Equal To (=) 010). Note that we have also checked the
Orradio box between between the 2nd and 3rd rows. This directs Dexter to evaluate the conditions in rows 1 and 2 to see of both are true (since the default AND logical connector applies between rows 1 and 2); it then evaluates the expression in the 3rd row and "ors" that with the previous compound condition. The result is TRUE (and the row is selected) provided either of the conditions is true: either it is a Missouri state, county or place level summary, OR it is a nation level summary.
- The 4th row is where we add the special overall qualifier that says that we do NOT want a geographic component summary. The geocomp variable contains the geographic component code, and a value of 00 means it is not really a compnents. So:
Geocomp Equal To (=) 00
Conceptually simple but practically tedious we come to Section III of the DQF where we want to select our variables. What do we need? From the ID variables list we get to select those data items that will allow us to identify the rows of our output tables. We see that we do not select the years and period variables (these are constants but could be chosen to carry along as extra documentation) and then choose the next 3: SumLev, geoid and AreaName. We know about SumLev from the previous discussions. Geoid is the geographic key provided by the Census Bureau to uniquely identify each geographic area. It actually includes the SumLev code as its first 3 characters. The name of the geographic area being summarized is what most users will use to make sense of the data. Choosing from the Numerics list is where things are not nearly so simple.
Looking at our specs, we have specified seven economic indicator variables and three more universe-size measures of interest. How do we know where these variables are within the select list of over 1000 variables? One good way to know ahead of time the names of interest on these data sets is to take advantage of the linkable metadata while viewing the ACS Profile report. When you see an item within a table of that report that you want you can right click on the sub-table heading and be taken to the detailed metdata page and specifically to the portion of that page dealing with that table. From there it is easy to see and record the variable names associated with the items of interest. You can also go the Variables.pdf file and do searches on it to rather easily find what you want in most cases. Old-timers might even resort to printing off this report and circling the items of interest with a felt tip pen. However you choose to do it, you should be able to come up with a list of the names of the varibles you want. Or, you can always guess. You could guess that the variables for Median Household Income would have the word "Median" in its name and/or label, and that the poverty rate would contain the word "poverty" or "poor". We took advantage of the
Filter by regular expression:feature located just below the Numerics select list to help us find the names of the variables we wanted. We just entered keywords separated by OR symbolds (vertical bars) and then clicked the Filter button. Doing so resulted in getting a fresh select list that showed only those variables matching the expression entered. We could then use the cntl key and the mouse to select our variables off the menu. But this would require that we get all 10 or our variables displayed at once and would require careful clicking. And the result would not be viewable in the scrollable image documenting this query. So, we opted to take advantage of the seldom-used "3rd way" of specifying what variables we wanted. We simply typed in their names in the text box provided just below the variable select lists. Of course, the first time we did this we entered
totpopand so we had to go back and fix that. This takes only a couple of extra seconds provided the browser remembers our specs when we use the Back button to return to it (which seems to be always be the case with Firefox, but not always with IE -- sometimes we have to start over because IE does not retain our previous selections).
A good hybrid approach to doing a complex variable list is to choose as many as you can from the Select lists in Section III. Run the query to view your results. Follow the link to the Summary Log page (note we provide a link to this page for this query in the Outputs: section of this page.) This file contains a list of the variables you selected; you can do a copy-paste on this list, copying from the Summary Log page and pasting into the "3rd way" text box at the bottom of Section III of the dqf (Dexter Query Form).
As usual, nothing is too difficult in Section IV). Just a title and subtitle to label the output report. If you decide to tweak the query by changing Missouri to another state or states don't forget to modify these titles to reflect your changes. If, for example, you change the state code in the first row of Section II from 29 to 06 then be sure to change the word "Missouri" in the title value to "California".
In the Advanced Report Formatting Options section (V-b) we checked the Use variable labels as column headers in reports box. That causes report column header to say "Median Family Income" instead of MedFamInc, etc. We also entered values for the next two text boxes in section V-b, specifying By variables and ID variables for the report. You can see how this works pretty much by looking at the report. The specified By variable was Sumlev and you see how it gets displayed in a "by line" rather than as a column in the report. Note that the file has to be sorted by these variables in order for it to work. This dataset is already sorted by sumlev (if it were not you could make it so by specifying SumLev in the Sort option text box in Section IV). The two specified ID variables are shown at the far left of the report lines and have special color-background formatting to highlight them, These are the variables that identify the geographic area being summarized on the line.
Questions, Exercises (with alternate outputs)
- Modify the setup to do a comparable report for your state.
- Modify the setup so that the report would be sorted by descending Mean Poverty Ratio (or any of the other six key measures) within each geographic summary level.
- Modify the query to look at data for the nation and for state level summaries excluding Puerto Rico. Use single-year data from 2008 instead of 3-year period estimates. Output to a pdf file with landscape orientation and sort by descending Mean Poverty Ratio. Select the state code instead of geoid and use it as an ID variable in the report. Add the mean poverty ratio variable to the ID variables list. See outputs: report (pdf) | Saved query file
- Generate a report showing these indicators for each of the PUMAs in your state. If you do this for Missouri you'll see that the PUMA areas have been assigned meaningful labels to help the user see the data. Consider creating your own set of PUMA AreaName values that you could combine with the output of such a query to create PUMA level reports that someone might actually be interested in reading. report (html) | Saved query file
Note that you can use either single-year or three-year period estimates for this query. For our sample solution we accessed the single-year data as you'll see if you load the saved query file via the link, just above. Notice that it shows the dataste name as
usmcdcprofiles3yrsuffix that would indicate the 3-year period estimates) but the observation count is the same 14,265 value that we saw for the 3yr dataset (there are only 7246 observations on the single-year dataset). So what gives? It is because we took advantage of the new (2010) feature in dexter that allows you to edit the name of the dataset being accessed (at your own risk - see the Online Doc for the limitations on this). When you do this is tells the program to pass a different dataset name to Dexter but it does not edit anything shown on the page, all of which was generated based on characteristics of the orignal dataset, and this includes the row and observation counts displayed at the top of the form. If we had not done it this way - if we had gone back out and selected the new dataset via uexplore and come back in we would then have correct counts, etc. but what we could not have is all the choices that we had already made on the form. We'd be back to an unimproved dqf and would have had to start over specifying what we wanted from scratch.
- Where would you go to look for equivalent data from the 2000 decennial census so you could look at trends?
- Census Geography and Summary Levels from the MCDC "All About..." series.
- Measures of Income in the Census, also from the MCDC "All About..." series.
- The MCDC American Community Survey home page.
- Metadata for ACS Profiles as linked to from the ACS Profiles reports and describing the variables available in the usmcdcprofiles datasets.
- X Samples Index Page