Dexter Xsample: Detailed Query DescriptionQuery ID: acsbasetabs1 | ||||||||||||||||||||||||||||||||||||||||||||||||
Before You Can Begin With Uexplore/Dexter you have to know how to find the data you are looking for. That may be the most challenging aspect of utilizing these data - although once you find out about the various tools that are available to assist you, it is not nearly as hard as it may seem. What you are dealing with here is American Community Survey detailed (also referred to as "base") tables. There are over 1100 tables in the collection, and they can be fairly complex. Many of them have several "dimensions", such as a table summarizing the population by age, race and sex (3 dimensions). Tables also have universes associated with them that must be considered before deciding if they are providing information about the persons or housing units or ancestry reportings in which we are interested. You also need to be sure you know precisely what data you need (this is pretty much always the case, of course, but it can be particularly important when dealing with this kind of demographic detail). In this particular example, the problem statement is fairly simple. You do need to consider what definition of "white" and "black" you'll use, since there are at least two ways to look at these categories under the new rules that allow users to identify themselves as being more than one race. For the sake of this example (and because that is how the data are available) we shall use the more exclusive "race alone" definitions (as opposed to the inclusive "race alone or in combination" versions). We'll use the American FactFinder web site to search for relevant tables. If you are not familiar with this site it would be worth a few hours of your time to explore its many features. The URL is http://factfinder.census.gov/home/saff/main.html?_lang=en. To access detailed data from the ACS choose the "Data Sets" option on the menu down the left side of page, and then American Community Survey from the resulting drop-down. You'll then want to accept the default choice of the most recent set of data (2006). While you can do a lot of things with this tool, all we are going to talk about here is how FactFinder can be used to assist you in finding tables related to specific subjects or that have been tagged with certain keywords. Having chosen a specific data set in AFF you will see a light gray (background) area on the right side of the screen containing a "Select from the following" menu. One of the menu options is Detailed Tables; that is what you should choose (click on).
There are 3 tabs across the top of the screen allowing you to "Choose a table selection method". The default is "Show all tables", which is both the default and the least useful of the 3 (unless you already know what table you need, which is rare). Clicking on the "By subject" tab lets you choose a subject from a drop-down list (which does not appear immediately - you have to click on the little down arrow at the right of the select box first). From this drop-down you get to select a subject - try choosing "Poverty (Families and Individuals)" and then clicking the Search button. This results in the display of another drop-down select list of tables relating to the chosen subject (poverty); this list appears in the select window just below the Search window. There will be over 100 tables to choose from; they report poverty in a lot of different ways and in conjunction with a lot of other characteristics such as age, race, tenure (rent vs. own), disability status, education, etc. So this is still a pretty big and complicated list to have to deal with. Fortunately, there is a 3rd tab - "by keyword". Choosing this tab causes the screen to display a blank text box with the instruction "Enter a keyword and click 'Search" (referring to the Search button just to the right of the text box). There is also a link to "Search Tips" just to the right of the Search button, and you really need to click on that link to find out exactly how this works. It's fairly simple most of the time. Because most of the time what you are looking for is a table that has data crossed within two or more categories. Try typing in the box and clicking Search. In response FactFinder will search its metadata, looking for tables that contain these keywords in the table title or one of the row headers. In this instance it should find only two tables that have been tagged (in the Bureau's metadata file which we have found to be very good, though not perfect) with the 4 requested keywords:
Note that as you search for the appropriate table(s) you can view more detailed information regarding exactly what a table contains by highlighting it within the Selected tables box and then clicking the What's this?
button. A new window will open ("pop up") and will display a detailed description of the table that looks like this:
(Not shown is the vertical scrollbar - this is not the entire table; the pattern repeats itself for persons above the poverty level). You could at this point go ahead and select the two tables in FactFinder and then go back (by clicking on the "Geography" bread crumbs link across the top of the screen) and choosing your geography. (It is strongly recommended that you use these bread crumb links to navigate between the various screens of the FactFinder query process and that you avoid using your browser's Back button to return to previous screens.) You would be able to first display and then save-as the results to a csv or an Excel spreadsheet file. You would then do your "postprocessing" in Excel. But we are going to go ahead and show how to extract the data via Dexter. Navigating the MCDC Archive Via Uexplore to Access the Data
Annotated Dexter Query Form (As filled out to define this query) Section I: We select both CSV (the default) and SAS dataset as output formats. We can then do our postprocessing in either Excel or SAS, depending upon software preference. Note the link to Detailed metadata if we need it. (To take us to page that define the values of key variables such as the summary level and geographic compondent codes.)
Section II: The filter specifies that we only want summary at the nation and state levels. The summaries at these levels include special summaries for geographic components, which we are not interested in. So our filter has 2 conditions:
Section III: We select the desired data elements (variables/ columns). For identifiers all we need is the Name of the area (Areaname and Name are the same value unless we are dealing with a geographic component summary, and we have filtered those out for this query). Selecting the items from the Numerics list is the more challenging task. This is where we need to recall what we discovered during our "pre-Uexplore/Dexter" step visiting American FactFinder to search for relevant tables. We know from that step that we need tables B17001A and B17001B. Scrolling through the list of numeric variables (with over 7500 table cell items) note that they are in archive standard table-naming convention and are sorted by the table ID. So we can scroll down looking for the entries for table B17001A. Once we find the first cell (variable) in that table we click on it to select it. Then we keep scrolling down through the 59 items of that table and come to item B17001Bi1 (the 1st item in table B17001B) and then keep scrolling to the end of that table. We hold down the shift key and click on the last item in this table - B17001Bi59, which selects all the variables in the list starting with B17001Ai1 and ending with B17001Bi59. The screen snapshot only shows part of the selected items.
(Feel free to ignore this discussion of an alternate way of choosing the variables). Section IV and Section V: These sections are not used for this query. We use the Extract Data button at the bottom of Section III to submit the query. Output The query takes about 10-15 seconds to run, since it has to access a fairly large data set (about 290M). The next thing you should see on your screen is the standard Dexter output menu with links to the Summary log file and to the two requested output files: a csv file and a SAS dataset file. What we want to do now is discuss how you might complete this example by postprocessing the results using your favorite software tool. For the large majority of you, that will probably by Excel and for an important minority it might be SAS. If neither is your favorite tool then you may well want to stop reading at this point. Postprocessing with SASOur assumption is that if you are reading this section that you are interested in processing these kinds of data using the SAS software package. We assume you have access to SAS and that you are familiar with (or are willing to become familiar with) the tabulate procedure in SAS. This is one of the most powerful tools within SAS for handling the kind of data that can be extracted from detailed Census files. We shall be employing a strategy that we commonly use, whereby we take the summary data (pre-defined tables) and use a data step to convert the aggregate data into a form that looks more like microdata. In this new dataset, each observation represents a cell of data with identifiers corresponding to the catgegories of interest with a single numeric variable containing the count of persons in that cell. Having done this we then pass this "micro-fied" data to the tabulate procedure and let it do the work of turning it back into a table with just the categories and statistics of interest.The process begins, of course, with capturing the Dexter-generated SAS dataset on our local system so we can access it with our local SAS software. To this we simply right click on the
To access the data from SAS you need to specify the directory where the data are located as a SAS data library. We code the following statement to do this:
Now that we know what variables we have and what they mean, we can think about how we can write some code that will turn all these cell counts into a form that can be readily re-tabulated into the exact categories of interest. We basically just need to get rid of a lot of age detail. Also, what we want are just poverty rates -- defined as the number of persons below the poverty level as a percent of all persons for whom poverty status was determined. (You get the latter -- the denominator used in the poverty rate calculation -- by adding the persons above the poverty level to those below it.) Here is (some of) the code for the SAS data step that takes our extracted data and creates the tabulate input dataset:
This program is a classic example of using SAS to "reshape" a data set. What we have are 2 very large tables with cells containing counts of persons. Most of these cells have associated with them an age (single year), sex and poverty status (above or below the poverty level). Each of the two tables (b17001a and b17001b) has a race associated with it. And, of course, each observation has a particular geography associated with it (either it's the US totals or the totals for a specific state). We use a subsetting if statement to delete the summary observation at the nation level. We said that "most of" the cells have the various attributes associated with them; some of the cells represent subtotals and thus are lacking one of more of the characteristics. We ignore these subtotal cells (SAS variables) and focus only upon those that indicate an age, a gender and a poverty status. What we now do is create 4 new variables (age, race, sex and poor) that are stored as character strings (except for poor) and represent categories, and a single new numeric variable, Count, that contains the count of persons associated with these categories. The variable poor is a special category variable that is assigned a value of 1 to indicate a cell of persons who are classified as being poor, and a value of 0 if they are not poor. The reason we use a numeric variable here instead of a $1 character variable has to do with a coding technique that we use in our Proc Tabulate step. Note that the varible Age gets assigned values of "1" through "3" representing the three cohorts we are interested in. "1" represents the 0 to 17 cohort, "2" the 18 to 64 cohort and "3" the 65+ group. We assign race='w' to indicate we shall be outputting counts of persons from table b170001a, which contains counts of white persons only. We then assign sex='m' to indicate we are going to output a count of white males. Finally we assign poor=1 to indicate that we are dealing with persons who are indeed poor. The result of these 4 assignments says that we are talking about white males aged 0-17 who are poor. How many such persons are there? If your understand the coding of the summary tables on the acs base tables dataset then you should understand that the variable b17001ai4 is the number of white males aged 0-4 who are poor, and b17001ai5 is the number of white males aged 5 who are poor, etc. (see the Sasout.Xtract Properties window image above; the person who was writing this code had that window or its equivalent open and was using it as they coded this statements).
The data step in this program basically takes the date from the two tables apart and creates a dataset that looks almost like microdata except that it has a Count variable that says this is how many we have in these 4 catgories (5 if you count State). The Proc Tabulate step lets us take this "micro-filed" data and put it back together again so we can see the bigger picture but with less detail regarding age and with poverty rates calculated which we did not have before.
Here is the code we use:
Think about a simple case where you were looking at 50 cases within a category grouping and of those 50 cases Poor had a value of 1 in 15 of them and a value of 0 in the other 35. What percentage of those 50 cases were poor? (We are ignoring the use of the Freq variable here but that doesn't change the logic.) The answer is pretty clearly 30% (15 out of 50) and if you take the mean of the variable Poor you get a value of .15 (mean=the sum of the values divided by the number of them so you sum the 1's and divide by the total number of cases; this is equivalent to getting the decimal portion of the time that the variable is true (has a value of 1). Converting this to a percentage is just a matter of shifting the decimal point and maybe displaying it with a "%" symbol. This is exactly what happens when we use the percent format code as in First we must define our ranges for each category. Tables 1 and 2 below shows the
ranges we will use for each ethnic/age group and poverty status.
Now in the spreadsheet, move to a clear cell to the right. I picked
column DR. In row 1 and row 2 add the headers I created in Figure 1
starting in column DR.
Figure 1
Next, we will define the formulas on row 3 (United
States) in the excel spreadsheet. (make sure you are in row 3, this
is important!).
Now type
=SUM(E3:J3)/(SUM(AH3:AM3)+SUM(E3:J3)) in that cell and press the
enter key as shown in Figure 2.
Figure 2
If all went well, the value of 0.1332342
should now up as shown in Figure 3. If that value does not show up, or you
get an error, check the syntax to see that you have an equals sign at the
beginning, and that you have parenthesis in the right places.
The cell that contains 0.1332342 contains the poverty rate for all white
males aged 0 to 17 in the United States.
Figure 3
Now let me explain what this formula does. An element such as E3, refers
to the Eth column and the 3rd row. The equal sign states that we want to
write a formula. SUM(E3:J3) says to add the values in cells E3, F3, G3,
H3, I3,
and J3 together. The / states we want to divide.
(SUM(AH3:AM3)+SUM(E3:J3)) says we want to sum up all values in the cells
AH3, AI3, AJ3, AK3, AL3, and AM3 together, and to sum those with the
values in cells E3, F3, G3, H3, I3, and J3. Because we have parenthesis
around SUM(AH3:AM3) + SUM(E3:J3), this addition takes priority and is
calculated before the division occurs.
Move over to the right one cell (still in row 3), and type
=SUM(K3:03)/(SUM(AN3:AR3)+SUM(K3:03)) which is
the next formula in the poverty rate row in Table 1 above and refers to
white males aged 18 to 64. If you typed it right, you should
have the value of 0.0846298 in the cell as shown in figure 4. If not,
please use the advice above to check for typos.
Figure 4
Use the formulas in the Poverty Rate row in Table 1 and 2 to calculate the
poverty rates for the rest of the categories for the United States. When
finished your spreadsheet should look similar to Figure 5.
Figure 5
Once you finish typing the formulas in the cells for row 3, we can
easily calculate the values for the rest of the rows. Right click on the
cell with 0.1332342 (the cell that contains the
poverty rate for white males aged 0 to 17.) and select copy as shown in
figure 6.
Figure 6
Now select the cell right below it in row 4 and do a group select by
pressing and holding the shift key, and press the down arrow until all
cells directly under the cell containing 0.1332 are selected as
shown in figure 7 until you reach row 53 as shown in figure 8.
Figure 7
Figure 8
Now do a right click and select paste from the menu as shown in figure 9.
Figure 9
All the values in the selected cells should be displayed as shown in
figure 10. What occurred here, is
that the formula was copied and was pasted to the other cells. Excel was
smart enough to know to look at the cells in the adjacent columns to do
the calculations itself.
Figure 10
Now repeat this process (of copying and pasting) for
the other 5 cells in which we have established formulas in row 3. When
you are finished your results should look similar to what is shown in
figure 11.
Figure 11
To format the columns by adding a percent sign, etc, we need to select all
cells we just created and right click and select format as shown in
figure 12.
Figure 12
A window
appears, select Percentage as shown in figure 13, and press the okay
button and the results
should be similar to what you see in figure 14.
Figure 13
Figure 14
We now need to hide the data columns B through DQ. To do this we need
to select those columns as shown in Figure 15.
Figure 15
Now right click the mouse and choose hide from the popup menu. If you do
this correctly you should get something similar to what we see in figure
16.
Figure 16
I leave it as an exercise for you to calculate the poverty rates for
black males and females. The steps are the same, the only thing that
changes are the columns you will use in your formulas. Use the results
from the SAS demo to see if your answers are correct.
Exercises
|
||||||||||||||||||||||||||||||||||||||||||||||||
This file last modified Wednesday January 16, 2008, 08:46:53
|