Dexter Xsample: Detailed Query Description
Query ID: casrh05
Technical Note: The dataset used in this demonstration is no longer the current data of its type. Normally, the dataset used in the example would be moved to an archives directory and would not be accessible. But we have specially kept this dataset in the active directory so that users who want to run the xsample code will see results comparable to what is shown in these screen shots. But any user wanting to actually run a comparable extract should understrand that they will be looking for a dataset such as mocasrh07 rather than mocasrh05 (depending on when you are reading this.)
SummaryWe use a rather complex filter to select several summary types from this somewhat complex data set. The summary types are all the possible combinations of the 4 demographic categories: Age, Race, Sex and Hispanic. Made a bit more confusing by the fact that there are really 2 sets of race categories: the race-alone and the race-(alone or in combination) categories. Accessing this data set is a bit like accessing a microdata file. But the rows are geographic summaries of counties for the specified demographic combinations. Here is the probem statement: Create a summary report at the state level (i.e. sum across all the counties) with 2000 and 2005 population estimates for the total population, for the race-alone categories, the total population aged 65 and over, and the 65+ population by the 7 race-alone categories. When displaying the variables SumType and Race use custom format codes to display the meaning of the codes rather than the codes themselves. The key to understanding how to code the query, as is often the case, is to understand the input data set. There is pretty good metadata for the data set, but it still takes some careful perusing of the data to really see how things work. Use of the Quick Look button is suggested as a helpful way to gain insight.
Filetype accessed: popests - recent population estimates from the Census Bureau.
Data Set accessed: mocasrh05 - this is the Missouri data set, one in a collection of 51 state level data sets. Users who want to see something similar for Pennsylvania, New York or Florida (to name just 3) have only to change the data set to pacasrh05, nycasrh05 or flcasrh05, respectively. (And when the 2006 numbers are released just change all these "05"s to "06"s to get the more current estimates.) See detailed metadata for this data set.
Dexter features used: Quick Look (to preview the data) ; complex filter utiltilizing the parentheses check boxes to control logic; the aggregation feature to sum across geography and age; specifying custom formats for variables to appear in the output reports so the user does not have to look up their meaning in a codebook.
Can be readily modified to:
Degree of Difficulty: 5 - this is a good test of both your Dexter skills and your ability to make sense of a rather complex input data set.
A good way to get a handle on what these data look like is to use the "Quick Look" button near the top of the page after entering a value of 1000 in the box near the bottom of Section II, limiting the number of observations to be displayed to 1000 (actually, the default value of this parameter when using Quick Look is 100, so we are using it to increase rather than limit the amount of output.) We cannot fit all 1000 lines here but it starts out looking like this:
One of the Usage Notes entries on the metadata page has a reference (link) to the SAS value-labels code. You should right click and open that link in a new window so you can easily reference it. You should spend as much time as you need studying the Quick Look display to make sure you understand what each row of the data set as displayed is summarizing.
Annotated Dexter Query Form (As filled out to define this query)
Section I: We select the HTML option in the Listing/Report row and indicate we do not need a Delimited file by specifying "none" in that row.
Section II: The filter here is the most challenging part of this query. We actually want data for 4 summary types:
Another important aspect of this filter is that for two of the SumType categories - t and r - we want all rows that have those values. But for the other two SumType categories (the two involving the Age dimension, a and ar, we only want those rows when the Age variable indicates a summary for persons who are aged 65 or older.
Our filter then has two parts to it:
Here is what the filter looks like:
Notice the peculiar value of 14 entered as the Value in the 3rd row of the filter. We want to filter so that when age is relevant we only keep values for age 65 and over (not age 14). But you need to be aware that Age is not an actual numeric age value but instead is a character string code that indicates an age interval. You need to look at the codebook page to see that a value of 14 for Age means we have a summary for persons in the age interval 65 to 69. The codes 15, 16, 17 and 18 represent the higher intervals. Thus choosing codes with values 14 and over is equivalent to choosing age categories of 65 and over.
Notice the choice of the Or button between the first two rows, indicating that we have two sufficient rather than necessary conditions here; if either one is true, we keep the row. Note also the check marks in rows 2 and 3 of the parentheses boxes to indicate that the And condition between rows 2 and 3 is to be done before the Or condition between rows 1 and 2.
Section III: Here we select our variables/columns for the report. Not much to keep; just two identifiers and two estimate values. Note that we are not keeping Age; we use it as a filter but it does not appear in the report. We shall have to use titles to alert the user regarding what we have done to filter the data based on age.
Section IV The subtitle is used to convey to the user that the two SumType categories involving age have been filtered to include only the 65 and over population.
Section V The box is used to specify that we want our data aggregated by the summary type and then by race within the type. Since there are no geography variables included in this list and since all of the observations are county level summaries for Missouri the results of the aggregation will be state totals. Dexter takes care of sorting the data by these variables prior to doing the aggregation, so the output will be sorted by the Aggby variables.
In part B we specify that the variable sumtype is to be used as a "By variable" for the report. Instead of appearning as a column of the report, the values of sumtype will be displayed only when they change and in the form a a "by line". See the output image, below, to see what a by line looks like.
In part C we have filled in the Format text box. What we want to accomplish here is to have the two code items SumType and Race display not as codes but instead as descriptive labels that are the meanings of those codes. The casrh data sets are unusual in that we have a series of custom format codes that can be associated with their variables (this is not generally the case within the data archive). The format names correspond to the variables names, except that must be preceded with a "$" and must end with a period. The $ indicates a character-type format and the period is a syntax convention that lets the program distinguish between variable names and format names.
Here is what the resulting report looks like (slightly shrunken to fit - use the run it link, above, to generate and view the actual report). Very compact. Lots of rows were processed but the aggregation caused things to be collapsed into a very short report. Notice that the numbers for the first two SumTypes pertain only to the 65-and-over population, while the figures for the total and Race alone categories pertain to persons of all ages.