Missouri Census Data Center

Dexter Xsample: Detailed Query Description

Query ID: casrh05

Technical Note: The dataset used in this demonstration is no longer the current data of its type. Normally, the dataset used in the example would be moved to an archives directory and would not be accessible. But we have specially kept this dataset in the active directory so that users who want to run the xsample code will see results comparable to what is shown in these screen shots. But any user wanting to actually run a comparable extract should understrand that they will be looking for a dataset such as mocasrh07 rather than mocasrh05 (depending on when you are reading this.)

Summary

We use a rather complex filter to select several summary types from this somewhat complex data set. The summary types are all the possible combinations of the 4 demographic categories: Age, Race, Sex and Hispanic. Made a bit more confusing by the fact that there are really 2 sets of race categories: the race-alone and the race-(alone or in combination) categories. Accessing this data set is a bit like accessing a microdata file. But the rows are geographic summaries of counties for the specified demographic combinations. Here is the probem statement:
Create a summary report at the state level (i.e. sum across all the counties) with 2000 and 2005 population estimates for the total population, for the race-alone categories, the total population aged 65 and over, and the 65+ population by the 7 race-alone categories. When displaying the variables SumType and Race use custom format codes to display the meaning of the codes rather than the codes themselves.
The key to understanding how to code the query, as is often the case, is to understand the input data set. There is pretty good metadata for the data set, but it still takes some careful perusing of the data to really see how things work. Use of the Quick Look button is suggested as a helpful way to gain insight.

Filetype accessed: popests - recent population estimates from the Census Bureau.

Data Set accessed: mocasrh05 - this is the Missouri data set, one in a collection of 51 state level data sets. Users who want to see something similar for Pennsylvania, New York or Florida (to name just 3) have only to change the data set to pacasrh05, nycasrh05 or flcasrh05, respectively. (And when the 2006 numbers are released just change all these "05"s to "06"s to get the more current estimates.) See detailed metadata for this data set.

Dexter features used: Quick Look (to preview the data) ; complex filter utiltilizing the parentheses check boxes to control logic; the aggregation feature to sum across geography and age; specifying custom formats for variables to appear in the output reports so the user does not have to look up their meaning in a codebook.

Can be readily modified to:

  • Work for different states (see above) or levels of geography (add county and/or cbsa to the aggby list).
  • Adding more demographic dimensions, such as Hispanic and Sex would not be too difficult (though the results might not be all that useful). In general, there is a lot of flexibility when using this data set, allowing you to get many different kinds of demographic combinations.

Degree of Difficulty: 5 - this is a good test of both your Dexter skills and your ability to make sense of a rather complex input data set.

Saved Query File: View It      Run It


A good way to get a handle on what these data look like is to use the "Quick Look" button near the top of the page after entering a value of 1000 in the box near the bottom of Section II, limiting the number of observations to be displayed to 1000 (actually, the default value of this parameter when using Quick Look is 100, so we are using it to increase rather than limit the amount of output.) We cannot fit all 1000 lines here but it starts out looking like this:

Screen image of Quick Look output (1st page)
The key to understanding this data set is the SumType variable. The rows with a value of R for for this variable have no values (are blank) for the key variables Age, Sex and Hispanic and take on the values 07 thru 11 for Race. If you view the values for Race as referenced on the Detailed metadata page you'll note that 07 is the code for 'White alone or in combination'. 08 means 'Black alone or in combination', etc. Following the 5 R summaries (corresponding to the 5 race-alone-or-in-combination categories), we see 10 Rh summaries. These are summaries that add the Hispanic dimension to Race. The row with SumType=Rh, Race=07 and Hispanic=1 has data for the subpopulation that is all or partly white (Race=07) and that is non-hispanic (Hispanic=1).

One of the Usage Notes entries on the metadata page has a reference (link) to the SAS value-labels code. You should right click and open that link in a new window so you can easily reference it. You should spend as much time as you need studying the Quick Look display to make sure you understand what each row of the data set as displayed is summarizing.

Annotated Dexter Query Form (As filled out to define this query)

Section I: We select the HTML option in the Listing/Report row and indicate we do not need a Delimited file by specifying "none" in that row.

Screen image of dexter query form, part 1



Section II: The filter here is the most challenging part of this query. We actually want data for 4 summary types:
  1. Total population (SumType=t)
  2. Total population for the race-alone categories (Sumtype=r)
  3. Population for the population over the age of 65 (can be aggregated using Sumtype=a summaries)
  4. Population over 65 by race-alone (can be aggregated using SumType=ar summaries)
An important detail to understand here is the distinction between upper and lower-case "r"s in the SumType codes. The lower-case r indicates a summary for a race-alone category, while an upper-case R indicates a summary for a race-alone-or-in-combination category. We want the lower-case r categories here. This means that we shall need to enter values for selecting on SumType that are case sensitive.

Another important aspect of this filter is that for two of the SumType categories - t and r - we want all rows that have those values. But for the other two SumType categories (the two involving the Age dimension, a and ar, we only want those rows when the Age variable indicates a summary for persons who are aged 65 or older.

Our filter then has two parts to it:

  1. If the SumType is t or r we want it without any further qualification.
  2. If the SumType is a or ar then we may want it, depending on whether or not Age has a value indicating 65 or older.
Each of these two conditions is sufficient, meaning that if they are true we want the row. So we need to define the two conditions and use the logical operator Or to connect them (instead of the usual And that we use to connect necessary conditions.

Here is what the filter looks like:

Screen image of dexter query form, part 2
Notice the tildes entered in the Value columns in the first 2 rows of the filter. This is how we indicate to Dexter that we want these values to be case-sensitive. Normally when you enter a value that is to be compared with a character-type variable Dexter will uppercase everything so that case does not matter. If we left off the tilde in the first row of the filter here then Dexter would generate this condition:

upcase(SumType) in ('T','R')
It would thus be comparing the uppercased value of SumType with the uppercased values as entered on the form. And that would cause it to select cases where the value of SumType was an uppercase R and that is not what we want. So we use the tilde.

Notice the peculiar value of 14 entered as the Value in the 3rd row of the filter. We want to filter so that when age is relevant we only keep values for age 65 and over (not age 14). But you need to be aware that Age is not an actual numeric age value but instead is a character string code that indicates an age interval. You need to look at the codebook page to see that a value of 14 for Age means we have a summary for persons in the age interval 65 to 69. The codes 15, 16, 17 and 18 represent the higher intervals. Thus choosing codes with values 14 and over is equivalent to choosing age categories of 65 and over.

Notice the choice of the Or button between the first two rows, indicating that we have two sufficient rather than necessary conditions here; if either one is true, we keep the row. Note also the check marks in rows 2 and 3 of the parentheses boxes to indicate that the And condition between rows 2 and 3 is to be done before the Or condition between rows 1 and 2.



Section III: Here we select our variables/columns for the report. Not much to keep; just two identifiers and two estimate values. Note that we are not keeping Age; we use it as a filter but it does not appear in the report. We shall have to use titles to alert the user regarding what we have done to filter the data based on age.
Screen image of dexter query form, part 3, initial view



Section IV The subtitle is used to convey to the user that the two SumType categories involving age have been filtered to include only the 65 and over population.

Screen image of dexter query form, part 4



Section V The box is used to specify that we want our data aggregated by the summary type and then by race within the type. Since there are no geography variables included in this list and since all of the observations are county level summaries for Missouri the results of the aggregation will be state totals. Dexter takes care of sorting the data by these variables prior to doing the aggregation, so the output will be sorted by the Aggby variables.

In part B we specify that the variable sumtype is to be used as a "By variable" for the report. Instead of appearning as a column of the report, the values of sumtype will be displayed only when they change and in the form a a "by line". See the output image, below, to see what a by line looks like.

In part C we have filled in the Format text box. What we want to accomplish here is to have the two code items SumType and Race display not as codes but instead as descriptive labels that are the meanings of those codes. The casrh data sets are unusual in that we have a series of custom format codes that can be associated with their variables (this is not generally the case within the data archive). The format names correspond to the variables names, except that must be preceded with a "$" and must end with a period. The $ indicates a character-type format and the period is a syntax convention that lets the program distinguish between variable names and format names.

Screen image of dexter query form, part 5



Output

Here is what the resulting report looks like (slightly shrunken to fit - use the run it link, above, to generate and view the actual report). Very compact. Lots of rows were processed but the aggregation caused things to be collapsed into a very short report. Notice that the numbers for the first two SumTypes pertain only to the 65-and-over population, while the figures for the total and Race alone categories pertain to persons of all ages.

Screen image of output report





Exercises

  1. Modify the filter in Section II so that it selects only summaires for the total population, the population that is Hispanic, and the Hispanic population by age. These will be 3 SumType values. Aggregate by SumType, Hispanic and Age.

  2. Generate a csv file that has breakdowns of the population by age and sex for the each county in Missouri for the most recent 3 years. Use formats so that the age and sex variables display meaningful category descriptions rather than codes.

  3. Repeat the above but instead of all counties in Missouri do it for all counties in the Illinois portion of the Chicago-Naperville-Joliet MSA (CBSA).

  4. Repeat the previous exercise but aggregate across the counties to produce data for the total Illinois portion of the Chicage CBSA.


This file last modified Monday August 07, 2017, 13:14:34
The Missouri Census Data Center is a sponsored program of the Missouri State Library within the office of the Missouri Secretary of State. The MCDC has been a partner in the U.S. Census Bureau's State Data Center program since 1979.

Questions/Comments regarding this page or this web site can be sent to