Notes for Query nchsbri_aggbysex (link to the query)

    Jgb, rev. 11/24/2015

Summary

 This sample query is based on a request we had from a user (Holly Groover with the South Carolina Department of Public Safety).  She was using our Estimates by Age web application to get data regarding the population of South Carolina aged 16 and under.  She was able to get data for this cohort by race and by sex (separately) but wondered if there was a way she could get them crossed, i.e. white males, white females, black males, black females, etc.   The answer was Yes, but it would not be as easy as using the web app.   It would require using Dexter to access the detailed nchsbri data and the aggregation feature, part of the Advanced section on the Dexter query form. 

The nchsbri data are stored in the nchsbri subdirectory of the popests data directory (“filetype”).   This sub-collection is important enough to merit its own separate Readme file .   In this Readme file it explains that we have 7  data sets per state, five of them with historical data and two with current data,  The two with current data are the original detail data (as downloaded from the NCHS web site) and a _sumry version that is tabulated to handle most requests.  But for this request we need the detailed data set where each row/observation corresponds to a 5-dimensional data matrix.  The dimensions are county, age (single year), race (4 “bridged” categories), sex, and Hispanic. The dimensions we are interested in are age, race and sex.  We shall select a data set with just South Carolina data and we’ll specify an filter by Age to just get the 16-and-under cohort.   Then we’ll need to aggregate by the race and sex categories.   Our output should be just 8 records/rows: the 4 bridged race categories crossed with the two sex categories.

The basic stored query here is pretty simple.   In the Variations section we’ll look at a somewhat more sophisticated extract. 

     

Top of the Page

 .

We see that we are accessing data in the popests/nchsbri subdirectory, data set is scnchsbridged201x.  Best way to get here is to navigate to the popests directory, then to the nchsbri directory and then use the Datasets.html directory file.   Note the link to the detailed metadata for this data set.   That page includes a very short “codebook” explaining the codes used for age, race, sex and Hispanic.    If we hit the Quick Look button it will show the first 100 observations of this data sete (and because we have already made our variable choices in Sec. III of the form it will only display those variables).    To have a better feel for what the input data set looks like I chose all the ID variables (in Sec.III) and then hit Quick Look to get something like this:

The only columns that change here are Age and the 2 estimate items.   Obs 1 is the count of persons aged 0 (i.e. < 1)  in Abbeville County who are white (race=1), male (sex=1), and non Hispanic (Hispanic=1).   What we’ll do here is add up all these population estimate cells based on just the race and sex categories, and filtering out all observations which do not satisfy our Age <= 16 filter.

Sec I: Output Formats

   

We choose to have output only in HTML format.  Note that we have checked the option that says to skip the intermediate Output Menu page and just pipe the single output report directly back to the browser. 

Sec. II: Choose rows

 

 Our filter tells Dexter to only process rows/observations where the value of Age is <= 16.   If you wanted to do multiple cohorts you would have to do multiple runs with multiple age-based filters.  

Sec. III: Choose columns

 

State is a constant on this data set but not a bad idea to carry it along as an ID variable.  Too bad there is no variable to document the age cohort.  The user actually just wanted the latest estimate (2014 at the time) but we chose the 2010 estimate as well just to show we could aggregate more than one variable.

Sec. IV: Other Options

 

Nothing too complicated here.  Just titles and a footnote to document the report and point users to this Note module.   Note that we do not have to specify any Sort option.

Sec. V: Advanced Options

This is what came for – the Advanced stuff.  The Aggby text box is where we enter a list of variables to “aggregate by”.  It tells Dexter to sort the selected rows by the values of these 3 variables and then go through and create one output row/observation for all distinct combinations of the 3 variables.   For example, we checked and there were over 1500 observations on the filtered data set (i.e. not counting age>16) that had state=45, Race=1 and Sex=1: white males living in South Carolina.  The output file has a single row/observation with value for two population estimates (2010 and 2014) around 333,000 - each of them a sum of those 1500+ values.  

The Agg Level item has been left with default value of 1.  When doing aggregations you can request summaries at multiple levels.  See the online help for details on this. Similarly, we have not altered the default value for the Variables to drop option.   If you want to see how many observations were used in creating each output summary observation then you should omit _nag_  from this parameter spec.

 (Note the nearest Extract Data button is just above the Sec.V header.)

Output Menu Page

Does not apply because we asked that the resulting html output be piped directly back to the browser.  So what you should see next in your browser after hitting the Extract Data button is this:

We have eight aggregated summaries, four race values by two sex categories.  The 4th row, for example, tells us that there were approximately 171,000 black (Race=2) females (Sex=2) under 17 in 2014, down slightly from the 171,858 estimate for July, 2010.  

 

Variations and Enhancements

To change the geographic universe you need to look at the comparable data for the other 4 states as well as SC.  You can do this in at least two ways.  The hard way would involve running the query five times and just editing the name of the input data set at the top of the form each time.  You could also edit the title each time so you are not saying South Carolina when the data are for Georgia.   You could do it that way, but you should not.  A new feature added to Dexter late in 2015 lets you access multiple data sets in one run as long as the names of those data sets follow a rigorous pattern.  That is the case here.  The input data set, scnchsbridged201x, is part of a national collection, with one data set per state.  The comparable data for North Carolina is in ncnchsbridged201x .   You could access the entire collection by using the new “wild card” feature.  You replace the state abbreviation portion of the data set name (first text box at top of page) like so:

 

The “sc” part has been selected so I can now just type over it, inserting the two underscores needed to trigger wild card processing.

Dexter will now concatenate all data sets in the current directory with names matching the specified pattern.  So it will be accessing alnchsbridged201x, aknchsbridged201x, … (through wycnchsbridged201x). 

But we don’t really want data for all those states – just the five in the region.    We take care of this by coding a filter in Sec. II of the form like so:

We have coded a second condition for selection.  In addition to having an Age value <= 16 we now add that the variable State (state FIPS code) must be in the specified list of 5 values.  (If you did not know the state values you could have used multiple sources to get them, such as http://mcdc.missouri.edu/sasfmats/Sstate.sas ).  So Dexter is going to actually consider (“read”) 51 data sets and only keep data from 5 of them.  This turns out to be faster/more efficient than you might think because of indexing and the way the SAS where statement works.

The other thing we said we would do for this variation is use formats to provide better labeling of our data.   We want to specify format codes for 3 of our output variables: state, race and sex.    We do this in the Advanced Options section of the form, part C as follows:

 This Format text box entry lists each variable that we want to have “formatted” (i.e. to have a SAS format applied so that it displays meaningful value labels instead of the codes stored on the data set).   The format names are the things that start with $ and end with periods.  State is a variable name, and $state. is a format code.  The special format codes $race. and $sex. only work for this particular data directory. $state works for any data set where the State variable contains a 2-character FIPS code (which it almost always will).  

We also tweak the title to identify the report as coming from this variation.  And, finally, we specify a couple of Advanced items to specify a By variable for the report and a pair of ID variables.   This will result in State being displayed on a “by line” instead of in a column, and cause Race and Sex to appear as the first columns of the report with special color formatting.   This is what we specify:

Here is a partial image of the resulting report.   Notice the data for the other selected states.  Notice the “State=Georgia” byline.  Notice that Race and Sex appear in blue and are the left most (“Id”) columns of the report (replacing “OBS”) and that State displays as Georgia instead of 13, Race displays as White instead of 1, and Sex displays as Male instead of 1.  These are the result of using formats.  

 

Exercise

Start with this variation and change the query in the following ways:

Questions/Comments

Address these to John Blodgett at OSEDA.