THE AGG MACRO
Rev.  2-14-00

John Blodgett
Office of Social & Economic Data Analysis
University of Missouri Outreach & Extension


 

ABSTRACT

The AGG SAS(r) macro is a utility for aggregating SAS data sets. It is particularly useful for handling situations where you need to aggregate to several levels of a hierarchal sequence of BY variables, and for handling the problems of "summing" mean, median, percentage and ratio variables.

INTRODUCTION

The SAS system provides a number of tools for aggregating data. PROCs like MEANS and SUMMARY handle most aggregation tasks easily and relatively efficiently. However, when dealing with data which is to be aggregated at a series of levels in a nested hierarchy or when dealing with the problems associated with aggregating non-summable variables such as means, medians and percentages, special programming is required.   Such variables have to handled by taking weighted averages rather than sums.   Prior to SAS Version 7 this special programming involved extra pre- and post-processing steps to weight the non-summable items to weight the variables prior to summing and then then to do the divisions after the summing to arrive at the final weighted averages.
 For example, suppose that you have a SAS data set where each observation represents the summary of births recorded in a ZIP code in a single year. Variables on the data set include total births (TOTBRTHS), percentage of births to black mothers (PCTBLACK), total births to black mothers (BBIRTHS), the median age of the mothers (MEDAGE) and a distribution of mothers by age stored as percentages (PCT1-PCT4), where PCT1 is the pct of mothers under 20, PCT2 is the pct of mothers 20-29, PCT3 the pct of mothers 30-39 and PCT4 the pct of mothers aged 40-49. Lets assume also that a STATE and COUNTY code have been assigned to each observation, and that the data set is sorted by YEAR, STATE, COUNTY and ZIP. You would like to summarize the data by COUNTY (within YEAR and STATE), by STATE (within YEAR), by YEAR, and by the entire data set (all years and all geographies combined.) This is an example of a "nested hierarchy" of class variables. PROC SUMMARY will calculate summaries for a such a hierarchy, but it will also calculate summaries for ALL combinations of the variables, which when they are nested like this may not make sense (a YEAR and COUNTY summary would be meaningless in this example since COUNTY is only meaningful when qualified by STATE) or simply may not be desired (a summary by ZIP alone in this example would make sense but may well not be wanted.) You would have to run SUMMARY or MEANS four  times to get just the set of summaries desired. And how would you handle aggregation of the median age of the mothers? And of the percentage distribution? These are the specific kinds of situations that AGG has been written to address. We'll come back to this specific problem as our first example in the SAMPLE APPLICATIONS section.

 Note: Most of what we said in the preceding  paragraph regarding the ability of the SAS summary procedures MEANS and SUMMARY to handle aggregation of means and to be selective with regarding to output summary levels has been changed with the release of Versions 7 and 8.
 

AGG PARAMETERS

The AGG macro uses a series of parameters which the user specifies when invoking AGG to tell it what to do, what data sets to do it with, what variables to "aggregate", etc. In this section we describe each parameter in general terms. Refer to the SAMPLE APPLICATIONS section for examples of how parameter values affect the actions of AGG. Notice that most of the parameters have "aliases", i.e. alternate names. These aliases are indicated in parentheses after the name of the parameter. If the user specifies a value for an alias parm name, that value will be copied as the value of the "correct" parm name. Never specify a value for both a parm and its alias in the same invocation of a macro.

SETIN (AGGIN)=<SAS data set name>

Specifies the name of the input data set containing the data to be aggregated. The default value is _LAST_, i.e. if you do not specify a value for this parm, it is assumed that you have just created a SAS data set and that is the one you want to aggregate. SETIN can be a one or two-level name. The data set referenced by the SETIN parm must be sorted by the aggregation variables (see AGGBY parm) and must contain all the variables referenced in any of the other AGG input variable-list parms (AGGVARS, NUMS, DENOMS, MEANS, MWEIGHTS AND IDVARS). SETOUT (AGGOUT)=<SAS data set name> Specifies the name of the output SAS data set to be created by AGG. The default value is AGGOUT. AGGBY (BYVARS)=<list of hierarchal classification variables> Specifies a list of one or more variables on the input data set(s) that define the aggregation groups. The order of the variables in this list is critical. You would probably NOT want to specify AGGBY=COUNTY STATE, since this would say to create totals for states within counties. The input data set must be sorted in ascending order by the AGGBY variables. Do not use interval lists (e.g. H1-H3) when specifying this list. AGGLVL (AGGLVLS)=<numeric value from 1 to # of vars in AGGBY list> This parameter tells AGG how many hierarchal aggregation levels are to be calculated and written to the AGGOUT data set. You may need to specify an AGGBY list of several variables even when you only want to aggregate at one level, because the codes used are hierarchical. For example, you might have AGGBY=STATE COUNTY,AGGLVL=1 to say that the input data set is sorted by county within state but you are only interested in county level summaries, not state level. The default value for AGGLVL is 1, except when GRAND=ONLY is specified, in which case it defaults to 0 (meaning no aggregation for class variables will be done - only one summary record for the entire data set).  If you specify a value of "STATE COUNTY TRACT BLKGRP" for AGGBY and a value of 3 for AGGLVL then your output data set will contains summaries at 3 levels: level 1 for BLKGRP (within all the other aggby categories); level 2 for tract; and level 3 for county.  ( Changing AGGLVL to 4 would result in also getting summaries at the state level.) GRAND=<ONLY | Y(ES) | N(O) | 1 | 0> This parameter controls the creation of grand totals, i.e. totals for all observations on the input data set. The value you enter will be upcased. Use ONLY to indicate that you only want grand totals, and no others. This will result in a single-observation data set for AGGOUT. Specify a value of Y(ES)  or 1 to indicate that you want grand totals, or a value of N(O) or 0 to indicate that you do not want them. The default value is NO.   Values entered are converted to uppercase. AGGVARS (VARS)=<list of numeric vars on SETIN to be aggregated> Comparable to the VARIABLES list in PROC MEANS or PROC SUMMARY. These are the variables you want to add up (sum). Variables included in the NUMS or DENOMS lists should be included in the AGGVARS list if and only if you are interested in summing them and keeping the results on the output data set. Oftentimes, NUMS and DENOMS are only used as a means to calculate the corresponding PCTS or RATIOS variables, and are not needed for any other purpose. Both single and double-dash lists/sublists can be included in this list. All these variables must be present on the input data set and must be numeric. MEANS (MEDIANS) =<list of non-aggregatable vars>

MWEIGHTS = <list of universe vars used to weight means>

These two related lists support "aggregation" of numeric variables such as means, medians, indexes, ratios, etc. In order to be able to aggregate such a variable AGG needs to be able to "weight" each of the means variables during the aggregation phase. A weighted average of the values is then calculated just prior to output. Processing is similar to that of PCTS/RATIOS (see below). If, for any observation, either the mean variable or its weight variable is missing, then neither variable is added to the special totals kept for these variables.  Variables specified in the MEANS list must also be included in the AGGVARS list (or else they will not be kept on the output data set.)

The MWEIGHTS list has the same characteristics as the NUMS and DENOMS lists: the variables must be numeric and on the input data set; the variables may be (but do not have to be) included in the AGGVARS list (but should be included in either the AGGVARS or DROPVARS lists); and a list of a single variable is allowed  to indicate that all of the variables in the MEANS list are to be weighted using the same variable (otherwise the MEANS and MWEIGHTS lists must contain exactly the same number of variables and must be specified in the proper order so that the n'th variable in the MWEIGHTS list is used as the weight variable for the corresponding n'th variable in the MEANS list.)

We mentioned medians in our list of the types of variables that can be handled using MEANS and MWEIGHTS.  This is not really the case since a weighted average of a median does not yield the true median of the aggregate area.  However, the weighted mean is in many cases a reasonable estimate of the aggregate median and is certainly a lot closer than taking the sum.  The only truly reliable way to aggregate a median is to also aggregate a distribution table corresponding to the median and then post-process the aggout data set and estimate the median from the aggregated distribution table using a tool such as the %median macro in this same sasmacro directory as agg.
 
 

IDVARS (IDS) = <list of identifier variables on input> Serves the same purpose as an ID list in PROC SUMMARY. These variables are written to the output data set and will have the value on the output data set corresponding to the last observation on the input data set used in the aggregation. Usually IDVARS variables are linked to AGGBY variables (e..g. AGGBY=state, IDVARS=statenam ).  All these variables are set to missing on the grand totals output observation. DROPVARS (DROP) = <list of variables to be dropped from output> If a variable is needed for aggregation purposes, but not wanted on the output data set then the DROPVARS parm can be used to specify a list of such variables. The special variables _LVL_ and _NAG_ can also be included in this list. Variables that appear in the DENOMS, NUMS or MWEIGHTS lists which are NOT included in the AGGVARS list should be included in this list; otherwise they will be kept on the output data set but their values will be from the last observation used in the aggregation (they will not be sums.) NODUPS = <1> Specify this option if you want to eliminate redundant summaries when the value of AGGLVLS is greater than 1. For example if you using AGGBY=REGION OFFICE, and the data had only one value of OFFICE within a REGION then you would get the _LVL_=1 summary for the OFFICE, but the _LVL_=2 summary for the REGION would be suppressed. Default is to output the redundant summary observations. SLCTEDIT (FILTER) = <name of user-defined macro to insert> This option allows the user to do custom editing and filtering (select / editing) of the input data. You have to code a macro prior to invoking AGG. To "filter" input observations you MUST use a WHERE statement; otherwise you run the risk of deleting the first or last observation of a BY group which would play havoc with the AGG macro's logical flow. Do NOT include a % symbol as part of the parameter value. (Users of Version 6.04 or earlier cannot use the WHERE statement.) You can also use the macro specified by this parm to specify things such as LENGTHS, FORMATS and LABELs for variables created by the AGG macro (such as PCTS/RATIOS variables.) FACVAR = <numeric variable used to "factor" the observations> This is a rarely-used feature that allows "weighted" aggregation of observations. All variables being summed in the observation will first be multiplied by the value of the FACVAR variable. If the FACVAR variable has a missing value then it is assumed to be 1. If its value is 0, then that input observation is ignored. If its value is negative (typically with a value of -1), then that observation is in effect subtracted from the aggregate totals. This option can be used to handle applications that involve taking the "remainder" of a geographic area (e.g. all of Boone County minus the city of Columbia), or when doing allocations (e.g. Senate District 2 is to be defined as the sum of all of Tract 10 plus half of Tract 11 plus all of Tract 12 except Block Group 2.)


NUMS=<list of numerator variables>

DENOMS (DENS)=<list of denominator variables>

PCTS (PERCENTS)=<list of pctage variables to be calculated>

RATIOS=<list of ratio variables to be calculated>

These four rarely used  parameters are closely tied and will be described as a group.  (Note that most of what you can do with these parameters can be more easily done using the MEANS and MWEIGHTS parameter lists.)   Each of these 4 parameters is a variable list and thus the restrictions described in VARIABLE LIST SPECIFICATIONS applies to each. NUMS and DENOMS are lists of numeric variables on the input data set; they are a tandem set - you cannot specify one without the other. PCTS and RATIOS are mutually exclusive options that must be specified in conjunction with NUMS and DENOMS. These are variables that will be calculated. They may or may not exist on the input data set; if they do they will be ignored (i.e. they will not appear in the KEEP variable list generated for the &SETIN data set.) The only difference between PCTS and RATIOS is that if PCTS is specified then the aggregated values of the NUMS variables will be multiplied by 100 before being divided by the corresponding DENOMS variables. The NUMS and DENOMS variables may be (but are not required to be) included in the AGGVARS list; PCTS/RATIOS variables should NOT be included in the AGGVARS list. AGG will sum the numerators and denominators, and then will calculate the ratios/percentages at each break requiring output. Special logic is used in creating the sums that are used in calculating the PCTS/RATIOS values; whenever either of the variables in a matching NUMS-DENOMS pair is missing then neither is summed for that observation.

The NUMS and DENOMS list can both have variables repeated. In fact, repeated values in the DENOMS list are quite common. A special case that frequently occurs is that the same denominator variable applies for ALL the numerators. AGG allows for this special case by allowing you to specify a single variable for DENOMS. With this one exception the 3 lists (NUMS, DENOMS and RATIOS/PCTS) must all have the same number of variables.

We mentioned that NUMS and DENOMS variables do not have to appear in the AGGVARS list if their sums are not required on the output data set. If they are not included in the AGGVARS list then they should be included in the DROPVARS list to avoid having variables on the output dataset that are not aggregate values. If a variable is included in both the NUMS (or DENOMS) list and the AGGVARS list then it will actually be treated as separate items by %AGG for the sake of aggregation; the sums of the variable will be aggregated the usual way, summing all non-missing values to determine the total value that will be assigned to the variable on the AGGOUT data set; this will be handled separately from the special-summing logic described above that is used for the calculation of PCTS/RATIOS.


DEBUG = <code to control debugging displays>

The program will display all parameters specified if you DEBUG=1 .  Specifying a value of 2 will result in lengthy display and will require a LIST file to be allocated. This option is mostly for the use of the author. FILLBY=<value used to "fill" irrelevant AGGBY variables> Suppose you specify AGGBY=STATE COUNTY. AGG would generate output summaries for all values of the COUNTY variable within the first value of STATE. It would then output the observation that summarizes all the observations for this first value of STATE. On this state-level output observation the value of the COUNTY variables is "irrelevant", i.e. it does not participate in identifying the area summarized. It is sometimes desirable to assign values to such values other than the default blank (missing) values. For example, it is frequently desirable to have the output data set still be in ascending sort order by the AGGBY variables. This can be accomplished by assigning a value for the FILLBY option which will be used to define the values for these irrelevant AGGBYs. Specifying FILLBY=9, for example, results in all such variables being "9 filled". This would typically result in these variables having values which are "high" in the sort sequence and will keep the output data set in the desired sort order. The special value of HIGH can be specified to cause the special unprintable character of X'FF' being used to fill the values of these variables.

Note the FILLBY option determines the values of all AGGBY variables when generating the grand total observation.
 
 

VARIABLE LIST SPECIFICATIONS

Many of the parameters used by AGG are lists of variables. In this section we describe some general rules to be followed, some things to be avoided, and some useful shortcuts to use when specifying these variable lists.

The simplest form of a variable list is a sequence of SAS variable names, each separated from its predecessor by one or more blanks. The SAS system also allows variable lists in some other special formats, not all of which are allowed in all contexts. In the context of coding AGG macro parameter specifications, you can always use lists such as X1-X5 ("single dash" lists), but not (in general -- see the important exception below) A05--LAST ("double dash" lists.) This restriction is due to the fact that AGG does not open your input data set to see the location of the variables, so it would have no way of expanding a "double dash" list. The same thing would apply to lists that use the _NUMERIC_ or _CHARACTER_ notation.

An important exception to this rule has been made for the AGGVARS list. Because AGG is frequently used for processing files with very large numbers of variables and because it can be very tedious and error-prone to have to code very long lists of variables AGG will now except double-dash lists only in the AGGVARS list. You can then specify the NAGGVARS parm - which will not be checked for accuracy. If you specify a double-dash list (for example, AGGVARS=P1I1--H51I3) without a value for NAGGVARS then AGG will generate an extra DATA step in which it will actually open the input data set and will count the variables in the AGGVARS list; this value will then be passed (via SYMPUT) as a parm to be used by the DATA step that does the aggregation. This can incur some extra overhead and if the input data is on tape it will require use of a RETAIN specification in the JCL to make sure the tape remains mounted.

In general, AGG cannot verify ahead of time that your lists are valid; i.e. it will not know if you misspell a name or specify a character variable where a numeric is required. Such problems will only become apparent when the program generated by AGG is executed by SAS.

While variable lists containing intervals (e.g. x1-x50) are permitted, they cause AGG to generate extra code in order to expand these lists and count their elements. You can make the macro run somewhat faster by specifying the size (number of entries) in a list as a parameter. Each of the variable lists except AGGBY (which does not allow intervals) has a corresponding parm which takes its name by preceding the list parm name with the letter "N". Thus the parm NAGGVARS can be used to specify the number of variables in the AGGVARS list. AGG will not verify the value so if you specify it you are responsible for making sure that it is correct. Examples of this feature:

AGGVARS=A B X1-X9 Y4-Y9, NAGGVARS=17, ...

MEANS=M1-M5 MEDIANX, MWEIGHTS=U1-U5 X, NMEANS=6, NMWTS=6, ...

Note that a separate N parm is required for MEANS and MWEIGHTS and that the N parm name for MWEIGHTS is NMWTS (because NMWEIGHTS would be too many characters for a SAS macro paramter.)

OUTPUT

AGG produces no printed output. It produces a single SAS data set containing the aggregations of the variables on the input data set. This output data set will be called AGGOUT by default, or whatever name is specified for the SETOUT (AGGOUT) parm. It will contain the variables described below. Keep in mind that we are assuming here that the DROPVARS parameter was not specified. If it was, then you obviously need to delete any variables coded in that list.

_LVL_

This is a numeric code (length 3 bytes) that tells the aggregation level of the observation. If _LVL_=1 it means this is a summary of all observations for the current values of all the AGGBY variables; if _LVL_=2 it means that its a summary for all AGGBY variables except the last (rightmost) one. That rightmost variable will be assigned a value based on the FILLBY option on the _LVL=2 observations. _LVL_=99 is used to denote a grand total. _NAG_ Used to count the number of observations that were aggregated to get this aggregate summary record. A numeric field of length 4 bytes. AGGBY Variables The variables used to define the aggregation cohorts. If a variable was not used to define the cohort summarized on a particular output observation then it will have a missing value (be default, or whatever value was specified by the FILLBY option). All AGGBY variables are "filled" on a grand total record. If an AGGBY variable contains missing values on the input data set then AGG will generate a summary observation for all those observations with that missing value and the AGGBY variable will be missing in the output observation. AGGVARS variables The variables specified to be aggregated. The values here are the totals (sums) of the non-missing values for all observations within the cohort defined by _LVL_ and the AGGBY variables. At level 1 (only) if all the values for a variable were missing then the corresponding variable in the AGGOUT data set will also be missing; for higher levels, the value will be 0. If variables in NUMS, DENOMS, MEANS or MWEIGHTS statements are also included in the AGGVARS lists, then the normal totals for those variables will appear in the output data set. The "special" aggregations done on those variables in order to handle special case where a mean and its weight are not both missing are not retained on the output data set. IDVARS Identifier variables as specified will appear with their values copied as-is from the last observation in the AGGBY-variables-defined cohort. All id variables are set to missing (FILLBY) on a grand total observation. NUMS, DENOMS, RATIOS/PCTS If it were easy to implement we would force variables in the NUMS and DENOMS lists that were not on the AGGVARS list to be on the DROPVARS list. If you fail to put a NUMS/DENOMS variables on either the AGGVARS or DROPVARS list then that variable will be kept on the output data set, but its values will not be a sum, but just the value taken from the last observation in the cohort. This will not bother the values of RATIOS/PCTS which will still be calculated properly. The value of RATIO(i)= sum(of NUMS(i))/sum(of DENOM(i)), where the "sums" indicated are the totals of the numerator/denominator variables for all those observations in the cohort where both variables were not missing. The formula for PCTS(i) would be similar but would involve multiplying the numerator by 100. Remember that RATIOS and PCTS are mutually exclusive parms. MEANS, MWEIGHTS Once again, what can be said for NUMS, DENOMS, RATIOS/PCTS can be readily translated to apply to these two categories. Each means variable on the output data set was calculated by taking the product of the mean variable times the corresponding weight variable for all observations in the cohort for which both those values were non-missing. For those same observations (only) the MWEIGHTS variables were summed (in a special temporary array) and the quotient of these two sums defines the "weighted mean" represented in the AGGOUT data set. The MWEIGHTS variables are like the DENOMS variables: if you do not include them in your AGGVARS list then you should put them in a DROPVARS list, because the result in the output data set would not be an aggregate. OTHER VARIABLES Variables appearing in the AGGOUT data set not mentioned above should be from one of 2 sources: 1. A SLCTEDIT macro was used and it created a variable, or 2. The macro author was making a quick "improvement" to the software and inadvertently used a variable he forgot to DROP. SAMPLE APPLICATIONS

Rather than including examples of the macro in this document we have created a special directory of examples.  These can be accessed via the WWW at  http://www.oseda.missouri.edu/cgi-bin/uexplore?/mscdc/sasmacro/aggsamps@secure .  Each example is represented by a trio of related files: a .sas source file showing the sample SAS code, and the corresponding .log and .lst SAS log and listing files showing the results.
The agg0.sas sample shows the creation of the sample input data set that is used in all the examples.



Questions and comments should be addressed to riceg@missouri.edu.