THE AGG MACRO
Rev. 2-14-00
John Blodgett
Office of Social & Economic Data Analysis
University of Missouri Outreach & Extension
ABSTRACT
The AGG SAS(r) macro is a utility for aggregating SAS data sets.
It is particularly useful for handling situations where you need to aggregate
to several levels of a hierarchal sequence of BY variables, and for handling
the problems of "summing" mean, median, percentage and ratio variables.
INTRODUCTION
The SAS system provides a number of tools for aggregating data. PROCs
like MEANS and SUMMARY handle most aggregation tasks easily and relatively
efficiently. However, when dealing with data which is to be aggregated
at a series of levels in a nested hierarchy or when dealing with the problems
associated with aggregating non-summable variables such as means, medians
and percentages, special programming is required. Such variables
have to handled by taking weighted averages rather than sums.
Prior to SAS Version 7 this special programming involved extra pre- and
post-processing steps to weight the non-summable items to weight the variables
prior to summing and then then to do the divisions after the summing to
arrive at the final weighted averages.
For example, suppose that you have a SAS data set where each
observation represents the summary of births recorded in a ZIP code in
a single year. Variables on the data set include total births (TOTBRTHS),
percentage of births to black mothers (PCTBLACK), total births to black
mothers (BBIRTHS), the median age of the mothers (MEDAGE) and a distribution
of mothers by age stored as percentages (PCT1-PCT4), where PCT1 is the
pct of mothers under 20, PCT2 is the pct of mothers 20-29, PCT3 the pct
of mothers 30-39 and PCT4 the pct of mothers aged 40-49. Lets assume also
that a STATE and COUNTY code have been assigned to each observation, and
that the data set is sorted by YEAR, STATE, COUNTY and ZIP. You would like
to summarize the data by COUNTY (within YEAR and STATE), by STATE (within
YEAR), by YEAR, and by the entire data set (all years and all geographies
combined.) This is an example of a "nested hierarchy" of class variables.
PROC SUMMARY will calculate summaries for a such a hierarchy, but it will
also calculate summaries for ALL combinations of the variables, which when
they are nested like this may not make sense (a YEAR and COUNTY summary
would be meaningless in this example since COUNTY is only meaningful when
qualified by STATE) or simply may not be desired (a summary by ZIP alone
in this example would make sense but may well not be wanted.) You would
have to run SUMMARY or MEANS four times to get just the set of summaries
desired. And how would you handle aggregation of the median age of the
mothers? And of the percentage distribution? These are the specific kinds
of situations that AGG has been written to address. We'll come back to
this specific problem as our first example in the SAMPLE APPLICATIONS section.
Note: Most of what we said in the preceding paragraph
regarding the ability of the SAS summary procedures MEANS and SUMMARY to
handle aggregation of means and to be selective with regarding to output
summary levels has been changed with the release of Versions 7 and 8.
AGG PARAMETERS
The AGG macro uses a series of parameters which the user specifies when
invoking AGG to tell it what to do, what data sets to do it with, what
variables to "aggregate", etc. In this section we describe each parameter
in general terms. Refer to the SAMPLE APPLICATIONS section for examples
of how parameter values affect the actions of AGG. Notice that most of
the parameters have "aliases", i.e. alternate names. These aliases are
indicated in parentheses after the name of the parameter. If the user specifies
a value for an alias parm name, that value will be copied as the value
of the "correct" parm name. Never specify a value for both a parm and its
alias in the same invocation of a macro.
SETIN (AGGIN)=<SAS data set name>
Specifies the name of the input data set containing the data to be
aggregated. The default value is _LAST_, i.e. if you do not specify a value
for this parm, it is assumed that you have just created a SAS data set
and that is the one you want to aggregate. SETIN can be a one or two-level
name. The data set referenced by the SETIN parm must be sorted by the aggregation
variables (see AGGBY parm) and must contain all the variables referenced
in any of the other AGG input variable-list parms (AGGVARS, NUMS, DENOMS,
MEANS, MWEIGHTS AND IDVARS).
SETOUT (AGGOUT)=<SAS data set name>
Specifies the name of the output SAS data set to be created by AGG.
The default value is AGGOUT.
AGGBY (BYVARS)=<list of hierarchal classification variables>
Specifies a list of one or more variables on the input data set(s)
that define the aggregation groups. The order of the variables in this
list is critical. You would probably NOT want to specify AGGBY=COUNTY STATE,
since this would say to create totals for states within counties. The
input data set must be sorted in ascending order by the AGGBY variables.
Do not use interval lists (e.g. H1-H3) when specifying this list.
AGGLVL (AGGLVLS)=<numeric value from 1 to # of vars in AGGBY list>
This parameter tells AGG how many hierarchal aggregation levels are
to be calculated and written to the AGGOUT data set. You may need to specify
an AGGBY list of several variables even when you only want to aggregate
at one level, because the codes used are hierarchical. For example, you
might have AGGBY=STATE COUNTY,AGGLVL=1 to say that the input data set is
sorted by county within state but you are only interested in county level
summaries, not state level. The default value for AGGLVL is 1, except when
GRAND=ONLY is specified, in which case it defaults to 0 (meaning no aggregation
for class variables will be done - only one summary record for the entire
data set). If you specify a value of "STATE COUNTY TRACT BLKGRP"
for AGGBY and a value of 3 for AGGLVL then your output data set will contains
summaries at 3 levels: level 1 for BLKGRP (within all the other aggby categories);
level 2 for tract; and level 3 for county. ( Changing AGGLVL to 4
would result in also getting summaries at the state level.)
GRAND=<ONLY | Y(ES) | N(O) | 1 | 0>
This parameter controls the creation of grand totals, i.e. totals
for all observations on the input data set. The value you enter will be
upcased. Use ONLY to indicate that you only want grand totals, and no others.
This will result in a single-observation data set for AGGOUT. Specify a
value of Y(ES) or 1 to indicate that you want grand totals, or a
value of N(O) or 0 to indicate that you do not want them. The default value
is NO. Values entered are converted to uppercase.
AGGVARS (VARS)=<list of numeric vars on SETIN to be aggregated>
Comparable to the VARIABLES list in PROC MEANS or PROC SUMMARY. These
are the variables you want to add up (sum). Variables included in the NUMS
or DENOMS lists should be included in the AGGVARS list if and only if you
are interested in summing them and keeping the results on the output data
set. Oftentimes, NUMS and DENOMS are only used as a means to calculate
the corresponding PCTS or RATIOS variables, and are not needed for any
other purpose. Both single and double-dash lists/sublists can be included
in this list. All these variables must be present on the input data set
and must be numeric.
MEANS (MEDIANS) =<list of non-aggregatable vars>
MWEIGHTS = <list of universe vars used to weight means>
These two related lists support "aggregation" of numeric variables
such as means, medians, indexes, ratios, etc. In order to be able to aggregate
such a variable AGG needs to be able to "weight" each of the means variables
during the aggregation phase. A weighted average of the values is then
calculated just prior to output. Processing is similar to that of PCTS/RATIOS
(see below). If, for any observation, either the mean variable or its weight
variable is missing, then neither variable is added to the special totals
kept for these variables. Variables specified in the MEANS list must
also be included in the AGGVARS list (or else they will not be kept on
the output data set.)
The MWEIGHTS list has the same characteristics as the NUMS and DENOMS
lists: the variables must be numeric and on the input data set; the variables
may be (but do not have to be) included in the AGGVARS list (but should
be included in either the AGGVARS or DROPVARS lists); and a list of a single
variable is allowed to indicate that all of the variables in the
MEANS list are to be weighted using the same variable (otherwise the MEANS
and MWEIGHTS lists must contain exactly the same number of variables and
must be specified in the proper order so that the n'th variable in the
MWEIGHTS list is used as the weight variable for the corresponding n'th
variable in the MEANS list.)
We mentioned medians in our list of the types of variables that can
be handled using MEANS and MWEIGHTS. This is not really the case
since a weighted average of a median does not yield the true median of
the aggregate area. However, the weighted mean is in many cases a
reasonable estimate of the aggregate median and is certainly a lot closer
than taking the sum. The only truly reliable way to aggregate a median
is to also aggregate a distribution table corresponding to the median and
then post-process the aggout data set and estimate the median from the
aggregated distribution table using a tool such as the %median macro
in this same sasmacro directory as agg.
IDVARS (IDS) = <list of identifier variables on input>
Serves the same purpose as an ID list in PROC SUMMARY. These variables
are written to the output data set and will have the value on the output
data set corresponding to the last observation on the input data set used
in the aggregation. Usually IDVARS variables are linked to AGGBY variables
(e..g. AGGBY=state, IDVARS=statenam ). All these variables are set
to missing on the grand totals output observation.
DROPVARS (DROP) = <list of variables to be dropped from output>
If a variable is needed for aggregation purposes, but not wanted on
the output data set then the DROPVARS parm can be used to specify a list
of such variables. The special variables _LVL_ and _NAG_ can also be included
in this list. Variables that appear in the DENOMS, NUMS or MWEIGHTS lists
which are NOT included in the AGGVARS list should be included in this list;
otherwise they will be kept on the output data set but their values will
be from the last observation used in the aggregation (they will not be
sums.)
NODUPS = <1>
Specify this option if you want to eliminate redundant summaries when
the value of AGGLVLS is greater than 1. For example if you using AGGBY=REGION
OFFICE, and the data had only one value of OFFICE within a REGION then
you would get the _LVL_=1 summary for the OFFICE, but the _LVL_=2 summary
for the REGION would be suppressed. Default is to output the redundant
summary observations.
SLCTEDIT (FILTER) = <name of user-defined macro to insert>
This option allows the user to do custom editing and filtering (select
/ editing) of the input data. You have to code a macro prior to invoking
AGG. To "filter" input observations you MUST use a WHERE statement; otherwise
you run the risk of deleting the first or last observation of a BY group
which would play havoc with the AGG macro's logical flow. Do NOT include
a % symbol as part of the parameter value. (Users of Version 6.04 or earlier
cannot use the WHERE statement.) You can also use the macro specified by
this parm to specify things such as LENGTHS, FORMATS and LABELs for variables
created by the AGG macro (such as PCTS/RATIOS variables.)
FACVAR = <numeric variable used to "factor" the observations>
This is a rarely-used feature that allows "weighted" aggregation of
observations. All variables being summed in the observation will first
be multiplied by the value of the FACVAR variable. If the FACVAR variable
has a missing value then it is assumed to be 1. If its value is 0, then
that input observation is ignored. If its value is negative (typically
with a value of -1), then that observation is in effect subtracted from
the aggregate totals. This option can be used to handle applications that
involve taking the "remainder" of a geographic area (e.g. all of Boone
County minus the city of Columbia), or when doing allocations (e.g. Senate
District 2 is to be defined as the sum of all of Tract 10 plus half of
Tract 11 plus all of Tract 12 except Block Group 2.)
NUMS=<list of numerator variables>
DENOMS (DENS)=<list of denominator variables>
PCTS (PERCENTS)=<list of pctage variables to be calculated>
RATIOS=<list of ratio variables to be calculated>
These four rarely used parameters are closely tied and will
be described as a group. (Note that most of what you can do with
these parameters can be more easily done using the MEANS and MWEIGHTS parameter
lists.) Each of these 4 parameters is a variable list and thus
the restrictions described in VARIABLE LIST SPECIFICATIONS applies to each.
NUMS and DENOMS are lists of numeric variables on the input data set; they
are a tandem set - you cannot specify one without the other. PCTS and RATIOS
are mutually exclusive options that must be specified in conjunction with
NUMS and DENOMS. These are variables that will be calculated. They may
or may not exist on the input data set; if they do they will be ignored
(i.e. they will not appear in the KEEP variable list generated for the
&SETIN data set.) The only difference between PCTS and RATIOS is that
if PCTS is specified then the aggregated values of the NUMS variables will
be multiplied by 100 before being divided by the corresponding DENOMS variables.
The NUMS and DENOMS variables may be (but are not required to be) included
in the AGGVARS list; PCTS/RATIOS variables should NOT be included in the
AGGVARS list. AGG will sum the numerators and denominators, and then will
calculate the ratios/percentages at each break requiring output. Special
logic is used in creating the sums that are used in calculating the PCTS/RATIOS
values; whenever either of the variables in a matching NUMS-DENOMS pair
is missing then neither is summed for that observation.
The NUMS and DENOMS list can both have variables repeated. In fact,
repeated values in the DENOMS list are quite common. A special case that
frequently occurs is that the same denominator variable applies for ALL
the numerators. AGG allows for this special case by allowing you to specify
a single variable for DENOMS. With this one exception the 3 lists (NUMS,
DENOMS and RATIOS/PCTS) must all have the same number of variables.
We mentioned that NUMS and DENOMS variables do not have to appear in
the AGGVARS list if their sums are not required on the output data set.
If they are not included in the AGGVARS list then they should be included
in the DROPVARS list to avoid having variables on the output dataset that
are not aggregate values. If a variable is included in both the NUMS (or
DENOMS) list and the AGGVARS list then it will actually be treated as separate
items by %AGG for the sake of aggregation; the sums of the variable will
be aggregated the usual way, summing all non-missing values to determine
the total value that will be assigned to the variable on the AGGOUT data
set; this will be handled separately from the special-summing logic described
above that is used for the calculation of PCTS/RATIOS.
DEBUG = <code to control debugging displays>
The program will display all parameters specified if you DEBUG=1 .
Specifying a value of 2 will result in lengthy display and will require
a LIST file to be allocated. This option is mostly for the use of the author.
FILLBY=<value used to "fill" irrelevant AGGBY variables>
Suppose you specify AGGBY=STATE COUNTY. AGG would generate output
summaries for all values of the COUNTY variable within the first value
of STATE. It would then output the observation that summarizes all the
observations for this first value of STATE. On this state-level output
observation the value of the COUNTY variables is "irrelevant", i.e. it
does not participate in identifying the area summarized. It is sometimes
desirable to assign values to such values other than the default blank
(missing) values. For example, it is frequently desirable to have the output
data set still be in ascending sort order by the AGGBY variables. This
can be accomplished by assigning a value for the FILLBY option which will
be used to define the values for these irrelevant AGGBYs. Specifying FILLBY=9,
for example, results in all such variables being "9 filled". This would
typically result in these variables having values which are "high" in the
sort sequence and will keep the output data set in the desired sort order.
The special value of HIGH can be specified to cause the special unprintable
character of X'FF' being used to fill the values of these variables.
Note the FILLBY option determines the values of all AGGBY variables
when generating the grand total observation.
VARIABLE LIST SPECIFICATIONS
Many of the parameters used by AGG are lists of variables. In this section
we describe some general rules to be followed, some things to be avoided,
and some useful shortcuts to use when specifying these variable lists.
The simplest form of a variable list is a sequence of SAS variable names,
each separated from its predecessor by one or more blanks. The SAS system
also allows variable lists in some other special formats, not all of which
are allowed in all contexts. In the context of coding AGG macro parameter
specifications, you can always use lists such as X1-X5 ("single dash" lists),
but not (in general -- see the important exception below) A05--LAST ("double
dash" lists.) This restriction is due to the fact that AGG does not open
your input data set to see the location of the variables, so it would have
no way of expanding a "double dash" list. The same thing would apply to
lists that use the _NUMERIC_ or _CHARACTER_ notation.
An important exception to this rule has been made for the AGGVARS
list. Because AGG is frequently used for processing files with very large
numbers of variables and because it can be very tedious and error-prone
to have to code very long lists of variables AGG will now except double-dash
lists only in the AGGVARS list. You can then specify the NAGGVARS
parm - which will not be checked for accuracy. If you specify a double-dash
list (for example, AGGVARS=P1I1--H51I3) without a value for NAGGVARS then
AGG will generate an extra DATA step in which it will actually open the
input data set and will count the variables in the AGGVARS list; this value
will then be passed (via SYMPUT) as a parm to be used by the DATA step
that does the aggregation. This can incur some extra overhead and if the
input data is on tape it will require use of a RETAIN specification in
the JCL to make sure the tape remains mounted.
In general, AGG cannot verify ahead of time that your lists are valid;
i.e. it will not know if you misspell a name or specify a character variable
where a numeric is required. Such problems will only become apparent when
the program generated by AGG is executed by SAS.
While variable lists containing intervals (e.g. x1-x50) are permitted,
they cause AGG to generate extra code in order to expand these lists and
count their elements. You can make the macro run somewhat faster by specifying
the size (number of entries) in a list as a parameter. Each of the variable
lists except AGGBY (which does not allow intervals) has a corresponding
parm which takes its name by preceding the list parm name with the letter
"N". Thus the parm NAGGVARS can be used to specify the number of variables
in the AGGVARS list. AGG will not verify the value so if you specify it
you are responsible for making sure that it is correct. Examples of this
feature:
AGGVARS=A B X1-X9 Y4-Y9, NAGGVARS=17, ...
MEANS=M1-M5 MEDIANX, MWEIGHTS=U1-U5 X, NMEANS=6, NMWTS=6, ...
Note that a separate N parm is required for MEANS and MWEIGHTS and that
the N parm name for MWEIGHTS is NMWTS (because NMWEIGHTS would be too many
characters for a SAS macro paramter.)
OUTPUT
AGG produces no printed output. It produces a single SAS data set containing
the aggregations of the variables on the input data set. This output data
set will be called AGGOUT by default, or whatever name is specified for
the SETOUT (AGGOUT) parm. It will contain the variables described below.
Keep in mind that we are assuming here that the DROPVARS parameter was
not specified. If it was, then you obviously need to delete any variables
coded in that list.
_LVL_
This is a numeric code (length 3 bytes) that tells the aggregation
level of the observation. If _LVL_=1 it means this is a summary of all
observations for the current values of all the AGGBY variables; if _LVL_=2
it means that its a summary for all AGGBY variables except the last (rightmost)
one. That rightmost variable will be assigned a value based on the FILLBY
option on the _LVL=2 observations. _LVL_=99 is used to denote a grand total.
_NAG_
Used to count the number of observations that were aggregated to get
this aggregate summary record. A numeric field of length 4 bytes.
AGGBY Variables
The variables used to define the aggregation cohorts. If a variable
was not used to define the cohort summarized on a particular output observation
then it will have a missing value (be default, or whatever value was specified
by the FILLBY option). All AGGBY variables are "filled" on a grand total
record. If an AGGBY variable contains missing values on the input data
set then AGG will generate a summary observation for all those observations
with that missing value and the AGGBY variable will be missing in the output
observation.
AGGVARS variables
The variables specified to be aggregated. The values here are the
totals (sums) of the non-missing values for all observations within the
cohort defined by _LVL_ and the AGGBY variables. At level 1 (only) if all
the values for a variable were missing then the corresponding variable
in the AGGOUT data set will also be missing; for higher levels, the value
will be 0. If variables in NUMS, DENOMS, MEANS or MWEIGHTS statements are
also included in the AGGVARS lists, then the normal totals for those variables
will appear in the output data set. The "special" aggregations done on
those variables in order to handle special case where a mean and its weight
are not both missing are not retained on the output data set.
IDVARS
Identifier variables as specified will appear with their values copied
as-is from the last observation in the AGGBY-variables-defined cohort.
All id variables are set to missing (FILLBY) on a grand total observation.
NUMS, DENOMS, RATIOS/PCTS
If it were easy to implement we would force variables in the NUMS
and DENOMS lists that were not on the AGGVARS list to be on the DROPVARS
list. If you fail to put a NUMS/DENOMS variables on either the AGGVARS
or DROPVARS list then that variable will be kept on the output data set,
but its values will not be a sum, but just the value taken from the last
observation in the cohort. This will not bother the values of RATIOS/PCTS
which will still be calculated properly. The value of RATIO(i)= sum(of
NUMS(i))/sum(of DENOM(i)), where the "sums" indicated are the totals of
the numerator/denominator variables for all those observations in the cohort
where both variables were not missing. The formula for PCTS(i) would be
similar but would involve multiplying the numerator by 100. Remember that
RATIOS and PCTS are mutually exclusive parms.
MEANS, MWEIGHTS
Once again, what can be said for NUMS, DENOMS, RATIOS/PCTS can be
readily translated to apply to these two categories. Each means variable
on the output data set was calculated by taking the product of the mean
variable times the corresponding weight variable for all observations in
the cohort for which both those values were non-missing. For those same
observations (only) the MWEIGHTS variables were summed (in a special temporary
array) and the quotient of these two sums defines the "weighted mean" represented
in the AGGOUT data set. The MWEIGHTS variables are like the DENOMS variables:
if you do not include them in your AGGVARS list then you should put them
in a DROPVARS list, because the result in the output data set would not
be an aggregate.
OTHER VARIABLES
Variables appearing in the AGGOUT data set not mentioned above should
be from one of 2 sources: 1. A SLCTEDIT macro was used and it created a
variable, or 2. The macro author was making a quick "improvement" to the
software and inadvertently used a variable he forgot to DROP.
SAMPLE APPLICATIONS
Rather than including examples of the macro in this document we have
created a special directory of examples. These can be accessed via
the WWW at http://www.oseda.missouri.edu/cgi-bin/uexplore?/mscdc/sasmacro/aggsamps@secure
. Each example is represented by a trio of related files: a .sas
source file showing the sample SAS code, and the corresponding .log and
.lst SAS log and listing files showing the results.
The agg0.sas sample shows the creation of the sample input data set
that is used in all the examples.
Questions and comments should be addressed to the author at blodgettj@umsystem.edu
.