We have now (Sept 2016) downloaded and converted the data for tax year 2014. The same basic processing was done to create a series of eight data sets.
There are 3 ways to classify each of these data sets:
- Geographic level: county or ZIP.
- AGI detail: "noagi" sets just have summaries for all the returns, while "allagi" sets have summaries for Adjusted Gross Income
- Original or enhanced ("plus") versions. We encourage use of the plus versions, which feature mnemonic variable names and lots of
calculated means and percentages.
What Data Are Reported
These data are aggregated summaries of U.S. federal tax returns (form 1040 et.al)for specific tax years and geographic areas.
The data source is the IRS web site at http://www.irs.gov/uac/SOI-Tax-Stats-County-Data-Downloads (county level data) and
http://www.irs.gov/uac/SOI-Tax-Stats-Individual-Income-Tax-Statistics-2011-ZIP-Code-Data-(SOI) (ZIP level data).
There are basically four files per tax year covering the entire U.S. There are two files containing data summarizing all the tax returns, regardless of adjusted gross income level: one for states and counties, and the other for ZIPs and counties. A corresponding pair of files summarizes the data based on six adjusted gross income (AGI) levels. These files have been converted "as-is" and stored as four data sets (per tax year) using variable names used in the corresponding technical documentation (available in Word documents downloaded along with the data and stored in this data directory.)
For tax year 2014 these 4 data sets are:
- uscntysallagi14 and uscntysnoagi14 (county level original data with and without AGI detail)
- uszipsallagi14 and uszipsnoagi14 (ZIP code level original data with and without AGI detail).
The variable names on these sets conform to the tech doc but are not very mnemonic. N1 is the variable for Total Number of Returns;
N00200 and A00200 are the number and aggregate dollar amount (in thousands) of the Salaries and Wages line on the returns. In other words N00200 tells us how many of the total (N1) returns reported having some Salary and Wage income, while A00200 tells us the aggregate total of those reported values. Most of the non-ID variables on these data sets are in the form Nnnnnn and Annnnn, where nnnnn is a 5-digit line item code (see the technical documentation to see what those codes are). We have assigned labels to each of the variables on these data sets that indicate the meaning of the code. E.g. the label for variable A00300 is "Taxable Interest Amt".
We processed the 2013 (tax year) data in February, 2016. There were 19 new form items added for these data sets.
There have all been included on our data sets, including the "plus" versions where we have assigned mnemonic names
and calculated corresponding averages and percents.
Enhanced Versions of Original Data Sets
While the data as downloaded provide the researcher with the basic information needed to do interesting analysis, the numbers as reported are not very meaningful per se. An aggregate amount of total dollars reported in a category is really not that useful when trying to see what it means or how it compares to other geographic areas. Likewise, the N values indicating how many returns had the item reported are not all that informative in isolation. What would make these raw values more useful if they came with means (averages) of the amounts reported and percentages (of total returns) of the N values. Now instead of just knowing the total of Social Security Income reported and the number of returns reported such income, we would have the Average dollar amount of social security received (just for the returns reporting it) and the percentage of all returns reporting such income. It would also be helpful for anyone wanting to write code to access these data if the variable (field) names were more meaningful - i.e. "AGI" instead of "A00100". This is why we created a parallel set of 4 enhanced versions of these data (per tax year). We encourage users to access these data sets. You will recognize them by the insertion of the word "plus" in the data set names and the word "Enhanced" in the descriptions that appear in the uexplore contents page and on the Datasets.html metadata.
We also have added SumLev code variables to the enhanced data sets to make it easier to distinguish the state total observations from the county or ZIP level summaries. A value of
040 indicates a state summary; values of
871 indicate a county of ZIP-within-state level summary, respectively. Note that the IRS refuses to deal with the reality that some ZIP codes cross state lines, at least just a little bit. The tech doc seems to indicate that returns where the state does not match their ZIP to state lookup are omitted from the aggregations. That should not be too significant since there are very few addresses where this occurs.
Handling of "Special" .0001 Values
Anyone taking the time to study the technical document files (referenced below) will come across an Endnotes section that starts out (at least the 11zpdocs-1 version) like this:
 For complete individual income tax tabulations at the State level, see the historic table posted to Tax Stats at http://www.irs.gov/uac/SOI-Tax-Stats---Historic-Table-2.
 Values of 0.0001 indicate an AGI class that has been combined with another AGI class within the same ZIP code or moved to the “other” (99999) category, where applicable.
[Our bold added.] Basically, what this means is that if there is no data reported for a category then instead of a simple 0 value, you will get this pseudo-value of .0001 .
We decided to leave these values as they are distributed by the IRS on the original data sets. But we have replaced them with 0 values on our
"enhanced" data sets (the ones containing "plus" in the name).
In the original creation of our enhanced data sets done in early March, 2014, we mistakenly assigned our new Avg variable values for cases
where there were 0 (or .0001) returns with data for a category. For example, if there were no reported data in the UnempComp category then both Number and dollar amout values reported for the UnempComp had values of .0001 . Our code that defined the corresponding Avg value looks like this:
if NUnempComp ne 0 then AvgUnempComp=AmtUnempcomp*1000/NUnempcomp;
What we were expecting here was that the condition (NUnempCom ne 0) would be false when we had no data reported and thus the assignment statement to
get the Avg value would not be executed, and the Avg variable would have a missing value, which is what you want in that situation. But
with the .0001 values stored in what appeared to be 0 values what we got were a bunch of Avg variables with values of 1000. We were alerted
to this problem by a user (Hoa Nguyen) on April 10, 2014 and it was fixed that morning. We decided to leave the original data sets as-is
with the .0001 values lurking. (There are no average values variables on those data sets, so nothing to fix there). But for our data sets we checked for these .0001 values and replaced them with 0's. Our Avg variable calculation then worked as intended and we no longer have the bogus $1,000 values showing up when there are no data for the category.
Names Added to ZIP Code Data
Another enhancement you'll find in the the "zipsplus" data sets is the inclusion of an Areaname variable to help users not familiar with the ZIP codes to have some idea of where it is located. This name is usually the name of the city with the largest intersection with the ZIP. It is not an official name but simply one that may help when presenting results.
Be sure to review the technical documentation files that provide further detail regarding these data. These Word files are provided by the IRS and were used by the MCDC to read and document the raw data files which were downloaded from the same source. For 2013 tax year the two files are 13incydocguide.doc and
13zpdoc.doc. There is very little difference between the two. The line item variables are the same, regardless of geography.
Getting Data for Just Your State
If you are put off by the fact that all these data sets are "us" (have data for the entire country) when all you want is data for a single state (or perhaps just 2 or 3) you should know how easy it can be to create state-based subsets. The key is knowing the state FIPS codes and knowing how use this knowledge to code the filter in Section II of the Dexter Query Form (DQF). You can get the state codes in lots of places, such as:
- The MCDC's geographic codes lookup for the United States.
Of course, you can also get them by following the link to "Details" on the Datasets.html page, or the link to "Detailed metadata" at the top of the Dexter Query Form when you are accessing one of these data sets.
Once you have used one of these resources to know (for example) what the codes are for New York, New Jersey and Connecticut you can easily specify that you want this 3-state subset by coding Section II (of the DQF) as follows:
Or, for those who just need data for a single state (like Missouri) you could use:
You could also use state postal abbreviations (variable Stab) to do your filtering, but that would be too easy. Also, the data sets are indexed by state, so when you filter using that variable it is very fast. (This has more to do with nerd aesthetics than any real noticeable difference in performance.)
More Data for More Years
The IRS started releasing these data (for tax year 2011 only) in early 2014. Their web site indicates that additional recent years (2009 and 2010) were to be released shortly. We downloaded and processed
the 2012 data later in 2014 and the 2013 data early in 2016 (February, but these data had already been
availalble for a while at that time.) The 2014 data were added in Sepember, 2016. We would expect
future years to be released circa September each year.