Program overview ================ cnvtsf3.sas Requirements ============ This code was written for SAS Version 8 or later. It uses names longer than 8 characters and labels longer than 40, which were not supported in earlier versions of SAS. Although it was written and tested on a Unix (IBM AIX) system, there is very little of the code that should be affected by platform. I.e. we expect it should run pretty much as-is in a Windows environemnt. MVS or CMS might be a little more of a challenge. In terms of computer resources, these files are quite large. You will need considerable space for storing the input files (although we try to minimize this by reading directly from zipped versions of the files). You will also need space if you are saving the output data sets. Based on the test data we have to work with at the moment (late July, 2002) we can only say that we have tried to minimize storage by using length statements and the compress=yes option on the data sets. Invoking ======== As long as nothing goes wrong, and as long as you do not want to create a custom version of the select macro, the person invoking this conversion setup should really not need to know much about SAS. Just enough to be able to point to the appropriate directories and then invoke the program. The program was debugged and tested in a batch environment. There is no user input required (or allowed) while the program is running. Typically, you will save this program in a .sas file somewhere on your system and then invoke the SAS system in batch mode, pointing to this program as the input file (after editing parameters etc as described below). Typically, it would be as simple (in Unix) as sas cnvtsf3 & You can also run the program interactively from Display Manager, of course. The nice thing about a batch invocation is that you can go away and come back later. Conversion runs for large states or on slower systems could take a while. There are over 16,000 data cells for each geographic area on these data files (fewer for sub-tract geographies), so there is a lot of work to be done. Inputs ====== There are 77 input files, a geographic headers/id file and 76 table cell files. All are zipped ascii files. The geography headers file is fixed format, and the remaining data segment files are all csv (comma separated value) files (though the Bureau does not use that standard extension in naming their files.) This setup reads the 76 data files using infile pipes to decompress them on the fly. We use the unzip program in the pipe. This program is compatible with pkunzip so Windows users with that program should be able to substitute pkunzip for unzip in the code. A Windows version of unzip is included in this Tools directory (as unzip.zip) which can be downloaded and used. (Alternatively, you can always go ahead and unzip all input files and alter the infile statement within the readloop macro to point to the decompressed files rather than using the pipes.) Outputs ======= The large DATA step creates 4 SAS data sets in the sf32000 data library. (The output library is expected to be permanent, so that you only have to run the conversion once, and can then access it as often as you wish. This has the advantage of being *much* faster than having to convert the raw data files each time you want to access the data. But if you prefer to reconvert each time you can easily point the sf32000 library to a temporary directory.) Each output SAS dataset name begins with the state postal abbreviation, stored in the global macro variable stab. So, for example, when converting the file for New York, you would be creating the data sets: nygeos Geographic header and id data only (except it does contain Pop100 and HU100 total pop and HU counts). nyph Population and housing data - "P" and "H" tables - available at all geo. levels. This set and the following 2 all contain the complete geographic headers variables as well as the specified table cells. nyphct Pop and housing data - PCT and HCT tables (except those in the next set, see below) - with data only for geographic levels census tract and above. So for block group level summaries there WILL BE NO OBSERVATIONS in this data set. nyphctr This data set contains data for those PCT and HCT tables with a 1-character racial indicator in their names. Examples: PCT62A thru PCT62I and HCT29A thru HCT29I. Like the previous dataset this one contains no observations for block group (sub-tract) level summaries. Since each of these files is sorted by the unique key logrecno it is very easy to put the pieces back together again when using the data sets in an application. For example, to combine all the data sets for New York you could code: data nyall; merge sf3.nyph sf3.nyphct sf3.nyphctr; by logrecno; where sumlev in ('050','060') and county='36001'; *<---optional filter---; run; You might even want to define a SAS view that combines all the data into one data file using: proc sql; create view sf1.de as select * from deph ph, dephct pct, dephctr pctr where ph.logrecno=pct.logrecno and ph.logrecno=phctr.logrecno; quit; This creates the view de (yes, we switched from New York to Delaware here) containing *all* the sf3 tables for the state. Convenient in some cases, but be cautioned that the number of variables in this view is so large (over 16,000) that you bump into some SAS limits. Such as the 100-screen limit in fsedit/fsbrowse. Related SAS code ================ This setup references several other SAS source modules directly (see the "%include curpath...." statements). These modules contain rather lengthy SAS LABEL statements. The code will run perfectly well if you decide you do not need these labels, in which case you can simply delete the %include statements. These modules should be distributed along with the cnvtsf3.sas module and should be stored in the same directory. Their file names are PLabels.sas, HLabels.sas, PCTLabels.sas, and HCTLabels.sas . The significant benefit of using these modules is that SAS users can use tools such as Proc Contents or the Variables window under Display Manager in order to see these labels. They can also be displayed as column headers in some procedure output including proc print. Finally, these SAS modules are extremely useful to have around as codebook modules, since they not only contain variable names and labels but also have the complete table structure information included as comments in between the assignment portions of the giant LABEL statements. SAS Format Codes ================ The conversion code makes reference to three custom SAS format codes (the kind you create with Proc Format). The source code for these codes is stored in the formats.sas module which should be distributed in the same directory as the cnvtsf3.sas module. This code is referenced via a %include statement within the conversion module. You may want to edit this module, especially the part that defines the $fplace code, since it is quite long. Many users may opt to not use it (the $fplace format code), as noted in the source code where it is referenced. We use it to assign a different value to Areaname for 155 level summaries (place within county). We prefer to have the place name rather than the county name on these summaries, so we use this format to "look up" the name of the place. Even if you plan to use this conversion, you may decide you do not need a complete format code for all states. The other two format codes on the file are trivially small. The $fipstab code is only used when converting a national file, while the $sumlev code is used only to label values of the sumlev codes in the post-conversion proc tabulate step. Parameters, etc. ================ To invoke this program to work for a specific set of files (presumably one state's worth) requires assigning some global macro variables and tweaking some other code. *******************************IMPORTANT NOTE**************************************** We have flagged parts of the code that may need to be altered with comments beginning with the characters "*<===". Be sure to search for all occurrences of this string to see what code you might need to change. You need to specify the state being converted by changing the line with %let statements defining the global parms stab and state. To run the program for Illinois, for example, make sure the line reads: %let stab=il; %let state=17; For consistency we strongly recommend using the lowercase for the value of stab. The maxobs parm can be assigned a value that will cause the program to run in test mode, stopping after processing the first &maxobs records off the input files. Use this for testing or when you want to do a temporary convert and know where the last record of interest is on the input files. For example, specify %let maxobs=1; to convert only the first (state summary) records. Assign a value to the inpath parameter to specify the path (directory) where the input files and output SAS data sets are stored. If you do not wish to store the SAS data sets in the same directory as the input files then you need to edit the "libname sf32000 ..." statement. A select macro is provided as a model and is referenced (invoked) within the conversion step. The code in this macro is executed after reading the geographic headers data, but **before** any data files are read. So it can access any of the fields/variables in the header record only. It needs to assign a value of 0 or 1 to the variable _keep. The program will interpret a value of _keep= (the null value) to mean you want to skip processing this geographic area. It will handle the logic of skipping over the records in all 76 (at most, less for sub-tract geography) input files. Use this to code a "selective convert" of specific geographic entities. For example if you coded: if sumlev='160' and Pop100 ge 10000 then _keep=1; else _keep=0; you would be converting only place-level summaries for places with at least 10000 total population. Note that if you tried to reference the variable P1i1 here instead of pop100 it would NOT work; that is because at the point where the select filter is invoked, we have not yet read any of the data for the area -- only the geographic header information. You can, of course, add your own filter code at the bottom of the data step after all the data files have been read. But this would somewhat defeat the purpose of trying to save the program from doing a lot of work reading data records that are not going to be used. John Blodgett, Missouri Census Data Center, July 31, 2002.