PROC SUMMARY in SAS procedures allows us to explore our data not only in terms of counts and distributions but also statistically.
PROC SUMMARY DATA=libref.filename options; BY (or CLASS) variable-list; ID variable-list; VAR variable-list; TYPES variable-list; WAYS n; OUTPUT OUT=libref.filename(options) output statistic=variable-list/options; RUN;
Options (partial list):
👉 MISSING – Treat missing values as a valid subgroup value
👉 NWAY – Calculate only the highest level of interaction
👉 DESCENDING – Arrange lowest summary levels first (by default they are arranged ascending)
👉 ORDER= – Specify sort order of CLASS variables
Optional statements (partial list):
👉 CLASS variable(s);
👉 ID variables(s);
👉 VAR variable(s);
👉 BY variable(s);
👉 WHERE where condition(s);
👉 OUTPUT OUT= SASdataset options=variable(s);
Available options on the
OUTPUT OUT=: N NMISS PRT VAR MEAN RANGE CSS MIN SKEWNESS STD USS CV SUMWGT MAX KURTOSIS
Aggregating Data with PROC SUMMARY
The CLASS statement in PROC SUMMARY names the character or numeric variables in which the data will be classified. The variables listed on the CLASS statement should be categorical variables, that is, they should have a few numbers of discrete values.
The VAR statement, on the other hand, names the numeric variables which will be analyzed.
The ID statement is used to specify any character variables not specified in the BY or CLASS variable list that you want to retain in the output SAS data set. Remember that variables that you specify with the ID statement are not summarized. Rather, the value of the ID variable on the last row summarized is retained. This is useful when a variable must be retained in the output file, but whose value is the same for each combination of the BY or CLASS variable list. Variables listed with an ID statement could be added to the BY or CLASS variable list to produce the same result.
The OUTPUT OUT = statement names the output SAS data set. It also defines which statistics will be used with which variables and what they will be called on the output SAS data set.
Automatic Variables in Proc Summary
The SUMMARY procedure creates two variables automatically: _FREQ_ and _TYPE_.
The _FREQ_ variable stores the number of rows from the input SAS dataset that is summarized into every single row.
The _TYPE_ variable contains a numeric value identifying the level of interaction between the variables in the CLASS list.
When a BY statement is used, the _TYPE_ variable will always equal 0. When the CLASS statement is used, the _TYPE_ variable will contain 0 for a grand total row and values of 1 through n for various levels of interaction between the variables in the CLASS list.
Options in Proc Summary
There are two important SUMMARY procedure options: MISSING and NWAY.
The MISSING option instructs the SUMMARY procedure to considers missing values in a class variable when creating summary rows. If you omit the MISSING option, the SUMMARY procedure excludes any rows with a missing value in a CLASS variable from the resulting output SAS data set.
The MISSING option can either be placed on the PROC SUMMARY statement or following a “/” at the end of the CLASS statement.
The NWAY options instruct the SUMMARY procedure to only create rows with a combination of all class variables. These options are available only when used with the CLASS statement, not with the BY statement. When the SUMMARY procedure is used with the BY statement it will produce the same output file as when used with the CLASS statement combined with the NWAY option.
See an example of NWAY later in this tutorial.
The below code snippet produces summary statistics like – Total Sales, Average Sales, Minimum and Maximum Sales for each region from the SASHELP.SHOES dataset.
proc summary data=sashelp.shoes; class Region; var sales; OUTPUT OUT=SUMDS SUM=TOTSales MEAN=AvgSales MIN= MAX= /autoname autolabel; run;
The Options AUTONAME and AUTOLABEL request the procedure creates unique and meaningful column names and labels for the results.
How are the TYPE Values Useful?
The TYPE variable is a part of the output SAS data set built (meaning if you did a PROC CONTENTS you would see it on there). This also means that you can query against this variable’s value and use it to create different reports, each of which contains different information, with different levels of detail and summarization.
The below code snippet is used to group sashelp.cars dataset by Origin,Type and Drivetrain.
proc summary data=sashelp.cars; class origin type drivetrain; var msrp; output out=cars_summary sum= mean= /autolabel autoname; run;
TYPE =0 Represents the entire data set
TYPE =1 Origin (across all Origins)
TYPE =2 Represents Type(across all Types withen Origins)
TYPE =3 Represents Drivetrain within Types and origins
After a PROC SUMMARY, a series of PROC PRINTs could be coded to select off different TYPE values and to create several different reportS each with a different level of information:
Suppose you want to find the Total MSRP by Region
title "Summary Statistics by Origin"; proc print data=cars_suummary; where _type_=4; run;
title "Summary Statistics by Origin and Type"; proc print data=cars_suummary; where _type_=6; run;
Figuring Out TYPE Values
There are two ways to see your _TYPE_values:
Apply PROC PRINT to the output SAS data set but if you have several CLASS variables this may result in a LOT of output.
Another way is to manipulate the data to print only the first value of each _TYPE_ variable, which is enough to find out which _TYPE_ values you need (be careful of doing this if you have used the MISSING option on the PROC SUMMARY statement)
DATA cars; set cars_summary; BY _TYPE_; IF FIRST. _TYPE_; RUN; PROC PRINT DATA=cars; T1TLE1 'PRINTING ONLY THE FIRST OCCURRENCE OF EACH TYPE VALUE'; RUN;
Using the NWAY and LEVELS option
As discussed earlier, with NWAY option you can have only the combination of class variables in the output. To demonstrate this see the example below where I have used the NWAY option to find the MSRP sum of Cars by Region and Type.
proc summary data=sashelp.cars nway; class origin type; var msrp; output out=summary_nway sum=msrp; run;
Until now, you have seen that adding class variable increases the _TYPE_ variable and eventually also increases the number of output. Using, NWAY option you can have only the combination of Origin and Type class variables.
The WAYS statement is used to specify the number of ways to make combinations of class variables. This causes the SUMMARY procedure to create summary rows for only the combinations specified. The WAYS statement does not work with the BY statement. You use this statement by specifying one or more integers that define the number of class variables to combine. For example, if you want only the row combination of Origin , type and Drivetrain then use the following WAYS statement:
proc summary data=sashelp.cars; class origin type drivetrain; var msrp; ways 1; output out=summary_nway sum=msrp; run; title "Using WAYS statement"; proc print;
You can also request multiple ways. For example, if you want all the rows representing a combination of two class variables – and you want the row representing the grand total (no combination of any class variables), you can use the following WAYS statement:
WAYS 0 2;
You can use the LEVELS and WAYS options to the OUTPUT statement to include in the the _LEVEL_ and _WAY_ variables in the output.
The _LEVEL_ variable contains a value from 1 to n that indicates the combination of class variables.
The _WAY_ variable contains a value from 1 to the maximum number of class variables that indicates how many class variables the SUMMARY procedure combines to create a row in the output SAS data set.
To use these options, add them to the OUTPUT statement after a “/”.
proc summary data=sashelp.cars; class origin type; var msrp; output out=summary_type sum=msrp /levels ways; run; title "Using Levels/ways"; proc print;
The TYPES statement creates summary rows for combinations of variables that you specify in the CLASS statement. The TYPES statement does not work with the BY statement. You use this statement by specifying each combination of class variables you want to be included in the summary ouput by stating the class variables separated by an asterisk.
For example, if you want only the row combination of Origin and type, and you want the rows for DIV only, then use the following TYPES statement:
TYPES Origin*Type Origin;
proc summary data=sashelp.cars; class origin type; var msrp; types origin*type Origin; output out=summary_type sum=msrp /levels ways; run;
If you also want the grand total row, use the syntax TYPES()
PROC SUMMARY Without a VAR Statement
You can use the PROC SUMMARY procedure without a VAR statement. In this case, it displays the counts or the number of occurrences of your CLASS variables’ values. This gives PROC SUMMARY the same functionality that we find in PROC FREQ.
PROC SUMMARY DATA=sashelp.cars; CLASS origin drivetrain; OUTPUT OUT=cars1; RUN; PROC PRINT DATA=cars1; TITLE 'RESULT WITHOUT USING A VAR'; RUN;
PROC SUMMARY Without a CLASS Statement
PROC SUMMARY does not need to have a CLASS statement. PROC SUMMARY must have either a CLASS or a VAR statement, but it does not need to contain both. When no CLASS statement is provided, only a _TYPE_ =0 record is produced and no other levels of variables are created.
PROC SUMMARY DATA=sashelp.cars; VAR msrp; OUTPUT OUT=cars2 SUM(msrp)= /autoname /autolabel; RUN;