PROC MEANS, PROC SUMMARY and PROC FREQ in SAS are used to evaluate quantitative data and to create a summary report for analysis. Using the PROC MEANS procedure, you can compute statistics like finding the mean, standard deviation, minimum and maximum values and a lot more statistical calculations.
Applications of PROC MEANS
- Describing quantitative data for analysis.
- Describing the means of numeric variables by group
- Identifying outliers and extreme values.
PROC MEANS<strong> DATA</strong>=<em><dataset-name> <options> <statistics keywords>; <statements> </em>
The most commonly used options in PROC MEANS are:
- MAXDEC – Determines the number of decimal places to print in the output.
- NOPRINT – Suppresses the output of descriptive statistics.
- ALPHA – Sets the level for confidence limits (default is 0.05)
Statistical keywords are used to calculate mean, median and standard deviation measures. You can find the list of Statistical keywords on the SAS documentation website.
The difference between PROC MEANS and PROC SUMMARY is that the :
By default MEANS always creates a table to be printed. If you do not want a printed table you must explicitly turn it off (NOPRINT option).
On the other hand, the SUMMARY procedure never creates a printed table unless it is specifically requested (PRINT option).
Using the CLASS statement
The CLASS statement is used in both the MEANS and SUMMARY procedures. It can be used as a single statement or in a series of CLASS statements.
The order of variables in the CLASS statement determines the order of classification of variables.
Options can be applied in the CLASS statement by preceding the option with a slash.
The MISSING Option
Observations with missing levels of classification are excluded from the analysis. The MISSING option on the PROC statement, it is applied to all of the classification variables if it is used in a single statement.
By using multiple CLASS statements along with the MISSING option on the CLASS statement, you can choose which classification variables are to utilise the MISSING option.
data class; set sashelp.class; if age < 14 then age=.; run;
The CLASS Statement
Using the CLASS statement in Proc means you can specify the variables whose values are for analysis. You can use the below option with the class statement. To use the options in a CLASS statement, you must use the ORDER of the classification variables.
The ASCENDING/DESCENDING Options
These options allow you to reverse the order of the display values accordingly.
proc means data=class; class age /order=freq ascending; run;
GROUPINTERVAL and EXCLUSIVE
With these options, you can determine the formats associated with CLASS variables when forming groups.
When a classification variable is associated with a format, that format is used in forming the groups.
In the following example, format weightclass is used to classify students (Normal, Overweight, Underweight)based on their BMI.
proc format; value weightClass low - 18.5='Underweight' 18.6-24.9='Normal' 25 - 29.9='Overweight' 30 - high='Obese'; run; data class2; set sashelp.class; bmi=weight*703/(height**2); format bmi weightclass.; run; proc means data=class2 noprint; class bmi/groupinternal; var height weight; output out=class_summary mean = MeanHT MeanWT; run;
The resulting output shows that the MEANS procedure has used the format to collapse the individual BMI levels into the three formatted classification variable levels.
Multilevel formats allow you to have overlapping formatted levels.
With this option, you can control the classification variable levels. There are options by which you can determine the order. Below are the options which you can use with the ORDER statement.
- DATA – order is based on the order of incoming data
- FORMATTED – Values are formatted first and then ordered.
- FREQ – the order is based on the frequency of class level.
- INTERVAL – It is the same as UNFORMATTED or GROUPINTERVAL
proc means data=class2; class age bmi/order=freq; var height weight; run;
The difference between BY and CLASS Statements
The BY statement provides summaries for the groups created by the combination of all BY variables. whreas the CLASS statement will provide summarized values for each class variable separately and also for each possible combination of class variables unless yu use the NWAY option.
You can also use the CLASS and BY statements together to analyze the data by the levels of class variables within BY groups.
OUTPUT options in PROC SUMMARY
The OUTPUT statement with the OUT= option stores the summary statistics in a SAS dataset. There are other options which you can use on the OUTPUT statements.
- AUTONAME – This allows the MEANS and SUMMARY to determine names for the generated variables;
- AUTOLABEL – Allows MEANS and SUMMARY to apply a label for each generated variables
- LEVELS – Adds the LEVELS column to the summary data set.
- WAYS – Add the WAYS column to the summary dataset.
IDENTIFYING EXTREME VALUES
To get a correct analysis, excluding the observation containing the extreme lowest or extreme highest values is often necessary.
These extreme values are automatically displayed in PROC UNIVARIATE but must be explicitly specified in PROC MEANS and PROC SUMMARY procedures.
The MAX and MIN statistics show the extreme lowest or highest values, but it does not identify the observation which contains these extreme values.
MAXID and MINID
The two options- MAXID and MINID, when used in the OUTPUT statement, identify the observations with extreme values.
proc summary data=sashelp.class; class age; var height; output out=stats max=maxHeight maxid(height(name))=maxStudentName; run;
The above example shows that the output has been generated with the extreme minimum and maximum values for each age group. (Class Variable.).
Using the IDGROUP Option
THE IDGROUP option displays a group of extreme values, unlike the MAXID and MINID, which only captures a single extreme value.
The PERCENTILE to create subsets
The percentile statistics are used to create search bounds for potential outlier boundaries. This can help us determine if any observation falls outside the defined percentile, like 1% or 5%.
The percentile is the data percentage below a certain point in the observation.
data outlier; set stats(keep=age_p1 age_p99); do until(EOF); set sampledata end=EOF; if age_p1 ge age or age ge age_p99 then output outlier; end; run; options nobyline; proc print data=outlier; by age_p1 age_p99; run;
The 1st and 99th are calculated and saved in the data set STATS. The IF condition checks age if it is below or above the 1st and 99th percentile.
We can say that observations C, J & K lies outside the 1 to 99% of the data.
The automatic _TYPE_ variable
TYPE variable is automatically included in the summary dataset. It is a numeric variable which help to track the level of summarization and distinguish the group of statistics.
proc summary data=sashelp.class; class age; output out=c1; run;
The _TYPE_ variable is 1 since there is 1 class variable – Age. The type variable is 0 if the means procedure has no CLASS variables.
The first observation has type = 0, meaning there is no classification, and statistics are calculated for all values.
The next observations with TYPE = 1 tell us that statistics like frequency have been calculated for each age.
As you increase the variables in classification, the TYPE variable increase.
For TYPE = 2, the statistics are computed for each SEX level (Male and Female). The classification from the AGE variable is not considered here.
FOR TYPE = 3, the statistics are computed for each combination of SEX and AGE values. It tells us that both of the classification variables are used.
Using the NWAY option
You can use the NWAY option if you want the statistics for a combination of variables rather than individual classification.
proc summary data=sashelp.class nway; class sex age; output out=c1; run;
The NWAY option keeps only the observations with the highest TYPE value.
[sdm_download id=”1315″ fancy=”2″]