Using the PROC MEANS procedure, you can compute statistics like finding mean, standard deviation, minimum and maximum values, and more statistical calculations.
Applications of PROC MEANS
- Describing quantitative data for analysis.
- Describing the means of numeric variables by group
- Identifying outliers and extreme values.
PROC MEANS DATA=<dataset-name> <options> <statistics keywords>; <statements>
The most commonly used options in PROC MEANS are:
- MAXDEC – Determines the number of decimal places to print in the output.
- NOPRINT – Suppresses the output of descriptive statistics.
- ALPHA – Sets the level for confidence limits (default is 0.05)
Statistical keywords are used to calculate statistical measures like mean, median and standard deviation. You can find the list of Statistical keywords on the SAS documentation website.
A simple example of Proc Means
You want to analyze the Mean, Maximum, Minimum, and Standard deviation of the student age in SASHELP.CLASS dataset.
In the DATA= option, you need to specify the dataset you want to use. In the VAR= option, you need to refer to the numeric variables you want to analyze. You cannot refer to character variables in the VAR statement.
If you omit the
VAR statement, proc means will generate the statistics for all the numeric variables in the dataset.
proc means data=sashelp.class; var age; run;
By default, PROC MEANS generates N, Mean, Standard Deviation, Minimum and Maximum statistics.
The most frequent statistical options used in PROC MEANS are listed below against their description.
|N||Number of observations|
|NMISS||Number of missing observations|
|SUM||Sum of observations|
|USS||Uncorr. sum of squares|
|CSS||Corr. sum of squares|
|T||Student’s t value for testing Ho: md = 0|
|PRT||P-value associated with t-test above|
|SUMWGT||Sum of the WEIGHT variable values|
Limit Descriptive Statistics
Suppose you want to see only two statistics – Minimum and Maximum
proc means data=sashelp.class min max; var age; run;
Using the CLASS statement
The CLASS statement is used in both the MEANS and SUMMARY procedures. It can be used as a single statement or in a series of CLASS statements.
The order of variables in the CLASS statement determines the order of classification of variables.
Options can be applied in the CLASS statement by preceding the option with a slash discussed below.
proc means data=sashelp.class; class sex; var age; run;
The BY statement Produces separate statistics for each BY group. Suppose you want to analyze the age variable grouping by Male or Female and want the output of each level of AGE in separate tables, in this case, you can use the BY statement.
proc sort data=sashelp.class out=class; by age; run; Title "Using By Statement"; proc means data=class; by sex; var age; run;
The MISSING Option
By default, Observations with missing levels of classification are excluded from the analysis.
Using the Missing option would instruct SAS to consider missing values for the variables in the CLASS statement.
The MISSING option on the PROC MEANS statement is applied to all classification variables if used in a single statement.
By using multiple CLASS statements and the MISSING option on the CLASS statement, you can choose which classification variables to utilize the MISSING option.
data class; set sashelp.class; if age < 14 then age=.; run; proc means data=class; class age /missing; run;
The Order= Statement
This option specifies the order PROC MEANS will group the levels of the classification variables in the output it generates. The arguments to the ORDER= option are:
With this option, you can control the classification variable levels. There are options by which you can determine the order. Below are the options which you can use with the ORDER statement.
- DATA – order is based on the order of incoming data
- FORMATTED – Values are formatted first and then ordered.
- FREQ – the order is based on the frequency of class level.
- INTERVAL – It is the same as UNFORMATTED or GROUPINTERVAL
proc means data=class2; class age bmi/order=freq; var height weight; run;
The ASCENDING/DESCENDING Options
DESCENDING is not an argument for this option, but is a “standalone” option within the CLASS statement. These options allow you to arrange the order of the variables in ascending or descending order. By default, the analysis variables are arranged in ascending order.
Note, the data is arranged in descending order according to the frequency of each age group.
proc means data=class max; var weight; class age /order=freq ascending; run;
With this output, we can say that there are 5 students who are of age 12 and the maximum weight in the age 12 group is 128.
GROUPINTERVAL and EXCLUSIVE
With these options, you can determine the formats associated with CLASS variables when forming groups.
When a classification variable is associated with a format, that format is used in the formation of the groups.
In the following example, format weight class is used to classify students (Normal, Overweight, Underweight)based on their BMI.
proc format; value weightClass low - 18.5='Underweight' 18.6-24.9='Normal' 25 - 29.9='Overweight' 30 - high='Obese'; run;
data class2; set sashelp.class; bmi=weight*703/(height**2); format bmi weightclass.; run; proc means data=class2 noprint; class bmi/groupinternal; var height weight; output out=class_summary mean = MeanHT MeanWT; run;
The resulting output shows that the MEANS procedure has used the format to collapse the individual levels of BMI into the three levels of the formatted classification variable.
Without using the
GROUPINTERVAL option the output would look as below.
OUTPUT options in PROC SUMMARY
The OUTPUT statement with the OUT= option is used to store the summary statistics in a SAS dataset. There are other options that you can use on the OUTPUT statements.
- AUTONAME – This allows the MEANS and SUMMARY to determine names for the generated variables;
- AUTOLABEL – Allows MEANS and SUMMARY to apply a label for each generated variables
- LEVELS – Adds the LEVELS column to the summary data set.
- WAYS – Add the WAYS column to the summary dataset.
Identifying extreme values
To get a correct analysis, it is often necessary to exclude the observation containing the extreme lowest or extreme highest values.
These extreme values are automatically displayed in PROC UNIVARIATE but must be explicitly specified in PROC MEANS and PROC SUMMARY procedures.
The MAX and MIN statistics shows the extreme lowest or highest values, but it does not identify the observation which contains these extreme values.
MAXID and MINID
The two options-
MINID when used in the OUTPUT statement identifies the observations with extreme values.
proc summary data=sashelp.class; class age; var height; output out=stats max=maxHeight maxid(height(name))=maxStudentName; run;
In the above example, we can see that the output has been generated with the extreme minimum and maximum values for each age group. (Class Variable.).
Using the IDGROUP Option
IDGROUP option displays a group of extreme values, unlike the MAXID and MINID which only captures a single extreme value.
The PERCENTILE to create subsets
The percentile statistics are used to create search bounds for potential outlier boundaries. This can help us to find out if any observation falls outside of the defined percentile like 1% or 5%.
Percentile is the percentage of data that is below a certain point in the observation.
data outlier; set stats(keep=age_p1 age_p99); do until(EOF); set sampledata end=EOF; if age_p1 ge age or age ge age_p99 then output outlier; end; run; options nobyline; proc print data=outlier; by age_p1 age_p99; run;
The 1st and 99th are calculated and saved in the data set STATS. The IF condition checks age if it is below or above the 1st and 99th percentile.
We can say that observations C, J & K lies outside the 1 to 99% of the data.
The automatic _TYPE_ variable
The TYPE variable is automatically included in the summary dataset. It is a numeric variable that help is to track the level of summarization and to distinguish the group of statistics.
proc summary data=sashelp.class; class age; output out=c1; run;
The _TYPE_ variable is 1 since there is 1 class variable – Age. The type variable is 0 if the means procedure does not have any CLASS variables.
The first observation has type = 0 which means there are no classification and statistics are calculated for all values.
The next observations with TYPE = 1 tell us that, statistics like frequency has been calculated for each age.
As you increase the variables in classification, the TYPE variable increase.
For TYPE = 2, the statistics are computed for each of the SEX levels (Male and Female). The classification from the AGE variable is not considered here.
FOR TYPE = 3, the statistics are computed for each combination of SEX and AGE values. It tells us that both of the classification variables are used.
Using the NWAY option
You can use the NWAY option if you want the statistics for a combination of variables rather than individual classification.
proc summary data=sashelp.class nway; class sex age; output out=c1; run;
The NWAY option keeps only the observations with the highest TYPE value.