# Using PROC MEANS for detailed analysis of data

PROC MEANSPROC SUMMARY, and PROC FREQ in SAS are used to evaluate quantitative data and to create a summary report for analysis.

Using the PROC MEANS procedure, you can compute statistics like finding mean, standard deviation, minimum and maximum values, and more statistical calculations.

Page Contents

## Applications of PROC MEANS

• Describing quantitative data for analysis.
• Describing the means of numeric variables by group
• Identifying outliers and extreme values.

Syntax:

`PROC MEANS<strong> DATA</strong>=<em><dataset-name> <options> <statistics keywords>; <statements> </em>`

The most commonly used options in PROC MEANS are:

• MAXDEC – Determines the number of decimal places to print in the output.
• NOPRINT – Suppresses the output of descriptive statistics.
• ALPHA – Sets the level for confidence limits (default is 0.05)

Statistical keywords are used to calculate statistical measures like mean, median and standard deviation. You can find the list of Statistical keywords on the SAS documentation website.

A simple example of Proc Means

You want to analyze the Mean, Maximum, Minimum, and Standard deviation of the student age in SASHELP.CLASS dataset.

In the DATA= option, you need to specify the dataset you want to use. In the VAR= option, you need to refer to the numeric variables you want to analyze. You cannot refer to character variables in the VAR statement.

If you omit the `VAR` statement, proc means will generate the statistics for all the numeric variables in the dataset.

```proc means data=sashelp.class;
var age;
run;```

By default, PROC MEANS generates N, Mean, Standard Deviation, Minimum and Maximum statistics.

The most frequent statistical options used in PROC MEANS are listed below against their description.

Statistical Option Description
N Number of observations
NMISS Number of missing observations
MEAN Arithmetic average
STD Standard Deviation
MIN Minimum
MAX Maximum
SUM Sum of observations
MEDIAN 50th percentile
P1 1st percentile
P5 5th percentile
P10 10th percentile
P90 90th percentile
P95 95th percentile
P99 99th percentile
Q1 First Quartile
Q3 Third Quartile
VAR Variance
RANGE Range
USS Uncorr. sum of squares
CSS Corr. sum of squares
STDERR Standard Error
T Student’s t value for testing Ho: md = 0
PRT P-value associated with t-test above
SUMWGT Sum of the WEIGHT variable values
QRANGE Quartile range

### Limit Descriptive Statistics

Suppose you want to see only two statistics – Minimum and Maximum

```proc means data=sashelp.class min max;
var age;
run;```

## Using the CLASS statement

The CLASS statement is used in both the MEANS and SUMMARY procedures. It can be used as a single statement or in a series of CLASS statements.

The order of variables in the CLASS statement determines the order of classification of variables.

Options can be applied in the CLASS statement by preceding the option with a slash discussed below.

```proc means data=sashelp.class;
class sex;
var age;
run;```

## BY Statement

The BY statement Produces separate statistics for each BY group. Suppose you want to analyze the age variable grouping by Male or Female and want the output of each level of AGE in separate tables, in this case, you can use the BY statement.

```proc sort data=sashelp.class out=class;
by age;
run;

Title "Using By Statement";
proc means data=class;
by sex;
var age;
run;```

### Difference between BY and CLASS Statements

The BY variables must sort the input dataset, whereas it is not required to sort the data in CLASS variables.

The BY statement provides summaries for the groups created by combining all BY variables. In contrast, the CLASS statement will provide summarized values for each class variable separately and each possible combination of class variables unless you use the NWAY option.

You can also use the CLASS and BY statements to analyze the data by the levels of class variables within BY groups.

### The MISSING Option

By default, Observations with missing levels of classification are excluded from the analysis.

Using the Missing option would instruct SAS to consider missing values for the variables in the CLASS statement.

The MISSING option on the PROC MEANS statement is applied to all classification variables if used in a single statement.

By using multiple CLASS statements and the MISSING option on the CLASS statement, you can choose which classification variables to utilize the MISSING option.

```data class;
set sashelp.class;

if age

Using the Missing Option
The Order= Statement

This option specifies the order PROC MEANS will group the levels of the classification variables in the output it generates. The arguments to the ORDER= option are:
ORDER

With this option, you can control the classification variable levels. There are options by which you can determine the order. Below are the options which you can use with the ORDER statement.

DATA - order is based on the order of incoming data
FORMATTED - Values are formatted first and then ordered.
FREQ - the order is based on the frequency of class level.
INTERVAL - It is the same as UNFORMATTED or GROUPINTERVAL
proc means data=class2;
class age bmi/order=freq;
var height weight;
run;

The ASCENDING/DESCENDING Options

DESCENDING is not an argument for this option, but is a "standalone" option within the CLASS statement. These options allow you to arrange the order of the variables in ascending or descending order. By default, the analysis variables are arranged in ascending order.
Note, the data is arranged in descending order according to the frequency of each age group.

proc means data=class max;
var weight;
class age /order=freq ascending;
run;

With this output, we can say that there are 5 students who are of age 12 and the maximum weight in the age 12 group is 128.
GROUPINTERVAL and EXCLUSIVE
With these options, you can determine the formats associated with CLASS variables when forming groups.
When a classification variable is associated with a format, that format is used in the formation of the groups.
In the following example, format weight class is used to classify students (Normal, Overweight, Underweight)based on their BMI.
proc format;
value weightClass low - 18.5='Underweight' 18.6-24.9='Normal'
25 - 29.9='Overweight' 30 - high='Obese';
run;
data class2;
set sashelp.class;
bmi=weight*703/(height**2);
format bmi weightclass.;
run;

proc means data=class2 noprint;
class bmi/groupinternal;
var height weight;
output out=class_summary mean = MeanHT MeanWT;
run;

The resulting output shows that the MEANS procedure has used the format to collapse the individual levels of BMI into the three levels of the formatted classification variable.
Without using the `GROUPINTERVAL` option the output would look as below.

The output without the group interval option
OUTPUT options in PROC SUMMARY
The OUTPUT statement with the OUT= option is used to store the summary statistics in a SAS dataset. There are other options that you can use on the OUTPUT statements.
```
• AUTONAME - This allows the MEANS and SUMMARY to determine names for the generated variables;
• AUTOLABEL - Allows MEANS and SUMMARY to apply a label for each generated variables
• LEVELS - Adds the LEVELS column to the summary data set.
• WAYS - Add the WAYS column to the summary dataset.

## Identifying extreme values

To get a correct analysis, it is often necessary to exclude the observation containing the extreme lowest or extreme highest values.

These extreme values are automatically displayed in PROC UNIVARIATE but must be explicitly specified in PROC MEANS and PROC SUMMARY procedures.

The MAX and MIN statistics shows the extreme lowest or highest values, but it does not identify the observation which contains these extreme values.

## MAXID and MINID

The two options- `MAXID` and `MINID` when used in the OUTPUT statement identifies the observations with extreme values.

```proc summary data=sashelp.class;
class age;
var height;
output out=stats max=maxHeight maxid(height(name))=maxStudentName;
run;```

In the above example, we can see that the output has been generated with the extreme minimum and maximum values for each age group. (Class Variable.).

## Using the IDGROUP Option

The `IDGROUP` option displays a group of extreme values, unlike the MAXID and MINID which only captures a single extreme value.

## The PERCENTILE to create subsets

The percentile statistics are used to create search bounds for potential outlier boundaries. This can help us to find out if any observation falls outside of the defined percentile like 1% or 5%.

Percentile is the percentage of data that is below a certain point in the observation.

```data outlier;
set stats(keep=age_p1 age_p99);

do until(EOF);
set sampledata end=EOF;

if age_p1 ge age or age ge age_p99 then
output outlier;
end;
run;

options nobyline;
proc print data=outlier;
by age_p1 age_p99;
run;```

The 1st and 99th are calculated and saved in the data set STATS. The IF condition checks age if it is below or above the 1st and 99th percentile.

We can say that observations C, J & K lies outside the 1 to 99% of the data.

## The automatic _TYPE_ variable

The TYPE variable is automatically included in the summary dataset. It is a numeric variable that help is to track the level of summarization and to distinguish the group of statistics.

```proc summary data=sashelp.class;
class age;
output out=c1;
run;```

The _TYPE_ variable is 1 since there is 1 class variable - Age. The type variable is 0 if the means procedure does not have any CLASS variables.

The first observation has type = 0 which means there are no classification and statistics are calculated for all values.

​​The next observations with TYPE = 1 tell us that, statistics like frequency has been calculated for each age.

As you increase the variables in classification, the TYPE variable increase.

For TYPE = 2, the statistics are computed for each of the SEX levels (Male and Female). The classification from the AGE variable is not considered here.

FOR TYPE = 3, the statistics are computed for each combination of SEX and AGE values. It tells us that both of the classification variables are used.

## Using the NWAY option

You can use the NWAY option if you want the statistics for a combination of variables rather than individual classification.

```proc summary data=sashelp.class nway;
class sex age;
output out=c1;
run;```

The NWAY option keeps only the observations with the highest TYPE value.

### Difference between Proc Means and Proc Summary

The difference between PROC MEANS and PROC SUMMARY is that the :
By default, MEANS always creates a table to be printed. If you do not want a printed table, you must explicitly turn it off (NOPRINT option).

On the other hand, the SUMMARY procedure never creates a printed table unless specifically requested (PRINT option).

Every week we'll send you SAS tips and in-depth tutorials

Subhro Kar is an Analyst with over five years of experience. As a programmer specializing in SAS (Statistical Analysis System), Subhro also offers tutorials and guides on how to approach the coding language. His website, 9to5sas, offers students and new programmers useful easy-to-grasp resources to help them understand the fundamentals of SAS. Through this website, he shares his passion for programming while giving back to up-and-coming programmers in the field. Subhro’s mission is to offer quality tips, tricks, and lessons that give SAS beginners the skills they need to succeed.

### 1 thought on “Using PROC MEANS for detailed analysis of data”

1. I have been exploring for a little bit for any high-quality articles or weblog posts on this sort of house .
Exploring in Yahoo I ultimately stumbled upon this site. Studying this info So
i am happy to convey that I’ve an incredibly just right uncanny feeling I discovered exactly what I needed.
I so much undoubtedly will make sure to don?t put out of your mind this web site and provides it a
look regularly.