Proc Univariate provides a wider variety of statistics and graphs than the proc means. It helps you to discover key information about the distribution of each variable, such as
- If the data is approximately normally distributed
- Identifying outliers in the data
- The distribution of data for variables differs by group.
The commonly used options for PROC UNIVARIATE include:
- DATA= – Specifies data set to use
- NORMAL – Produces a test of normality
- FREQ Produces a frequency table
- PLOT Produces stem-and-leaf plot
The Commonly used statements used with PROC UNIVARIATE include:
- BY variable list;
- VAR variable list;
- OUTPUT OUT = dataset name;
The BY‑group specification causes UNIVARIATE to calculate statistics separately for groups of observations (i.e., treatment means).
The OUTPUT OUT= statement allows you to output the means to a new data set.
Understanding the Proc Univariate output
PROC UNIVARIATE DATA=sashelp.class; var weight; RUN;
The first table generated is the Moments table. it provides a list of descriptive statistics for the variable weight.
- N is the sample size
- Sum weights are the same as sample size however, the weight statements are used to identify a separate variable that contains counts for each observation. The default weight variable is defined to be 1 for each observation. This field is the sum of observation values for the weight variable. In our case, since we didn’t specify a weight variable, SAS uses the default weight variable. Therefore, the sum of weight is the same as the number of observations.
- Mean is the arithmetic mean or the average.
- Sum of Observations is the total sum of all the data values.
- Std Deviation is the standard deviation.
- Variance is the measure of the spread of the distribution and is the square of standard deviation.
- Skewness is the measure of the symmetry of the data.
- Kurtosis is the shape of the distribution.
- Uncorrelated SS is the sum of the squared values.
- Corrected SS is the sum. of the squared deviations from the mean and is the quantity that is used in the calculation of standard deviation and other statistics.
- Coef Variation is the coefficient of variation and is a unitless measure of variability.
- Std Err Mean is the standard error of the mean.It is calculated by dividing Standard deviation with .
Basic Statistical Measures
The second table from PROC UNIVARIATE provides several measures of the central tendency and spread of the data.
- Median is the centremost value of the data.
- Mode is the most frequently occurring value in the data.
- Range is the maximum value minus the minimum value.
- Interquartile range is the difference between the and percentiles of the data.
Tests for Location
The test for locations is used to determine whether the mean of the data is significantly different from 0 or another hypothesized value.
- Student’s t-Test is a single sample t-test of the null hypothesis that the mean of the data is equal to the hypothesised value.
- Sign test is a test of the null hypothesis that the probability of obtaining a positive value.
- Signed Rank Test is a non-parametric test often used instead of the Student’s t-test when the data is not normally distributed and the sample sizes are small.
In this table, the commonly used quantiles of the data are listed.
The quantiles provide information about the tails of the distribution as well as including the five number summaries for each variable. These consist of the minimum, lower quartile, median, upper quartile, and maximum values of the variables.
You can also calculate custom percentile with the
PCTLPTS= option, like 10, 20, 30, 40, 50 and Q3(75th) percentiles.
proc univariate data = sashelp.iris ; var sepallength; output out = pctds pctlpts = 10 to 50 by 10,q3=quartile3 pctlpre = pct_ pctlname=P10 P20; run;
pctlpre is the prefix to add for the variables. The
pctlname is to create suffixes to create the names for the variables that contain the PCTLPTS= percentiles and the Output statement is to save the values in a SAS dataset.
The extreme observations table provides a listing of the largest and smallest values in the data set. This is useful for locating outliers in the data.
Test for Normality
Most of the statistical techniques assume data should be normally distributed. It is important to check this assumption before running a model.
There are multiple ways to check Normality :
- Plot Histogram
- Calculate Skewness
- Normality Test
Creating Histogram using Proc Univariate
A histogram is a commonly used plot for visually examining the distribution of a set of data. You can create a histogram in PROC UNIVARIATE with the following statement.
The normal option creates a superimposed normal curve.
proc univariate data=sashelp.shoes NOPRINT; var sales; HISTOGRAM / NORMAL (COLOR=RED); run;
Skewness is a measure of the degree of asymmetry of a distribution. If skewness is close to 0, it indicates data is normally distributed.
If Skewness > 0, data is Positively skewed and it means that there are a few extreme values or outliers which are having large values. In positively skewed data, the mean is greater than the median and the median is greater than the mode.
If skewness < 0, it indicates data is negatively skewed and it means that there are a few outliers with small values.
In the above example, skewness is close to 0, which means data is normally distributed.
Test for normality is another way to assess whether the data is normally distributed. There are four test statistics that are displayed in the table.
- Shapiro Wilk
- Kolmogorov test
- Cramer-Von Mises
- Anderson darling
The NORMAL option is included in the PROC UNIVARIATE to test for the normality of data.
Shapiro Wilk and Kolmogorov test are the two mainly used methods. The p-values below are for testing the null hypothesis that the variable is normally distributed. If the p-value is greater than 0.05, you may assume that the data is normally distributed.
proc univariate data = sashelp.iris normal; var sepallength; run;
Shapiro Wilk Test
The Shapiro-Wilk test gives you a W value. Smaller values indicate data is not normally distributed and you can reject the null hypothesis. This test works well for a sample size of less than 2000.
The Kolmogorov test is also known as KS Test and this test can handle a large sample size.
The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples.
Winsorized and Trimmed Means
The Winsorized and Trimmed Means are extremely sensitive to a single outlier. When the data is highly skewed a percentage of data is removed and then the mean is calculated when the data is highly skewed.
The percentage tells you what percentage of data to remove. For example, with a 5% trimmed mean, the lowest 5% and highest 5% of the data are excluded. The mean is calculated from the remaining 90% of data points.
Trimmed Mean: Trimmed mean is calculating 10th and 90th percentile values and removing those extreme values and then calculate mean.
proc univariate trimmed= 0.1 data=sashelp.shoes; var sales; histogram / normal; run;
In the example above, we are calculating a 10% Trimmed Mean.
10% of values Trimmed from each tail (upper and lower side) and 40 values are trimmed from the left and right tail.
Winsorized Mean: Winsorized means is a method that replaces extreme values (smallest or largest) with the observations closest to them and then means is calculated. It is the same as trimmed mean except removing the extreme values, we are capping a percentage of values from both ends of the data.
proc univariate winsorized= 0.2 data=sashelp.shoes; var sales; histogram / normal; run;
PROC UNIVARIATE with the
PLOT option generates the following plots :
- Stem-and-Leaf plot (for some SAS versions)
- Box-and-whiskers Plot
- Normal Probability Plot
proc univariate data = sashelp.shoes plot; var sales; run;
The horizontal histogram (top-left) is a visual representation of the distribution of the sales value. In normally distributed data, the peak will be in the middle with equal trails trailing on either side.
The box-and-whiskers plot (top-right) is a graphical representation of the quartiles of the data. 50 % of the data(the middle) is represented by the box and the whiskers represent 25% of data on each side. The centre line represents the median which is the 50th percentile. The diamond symbol ◇ indicates the mean. The circles ◦ that stand at the top of the box plot indicates extreme values.
The normal probability plot (bottom) provides a graphical representation of the plot of points shown as dots that lie in a tight scatter around the reference (diagonal) line.