Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.
Descriptive statistics are typically distinguished from inferential statistics. With descriptive statistics, you are simply describing what is or what the data shows. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone.
The mean in descriptive statistics is the average of the data. It is calculated by adding all the data and dividing by the number of data points.
Limitations
The mean is significantly affected by the presence of outliers. Therefore, it is not a useful measure in taking decisions.
The Median in descriptive statistics is the value that divides the data into two equal parts. To find the median value, the data must be arranged in ascending order, and the median is the value at position when is odd.
When is even, the median is the average value of and observation after arranging the data in an increasing order.
Example:
| 245 | 326 | 180 | 226 | 305 | 195 | 220 | 295 |
|---|
Step 1: Arrange the data in ascending order
| 180 | 195 | 220 | 226 | 245 | 295 | 305 | 326 |
|---|
Step 2: Find the position of the median
Since (even number), the median is the average of the 4th and 5th values.
The mode in descriptive statistics is the value that appears most frequently in a data set.
Example:
| 245 | 326 | 180 | 226 | 305 | 195 | 220 | 295 | 245 | 180 |
|---|
In the above data, the values 245 and 180 appear twice, while all other values appear only once. Therefore, this data set has two modes: 245 and 180.
The range in descriptive statistics is the difference between the largest and smallest values in a data set.
Example:
For the data set: 180, 195, 220, 226, 245, 295, 305, 326
Standard deviation in descriptive statistics measures the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.
The formula for standard deviation is:
Where:
Variance in descriptive statistics is the square of the standard deviation. It measures how far a set of numbers is spread out from their average value.
Skewness in descriptive statistics is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.
Kurtosis in descriptive statistics is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Kurtosis describes the shape of a probability distribution.
SAS provides several procedures for calculating descriptive statistics:
PROC MEANS is used to calculate summary statistics for numeric variables. By default, it provides the following statistics:
Basic Syntax:
PROC MEANS DATA=dataset-name;VAR variable-name;RUN;
Example:
PROC MEANS DATA=sashelp.class;VAR age height weight;RUN;
Output:
| Variable | N | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|
| Age | 19 | 13.32 | 1.49 | 11.00 | 16.00 |
| Height | 19 | 62.34 | 5.13 | 51.30 | 72.00 |
| Weight | 19 | 100.03 | 22.77 | 50.50 | 150.00 |
PROC SUMMARY is similar to PROC MEANS but provides more control over the output. It creates summary statistics and can create output datasets.
Basic Syntax:
PROC SUMMARY DATA=dataset-name;VAR variable-name;OUTPUT OUT=output-dataset;RUN;
Example:
PROC SUMMARY DATA=sashelp.class;VAR age height weight;OUTPUT OUT=summary_stats;RUN;
PROC UNIVARIATE provides detailed descriptive statistics for numeric variables, including:
Basic Syntax:
PROC UNIVARIATE DATA=dataset-name;VAR variable-name;RUN;
Example:
PROC UNIVARIATE DATA=sashelp.class;VAR age height weight;RUN;
The CLASS statement is used to group the analysis by categorical variables.
PROC MEANS DATA=sashelp.class;CLASS sex;VAR age height weight;RUN;
The BY statement is similar to CLASS but requires the data to be sorted first.
PROC SORT DATA=sashelp.class OUT=class_sorted;BY sex;RUN;PROC MEANS DATA=class_sorted;BY sex;VAR age height weight;RUN;
You can request specific statistics using the appropriate keywords:
PROC MEANS DATA=sashelp.class N MEAN MEDIAN MODE RANGE STD;VAR age height weight;RUN;
You can create output datasets with summary statistics:
PROC MEANS DATA=sashelp.class NOPRINT;VAR age height weight;OUTPUT OUT=statsN=nMEAN=mean_age mean_height mean_weightSTD=std_age std_height std_weight;RUN;
PROC UNIVARIATE DATA=sashelp.class PLOTS;VAR age height weight;RUN;
PROC UNIVARIATE DATA=sashelp.class NORMAL;VAR age height weight;RUN;
PROC UNIVARIATE DATA=sashelp.class HISTOGRAM;VAR age height weight;RUN;
/* Create sample data */DATA test_data;INPUT ID Score @@;DATALINES;1 85 2 92 3 78 4 95 5 886 76 7 91 8 84 9 89 10 93;RUN;/* Calculate descriptive statistics */PROC MEANS DATA=test_data N MEAN MEDIAN MODE STD RANGE;VAR Score;RUN;
/* Create sample data with groups */DATA grouped_data;INPUT Group $ Value @@;DATALINES;A 25 A 30 A 35 A 40 A 45B 15 B 20 B 25 B 30 B 35;RUN;/* Calculate statistics by group */PROC MEANS DATA=grouped_data;CLASS Group;VAR Value;RUN;
PROC UNIVARIATE DATA=sashelp.class NORMAL PLOTS;VAR height;HISTOGRAM / NORMAL;INSET N = 'N' MEAN = 'Mean' STD = 'Std Dev' / POS = NW;RUN;
When you run PROC MEANS or PROC UNIVARIATE, you’ll see various statistics:
Descriptive statistics provide the foundation for understanding your data. SAS offers powerful procedures like PROC MEANS, PROC SUMMARY, and PROC UNIVARIATE to calculate these statistics efficiently. By understanding both the concepts and the SAS implementation, you can effectively summarize and describe your data, which is crucial for making informed decisions in any data analysis project.
Remember that descriptive statistics are just the first step in data analysis. They help you understand what your data looks like, but for making inferences about populations, you’ll need to move on to inferential statistics.