Descriptive Statistics is about finding “what has happened” by summarizing the data using statistical methods and analyzing the past data using queries.
Descriptive statistics, in short, are descriptive information that summarizes a given data. The input data can be either a representation of the entire population or a subset of a population.
Descriptive statistics are of two types, measures of central tendency and measures of variability.
The tools and techniques used for describing or summarizing the data in descriptive statistics are:
|Measures of Central Tendency||MEAN, MEDIAN and MODE|
|Measures of Variance||Range, Inter-Quartile Distance, Variance, Standard Deviation|
|Measures of Shape||Skewness and Kurtosis|
Measures of Central Tendency
In Descriptive Statistics, measures of central tendency are used for describing the data using a single value. It helps users to summarize the data.
Mean is one of the measures of central tendency used in descriptive statistics and it is the arithmetic average value of the data which is calculated by adding all observations of the data and dividing by the number of observations.
The average age (MEAN) of students from the above sample is given by
Mean is significantly affected by the presence of outliers. Therefore, it is not a useful measure in taking decisions.
Median in descriptive statistics, is the value that divides the data into two equal parts. To find the median value, the data must be arranged in ascending order and the median is the value at position when is odd. When is even, the median is the average value of and observation after arranging the data in an increasing order.
The ascending order of the data is 180,226,245,260,319,326,445.
Thus, the median is the 4th value in the data which is 260 after arranging in ascending order.
Calculating Mean and Median in SAS
PROC MEANS can be used to find the mean and median in a SAS dataset.
title 'Table of Mean and Median for Students age'; proc means data=sashelp.class mean median maxdec=2; var age; run;
A mode is a value that occurs more frequently in the data set.
Calculating Mode in SAS
The mode of a dataset can be calculated in SAS using the PROC UNIVARIATE procedure.
title 'Table of Modes for Student age'; ods select Modes; proc univariate data=sashelp.class modes; var age; run;
PERCENTILE, DECILE, AND QUARTILE
Percentile, decile, and Quartile are frequently used to identify the position of the observation in a SAS dataset.
Percentile is used to identify the position of any value in a group. Percentile is denoted as P which is the value of the data at which percentage of the data lie below that value.
P10 denotes the value below which 10 percent of the data lies. To find P, arrange the data in ascending order and the value of P is the position in the data is calculated using the below formula.
Pis the number of observations in the data.
Calculating percentile in SAS
The frequently used percentiles (such as the 5th, 25th, 50th, 75th, and 95th percentiles), can be calculated using PROC MEANS. The
STACKODSOUTPUT option was introduced in SAS 9.3, to create an output data set that contains percentiles for the multiple variables.
proc means data=sashelp.cars StackODSOutput P5 P25 P75 P95; var mpg_city mpg_highway; ods output summary=Percentiles; run;
Decile is the value of percentile that decides the data into 10 equal parts. The first decile contains 10% of the data, the second decile contains the first 20% of the data and so on.
Quartile divides the data into 4 equal parts. The first Quartile (Q1) contains 25% of the data, Q2 contains 50% of the data and it is also the median. Q3 contains 75% of the data.
Measures of Variation
In Descriptive statistics, Measures of Variations helps us to understand the variability in the data. Predictive analytics like Regression explains variations in the outcome variable(Y) using the predictor variable (X). variability in the data is measured using the following techniques.
- INTER-QUARTILE DISTANCE(IQD
- STANDARD DEVIATION
RANGE captures the data spread and is the difference between the maximum and minimum value of the data.
Calculating RANGE in SAS
Proc means can be used with the RANGE option to calculate the RANGE of any variables.
proc means data=sashelp.class range maxdec=2; var age weight; run;
INTER-QUARTILE DISTANCE (IQD)
Inter-Quartile Distance also known as Inter-Quartile Range (IQR) is the measure of the distance between Quartile 1 (Q1) and Quartile 3 (Q3).
IQD can also be used for identifying outliers in the data. Outliers are observations that are far away (on either side) from the mean value of the data.
Values of data below Q1 – 1.4 * IQD and above Q3 + 1.5 * IQD are classified as outliers.
SAS program to find Outliers using the IQD
data class; set sashelp.class; if name in('Alice', 'Carol', 'Henry') then age=110 + rand("Uniform"); run; proc univariate data=class noprint; var age; output out=ClassStats qrange=iqr q1=q1 q3=q3; run; data _null_; set classStats; call symput ('iqr', iqr); call symput ('q1', q1); call symput ('q3', q3); run; data outliers; set class; if (age le &q1 - 1.5 * &iqr) or (age ge &q3 + 1.5 * &iqr); run; proc print data=outliers; run;
In the above program, we have modified the sashelp.class dataset by inserting some outliers age.
Then, QRANGE, Q1 and Q3 are calculated for the modified SAS dataset.
Macro variables are created for the 3 variables respectively.
Conditions are applied to the modified dataset ‘class’ to check for outliers.
Variance is a measure of variability in the data from the mean value.
The variance of a population,is calculated using
The variance of a sample is () is calculated using
Note that, the deviation from mean is squared since sum of deviations from mean will always add up to 0.
For calculating sample variance, the sum of squared deviation is divided by . This is known as Bessel’s correction.
Standard deviation is the square root of Variance and it is also a measure of how spread out the numbers is from the mean value.
Why do we need Standard Deviation?
Since variance is the square of deviations it does not have the same unit of measurement as the original values.
For example, lengths measured in metres () have a variance measured in metres squared ().
If we find the square root of the variance it gives us the units used in the original scale and this is known as the standard deviation.
The formula of Standard deviation for population is
and for Sample is
Properties of standard deviation
- Standard deviation is used to measure the spread of data around the mean.
- Standard deviation can never be negative as it is a measure of distance (and distances can never be negative numbers).
- Standard deviation is significantly affected by outliers.
- For data with approximately the same mean, the greater the spread is, the greater the standard deviation.
- The standard deviation is zero (smallest possible number in Standard deviation) if all values of a dataset are the same(This is because each value is equal to the mean).
Calculating Variance and Standard Deviation in SAS
Variance and Standard deviation can be calculated using the
STDDEV options in the PROC MEANS Procedure.
proc means data=sashelp.class; var stddev maxdec=2; var age; run;
MEASURES OF SHAPE – SKEWNESS AND KURTOSIS
SKEWNESS is a measure of symmetry or lack of symmetry. A data set is symmetrical when the proportion of data at an equal distance from the mean is equal.
Measures of skewness are used to identify whether the distribution is left-skewed or right-skewed.
The value of Skewness will be 0 when the data is symmetrical. A positive value indicates a positive skewness whereas a negative value indicates negative skewness.
KURTOSIS is a measure of the shape of the tail i.e the shape of the tail of a distribution is heavy or light.
Kurtosis identifies whether the tails of a given distribution contain extreme values.
Excess kurtosis is a measure that compares the kurtosis of distribution subtracting the kurtosis of a normal distribution. The kurtosis of a normal distribution is 3. Therefore, the excess kurtosis is found using the formula below:
Excess Kurtosis = Kurtosis – 3
Types of Kurtosis
The types of kurtosis are determined by the excess kurtosis of a particular distribution. The excess kurtosis can be positive, negative or 0.
Kurtosis value of more than 3 is called Leptokurtic distribution. The leptokurtic distribution has heavy tails on either side, which indicating the large outliers.
Kurtosis value of less than 3 is called Platykurtic distribution. It shows a negative excess kurtosis which has flat tails. The flat tails indicate the presence of small outliers in a distribution.
The Kurtosis value equal to 3 is called Mesokurtic distribution which shows an excess kurtosis of 0 or close to 0.
Calculating Kurtosis and Skewness in SAS
Skewness and Kurtosis are calculated using the PROC UNIVARIATE procedure in SAS.
proc univariate data=sashelp.class; var age; run;
In the above example, the Skewness is close to 0 which means age values are almost normally distributed. The Kurtosis value is negative which means that age values are flatter than a normal curve having the same mean and standard deviation.
If you liked this article, you might also want to read How to summarize categorical data graphically?
Do you have any tips to add Let us know in the comments?