PROC FREQ in SAS is a procedure for analyzing the count of data. It is used to obtain frequency counts for one or more individual variables or to create two-way tables (cross-tabulations) from two variables.
PROC FREQ in SAS can also perform statistical tests on count data.
PROC MEANS is another SAS procedure which you can use to compute descriptive statistics like finding the mean, standard deviation, minimum and maximum values and a lot more statistical calculations.
PROC FREQ <options(s)>; <statements> TABLES requests </options> ;
The table statement in PROC FREQ returns the frequency or count of the specified columns.
You can specify Options for the TABLE statement followed by a slash (/).
The basic use of PROC FREQ in SAS is to obtain counts of the number of Students observed in each category of Gender. We can use the following code:
proc freq data=sashelp.class; table sex; run;
TABLES age*weight / CHISQ;
The above statement requests that the chi-square and related statistics be reported for the cross-tabulation A*B.
ONE-WAY Frequency Tables
You can use PROC FREQ in SAS to produce tables of the frequency/counts by category and perform on the counts.
Using the ORDER=FREQ option
proc freq data=data.flighttravelers order=freq; table day_of_booking; run;
- PROC FREQ in the above example includes the ORDER=FREQ option.
- The ORDER=FREQ option helps you quickly analyse which categories have the most and fewest counts.
- The “Frequency” column gives the count of the times the day_of_booking variable takes on the value in the column.
- The “Percent” column is the per cent of the total.
- The “Cumulative Frequency” and “Percent” columns report an increase in the count and per cent for values of day_of_booking.
You use this type of insight to learn about the distribution of the categories in your data set.
For example, in these data, 28 people have booked flights on Sunday.
Using the ORDER=FORMATTED option.
Using the ORDER=FORMATTED option, you can control the order in which the categories will be displayed in the table.
Before using this option, you need to create a custom format to define the order you want in the output.
proc format; value $dayfmt "Saturday"="Weekends" OTHER="Weekdays"; run; proc freq order=formatted data=data.flighttravelers; tables day_of_booking; title "Example of PROC FREQ with Formatted Values"; format day_of_booking $dayfmt.; run;
Creating a ONE-WAY Frequency table from a summarized data
If the data is already summarized, you can use the WEIGHT statement to indicate the variables representing the count.
DATA COINS; INPUT CATEGORY $9. COUNT 3.; DATALINES; CENTS 152 CENTS 100 NICKELS 49 DIMES 59 QUARTERS 21 HALF 44 DOLLARS 21 ; PROC FREQ; WEIGHT COUNT; TITLE 'Reading Summarized Count data'; TABLES CATEGORY; RUN;
WEIGHT COUNT tells PROC FREQ that the data for the variable COUNT are counts. Even though there are two records for CENTS, the program can combine the WEIGHT into a single CENTS category (252 CENTS).
Testing Goodness of Fit in a One-Way Table
A goodness-of-fit test of a single population is a test that is used to determine if the distribution of observed frequencies in the sample data represents the expected number of occurrences for the population.
Assuming that the number of observations is fixed.
The hypotheses being tested are as follows:
H:0 The population follows the hypothesized distribution. H: a The population does not follow the hypothesized distribution.
A chi-square test is one of the goodness-of-fit tests. A decision can be made based on the -value associated with that statistic.
A low -value indicates that the data do not follow the hypothesised or theoretical distribution.
If the -value is sufficiently low (usually <0.05), you will reject the null hypothesis.
The syntax to perform a goodness-of-fit test is as follows:
PROC FREQ; TABLES variable / CHISQ TESTP=(list of ratios);
An airline operated daily flights to several Indian cities. One of the airline’s problems is the passengers’ food preferences. The Captain, the cook operation manager, believes 35% of their passengers prefer vegetarian, and 40% prefer vegetarian food. 20% low-calorie food and 5% request for diabetic food.
A sample of 500 passengers was randomly chosen to analyse the food preferences, and the data is shown below.
We will be conducting a CHI-SQURE Test to check if Captain Cook’s belief is true at =0.05
|Number of Passengers||190||185||90||35|
proc freq data=airlines order=data; weight no_of_passengers; title 'goodness of fit analysis'; tables foodtype / nocum chisq testp=(0.35 0.40 0.20 0.05); run;
- The WEIGHT Number_of_passengers summarises the data.
- The ORDER=DATA is used to order the data as in the input dataset. Frequencies are based on the variable Food_Type.
- The /NOCUM CHISQ and TESTP= statements are used to calculate the goodness-of-fit test.
- The test ratios are based on the per cent progeny expected from each of the four categories.
- The NOCUM option requests a table without the cumulative column.
Note: You must use the ORDER=DATA option to ensure that the hypothesised ratios listed in the TESTP= statement match up correctly with the categories in the input data.
The p-value for the chi-square test(=7.4107) is greater than critical value(=0.05), therefore we conclude that Captain Cook’s belief about food preferences is true.
CHI SQUARE Test of Independence – Analyzing TWO-WAY Tables
We test whether two or more groups are statistically independent in the Chi-Square test of Independence.
The TABLES statement with two or more variables listed and separated by an asterisk creates a cross-tabulation table for relating two variables.
The cross-tabulation table is often called a contingency table.
The count of the number of occurrences in a sample across two grouping variables creates a cross-tabulation.
In the below example, we want to determine the relationship between crime and drinking alcohol.
The independent variable is CRIME, and the dependent variable is DRINKER.
So, the Cross Tabulation statement will be
The null and alternative hypothesis in this case is:
: The variables are independent, which means there is no association between crime and drinking alcohol
: The variables are dependent, which means the crime rate is dependent on the drinking of alcohol
proc freq data=drinkers; weight count; tables crime*drinker/chisq expected norow nocol nopercent; title 'chi square analysis of a contingency table'; run;
By default, the table will show four numbers in each cell the overall frequency, the overall per cent, the row percent, and the column per cent as below.
The EXPECTED specifies that expected values are to be included in the table, and NOROW, NOCOL, and NOPERCENT tell SAS to exclude these values from the table.
Observe the statistics from the output. The chi-square value is 49.5660 and .
Thus, you reject the null hypothesis, i.e. there is no association (independence) and conclude that there is evidence of a relationship between drinking status and the type of crime committed.
Most of the Expected values are close to the observed values, while in the case of Fraud, the observed value(63) is different from what was expected(109.14).
This information leads to the conclusion that those involved in fraud are less likely to drink alcohol.
Calculating Relative Risk
Two-by-two contingency tables are often used while examining a measure of risk. In a medical test, these tables are constructed when one variable represents the presence or absence of disease and the other indicates some risk factor.
A measure of this risk in a case-control study is called the odds ratio (OR).
In a case-control study, a researcher takes a sample of subjects and looks back in time for any exposure or non-exposure to disease.
In a Cohort study, Subjects are selected by presence or absence of risk and then observed over time to see if they develop an outcome; the measure of this risk is called relative risk (RR).
ODDS Ratio is how many times more likely the odds of finding an exposure in someone with disease is compared to finding the exposure in someone without the disease.
Relative Risk indicates how many times more or less likely an exposed person develops an outcome relative to an unexposed person.
In either case, a risk measure (OR or RR) equal to 1 indicates no risk.
A risk measure different from 1 represents a risk. Assuming the outcome studied is undesirable.
- Risk measure >1 indicates an increased risk of the outcome.
- Risk Measure <1 implies a reduced risk of the outcome.
- Risk Measure = 1 indicates no risk.
In PROC FREQ, the option to calculate the values for OR or RR is RELRISK and appears as an option to the TABLES statement as shown here:
TABLES CHOLESTROLDIET*OUTCOME / CHISQ RELRISK;
proc freq data=HeartDisese order=data; title 'Case-Control Study of High Fat/Cholesterol Diet'; TABLES CHOLESTROLDIET*OUTCOME / CHISQ RELRISK; exact pchi or; weight Total; run;
Frequency tells us how many subjects we have in the LOW Cholesterol diet with NO/YES Heart Disease outcome.
Interpreting the first row, we have 6 subjects with LOW Cholesterol who have NO heart disease while 2 subjects with LOW cholesterol have HEART Disease.
Expected indicates the actual value to the observed value.
Per cent is the Overall percentage which indicates that 26.09 % of people are on a LOW cholesterol diet and do not have Heart Disease.
ROW Percent tells us the percentage of subjects in the LOW cholesterol diet who have NO heart disease out of 8 subjects in the LOW cholesterol diet. i.e. 75% of people with LOW Cholesterol don’t have heart disease.
COL Percent tells us the percentage of subjects who are not having heart disease on a LOW cholesterol diet. i.e. 6 out of 10, which is 60%. at the same time, 80% of people with heart disease are on a HIGH Cholesterol diet.
The CHI-SQUARE test interpretation tells us the association of these variables between what was expected and what was observed.
The Chi-Square statistics (4.9597) is less than the P-value(0.0259), which indicates an association between what was expected and what is observed.
One of the CHI-SQUARE test assumptions is that the observed value in each cell should be greater than 5. In the above example, we have 4 and 2, which are less than 5. In these cases, using Fisher’s Exact test is more appropriate.
The Fisher’s exact test (0.0393) which is statistically significant at a 5% . so we can say there is an association, and perhaps a HIGH-Fat diet is associated with a HIGH Risk for Hear diseases.
The EXACT statement is for PICHI, which means P-value for the CHI-SQUARE outputs in the below table.
Odds ratio – 8.25 with a 95% confidence limit, which means the odds of having heart disease are 8 times more than the people who don’t have a heart disease
The relative risk of 2.88 indicates that Heart disease is 2.88 times more in the HIGH FAT group. (Increased Risk).
The relative risk of 0.34 tells us that, There is a decreased risk (0.34 times Less)of LOW cholesterol and Heart disease.
Since we have specified EXACT on ODDs Ratio, we will get the last table as below.
The odds ratio is the same as the above, which is 8.25, but it also gives the exact confidence limit.