PROC FREQ in SAS, as its name, gives us the frequency counts, as well as other statistics that helps in analyzing the data. But Proc Freq procedure can do more than just giving the count.
What is PROC FREQ?
PROC FREQ is a procedure for analyzing the count of data. It is used to obtain frequency counts for one or more individual variables or to create two-way tables (cross-tabulations) from two variables.
PROC FREQ is also used to perform statistical tests on count data.
PROC MEANS is another SAS procedure, which you can use to compute descriptive statistics like finding the mean, standard deviation, minimum and maximum values, and a lot more statistical calculations.
Proc freq may be the first procedure you would think when the result needed is a count. For the purpose of this tutorial, I have taken the example of SASHELP.HEARTS dataset. Here, It can be used to find the number of subjects with bp_status=’Normal’, count the number of MALE and FEMALE subjects in the dataset.
But what if the result needed is not necessarily a count? Proc freq can be a helpful tool for purposes other than counting. The following are just some of the common types of questions that may arise that fit into this category:
- What percentage of Male and Female have heart disease?
- Which is the count of subjects who died due to high cholesterol?
- What is the death cause of the subjects?
- Who are the impacted subjects?
- Do any subjects have non-unique records?
Proc freq might not be the first method thought to answer questions such as these, but it may be a very quick and efficient option to use.
Why might proc freq be a good candidate for these quick responses? First, in these cases, there is less typing than other methods. This can save time as well as reduce typos.
Additionally, there may already be some freqs in the works to check into an issue when further questions arise. So getting the results from the existing frequency counts rather than data processing may be a logical next step.
Finally, using the output datasets provided by proc freq, there are readily available lists of unique values, combinations, and counts.
Syntax of PROC FREQ
PROC FREQ; BY variables ; EXACT / ; OUTPUT <OUT=sas dataset options> ; TABLES requests ; TEST ; WEIGHT variable ;
PROC FREQ Statements
- BY Provides separate analyses for each BY group
- EXACT Requests for exact tests.
- OUTPUT Requests an output data set.
- TABLES Specifies tables and requests analyses.
- TEST Requests tests for measures of association and agreement.
- WEIGHT Identifies a weight variable.
PROC FREQ Options
PROC FREQ has several different possible output requests. The most common is the OUT= option that is used to create a SAS dataset of a particular tabulation.
Options for the table statement are followed by a slash. The “/” resolves the ambiguity as to whether the word you are using is a variable name or a SAS keyword and thus the “/” tells SAS the arguments for the table ends and whatever is after options is to modify the default behaviour of the statement.
Another example is the NOCOL option when used, requests SAS to not show column percentages.
Example of PROC FREQ
Below is a basic use of PROC FREQ in SAS, to obtain counts of the number of Students observed in each category of Gender.
proc freq data=sashelp.class; table sex; run;
TABLES age*weight / CHISQ;
The above statement requests that the chi-square and related statistics will be reported for the cross-tabulation A*B.
Scenario 1. What is the count of Males and Females with bp_status of High and Weight Status of Overweight?
For these types of scenarios, Sorts and data steps could be used or proc SQL is another option. A data step might be the first thought for a solution. This involves first, sorting the data needs using proc sort. Then records of interest and unique sex need to be selected (data step). Then the output of the record (proc print).
But proc freq can answer this question in one block of code without a data step! A proc freq can be used with a WHERE statement to subset to the bp and weight status.
proc freq data=sashelp.heart; where bp_status='High' and weight_status='Overweight'; tables sex /nocum; run;
A TABLES statement of sex can then be used to get a unique list of the gender included in that subset. The NOCUM option is used to exclude Cumulative frequency from the result.
Scenario 2: Which Products got affected by the prediction value?
For this scenario, I have used the SASHELP.PRDSALE dataset. Imagine that it has been found that there are some products where the Actual Sales price is greater than the predicted price.
The first question is likely: Which are those products which have Actual greater than the Predicted?
But this is the type of question where the results may lead to many follow-up questions and the scope of the investigation and analysis may expand requiring additional coding steps to get to the root cause of the issue.
Regardless of the method followed, the investigation will start with a dataset containing both the lab data and the informed consent date for comparison.
This question may be answered by working within a data step or by using the proc freq method.
Data step method:
Working within a data step, if the dataset is not already sorted, a proc sort would come first and then a data step to select the unique product values of interest.
proc sort data=sashelp.prdsale out=sale; by product; run;
data unique_product(keep=product); set sale(where=(actual > predict)); by product; if first.product; run;
Proc freq method:
Working with a proc freq, we could simply do a freq with USUBJID in the TABLES statement in addition to using the WHERE statement to subset to the records with issue:
proc freq data=sashelp.prdsale; where actual gt predict; tables product /list missing nocum; run;
One-Way Frequency Tables
You can use PROC FREQ to produce tables of the frequency/counts by category and perform on the counts.
Using the ORDER=FREQ option
proc freq data=data.flighttravelers order=freq; table day_of_booking; run;
- PROC FREQ in the above example includes the ORDER=FREQ option.
- Using the ORDER=FREQ option helps you quickly perform an analysis of which categories have the most and fewest counts.
- The “Frequency” column gives the count of the number of times the day_of_booking variable takes on the value in the column.
- The “Percent” column is the percent of the total.
- The “Cumulative Frequency” and “Percent” columns report an increase in the count and percent for values of day_of_booking.
You use this type of insight to learn about the distribution of the categories in your data set.
For example, in these data, 28 people have booked flights on Sunday.
Using the ORDER=FORMATTED option.
Using the ORDER=FORMATTED option you can control the order in which the categories will be displayed in the table.
Prior to using this option, you need to create a custom format to define the order that you want in the output.
proc freq order=formatted data=data.flighttravelers; tables day_of_booking; title "Example of PROC FREQ with Formatted Values"; format day_of_booking $dayfmt.; run;
Creating a One-Way Frequency table from a summarized data
If the data is already summarized you can use the WEIGHT statement to indicate the variables that represent the count.
proc freq; weight count; title 'Reading Summarized Count data'; tables category; run;
WEIGHT COUNT tells PROC FREQ that the data for the variable COUNT are counts. Even though there are two records for CENTS, the program is able to combine the WEIGHT into a single CENTS category (252 CENTS).
Testing Goodness of Fit using PROC FREQ in SAS
PROC FREQ in SAS can also be used for testing goodness of fit for one-way. table. A goodness-of-fit test of a single population is a test that is used to determine if the distribution of observed frequencies in the sample data represents the expected number of occurrences for the population.
Assuming that the number of observations is fixed.
The hypotheses being tested are as follows:
- H:0 The population follows the hypothesized distribution.
- H: a The population does not follow the hypothesized distribution.
A chi-square test is one of the goodness-of-fit tests. A decision can be made based on the -value associated with that statistic.
A low -value indicates that the data do not follow the hypothesized, or theoretical, distribution.
If the -value is sufficiently low (usually <0.05), you will reject the null hypothesis.
The syntax to perform a goodness-of-fit test is as follows:
TABLES variable / CHISQ TESTP=(list of ratios);
An airline operated daily flights to several Indian cities. One of the problems for this airline is the food preferences of the passengers. The Captain cook operation manager believes 35% of their passengers prefer vegetarian, 40% prefer vegetarian food. 20% low-calorie food and 5% request for diabetic food.
A sample of 500 passengers was randomly chosen to analyze the food preferences and the data is shown below.
We will be conducting a CHI-SQURE Test to check if Captain Cook belief is true at =0.05
|Number of Passengers||190||185||90||35|
proc freq data=airlines order=data; weight no_of_passengers; title 'goodness of fit analysis'; tables foodtype / nocum chisq testp=(0.35 0.40 0.20 0.05); run;
- The data is summarised as by the WEIGHT Number_of_passengers.
- The ORDER=DATA is used to order the data as in the input dataset. Frequencies are based on the variable Food_Type.
- The /NOCUM CHISQ and TESTP= statements are used to calculate the goodness-of-fit test.
- The test ratios are based on the percent progeny expected from each of the four categories.
- The NOCUM option requests a table without the cumulative column.
Note: You must use the ORDER=DATA option to ensure that the hypothesized ratios listed in the TESTP= statement match up correctly with the categories in the input data.
The p-value for the chi-square test(=7.4107) is greater than critical value(=0.05), therefore we conclude that Captain Cook’s belief about food preferences is true.
Chi-Square Test of Independence – Analyzing Two-way tables using PROC FREQ in SAS
Using PROC FREQ in SAS we can perform the Chi-Square test of Independence, we test whether two or more groups are statistically independent or not.
The TABLES statement with two or more variables listed and separated by an asterisk creates a cross-tabulation table for relating two variables.
The cross-tabulation table is often called a contingency table.
The count of the number of occurrences in a sample across two grouping variables creates a cross-tabulation.
In the below example we want to determine the relationship between crime and drinking alcohol.
The independent variable is CRIME and the dependent variable is the DRINKER.
So, the Cross Tabulation statement will be
The null and alternative hypothesis in this case is:
- : The variables are independent which means there is no association of crime and drinking of alcohol
- : The variables are dependent which means the crime rate is dependent on the drinking of alcohol
proc freq data=drinkers; weight count; tables crime*drinker/chisq expected norow nocol nopercent; title 'chi square analysis of a contingency table'; run;
By default, the table will show four numbers in each cell are the overall frequency, the overall percent, the row percent, and column percent as below.
The EXPECTED specifies that expected values are to be included in the table, and NOROW, NOCOL, and NOPERCENT tell SAS to exclude these values from the table.
Observe the statistics from the output. The chi-square value is 49.5660 and .
Thus, you reject the null hypothesis i.e there is no association (independence) and conclude that there is evidence of a relationship between drinking status and type of crime committed.
Most of the Expected values are close to the observed values, while in the case of Fraud the observed value(63) is different from what was expected(109.14).
This information leads to the conclusion that those involved in fraud are less likely to drink alcohol.
Calculating Relative Risk using PROC FREQ in SAS
Two-by-two contingency tables are often used while examining a measure of risk. In a medical test, these types of tables are constructed when one variable represents the presence or absence of disease and the other indicates some risk factor.
A measure of this risk in a case-control study is called the odds ratio (OR).
In a case-control study, a researcher takes a sample of subjects and looks back in time for any exposure or non-exposure to disease.
In a Cohort study, Subjects are selected by the presence or absence of risk and then observed over time to see if they develop an outcome, the measure of this risk is called relative risk (RR).
ODDS Ratio is how many times more likely the odds of finding an exposure in someone with disease is compared to finding the exposure in someone without the disease.
Relative Risk indicates how many times more or less likely an exposed person develops an outcome relative to an unexposed person.
In either case, a risk measure (OR or RR) equal to 1 indicates no risk.
A risk measure different from 1 represents a risk. Assuming the outcome studied is undesirable.
- Risk measure >1 indicates an increased risk of the outcome.
- Risk Measure <1 implies a reduced risk of the outcome.
- Risk Measure = 1 indicates no risk.
In PROC FREQ, the option to calculate the values for OR or RR is RELRISK and appears as an option to the TABLES statement as shown here:
TABLES CHOLESTROLDIET*OUTCOME / CHISQ RELRISK;
proc freq data=HeartDisese order=data; title 'Case-Control Study of High Fat/Cholesterol Diet'; TABLES CHOLESTROLDIET*OUTCOME / CHISQ RELRISK; exact pchi or; weight Total; run;
Frequency tells us how many subjects we have in the LOW Cholesterol diet with NO/YES Heart Disease outcome.
Interpreting the first row, we have 6 subjects with LOW Cholesterol who have NO heart disease while 2 subjects with LOW cholesterol have HEART Disease.
Expected indicates the actual value to the observed value.
The percent is the Overall percentage which indicates that among all subjects 26.09 % of people are in the LOW cholesterol diet and do not have Heart Disease.
ROW Percent tells us the percentage of subjects who are in the LOW cholesterol diet who has NO heart disease out of 8 subjects who are in the LOW cholesterol diet. i.e 75% of people who are in LOW Cholesterol don’t have heart disease.
COL Percent tells us the percentage of subjects who are not having heart disease is in the LOW cholesterol diet. i.e 6 out of 10 which is 60%. while there are 80% of people who have heart disease are on the HIGH Cholesterol diet.
The CHI-SQUARE test interpretation tells us the association of these variables between what was expected and what is observed.
The Chi-Square statistics (4.9597) are less than the P-value(0.0259) which tells that, there is an association between what was expected and what is observed.
One of the assumptions of the CHI-SQUARE test is that the observed value in each cell should be greater than 5. In the above example, we have 4 and 2 which are less than 5. In these cases, it is more appropriate to use Fisher’s Exact test.
The Fisher’s exact test (0.0393) which is statistically significant at a 5% . so we can say there is an association and perhaps a HIGH-Fat diet is associated with a HIGH Risk for Hear diseases.
The EXACT statement is for PICHI which means P-value for the CHI-SQUARE outputs in the below table.
Odds ratio – 8.25 with 95% confidence limit which means the odds of having a heart disease is 8 times more than the people who don’t have a heart disease
The relative risk of 2.88 indicates that Heart disease is 2.88 times more in the HIGH FAT group. (Increased Risk).
The relative risk of 0.34 tells us that, There is a decreased risk (0.34 times Less)of LOW cholesterol and Heart disease.
Since we have specified EXACT on ODDs Ratio, we will get the last table as below.
The odds ratio is the same as the above which is 8.25 but it also gives the exact confidence limit.
Download Proc-Freq, exercise files,…