There are many scenarios where you may need to know the metadata of a SAS dataset. It is always essential to understand fully and explore the inputs to SAS-created output.
There are various possibilities in determining the position of variables or columns (and other essential details comprising metadata) in SAS.
Below are some of the methods by which you can view the metadata information of a SAS dataset.
- The CONTENTS Procedure
- The DATASETS PROCEDURE
- SAS Dictionary Tables (used with PROC SQL – DICTIONARY.COLUMNS)
- SASHELP Views (can be used with PROC SQL and with the SAS DATA step VMACRO,VCOLUMN)
- %SYSFUNC (through MACRO functions )
- Variable functions (VTYPE,VFORMAT)
- SAS File I/O (ATTRN)
In this article, I will specifically describe the PROC CONTENTS procedure.
Summarizing Data with PROC CONTENTS
With the PROC CONTENTS procedure, you can generate summary information about the contents of a dataset, such as:
- The variables’ names, types, and attributes (including formats, informats, and labels)
- How many observations are in the dataset?
- How many variables are in the dataset?
- When was the dataset created or last modified?
- How many datasets exist in a library?
- Find the Character and Numeric variables in a SAS dataset.
PROC CONTENT is useful if you have imported your data from a file and want to check that your variables have been read correctly and have the appropriate variable type and format.
For example, you may wish to review there is no truncation of character variables or date variables are correctly read.
PROC CONTENTS in SAS Example
To view the metadata of a SAS dataset, specify the SAS dataset name in the DATA= option. The dataset name is optional, but specifying it brings clarity to the code.
If you do not specify a dataset, SAS will use the most recently created dataset by default.
data class(label='Copy of SASHELP.CLASS dataset' alter="wclass" type=data genmax=4); length sex $6.; set sashelp.class; label name="Student Name"; format sex $genderfmt.; format weight z5.2; run; /*Assigning formats and labels*/ proc format; value $genderFmt 'M'='Male' 'F'='Female' Other='Error'; run; /*Sorting dataset*/ proc sort data=class(alter=wclass) nodupkey force; by age weight height; quit; proc sql; create unique index name on class(alter=wclass); quit; proc contents data=class; run;
Observations – It is the number of observations (or rows) in the dataset. In the example of SASHELP.CLASS dataset it contains 19 observations.
Variables – The number of variables (or columns) in the dataset. In the example of SASHELP.CLASS dataset it contains five columns.
Indexes – The number of Indexes in the SAS dataset.
Observation Length – It displays the record size in bytes. The length of observation might not be the sum of the total of the variables’ lengths. There can be padding depending on the type of variables.
proc contents data=class noprint out=class_contents; run; proc print; var name varnum npos type length; run;
PROC CONTENTS shows an observation length of 40. PROC PRINT displays the internal layout of the variables within the observation where NPOS is the physical position of the first character of the variable in the data set.
Numeric variables have a default length of 8 bytes whereas the length of character variables is set in the program. If you take a look at the program, we had the length of Sex as 6 and the remaining variables have a length of 8.
The total=8+8+8+8+6=38 bytes, but the observation length is always shown as the multiples of 8 in the results so it is 40 instead of 38.
The difference is the 2 bytes of padding so that the next observation is aligned on a double-byte boundary within the disk page buffer.
Also Read: Length and Precision of SAS Variables
Deleted Observations – This attribute shows how many observations are marked for deletion in the data set. If an observation is deleted, SAS marks that observation as no longer existing—it is not actually removed from the data set at that time. If it were, the data set would be rewritten every time an observation was deleted. Instead, the data can no longer be accessed.
The count for
Deleted Observations shows a missing value if you use the COMPRESS=YES option with one or both of the REUSE=YES and POINTOBS=NO options.
If your data set has a large number of deleted observations, it uses disk space unnecessarily and SAS might take longer to access your active observations. Deleted observations must still be read from the disk and logic must be used to skip them when returning active data. To remove the deleted observations from the data set, you can use a method such as PROC COPY to copy the data set and retain other data set attributes that you need.
Use this code to determine how many deleted observations are in each data set in a library:
ods output attributes=deleted(keep=member label2 nvalue2 where=(label2 contains 'Deleted' and nvalue2 > 0) ); proc contents data=lib1._all_; run; proc print data=deleted; format nvalue2 1.; run;
Compressed – Compression can decrease the amount of disk space required to store a data set. Compressing data means that fewer reads or writes are required to get or put the data from or to the disk. indicates whether the data set is compressed.
If the data set is compressed, the output includes an additional item,
Reuse Space (with a value of YES or NO). This item indicates whether to reuse space that is made available when observations are deleted.
Sorted – This field indicates whether the data set is sorted. If you sort the data set with PROC SORT, PROC SQL, or specify sort information with the SORTEDBY= data set option, a value of YES appears here, and there is an additional section to the output.
Last Modified – The Last modification date and time of the dataset.
Protection – It shows whether the dataset is Password protected or not.
Max Generation – The Maximum number of generations the dataset holds. Read – SAS Generation Datasets
Label – The Descriptive Label of the dataset.
Data Representation – It is the format in which data is represented on computer architecture or in an operating environment. For example, on an IBM PC, character data is represented by its ASCII encoding and byte-swapped integers. Native data representation refers to an environment for which the data representation compares with the CPU that is accessing the file. For example, a file that is in Windows data representation is native to the Windows operating environment.
Encoding – It is the encoding value. Encoding is a set of characters (letters, logograms, digits, punctuation, symbols, control characters, and so on) that have been mapped to numeric values (called code points) that can be used by computers. The code points are assigned to the characters in the character set when you apply an encoding method. Knowing the encoding allows the code points to be interpreted correctly.
This part of the output lists the dataset’s variables and their attributes.
(#): The original order of the variable in the columns of the dataset. (PROC CONTENTS prints the variables in alphabetical order to name, instead of in the order that they appear in the dataset.)
- Type: Whether the variable is numeric (Num) or character (Char).
- Len: Short for “Length”; represents the width of the variable.
- Format: The format for the values of the variables when printed on the output window.
- Informat: The format of the variables used to read the data in SAS.
- Label: The name of the variable when printed in the output window. If your variables do not have labels, this column will be the same as the Variable column.
- # indicates the number of each index. The indexes are numbered sequentially as they are defined.
- Index – displays the name of each index. For simple indexes, the name of the index is the same as a variable in the data set.
- Unique Option – indicates whether the index must have unique values. If the column contains YES, the combination of values of the index variables is unique for each observation.
- Nomiss Option – If there is a NOMISS option specified while creating an index, this field will show whether the index excludes missing values for all index variables. If the column contains YES, the index does not contain observations with missing values for all index variables.
- # of Unique Values – It gives the number of unique values in the index.
- Variables names – It displays the variables in a composite index.
- Sortedby- indicates how the data are currently sorted. This field contains either the variables and options that you use in the BY statement in PROC SORT, the column name in PROC SQL, or the values that you specify in the SORTEDBY= option.
- Validated- indicates whether the data was sorted using PROC SORT or SORTEDBY. If PROC SORT or PROC SQL sorted the data set, the value is YES. If you assigned the sort indicator with the SORTEDBY= data set option, the value is NO.
- Character Set – is the character set used to sort the data. The value for this field can be ASCII, EBCDIC, or PASCII.
- Collating Sequence- is the collating sequence used to sort the data set, which can be a translation table name, an encoding value, or LINGUISTIC if the data set is sorted linguistically. This field does not appear if you do not specify a collating sequence that is different from the character set. If the data set is sorted linguistically, additional linguistic collating sequence information appears after Collating Sequence, such as the locale, collation style, and so on. For a list of the collation rules that can be specified for linguistic collation.
- Sort Option indicates whether PROC SORT used the NODUPKEY or NODUPREC option when sorting the data set. This field does not appear if you did not use this option in a PROC SORT statement (not shown).
Proc Content options
Describing a SAS dataset gives us every information of the SAS dataset. There are various options that you can use with the PROC Contents procedure to customize the information you need.
Using the _ALL_ Keyword
Instead of running PROC CONTENTS separately for each data set, we can use the _ALL_ keyword to run PROC CONTENTS once and get all the variable information for all the data sets in the library.
The below code snippet will show all the datasets that are available in the WORK library.
proc contents data=work._all_; run;
The NODS option
The NODS option (which stands for “no details”) suppresses the printing of detailed information about each file when you specify the _ALL_ option.
proc contents data=work._all_ nods; run;
Using the DIRECTORY and DETAILS Options
Until now, we have asked PROC CONTENTS for information about only one SAS data set. SAS provides the ability to display the contents information about all the files in our data library using _ALL_ in place of the data file name.
When we do so, PROC CONTENTS also provides useful information about our SAS data library. If we want this useful data library information while wanting data file information about only one data file (that is, when we don’t use the _all_ feature), we can use the DIRECTORY option.
DIRECTORY option in proc contents procedure prints a list of all SAS files that are in the specified SAS library. The DETAILS option includes information in the output about the number of rows, number of variables, and table labels.
proc contents data=macas.cars directory details; title 'Using the DIRECTORY and DETAILS Options'; run;
Below is the additional details you get using the directory option.
DETAILS | NODETAILS
These two options are used along with the DIRECTORY option. It includes information in the output about the number of rows, number of variables, and table labels.
If you do not specify the DETAILS or NODETAILS option, the default for the CONTENTS procedure is the system options setting, which is NODETAILS; for the CONTENTS statement.
The NOPRINT option is used to suppress the printing of the output.
By default, variables are listed alphabetically. The varnum option requests that SAS display the listing of the variables in the order in which they appear in the data set:
Below is an example.
proc contents data=sashelp.class varnum;
ORDER=COLLATE | CASECOLLATE | IGNORECASE | VARNUM
- COLLATE prints a list of variables in alphabetical order beginning with uppercase and then lowercase names.
- CASECOLLATE prints a list of variables in alphabetical order even if they include mixed-case names and numerics.
- IGNORECASE prints a list of variables in alphabetical order ignoring the case of the letters.
- VARNUM is the same as the VARNUM option.
SHORT option prints only the list of variable names for the table. The list will be truncated if the variables are more than 32,767 characters, and a WARNING is written to the log. The alphabetical listing – COLLATE will give you a complete list of variables.
proc contents data=class short; title 'Using the SHORT option'; run;
Using this option will print the length of the informat or format. If the length for format or informat is not specified, it will not appear in the PROC CONTENTS output unless you use the FMTLEN option.
Centiles gives you a nice overview of the distribution of your data. Print centiles information for indexed variables.
Sometimes we will want to document all of the files in our data warehouse of a certain type. For example, we might want the directory and data set information about only the files that are view.
PROC CONTENTS allows us to do so using the MEMTYPE= option. There three member types, DATA, INDEX, and VIEW. We simply set the option to a specific member type (in our example, a “VIEW”).
The syntax will look something like the following: proc contents data=mylib._all_ memtype=view;
proc contents data=work._all_ memtype=data;
Creating Output Datasets
You can create output datasets using the OUT and OUT2 options in the contents Procedure. Below is an example.
OUT=table-name names an output table. Note that OUT= does not suppress the printed output from the statement. Use the NOPRINT option to suppress the printed output.
proc contents data=sashelp.class noprint out=varnames (keep=name varnum); run;
OUT2= Use the OUT2= option if you want to print only the information about constraints and other datasets related information.
proc contents data=class noprint out2=DatasetInfo; run; proc transpose data=DatasetInfo out=DatasetInfo_Transposed(rename=(_name_=Variables)) label=Description prefix=Values; var _all_; run; proc print data=DatasetInfo_Transposed; run;
Below is the transposed output of datsetInfo.
Exporting Proc Contents output
ODS EXCEL FILE="/home/outputs/contents.xlsx" options (sheet_interval='PAGE'); ods exclude directory; ods exclude members; proc contents data=work._all_ memtype=data; run; ODS EXCEL CLOSE;