An index in SAS is used to logically sort your data without physically sorting it. If you are sorting and then re-sorting data to accomplish merges, you may find that indexes to be useful.
Creating an Index in SAS
Indexes are created using PROC DATASETS or through PROC SQL and even they can also be created in a DATA step.
Indexing is storing the order of the data like physically sorting If there is an index for a dataset, SAS will be able to access it and allows us to use the data set with the appropriate BY statement without sorting the data.
SAS Indexes are stored in a separate file, and the size of this file can be substantial large especially as the number of indexes, observations, and variables used to form the indexes are more. However, Indexes can substantially speed up processes.
Creating an index using proc datasets
PROC DATASETS LIBRARY=libref; MODIFY SAS-data-set; INDEX CREATE varlist / UNIQUE NOMISS UPDATECENTILES = ALWAYS | NEVER | integer;
- libref is the SAS data library that is to be modified
INDEX CREATEstatement is the list of variables for which you want to create an index.
UNIQUEoption specifies that key variable values must be unique within the SAS data set.
NOMISSoption specifies that no index entries are to be built for observations with missing key variable values.
UPDATECENTILESoption allows you to override when SAS updates the index’s centiles.
The UNIQUE, NOMISS, and UPDATECENTILES options are optional.
There are two types of Indexes- Simple and Composite.
Simple index consistes of one variable whereas composite index contains multiple variables.
data subjects; length subject_name $3; do subject_id=100 to 100000; do j=1 to 3; substr(subject_name, j)=byte(int(65+26*ranuni(0))); end; output; end; run; /*Creating single index*/ proc datasets library=work; modify subjects; index create subject_id / unique; run; /*Creating multiple Index*/ proc datasets library=work; modify subjects; index create idname=(subject_id subject_name) / nomiss; run;
Using PROC SQL
CREATE INDEX index-name ON data-set-name(varlist)
Note that the UNIQUE, NOMISS, and UPDATECENTILES options are not available while creating an index using PROC SQL.
/*Single Index*/ proc sql; create unique index subject_id on work.subjects; quit; /*Multiple Index*/ proc sql; create index idname on work.subject(subject_id subject_name); quit;
DATA data-set-name(INDEX=(varlist / <UNIQUE><NOMISS><UPDATECENTILES = | ALWYAYS | NEVER integer> ));
- Data-set-name is the name of the new SAS data set name. Varlist is the name of the key variable.
- The UNIQUE, NOMISS, and UPDATECENTILES options can be used while creating an index in the data step.
/*Single Index*/ data SubjectIndexed(index=(subject_id / unique)); set work.subject; /*SAS Statements */ run; /*Multiple Index*/ data SubjectIndexed(index=(idname=(subject_id subject_name) / nomiss)); set work.subjects; /* SAS Statements…*/ run;
Using the index in SAS
Once, an Index is created by any of the above methods, you will want to use them. You can use the index in any of the four places described below.
- WHERE statement in a DATA or PROC step
- BY statement in a DATA or PROC step
- KEY option on a SET or MODIFY statement
Using Index in a WHERE statement
The WHERE statement can be used in DATA and PROC steps as in the below example.
data subject1; set subjects; where subject_id eq 5678; run;
Using the BY Statement
Using an index in a BY statement is simple. When the BY variable is an index, the index is automatically used, and the data does not need to be sorted.
data class(index=(age)); set sashelp.class; run; data class2; set class; by age; if first.age; run;
Using the KEY= Option
KEY= option can be used to look up a value when an index exists on only the data set that contains the values to be looked up. The KEY= option on the SET statement identifies an index that is to be used.
Please refer to the article below which has an example of using the KEY= option for performing table lookup.
More on Indexes
It is not necessary that SAS will use an existing index just because Index has been created even when using a WHERE or BY statement.
SAS first analyzes if using an index will be more efficient than reading the entire data set sequentially.
There are a lot of factors which are considered before using an index. Some of these are dataset size, the availability of index and centile information.
Hence, If SAS predicts that it will be more efficient to use a specific index to return observations than to read the entire data set, then it will use that index. otherwise, it will read the entire data set, sequentially, to fetch the observations.
Mostly SAS makes a good decision to use an index or not. However, its internal algorithms can sometimes make a decision to not use Index considering the resources consumed when reading a large subset of data via an index are greater than reading the entire SAS data set.
Override the SAS System decision about whether to use an index.
The IDXNAME= and IDXWHERE= dataset option are used to override SAS’ default index usage.
IDXNAME – Using the IDXNAME dataset option will prevent SAS from an attempt to determine if the specified index is the best one or if a sequential search might be more resource-efficient.
IDXWHERE= This option enables you to override the SAS System decision about whether to use an index.
You can use any one of the above but not both.
data subject1; set subjects (idxname=idname); where idname < 100; run;
Preventing SAS from using an Index
To prevent SAS from using an index, you can use the IDXWHERE=NO option
data subject1; set subjects (idxwhere=NO); where idname < 100; run;
SAS will automatically use an index when you specify the KEY option on either a SET statement or a MODIFY statement.
How to find if SAS is using the index
Using the below system option, SAS will print the Index used for any operation.
Message in the LOG
INFO: Index idname selected for WHERE clause optimization.
Check for the presence of Index in SAS
Index of datasets information is available in DICTIONARY.INDEX and SASHELP.VINDEX from which you can view to get all the useful information about indexes.
Alternatively, you can also use the proc content procedure to check if an index exists for a dataset.
PROC SQL outobs=5; SELECT * FROM dictionary.indexes; QUIT; DATA TEST; SET SASHELP.VINDEX; run;
Dropping an Index in SAS
To drop an in index specify the name of the index after the DROP statement as below.
proc sql; drop index idanme from students; quit;
Important Points on Index in SAS
- An index cannot be created on a SAS view.
- Since resources are required to create indexes and these resources should be taken into consideration. It is good to keep the number of indexes to a minimum to reduce disk storage and update costs.
- It is not beneficial to create an index for small tables since sequential access is faster on small tables.
- Avoid creating an index on variables which have low cardinality or a small number of distinct values. For, example Gender will have two values of Male and Female.
- Use indexes for queries that retrieve a relatively small subset of rows—that is, less than 15%.
- Do not create more than one index that is based on the same column as the primary key.