SAS has two powerful functions for dividing a string into words. Words can be characters separated by blanks or other delimiters that you specify.
SCAN and SCANQ split strings into words. Both functions are similar. However, the SCANQ function has some additional features, and there is a difference in the default delimiter used in these two functions.
This article contains the SCAN function’s primary usage and real use cases of the SCAN function.
SAS SCAN Function
SAS SCAN function extracts a specified word from a character expression, where the word is the characters separated by a set of specified delimiters.
The length of the returned variables is 200 unless previously defined.
SCAN(character-value, n-word <,’delimiter-list’>,<modifiers>)
- The n-word is the nth “word” in the string.
- An ‘n’ value greater than the number of words returns a value that contains no characters.
- For negative ‘n’ values, the character value is scanned from right to left, and a value of zero is invalid.
List of Modifiers
Below are some of the most used modifiers with the SCAN function.
👉 a adds alphabetic characters to the list of characters.
👉 b scans backward from right to left instead of from left to right, regardless of the sign of the count argument.
👉 d adds digits to the list of characters.
👉 i ignores the case of the characters.
👉 k causes all characters that are not in the list of characters to be treated as delimiters.
👉 m With the m modifier you can specify multiple consecutive delimiters, and delimiters at the beginning or end of the string argument, refer to words that have a length of zero.
👉 q modifier is used when you want to ignore delimiters that are inside substrings that are enclosed in quotation marks.
👉 t trims trailing blanks from the string and charlist arguments. If you want to remove trailing blanks from only one character argument instead of both character arguments, use the TRIM function instead of the SCAN function with the T modifier.
For the complete list of delimiters, refer to the SAS documentation.
Example: Using Scan function to find the nth word – Left to Right Scan
data _null_; text = "Kenny Green flies brown kites"; third_word = scan(text,3); put third_word=; run;
Using Scan function to find the Second Last Word – Right to Left Scan
Using the SCAN function, you also can read from right to left, effectively allowing you to capture the last word in a character string.
To tell SAS to read from right to left, change the count argument to a negative number to indicate the word number you would like to read, starting from the right and moving left.
So, to select the word “brown” in our TEXT variable, we can use a count of -2, as shown here:
data _null_; text = "Kenny Green flies brown kites"; second_last_word = scan(text,-2); put second_last_word=; run;
Alternatively, you can use the “b” modifier available with the SCAN function rather than using a negative count. By specifying the “b” argument with the SCAN function, you can tell SAS to read from right to left instead of the default left to right.
data _null_; text="Kenny Green flies brown kites"; second_last_word=scan(text,2," ","b"); put second_last_word=; run;
Points to remember while using SCAN Function:
- If the length is not defined previously, it defaults to 200 bytes for the created variable.
- A missing value is returned if there are fewer than n words in the string.
- delimiters at the beginning or end of the string argument are ignored
- Two or more contiguous delimiters are treated as a single delimiter.
- When the SCAN function is used, Any character or set of characters can serve as delimiters.
- If n is negative, SCAN selects the word in the character string starting from the string’s end.
Handling Different Word Delimiters
The default word delimiter for SCAN function is the space but the SCAN function still works even with commas as the delimiter.
data _null_; text = "Kenny,Green,flies,brown,kites"; third_word = scan(text,3); put third_word=; run;
The reason these still works is that by default, with any computer using ASCII characters, the SCAN function will automatically check for any of the following characters as delimiters:
blank ! $ % & ( ) * + , – . / ; < ^ :
When your data contains a delimiter between words not found in the default list, you can use the third argument with the SCAN function to specify your custom delimiter.
For example, if the words in your character string with a plus sign (+), you need to enclose the plus sign in quotations as the third argument to the scan function.
The syntax below demonstrates how to select the fifth word from a plus sign delimited character string:
data _null_; text = "Kenny+Green+flies+brown+kites"; fifth_word = scan(text,5,"+"); put fifth_word=; run;
How to handle multiple Delimeters?
By default, SAS will use not just one but all of the delimiters in the default list. The result might not come as expected when your data contains multiple delimiters.
So, you may also want to force SAS to use only one of the default delimiters in some cases.
In the below dataset, the names variable contains a list of first, last, and middle names.
The structure is as follows: <last name><blank><middlename><comma><firstname>. You would like to create first name from this data.
Since commas and spaces are default delimiters, if we use the SCAN function without specifying our delimiter in firstname1, SAS will consider space as the delimiter and “Slyke” will become the first word.
To correct this, we can tell SAS only to use the comma as a delimiter so that “Van Slyke” will become the last name and Andy will be the given name:
data one; input names $25.; firstname1=scan(names,2); firstname=scan(names,2,","); datalines; Van Slyke, Andy Thomas,Andres Robidoux, Billy Jo Mr. Bruce,Brenly Bob,Horner ; run;
Now that the blanks are no longer considered delimiters and only the commas are, we get the desired result in our output data with “Andy” now in the firstname variable.
Using SCAN with DO LOOPS to Parse Long Character Strings
When combined with a simple DO LOOP and a SAS, the SCAN function makes it easy to parse out each word from a character string into separate variables.
In the below example dataset, you would like to parse out each word from the letters variable into five separate variables.
data letter; input letter $50.; call symputx('count', count(letter, ",") + 1); datalines; A,B,C,D,E,F,G ; run;
A macro variable is created within the data step to hold the count of letters. This macro variable is later used to define the array elements dynamically.
The code below uses a DO LOOP to scan the letters variable and then create the variables letter1 to letter7.
data one; set letter; array letters[&count] $15 letter1-letter&count; do i=1 to dim(letters); letters[i]=scan(letter, i, ", "); end; drop i; run;
As you can see in the output data shown partially below, we now have five new MODEL variables, with one word per variable:
Separating the comma-separated list horizontally
Sometimes you may need to create separate rows from one value in a column. For example, consider a list of comma-separated email id’s in a variable. You may need to create send separate emails to each of the email ids.
The SCAN function can be used in this scenario where each of the email ID can be separated as a single row.
Below is a simple example where I have used the letter dataset to create separate letters as a row in the dataset.
data two(keep=new_letter); set letter; do i = 1 by 1 while (scan(letter,i,',') ^= ' '); new_letter = scan(letter,i,','); output; End; run;
Similar to the SCAN function, SCANQ is also used to extract a specified word from a character expression.
SCANQ(character-value, n-word <,’delimiter-list’>)
Difference between SCAN and SCANQ
Below are some of the differences between SCAN and SCANQ
- Default Delimeter set – If you omit delimiter, SCANQ uses white space characters (blank, horizontal and vertical tab, carriage return, line feed, and form feed) as the default delimiters.
- A value of 0 for the word count does not result in an error message. when you use the SCANQ
- SCANQ also ignores delimiters enclosed in quotation marks.
data _null_; text="Kenny,'Green,flies',brown,kites"; third_word1=scan(text, 3); third_word2=scan(text, 3,",","q"); third_word3=scanq(text, 3,","); put third_word1=third_word2=third_word3=; run;
third_word1=flies’ third_word2=brown third_word3=brown
If the delimiter is a comma, you have to use the delimiter argument in the SCANQ function.
To ignore the delimiter inside the quotation mark, the q argument can be used with the SCAN function.
Difference between SCAN and SUBSTR
SCAN is used to extract words from a list of words that are separated by delimiters. SUBSTR is used to extract a part of the word by specifying a starting location and the length of the part to be extracted.
When to use the SCAN function?
- Use the scan function when you know the order of the words in the character value.
- When you know that the starting position of the word varies.
- The words are separated by some delimiter.