SAS has two powerful functions for dividing a string into words. Words can be characters separated by blanks or other delimiters that you specify.
SCAN and SCANQ split strings into words. Both functions are similar. However, the SCANQ function has some additional features. There is a difference in the default delimiter used in these two functions.
This article contains the SCAN function’s primary usage and actual use cases of the SCAN function.
SAS SCAN Function
THE SAS SCAN function extracts a specified word from a character expression. The word is the characters separated by a set of specified delimiters.
The length of the returned variables is 200 unless previously defined.
Syntax of the SCAN Function:
SCAN(character-value, n-word <,'delimiter-list'>,<modifiers>)
- The n-word is the nth “word” in the string.
- An ‘n’ value greater than the number of words returns a value with no characters.
- For negative ‘n’ values, the character value is scanned from right to left, and a value of zero is invalid.
List of Modifiers for the SCAN Function
The below table contains the list of modifiers that can be used with the SCAN function.
|a||adds alphabetic characters to the list of characters.|
|b||scans backward from right to left instead of left to right, regardless of the sign of the count argument.|
|c||adds control characters to the list of characters.|
|d||adds digits to the list of characters.|
|f||adds an underscore and English letters to the list of characters.|
|g||adds graphic characters to the list of characters. Graphic characters are characters that, when printed, produce an image on paper.|
|h||adds a horizontal tab to the list of characters.|
|i||ignores the case of the characters.|
|k||causes all characters not in the list of characters to be treated as delimiters.|
|l||adds lowercase letters to the list of characters.|
|m||Multiple consecutive delimiters can be specified with the m modifier, and delimiters at the start or end of the string argument refer to zero-length words.|
|n||adds digits, an underscore, and English letters to the list of characters.|
|o||processes the charlist and modifier arguments only once, rather than every time the SCAN function is called.|
|p||adds punctuation marks to the list of characters.|
|q||q modifier is used when you want to ignore delimiters inside substrings enclosed in quotation marks.|
|r||removes leading and trailing blanks from the word that SCAN returns.|
|s||adds space characters to the list of characters (blank, horizontal tab, vertical tab, carriage return, line feed, and form feed).|
|t||trims trailing blanks from the string and charlist arguments.|
|u||adds uppercase letters to the list of characters.|
|w||adds printable (writable) characters to the list of characters.|
|x||adds hexadecimal characters to the list of characters.|
When using a modifier with the SCAN function, specify the modifier as the fourth argument and explicitly specify the delimiter as the third argument. SAS will not use your modifier as the delimiter!
Using the Scan function to find the nth word – Left to Right Scan.
data _null_; text = "Kenny Green flies brown kites"; third_word = scan(text,3); put third_word=; run;
Using the Scan function to find the Second Last Word – Right to Left Scan
Using the SCAN function, you also can read from right to left, effectively allowing you to capture the last word in a character string.
To tell SAS to read from right to left, change the count argument to a negative number to indicate the word number you would like to read, starting from the right and moving left.
So, to select the word “brown” in our TEXT variable, we can use a count of -2, as shown here:
data _null_; text = "Kenny Green flies brown kites"; second_last_word = scan(text,-2); put second_last_word=; run;
Alternatively, you can use the “b” modifier available with the SCAN function rather than using a negative count. By specifying the “b” argument with the SCAN function, you can tell SAS to read from right to left instead of the default left to right.
data _null_; text="Kenny Green flies brown kites"; second_last_word=scan(text,2," ","b"); put second_last_word=; run;
Points to remember while using the SCAN Function:
- If the length is not defined previously, it defaults to 200 bytes for the created variable.
- A missing value is returned if fewer than n words are in the string.
- delimiters at the beginning or end of the string argument are ignored
- Two or more delimiters that are contiguous are handled as a single delimiter.
- When the SCAN function is used, Any character or set of characters can serve as delimiters.
- If n is negative, SCAN selects the word in the character string starting from the string’s end.
Handling Different Word Delimiters in the SCAN Function
The default word delimiter for the SCAN function is the space, but the SCAN function still works even with commas as the delimiter.
data _null_; text = "Kenny,Green,flies,brown,kites"; third_word = scan(text,3); put third_word=; run;
The SCAN function will automatically check for any of the following characters as delimiters:
blank ! $ % & ( ) * + , – . / ; < ^ :
You can use the third argument with the SCAN function to specify a custom delimiter when your data contains a delimiter between words that aren’t listed in the default list.
For example, if the words in your character string with a plus sign (+), you need to enclose the plus sign in quotations as the third argument to the scan function.
The syntax below demonstrates how to select the fifth word from a plus sign delimited character string:
data _null_; text = "Kenny+Green+flies+brown+kites"; fifth_word = scan(text,5,"+"); put fifth_word=; run;
How to handle multiple Delimeters in the SCAN Function?
SAS will use not just one but all delimiters in the default list by default. Therefore, the result might not be expected when your data contains multiple delimiters.
So, you may also want to force SAS to use only one of the default delimiters in some cases. In the below dataset, the names variable contains a list of first, last, and middle names.
The structure is as follows: <last name><blank><middlename><comma><firstname>. You would like to create a first name from this data.
Because commas and spaces are default delimiters, if we run the SCAN function without specifying a delimiter in firstname1, SAS will use “Slyke” as the first word.
To correct this, we can tell SAS only to use the comma as a delimiter so that “Van Slyke” will become the last name and Andy will be the given name:
data one; input names $25.; firstname1=scan(names,2); firstname=scan(names,2,","); datalines; Van Slyke, Andy Thomas,Andres Robidoux, Billy Jo Mr. Bruce,Brenly Bob,Horner ; run;
We receive the desired result in our output data with “Andy” in the firstname variable now that the blanks are no longer regarded delimiters and just the commas are.
Using SCAN with DO LOOPS to Parse Long Character Strings
When combined with a simple DO LOOP and SAS, the SCAN function makes it easy to parse each word from a character string into separate variables.
In the below example dataset, you would like to parse out each word from the letters variable into five separate variables.
data letter; input letter $50.; call symputx('count', count(letter, ",") + 1); datalines; A,B,C,D,E,F,G ; run;
A macro variable is created within the data step to hold the count of letters. This macro variable is later used to define the array elements dynamically.
The code below uses a DO LOOP to scan the letters variable and then create the variables letter1 to letter7.
data one; set letter; array letters[&count] $15 letter1-letter&count; do i=1 to dim(letters); letters[i]=scan(letter, i, ", "); end; drop i; run;
As you can see in the output data shown partially below, we now have five new MODEL variables, with one word per variable:
Separating the comma-separated list horizontally using SCAN Function
Sometimes you may need to create separate rows from one value in a column. For example, consider a single variable holding comma-separated email ids.
In this scenario, you can use the SCAN function, where each email ID can be separated as a single row.
data email_list; infile datalines truncover; input emails $100.; datalines; [email protected],[email protected],[email protected],[email protected],[email protected] ; run;
data email_list2(keep=new_emails); set email_list; do i=1 by 1 while (scan(emails, i, ',') ^=' '); new_emails=scan(emails, i, ','); output; End; run;
SCANQ is also used to extract a specified word from a character expression like the SCAN function.
SCANQ(character-value, n-word <,'delimiter-list'>)
Differences between SCAN and SCANQ
Below are some of the differences between SCAN and SCANQ
- Default Delimeter set – If you don’t specify a delimiter, SCANQ will utilise white space characters as default delimiters (blank, horizontal and vertical tab, carriage return, line feed, and form feed).
- A value of 0 for the word count does not result in an error message. when you use the SCANQ
- SCANQ also ignores delimiters enclosed in quotation marks.
data _null_; text="Kenny,'Green,flies',brown,kites"; third_word1=scan(text, 3); third_word2=scan(text, 3,",","q"); third_word3=scanq(text, 3,","); put third_word1=third_word2=third_word3=; run;
third_word1=flies' third_word2=brown third_word3=brown
If the delimiter is a comma, you have to use the delimiter argument in the SCANQ function.
The q argument can be used with the SCAN function to ignore the delimiter inside the quotation mark.
Difference between SCAN and SUBSTR Function
SCAN is used to extract words from a list of words that are separated by delimiters. SUBSTR is used to extract a part of the word by specifying a starting location and the part’s length.
When to use the SCAN function?
Use the scan function when you know the order of the words in the character value.
When you know that the starting position of the word varies.
Some delimiter separates the words.
So, this was our guide on SCAN function. We hope that you must have found it helpful. Do you have any tips to add? Let us know in the comments.
Thanks for reading!