Objectives of
this Chapter:
- Learn SAS commonly used functions for
data processing.
- Learn how to MERGE datasets.
- Learn conditional processing using
WHERE, IF/ELSE and DO-END
A SAS function performs a
computation or manipulation on variables (arguments) and returns a
value. Most functions use arguments supplied by the user. SAS functions
are mainly used in DATA step programming statements.
data function_test;
Max_var= max(100,101);
Length_var = length(Max_var);
This_month = month(today());
run;
proc print;run;
The syntax of a function is :
Function-name (argument-1, . .
.,argument-n>)
In the above example, we have shown how MAX, LENGTH and MONTH functions
are used. The table below has some frequently used functions in our
programs.
|
Function Name
|
Description
|
|
LENGTH
|
Returns the length of an argument
|
|
LOWCASE
|
Converts all letters in an argument
to lowercase
|
|
SCAN
|
Selects a given word from a
character expression
|
|
SUBSTR
(right of =)
|
Extracts a substring from an
argument
|
|
UPCASE
|
Converts all letters in an argument
to uppercase
|
|
DATEPART
|
Extracts the date from a SAS
datetime value
|
|
DAY
|
Returns the day of the month from a
SAS date value
|
|
INTCK
|
Returns the integer number of time
intervals in a given time span
|
|
INTNX
|
Advances a date, time, or datetime
value by a given interval, and returns a date, time, or datetime value
|
|
MONTH
|
Returns the month from a SAS date
value
|
|
QTR
|
Returns the quarter of the year from
a SAS date value
|
|
TODAY
|
Returns the current date as a SAS
date value
|
|
YEAR
|
Returns the year from a SAS date
value
|
|
MAX
|
Returns the largest value
|
|
MEAN
|
Returns the arithmetic mean
(average)
|
|
MIN
|
Returns the smallest value
|
|
SUM
|
Returns the sum of the nonmissing
arguments
|
|
CALL
SYMPUT
|
Assigns DATA step information to a
macro variable
|
|
LOG
|
Returns the natural (base e)
logarithm
|
|
MOD
|
Returns the remainder value
|
|
SQRT
|
Returns the square root of a value
|
|
RANUNI
|
Returns a random variate from a
uniform distribution
|
|
INPUT
|
Returns the value produced when a
SAS expression that uses a specified informat expression is read
|
|
LAG
|
Returns values from a queue
|
|
PUT
|
Returns a value using a specified
format
|
|
ZIPSTATE
|
Converts ZIP codes to state postal
codes
|
|
CEIL
|
Returns the smallest integer that is
greater than or equal to the argument
|
|
FLOOR
|
Returns the largest integer that is
less than or equal to the argument
|
|
ROUND
|
Rounds to the nearest round-off unit
|
|
TRUNC
|
Truncates a numeric value to a
specified length
|
|
VLABEL
|
Returns the label that is associated
with the specified variable
|
|
VNAME
|
Returns the name of the specified
variable
|
|
VTYPE
|
Returns the type (character or
numeric) of the specified variable
|
MERGE
–Combining datasets
One of the features
frequently used while SAS programming is combining one or many
datasets. For example to combine the application data with performance
data we use MERGE utility in SAS.
Most of the time a match
merging is done on SAS datasets. For example, if we are interested in
the performance of the accounts that are opened only in Jan 2004, we
try to match only those accounts while doing the performance data
merging. In consumer Finance, account number or Account Key is usually
used for match merging.
Let us look at the
following examples. We have two record sets that have information about
credit lines of the accounts. Intial_cl holds data for all initial
credit lines assigned and new_cl has all increased credit lines. Now we
need to combine these into one dataset so that it will show the latest
credit lines for each account.
DATA intial_cl;
INPUT account credit_limit;
DATALINES;
1002 2000
1003 4000
1004 3000
;
DATA new_cl;
INPUT account credit_limit;
DATALINES;
1002 3000
1004 5000
;
DATA credit_limit;
MERGE intial_cl new_cl;
BY account;
RUN;
PROC PRINT;
RUN;
Output looks like as follows:
credit_
Obs
account limit
_______________________________________________________
1
1002 3000
2
1003 4000
3
1004 5000
So a simple merge statement
combined the datasets and the BY statement made sure that its
updated with the latest information.
What if we put ‘new_cl’
data set first in the MERGE statement? Note that the data overwritten
on the first dataset.
Let us introduce some more
complexity to reflect the way we use MERGE .
DATA credit_limit;
MERGE intial_cl(in=a) new_cl(in=b) ;
BY account;
IF a and b;
RUN;
The above data step introduces a conditional processing using IF . Also
a handle to the dataset is declared after the dataset name (in=a, in=b)
. Using these handles we can combine the datasets conditionally.
‘IF a and b’ combines the
dataset if the BY statement finds a match in the second dataset. Only
those records with a match are combined. This is different from the
first example where all records are output into the resultant dataset.
Output looks like as follows
Account Sample
credit_
Obs
account limit
____________________________________________
1
1002 3000
2
1004 5000
There are various combinations that suits to various requirements
If a ; = combines the data with data updated for all accounts in the
first dataset
If b; = combines the data with data updated for all accounts in the
second data set
If a or b; combines both the datasets with data updated into the first
dataset specified (Outer join)
Also it is possible to merge more than two data sets at a time. PROC
SQL is also used to merge datasets. Please go through the Appendix-I
for more details.
Conditional
Processing with WHERE , IF-ELSE , DO-END
When working with data we
come across situations where data need to be filtered or we may have to
apply several conditions before we have the final data for analysis or
modeling. For example, When we want to report out the number of active
accounts and total $ outstanding, we need to subset the entire data so
that only accounts that are active are included. Suppose the
active definition says “ An active account is an account where
current $ outstanding is greater than 100 and Status_code is not
Z”. Let us look at an example now.
DATA account_perf;
INPUT account current_os status_code $;
cards;
1002 300 A
1003 20 A
1004 1200 .
1005 800 Z
1006 450 D
1007 560 Z
1008 450 A
1009 900 C
1110 300 C
;run;
Data perf;
set account_perf;
where current_os >100 and status_code ne 'Z';
run;
Here dataset ‘perf’ is a subset of account_perf. Conditions
behind creating this subset were the activity definitions we mentioned
before. Note that WHERE statement is used to evaluate two variables
with and ‘and’ condition. The same way we can use OR also.
IF can also be used
instead of WHERE . (You can use IF current_os >100 and status_code
ne 'Z';). But WHERE is more efficient when we do the sub-setting.
IF ELSE conditions are
commonly used to flag accounts or create new variables based on certain
conditions. Suppose we need to flag the accounts into Good and Bad
based on certain conditions, IF/ELSE can be used.
The example below shows the
usage of IF/ELSE
Data perf;
set account_perf;
length status $20.;
if status_code eq 'Z' then status= "Bad-Charged Off";
else if status_code eq 'C' then status = "Cancelled";
else Status= "Good Account";
run;
Dataset used is same as the
previous example. Here we are categorizing the data into three
segments. Bad Charged off, Cancelled and Good Accounts. A new variable
(status) is created based on the condition.
current_ status_
Obs account
os
code status
1
1002
300
A Good Account
2
1003
20
A Good Account
3
1004
1200
Good Account
4
1005
800
Z Bad-Charged Off
5 1006
450
D Good Account
6
1007
560
Z Bad-Charged Off
7
1008
450
A Good Account
8
1009
900
C Cancelled
9
1110
300
C Cancelled
When there is multiple
processing done conditionally, then DO-END loop is used. For example we
want to categorize the good and bad accounts and also want to compute
the write-off amount based on a same condition ie Status_code =Z, this
can be achieved using a DO-END loop. Let us have a look at the example
below.
Data perf;
set account_perf;
length status $20.;
if status_code eq 'Z' then do ;
status= "Bad-Charged Off";
wo_amount = current_os ;
end;
else do;
Status= "Good Account";
wo_amount = 0 ;
end;
run;
proc print
;run;
The output looks like the
following:
current_ status_
Obs
account
os
code
status wo_amount
1
1002
300
A Good
Account
0
2
1003
20
A Good
Account
0
3
1004
1200
Good
Account
0
4
1005
800
Z Bad-Charged
Off 800
5
1006
450
D Good Account
0
6
1007
560
Z Bad-Charged
Off 560
7
1008
450
A Good
Account
0
8
1009
900
C Good
Account
0
9
1110
300
C Good
Account
0
Now let us summarize:
WHERE, IF-ELSE, DO-END are used for conditional processing in SAS data
steps. WHERE and IF conditions are also used in many SAS PROCs to
subset data and filter out unwanted data.
Exercises:
Create two data sets as showed below and try out the following
combination of merging. Write down your observations on how each
merging worked.
DATA
intial_cl;
INPUT account credit_limit;
DATALINES;
1002 2000
1003 4000
1004 3000
;
DATA
new_cl;
INPUT account credit_limit;
DATALINES;
1002 3000
1004 5000
1005 2500
;
DATA
credit_limit;
MERGE intial_cl(in=a) new_cl(in=b) ;
BY account;
IF a ;
RUN;
PROC PRINT;
RUN;
DATA
credit_limit;
MERGE intial_cl(in=a) new_cl(in=b) ;
BY account;
IF b ;
RUN;
PROC PRINT;
RUN;
DATA
credit_limit;
MERGE intial_cl(in=a) new_cl(in=b) ;
BY account;
IF a or b ;
RUN;
PROC PRINT;
RUN;
DATA
credit_limit;
MERGE intial_cl(in=a) new_cl(in=b) ;
BY account;
IF a=b ;
RUN;
PROC PRINT;
RUN;