Objectives of
this Chapter:
- Learn how to use KEEP and DROP
Statements
- Learn ARRAY Variables
- Learn how to LABEL and FORMAT data.
KEEP and
DROP statements.
KEEP and DROP statements
are used often to control the number of variables (fields) read into
and output into the datasets. During the data processing we
create several variables but need to save only select ones in the final
dataset.
If you want to restrict the
number of columns in output data set, use the following method. This
will ensure that output dataset is created with required variables
only.
Data target_data (keep =
var1 var2 var3 etc);
Set base_data;
Run;
Alternately you can
specify the first statement as follows:
Data target_data ;
Set base_data;
keep = var1 var2 var3 ;
Run;
If you are reading a big
dataset into SAS and require only a few variables from it, use
the following statements in the program.
Data target_data ;
Set base_data((keep = var1 var2 var3 etc);
Run;
In the first case, SAS reads the entire data set base_data, even though
you only intend to use three variables. In the second case, SAS reads
from disk only the three variables you intend to keep. Please note that
we have to use such efficient methods to restrict the data read into
the system to optimize the system resources such as SASWORK
and shared drives.
The same way DROP
statement can also be specified based on the data requirement of the
user.
Unlike other programming
languages, SAS array variable is a set of similar variables grouped
together with a name in an ARRAY statement. We use
arrays very often when we work with datasets that are arranged as a
Time Series. Let us look at simple program to understand.
Data array_test;
Set weight_data;
array weight{50} wt1-wt50;
do i=1 to 50;
if weight{i}=999 then weight{i}= .;
end;
run;
As we have seen earlier one of the basic use of array is to group
similar variables . In the above example ‘weight_data has weight
measured at fifty different growth stages and wherever data is missing
that data point is updated with 999. Assume it’s a very large data set
and we want to replace all 999 with a ‘.’ for missing. If we are to do
this data step, we will have to write 50 statements ( if wt1 =999 then
wt1=. ; like that for each column) . Using array variable we can
simplify this.
Line 3: Here ‘array’
is the key word that tells SAS the following name (weight) is an array.
The closed bracket {50} specifies the number of elements in the array
followed by the values of array elements*wt1, wt2, wt3….wt50). For
example weight{40} = ‘wt40’
Line 4: Do loop 1 to
50 to hold the record to scan through 50 different fields
specified.
Line 5: Remember
weight{1} will have a value wt1 and that field will be evaluated in the
following IF condition to check whether it has a value of 999.
Similarly when i=2, weight {2} will have a value of wt2. Like this 50
times the loop is executed for each row of data and the update is made
for fifty fields (columns)
Line:6 Ending the loop .
End of the execution, all
the fields are evaluated and updated with ‘.’ for 999.
Now think about
substituting fields with missing. Values with some other value.
Time Series Example
Now let us look at an
example to know the way array is used in our environment (This program
does not run as it requires some datasets)
Data Balance_due;
set TS_49;
array st_month {6}
stmonth15 stmonth25 stmonth35 stmonth45 stmonth55;
array bal_due {6} baldue15 baldue25 baldue35 baldue45
baldue55 ;
do i=1 to 5;
Statement_month =st_month (i);
balance_due = bal_due(i);
output;
end;
run;
Proc summary
data = Balance_due;
class Statement_month;
var balance_due ;
run;
The above program reads a
dataset (TS_49) where the variables are arranged in a time series
format. (For example, baldue15 stands for balance due for the month of
Jan 2005 and so on) The user wants to see the balance due for all
accounts by statement month for 5 specific months , using
SAS ARRAY (s)he could achieve that easily. Further, (s)he could
use PROC summary to do the summarization as data is now in a format
that PROC Summary understands.
No let us look at the ARRAY statement.
array st_month {5}
stmonth15 stmonth25 stmonth35 stmonth45 stmonth55
Here ‘array’ is the key
word that tells SAS the following name (st_month) is an array. The
closed bracket (5) specifies the number of elements in the array
followed by the values of array elements. For example st_month(5) =
‘stmonth55’
In other words, ARRAY help
us to group a set of variables so that programming could be made short
and flexible. If we need to do the same action repetitively on same
group of variables, declaring an array would solve the same.
A SAS Label describes a
variable. When labels are assigned in the data step they are available
for all procedures that use that data set. When we produce reports we
could print labels instead of variable names.
Let us modify our earlier example:
Data var_test;
length id $ 10;
length NProducts 4;
id ='JK';
NProducts= 6;
pro_price = 4.55;
tot_cost = NProducts*pro_price;
final_price = tot_cost;
LABEL id
="Vendor Name"
NProducts
="Number of Products"
pro_price = "Product Price"
tot_cost ="Total Cost"
final_price ="Final Price"
;
run;
proc contents
data =var_test;run;
Let us look at the output
of PROC Contents:
-----Alphabetic List of Variables and Attributes-----
#
Variable Type
Len Pos Label
-------------------------------------------------------------------
2 NProducts
Num
4
24 Number of Products
5 final_price
Num
8
16 Final Price
1 id Char 10 28 Vendor Name
3
pro_price
Num
8
0 Product Price
4
tot_cost
Num
8
8 Total Cost
The PROC output now show
‘Label’, which gives more information about the variable.
When you use PROC PRINT or some other Procedures you can print the
labels instead of variable names. For example the code below will print
labels for the variables in var_test.
Proc print data = var_test label;
run;
Additionally, the labels can be created with PROC FORMAT procedure,
which is not discussed in this guide.
A format
is an instruction that SAS uses to write data values. You use formats
to control the written appearance of data values. Note that a format
does not change the original value of a variable. Since formats are
primarily used to format output, we will look at how we can use
existing SAS internal formats using the FORMAT statement in PROCs and
Data steps.
The following example shows how to use a format statement in a data
step.
data format_test;
num_test = 1250;
today =today();
dollar_amt =13400.5;
Name = 'J KURIAN';
format num_test words40. name $reverj7. dollar_amt
dollar10.2 today monyy7.;
;
run;
proc print
data = format_test;run;
The above example
formats four variables – num_test with words40., name with reverj7.,
dollar_amt with dollar10.2 and today with monyy7. formats. Please have
a look at the following table to understand what each format does to
the display of the variable. The below list is not exhaustive- Do
consult the SAS manual for a complete list of formats.
Syntax for the format statement is:
Format <variable name> <format name> ;
There are three categories
of formats. Character, date and time and numeric. Lists of frequently
used formats are provided below.
Category |
Format
|
Description
|
Character
|
$CHARw.
|
Writes standard
character data
|
|
|
$QUOTEw.
|
Writes data values
that are enclosed in double quotation marks
|
|
|
$REVERJw.
|
Writes character data
in reverse order and preserves blanks
|
|
|
$UPCASEw.
|
Converts character
data to uppercase
|
|
|
$w.
|
Writes standard
character data
|
|
Date and Time
|
DATEw.
|
Writes date values in
the form ddmmmyy or ddmmmyyyy
|
|
|
DATETIMEw.d
|
Writes datetime
values in the form ddmmmyy:hh:mm:ss.ss
|
|
|
DAYw.
|
Writes date values as
the day of the month
|
|
|
DDMMYYw.
|
Writes date values in
the form ddmmyy or ddmmyyyy
|
|
|
DDMMYYxw.
|
Writes date values in
the form ddmmyy or ddmmyyyy with a
specified separator
|
|
|
MMYYxw.
|
Writes date values as
the month and the year and separates them with a character
|
|
|
MONNAMEw.
|
Writes date values as
the name of the month
|
|
|
MONTHw.
|
Writes date values as
the month of the year
|
|
|
MONYYw.
|
Writes date values as
the month and the year in the form mmmyy or mmmyyyy
|
|
|
QTRw.
|
Writes date values as
the quarter of the year
|
|
|
QTRRw.
|
Writes date values as
the quarter of the year in Roman numerals
|
|
|
WEEKDATEw.
|
Writes date values as
the day of the week and the date in the form day-of-week,
month-name dd, yy
(or yyyy)
|
|
|
WEEKDAYw.
|
Writes date values as
the day of the week
|
|
|
WORDDATEw.
|
Writes date values as
the name of the month, the day, and the year in the form month-name
dd, yyyy
|
|
|
YEARw.
|
Writes date values as
the year
|
|
|
YYMMxw.
|
Writes date values as
the year and month and separates them with a character
|
|
Numeric
|
BESTw.
|
SAS chooses the best
notation
|
|
|
COMMAw.d
|
Writes numeric values
with commas and decimal points
|
|
|
COMMAXw.d
|
Writes numeric values
with periods and commas
|
|
|
DOLLARw.d
|
Writes numeric values
with dollar signs, commas, and decimal points
|
|
|
FLOATw.d
|
Generates a native
single-precision, floating-point value by multiplying a number by 10
raised to the dth power
|
|
|
NUMXw.d
|
Writes numeric values
with a comma in place of the decimal point
|
|
|
SSNw.
|
Writes Social
Security numbers
|
|
|
w.d
|
Writes standard
numeric data one digit per byte
|
|
|
WORDFw.
|
Writes numeric values
as words with fractions that are shown numerically
|
|
|
WORDSw.
|
Writes numeric values
as words
|