Structured Query Language
(SQL) is a standardized, widely used language that retrieves and
updates data in relational tables and databases.
The SQL procedure is SAS’
implementation of Structured Query Language. PROC SQL is part of Base
SAS software, and you can use it with any SAS data set (table). Often,
PROC SQL can be an alternative to other SAS procedures or the DATA
step. You can use SAS language elements such as global statements, data
set options, functions, informats, and formats with PROC SQL just as
you can with other SAS procedures.
PROC SQL is used in
analytics to :
- Retrieve data from database tables or
views (Oracle or SQL Server)
- Combine SAS datasets from tables or
views (MERGE)
- Create datasets and indexes
- Compute statistics and Generate reports
We
often use PROC SQL to extract data from various warehouses. Below is an
example reproduced from a previous chapter.
PROC SQL;
Connect To ORACLE(User=901115644
Password=ypasswd Buffsize=10000 Path=BDCIT1
Preserve_Comments );
CREATE TABLE acct_status AS SELECT *
FROM Connection To ORACLE
(SELECT current_account_nbr AS
account_number, external_status_reason_code AS
ext_rcode,external_status AS estatus,
billing_cycle_day AS billing_cycle_day
FROM
ACCOUNT_DIM
WHERE CLIENT_ID='BROOK BROS'
AND
nvl(EXTERNAL_STATUS_REASON_CODE,'0')
<>'98');
Disconnect From ORACLE;
Quit;
In the above example PROC
SQL use a connect string “Connect To ORACLE(User=901115644
Password=ypasswd Buffsize=10000 Path=BDCIT1 Preserve_Comments )” to
identify the oracle database (BDCITI) and use the user name and
passwords specified to read data from it. Further, CREATE
TABLE statement creates a SAS Dataset from the output of SELECT
statement.
When querying a data
warehouse, PROC SQL automatically converts the field formats in to SAS
formats. For example, a date field in Oracle will be converted into SAS
date and an Oracle Varchar field will be converted into character.
PROC SQL can perform some
of the operations that are provided by the DATA step and the PRINT,
SORT, and SUMMARY procedures.
Let us create a dataset and
then we will see how PROC SQL works like a DATA step.
DATA account_perf;
INPUT client $ account fico_seg $
current_os tot_payment ;
cards;
Cmart 1002 401-500 300 100
Cmart 1003 501-600 200 150
Cmart 1004 601-700 1200 180
Cmart 1005 701-800 800 190
Cmart 1006 801-900 450 200
GIA 1007 401-500
560 210
GIA 1008 501-600
450 180
GIA 1009 601-700
900 145
GIA 1110 701-800
300 148
;
run;
PROC SQL;
title "Summary of O/S and Payments by
Client";
select client, sum(current_os) as tot_os
, sum(tot_payment)as tot_payment
from account_perf
group by client
order by tot_os descending ;
quit;
The above program block
shows how a PROC SQL is substituting a PROC PRINT, PROC SORT and PROC
SUMMARY.
Line 2: Assigns a title to
the output of SELECT statement that follows
Line 3: Select statement
with group function SUM used to summarize the data. GROUP BY clause is
used to compute the SUM for each distinct group in the database. ORDER
BY is used to sort the output and DESCENDING keyword is to control the
sort order. It is also possible to SORT by multiple variables.
A PROC SQL statement ends
with ‘QUIT\;’ and it terminates the procedure. Output is always
printed to the screen (like PROC PRINT) and it’s also possible to
create a SAS dataset from the output. To create a dataset from the
output the above program can be modified as follows:
PROC SQL;
title "Summary of O/S and Payments by
Client";
create table summary as select client,
sum(current_os) as tot_os , sum(tot_payment)as tot_payment
from account_perf
group by client
order by tot_os descending ;
quit;
Line 3: Note the CREATE
TABLE <table name> AS statement.
We have seen above how
SELECT statements are used in PROC SQL. Most of data retrieval and data
combining are done using select statement. We will see some sample
select statements and how it is used conditionally to work with data.
In the example above the simple SELECT statement is shown below.
SELECT client, sum(current_os) as tot_os
, sum(tot_payment)as tot_payment
FROM account_perf
The SELECT statement must
contain a SELECT clause and a FROM clause, both of which are required
in a PROC SQL query. Other clauses added to SLECT statements to
restrict the data retrieval or conditional processing . Those clauses
are WHERE, ORDER BY, GROUP BY and HAVING.
The WHERE clause restrict
the data that you retrieve by specifying a condition that each row of
the table must satisfy. In our example below, clients are restricted to
GIA and Cmart only.
SELECT client, sum(current_os) as tot_os
, sum(tot_payment)as tot_payment
FROM account_perf
WHERE client in ('GIA', 'CMART')
ORDER BY sorts the output
in ascending or descending order as specified. In our example
below total outstanding in sorted in a descending order.
SELECT client, sum(current_os) as tot_os
, sum(tot_payment)as tot_payment
FROM account_perf
WHERE client in ('GIA', 'CMART')
ORDER BY tot_os descending ;
GROUP BY computes the
statistics for each category of values in the specified variable. A
summary or group function like average or SUM in SELECT statement is
followed by GROUP BY clause to instruct SAS that the statistics should
be computed for each group of data. Let us look at our modified example
to see how total outstanding and total payments are computed client
wise.
PROC SQL;
SELECT client, sum(current_os) as tot_os
, sum(tot_payment)as tot_payment
FROM account_perf
GROUP BY client;
quit;
The HAVING clause works
with the GROUP BY clause to restrict the groups in a query’s results
based on a given condition. PROC SQL applies the HAVING condition after
grouping the data and applying aggregate functions. For example, the
following query restricts the groups to include only the client GIA.
PROC SQL;
SELECT client, sum(current_os) as tot_os
, sum(tot_payment)as tot_payment
FROM account_perf
GROUP BY client
HAVING Client='GIA';
quit;
Using the dataset in above
example, we will demonstrate how to retrieve data from a single table
and how to create SAS datasets from resultant output.
I. SELECT – All columns in a Table
PROC SQL;
SELECT * FROM account_perf;
quit;
II. SELECT- Specific columns in a Table
PROC SQL;
SELECT client,current_os
FROM account_perf;
quit;
III. SELECT-How to create dataset from
SELECT statements
As we have seen from the
previous examples, just add a CREATE TABLE <Table Name> AS before
the SELECT statements. So the above example so modified would look like
as follows
PROC SQL;
CREATE TABLE Sample_Dataset AS
SELECT client,current_os
FROM account_perf;
quit;
IV. SELECT- Eliminating duplicate rows
PROC SQL;
SELECT DISTINCT client
FROM account_perf;
quit;
V. SELECT-Computing values
PROC SQL;
SELECT client,(current_os/1000)as
OS_in_1000 FROM account_perf;
quit;
Note that row level
computing can be done using the formula or multiple columns.
VI. SELECT-Assigning Column Alias and
formatting it.
We have seen earlier in our
examples that a new column name is formed with ‘AS’ statement in SELECT
statement. Its also possible to specify the format of that variable in
PROC SQL. Let us have a look at how OS_in_1000 variable is formed.
PROC SQL;
SELECT client,(current_os/1000)as
OS_in_1000 format =4.2 FROM account_perf;
quit;
VII. SELECT – Conditional Assignment
using CASE
Using CASE statement for
conditional processing is a powerful feature of SAS Data step and PROC
SQL. Here is an example where in SELECT statement CASE statement is
used to create a new field Risk_Category based on certain conditions.
PROC SQL;
SELECT client,current_os,
CASE
WHEN current_os <= 300
THEN 'Low Risk'
WHEN current_os <= 800
THEN 'Med Risk'
ELSE 'High Risk'
END AS Risk_category
from account_perf;
quit;
Note that unlike DATA step
CASE constructs, in PROC SQL each line does not end with a semi column.
Also ‘AS’ key word is logically follows after the END of the loop.
VIII.
SELECT-Specifying COLUMN attributes
You can specify the
following column attributes, which determine how SAS data is displayed:
FORMAT=
INFORMAT=
LABEL=
LENGTH=
If you do not specify these
attributes, then PROC SQL uses attributes that are already saved in the
table or, if no attributes are saved, then it uses the default
attributes. Let us have look at an example:
PROC SQL;
SELECT client,current_os format =4.2
label ='Current Outstanding' FROM account_perf;
quit;
IX. SELECT – Using Sub queries
SUB Queries and Queries
inside a Query. Instances where the WHERE clause evaluates the output
of another SELECT statement, the second SLECT statement is known as a
SUB Query. Here is an example:
PROC SQL;
SELECT client,current_os
FROM account_perf
WHERE client IN (SELECT
distinct client FROM account_perf
where tot_payment >180);
quit;
Though not a real life
scenario, the above example demonstrates how a Sub-Query is used. The
sub-query returns a list of clients that had at least one payment more
than $180 and their current outstanding is listed for all accounts. In
real life, esp when we work with multiple tables, sub queries are very
useful to frame the right WHERE clauses for data retrieval. Now
let us look at some conditional operators used in a WHERE clause of a
SELECT Statement. Exercise for you is to frame a query using these
operators. Consult some online help for syntax help.
|
Operator
|
Definition
|
|
ANY
|
Specifies that at least one of a set
of values obtained from a sub query must satisfy a given condition
|
|
ALL
|
Specifies that all of the values
obtained from a Sub query must satisfy a given condition
|
|
BETWEEN-AND
|
Tests for values within an inclusive
range
|
|
CONTAINS
|
Tests for values that contain a
specified string
|
|
EXISTS
|
Tests for the existence of a set of
values obtained
From a sub query
|
|
IN
|
Tests for values that match one of a
list of values
|
|
IS NULL or IS MISSING
|
Tests for missing values
|
|
LIKE
|
Tests for values that match a
specified pattern
|
X. SELECT- GROUP FUNCTIONS
There are a lot of group
functions we can use in SELECT Statement of a PROC SQL. When you use an
aggregate function, PROC SQL applies the function to the entire table,
unless you use a GROUP BY clause. Here is an example:
PROC SQL;
SELECT client, avg(current_os) as avg_os
, avg(tot_payment)as avg_payment
FROM account_perf
GROUP BY client;
quit;
PROC SQL;
SELECT client, avg(current_os) as avg_os
, avg(tot_payment)as avg_payment
FROM account_perf;
quit;
When you execute these
program blocks, the first PROC SQL computes the averages for each
client group and the second PROC SQL computes them for the entire
table. Having seen how a group function is used, below
given is a list of group functions you can use in a PROC SQL statement.
Note that all of these are substitutes for a PROC SUMMMARY or PROC
MEANS statistics.
|
Function
|
Definition
|
|
AVG, MEAN
|
Mean or average of values
|
|
COUNT, FREQ, N
|
Number of nonmissing values
|
|
CV
|
Coefficient of variation (percent)
|
|
MAX
|
Largest value
|
|
MIN
|
Smallest value
|
|
NMISS
|
Number of missing values
|
|
RANGE
|
Range of values
|
|
STD
|
Standard deviation
|
|
STDERR
|
Standard error of the mean
|
|
SUM
|
Sum of values
|
|
VAR
|
Variance
|
XI.
SELECT – Joining the tables
When
we work with multiple SAS tables we often will have to join them for
data processing. We commonly use SAS DATA step and MERGE method to
achieve this task. PROC SQL can be used as a simpler substitute for the
MERGE data step method. Let us create another dataset so that we can
demonstrate the example.
DATA account_perf2;
INPUT account fico_seg $
prev_os prev_payment ;
cards;
1002 401-500 400 100
1003 501-600 300 150
1004 601-700 400 180
1005 701-800 900 190
2009 601-700 600 145
1007 401-500 660 210
1008 501-600 750 180
2006 801-900 550 200
2009 601-700 600 145
2110 701-800 400 148
;run;
Now to join these tables
for all common accounts (equi-join), we use DATA step and MERGE
statement as follows. Note that in the data step, dataset should be
sorted before we MERGE them.
proc sort
data=account_perf out=account_perf;
by account;
run;
proc sort
data=account_perf2 out =account_perf2;
by account;
run;
Data merged1;
merge account_perf(in=a)
account_perf2(in=b);
by account;
if a=b;
run;
Now the same results can be achieved
using PEOC SQL as follows.
Proc SQL;
create table merged2 as Select a.*,
b.* from account_perf a,
account_perf2 b
where
a.account=b.account;
quit;
The above example
demonstrates that how PROC SQL can simplify the coding. Not only
that we could avoid the data sort, now we can make use of the powerful
WHERE clause to exactly tell SAS various conditions of merging. With
the use of Sub-queries and conditional processing (like IN, NOT IN ,
LIKE , CONTAIN) in WHERE clause, we can achieve any
combination of data merging.
PROC SQL in SAS also
provides direct merging of multiple tables with RIGHT JOIN, LEFT JOIN
and FULL JOIN keywords for various Outer joins. Readers are requested
to explore them as well.