ppt - Research Triangle SAS Users Group

Download Report

Transcript ppt - Research Triangle SAS Users Group

SAS® PROC SQL or Vanilla Flavor
Cecilia Mauldin
January 31 2007
What is Special in SAS PROC
SQL?
• Things you already know from data
step programming
• Some differences between data step
and proc sql syntax
• Order of execution
• How good is my code?
The Data
DATA dogs;
LENGTH name breed $20
dob
$10;
INPUT name breed dob;
CARDS;
Nala
Nina
Jadzia
Rocco
Bruja
Chacal
Chata
Tyson
;
RUN;
Airdale
AMBR
Pointer
MMBNR
Rottweiler
Rottweiler
Boxer
Rottweiler
11jan1990
22apr1997
15feb1999
08may1980
15aug2000
29jan1990
UNK
15mar2000
DATA parents;
LENGTH parent name $20;
INPUT parent name;
CARDS;
Ina
Ina
Ina
Ina
Chavo
Cecilia
Cecilia
;
RUN;
Nala
Chata
Bruja
Fidonga
Rocco
Nina
Jadzia
What is Special in SAS PROC
SQL?
• Things you already know from data
step programming
• Some differences between data step
and Proc SQL syntax
• Order of execution
• How good is my code?
Things You Already Know from
Data Step Programming
• Our old friends: where, keep, drop
and rename
• In data step
– How do we use them?
– Why do we use them?
• In SAS PROC SQL
– How do we use them?
– Why do we use them?
Things You Already Know from
Data Step Programming
DATA new_data (KEEP=
RENAME=( )
DROP=
WHERE=( ));
SET old_data (KEEP=
RENAME=()
DROP=
WHERE=( ));
SAS data steps … ;
RUN;
It is written from the
PDV when specified
on an output data set.
Is applied when the
dataset OLD_DATA is
read into the PDV.
Data Step: Why Do We Use Our
Old Friends?
DATA new_data (KEEP=
RENAME= ( )
DROP=
WHERE=( ));
SET old_data (KEEP=
RENAME=()
DROP=
WHERE=( ));
SAS data steps … ;
RUN;
No need to keep
observations and/or
variables that are not
used in the output
data set.
No need to carry
observations or
variables that are not
going to be used in
the PDV.
Proc SQL: How Do We Use
Them?
PROC SQL SQL_OPTIONS;
CREATE TABLE example (KEEP= ….
RENAME=( ….)
DROP=
WHERE=( ….))
AS SELECT *
FROM data_set (KEEP= var1 var2….
RENAME=(var1 = newvar1
var2 = newvar2
QUIT;
DROP=
WHERE=(….));
….)
Why Do We Use Them?
• Typing convenience
• Right, left and full joins
Typing Convenience
• Your dataset has 50 variables, you
need to use 45
– Do you want to type 45 variables?
– Do you carry 5 variables that you don’t need?
Typing Convenience cont’d
SELECT var1, var2, ∙∙∙, var45
FROM data_set;
SELECT *
FROM data_set (DROP=var46 … var50);
Typing Convenience cont’d
SELECT
name AS firstname,
breed
FROM dogs
;
First name
Nala
Nina
Jadzia
Rocco
Bruja
Chacal
Chata
Tyson
SELECT *
FROM dogs (DROP=dob
RENAME=(name=firstname))
;
Breed
Airdale
AMBR
Pointer
MMBNR
Rottweiler
Rottweiler
Boxer
Rottweiler
Joins
• Avoid ‘warning’ messages when
creating a data set because two
variables in two different datasets
have the same name
• Limit the records that come in a join
Joins cont’d
• Avoid WARNING messages when 2
variables have the same name
CREATE TABLE final
AS SELECT a.*, b.*
FROM data_set1 a, data_set2 b
WHERE a.var1=b.var1
AND
a.var2=b.var2
AND
a.var3=b.var3
;
Joins cont’d
• Avoid WARNING messages when 2
variables have the same name
CREATE TABLE family AS
SELECT a.*, b.*
FROM dogs a, parent b
WHERE a.name=b.name
;
WARNING: Variable Name already exists on file WORK.FAMILY.
NOTE: Table WORK.FAMILY created, with 6 rows and 4 columns.
Joins cont’d
• Avoid WARNING messages when 2 variables
have the same name
CREATE TABLE final (DROP=VAR1B VAR2B VAR3B)
AS SELECT a.*, b.*
FROM data_set1 A,
data_set2 (RENAME=(var1=var1b
WHERE a.var1=b.var1b
AND
a.var2=b.var2b
AND
a.var3=b.var3b;
var2=var2b
var3=var3b)) B
Joins cont’d
CREATE TABLE family (DROP=nickname)
AS SELECT a.*, b.*
FROM dogs a, parents (RENAME=(name=nickname)) b
WHERE a.name=b.nickname
;
Name
Nala
Nina
Jadzia
Rocco
Bruja
Chata
breed
Airdale
AMBR
Pointer
MMBNR
Rottweiler
Boxer
DOB
11JAN1990
22APR2001
15FEB1999
08MAY2001
15AUG2000
.
Parent
Ina
Cecilia
Cecilia
Chavo
Ina
Ina
Joins cont’d
SELECT a.*, b.*
FROM dogs a LEFT JOIN parents b
Do we get a
warning?
ON a.name=b.name
AND a.breed='Rottweiler'
;
Name Breed
Bruja
Chacal
Chata
Jadzia
Nala
Nina
Rocco
Tyson
Rottweiler
Rottweiler
Boxer
Pointer
Airdale
AMBR
MMBNR
Rottweiler
DOB
15AUG2000
29JAN1990
.
15FEB1999
11JAN1990
22APR2001
08MAY2001
15MAR2000
Parent name
Ina
Bruja
Joins cont’d
• Limit records
CREATE TABLE final AS SELECT a.*, b.*
FROM data_set1 (WHERE=(var1=’keep’)) a
LEFT JOIN data_set2 b
ON
;
a.var1=B.var1b
AND
a.var2=B.var2b
AND
a.var3=B.var3b
Joins cont’d
SELECT a.*, b.*
FROM dogs (WHERE=(breed='Rottweiler'))a
LEFT JOIN parents b
ON a.name=b.name
;
Name
Bruja
Chacal
Tyson
breed
Rottweiler
Rottweiler
Rottweiler
DOB
Parent name
15AUG2000 Ina
Bruja
29JAN1990
15MAR2000
What is special in SAS PROC
SQL?
•Things you already know from data
step programming
• Some differences between data step
and SQL syntax
•Order of execution
•How good is my code?
Some Differences Between
Data Step and SQL Syntax
• Reserved word
• Format modifiers
• Descending sorting
Reserved Word
• CASE
– Why to use it
• Is there something similar in data step?
– How to use it
– What can be the problem
– How to avoid a problem
CASE: Why to Use It
• There is no IF or SELECT statements in SQL
• Is similar to the SELECT clause in data step
CASE: Why to Use It
DATA final;
SET parents;
SELECT (name);
WHEN (‘Nina') education =‘Novice';
WHEN (‘Nala') education =‘None';
OTHERWISE education =‘There is hope’;
END;
CASE: How to Use It
SELECT var1, var2,
CASE [vark]
WHEN (“value1”) THEN [‘constant value’ | function]
WHEN (‘value3’) THEN [‘other constant’ | function]
…
ELSE [‘last value’ | function ]
END AS newvar
FROM data_set
;
CASE: How to Use It cont’d
SELECT
name,
Name
Nala
Chata
Bruja
Fidonga
Rocco
Nina
Jadzia
Education
None
There is hope
There is hope
There is hope
There is hope
Novice
There is hope
CASE name
WHEN ('Nala') THEN 'None'
WHEN ('Nina') THEN 'Novice'
ELSE
'There is hope'
END AS education LENGTH = 15 FORMAT $15.,
parent
FROM parent
;
Parent
Ina
Ina
Ina
Ina
Chavo
Cecilia
Cecilia
CASE: How to Use It cont’d
SELECT name,breed,
CASE
WHEN INDEX (breed,'MB') THEN PUT(breed,$UKC.)
ELSE
breed
END AS lbreed label "Long Breed"
FROM dogs; Name Breed
Long breed
Bruja
Chacal
Chata
Jadzia
Nala
Nina
Rocco
Tyson
Rottweiler
Rottweiler
Boxer
Pointer
Airdale
AMBR
MMBNR
Rottweiler
Rottweiler
Rottweiler
Boxer
Pointer
Airdale
American Mixed Breed Registry
Mexican Mixed Breed Non-Registered
Rottweiler
CASE: What Can Be the
Problem
DATA new;
LENGTH case $20;
SET parents;
IF parent ='Ina'
THEN case='None';
ELSE IF parent ='Chavo' THEN case='Spoiled';
ELSE
case='Work';
RUN;
PROC PRINT DATA=new;
RUN;
Obs
1
2
3
4
5
6
7
Case
None
None
None
None
Spoiled
Work
Work
Parent
Ina
Ina
Ina
Ina
Chavo
Cecilia
Cecilia
Name
Nala
Chata
Bruja
Fidonga
Rocco
Nina
Jadzia
CASE: What Can Be the Problem
cont’d
SELECT * FROM new WHERE parent='Ina';
Case
None
None
None
None
Parent
Ina
Ina
Ina
Ina
Name
Nala
Chata
Bruja
Fidonga
CASE: What Can Be the Problem
cont’d
SELECT * FROM new WHERE case='None';
select * from new WHERE case='None';
+
22
+
76
+ERROR 22-322: Syntax error, expecting one of the following:
+
a name, a quoted string, a numeric constant,
+
a datetime constant, a missing value, (, +, -,
+
BTRIM, CALCULATED, CASE, EXISTS, INPUT, LOWER,
+
NOT, PUT, SELECT, SUBSTRING, TRANSLATE, UPPER,
+
USER, WHEN, ^, ~.
+ERROR 76-322: Syntax error, statement will be ignored.
CASE: How to Avoid a Problem
SELECT *
FROM new (RENAME=(case=education))
WHERE education='None';
Some Differences Between
Data Step and SQL Syntax
• Reserved word
• Format modifiers
• Descending sorting
Format Modifiers
• What are format modifiers?
• Format modifiers in data step
• Format modifiers in proc SQL
Format Modifiers in Data Step
DATA ndogs;
SET dogs;
idate=INPUT(dob,date9.);
IF _ERROR_=1 THEN PUT _ALL_;
RUN;
NOTE: Invalid argument to function INPUT at line 196 column 9.
name=Chata breed=Boxer dob=UNK idate=. _ERROR_=1 _N_=3
NOTE: Mathematical operations could not be performed at the
following places. The results of the operations have been set to
missing values.
Each place is given by: (Number of times) at (Line):(Column).
1 at 196:9
Format Modifiers in Data Step
cont’d
DATA ndogs;
SET dogs;
idate=INPUT(dob, ?? date9.);
IF _ERROR_=1 THEN PUT _ALL_;
RUN;
NOTE: There were 8 observations read from the
data set WORK.DOGS.
NOTE: The data set WORK.NDOGS has 8
observations and 4 variables.
Format Modifiers in PROC SQL
CREATE TABLE ndogs AS SELECT
name, dob,
INPUT(dob, date9.) AS idate FORMAT mmddyy8.
FROM dogs;
+ERROR: Invalid day value
NOTE: Table WORK.NDOGS created, with 8
rows and 3 columns.
Format Modifiers in PROC SQL
cont’d
CREATE TABLE ndogs AS SELECT
name, dob,
INPUT(dob,? date9.) AS idate FORMAT mddyy8.
FROM dogs;
NOTE: Table WORK.NDOGS created, with 8
rows and 3 columns.
Some Differences Between
Data Step and SQL Syntax
• Reserved word
• Format modifiers
• Descending sorting
Descending Sorting
• Data step and PROCS
PROC SORT DATA = ndogs;
BY DESCENDING idate;
RUN;
• PROC SQL
SELECT *
FROM ndogs ORDER BY idate descending;
Descending Sorting cont’d
• A little lagniappe
(pronounced nańap)
SELECT name, dob,
INPUT( dob,? date9.) AS idate FORMAT date9.
FROM dogs
ORDER BY 3 DESC
Name
DOB
Idate
Bruja
15Aug2000
15AUG2000
;
Tyson
Jadzia
Nina
Chacal
Nala
Rocco
Chata
15Mar2000
15Feb1999
22APR1997
29Jan1990
11jan1990
08may1980
UNK
15MAR2000
15FEB1999
22APR1997
29JAN1990
11JAN1990
08MAY1980
.
What is Special in SAS PROC
SQL?
• The program data vector (PDV)
• Some differences between data step
and SQL syntax
• Order of execution
• How good is my code?
The Order of Execution
SELECT VAR1, VAR2, …
FROM data_set
[ FULL | LEFT | RIGHT | INNER ] JOIN TABLE2
ON
AND
condition 1
condition 2
WHERE
AND
condition k
condition l
GROUP BY VAR1, VAR….
HAVING condition p
AND condition q
ORDER BY var1, var2…
;
The Order of Execution cont’d
PROC SQL NUMBER;
SELECT a.breed,
INPUT(a.dob,? date9.) AS ndob format date9.,
COALESE(a.name, b.name) AS name, b.parent
FROM dogs a FULL JOIN parents b
ON a.name=b.name; Breed
NDOB
Name
Parent
1 Rottweiler
2 Rottweiler
3 Boxer
4
5 Pointer
6 Airdale
7 AMBR
8 MMBNR
15AUG2000
29JAN1990
.
15FEB1999
11JAN1990
22APR1997
08MAY1980
Bruja
Chacal
Chata
Fidonga
Jadzia
Nala
Nina
Rocco
Ina
Ina
Ina
Cecilia
Ina
Cecilia
Chavo
The Order of Execution cont’d
SELECT a.breed,
INPUT( a.dob, ? date9.) AS ndob FORMAT date9.,
COALESCE (a.name, b.name) AS name,
b.parent
FROM dogs a FULL JOIN parents b
ON
a.name=b.name
AND breed='Rottweiler';
Breed
1 Rottweiler
2 Rottweiler
3 Boxer
4
5 Pointer
6 Airdale
7 AMBR
8 MMBNR
NDOB
15AUG2000
29JAN1990
.
15FEB1999
11JAN1990
22APR1997
08MAY1980
Name
Bruja
Chacal
Chata
Fidonga
Jadzia
Nala
Nina
Rocco
Parent
Ina
Ina
Ina
Cecilia
Ina
Cecilia
Chavo
The Order of Execution cont’d
SELECT a.breed,b.parent,
COALESCE ( a.name, b.name)
FROM dogs a FULL JOIN parent b
ON
a.name=b.name
WHERE breed='Rottweiler';
SELECT
a.breed, b.parent,
COALESCE (b.name, a.name)
FROM dogs a FULL JOIN parent b
ON
a.breed='Rottweiler'
WHERE a.name=b.name;
Breed
Parent
Rottweiler Ina
Bruja
Rottweiler
Chacal
Rottweiler
Tyson
Breed
Parent
Rottweiler Ina
Bruja
The Order of Execution cont’d
SELECT
a.breed,
INPUT(a.dob, ? date9.) AS ndob format date9.,
COALESCE(a.name, b.name) AS name,
b.parent
FROM dogs a FULL JOIN parent b
ON
a.name=b.name
WHERE breed='Rottweiler'
HAVING ndob GT '01jan2000'd;
Breed
Rottweiler
Rottweiler
NDOB
15AUG2000
15MAR2000
Name Parent
Bruja Ina
Tyson
How Good is My Code?
• Does the options used in data step
make my code faster?
• Is speed important?
How Good is My Code? cont’d
• Do data step options make my code faster?
data all;
merge dogs
parent;
by name;
do i=1 to 100;
do j=1 to 100;
output;
end;
end;
run;
NOTE: There were 8 observations read from the data set WORK.DOGS.
NOTE: There were 7 observations read from the data set WORK.PARENT.
NOTE: The data set WORK.ALL has 891000 observations and 6 variables.
How Good is My Code? cont’d
• Does the statements used to control the
size of my input data set make my code
faster?
proc sql stimer;
NOTE: SQL statement used:
real time
0.01 seconds
cpu time
0.02 seconds
How Good is My Code? cont’d
• Does the statements used to control the size of
my input data set make my code faster?
SELECT * FROM all WHERE breed='Rottweiler';
NOTE: SQL statement used:
real time
10.82 seconds
cpu time
10.50 seconds
SELECT * FROMall (WHERE=(breed='Rottweiler'));
NOTE: SQL statement used:
real time
10.85 seconds
cpu time
10.61 seconds
How Good is My Code? cont’d
• Does the statements used to control the size of
my input data set make my code faster?
SELECT DISTINCT name, breed, dob
FROM all
real time 4.90 seconds
WHERE breed='Rottweiler';
cpu time 4.35 seconds
SELECT DISTINCT name, breed, dob
FROM all ( WHERE=(breed='Rottweiler'));
real time 9.42 seconds
cpu time 4.57 seconds
SELECT DISTINCT name, breed, dob
FROM all (WHERE=(breed='Rottweiler')
real time 5.89 seconds
KEEP=name breed dob);
cpu time 4.30 seconds
Conclusions
• This presentation showed examples
on how and when to use statements
that we already know
– KEEP
– RENAME
– DROP
– WHERE
(things you already know)
Conclusions
•
Possible language problems (line
mines)
– Reserved word (CASE)
– Format modifiers
– Descending order
(things you think you already know)
Conclusions cont’d
• Be careful with the order of the
conditions
• The order of the conditions affect the
result
Conclusions cont’d
• When you use SQL, consider if you
need all your variables and
observations
• If you don’t need them consider:
– It there is a significant difference in computer
use
– Are you a good typist?
Conclusions cont’d
• You can have plain vanilla SAS PROC
SQL or you can SASsy it up with data
step statements
Acknowledgements
• PPD management
– Kim Sturgen and Bonnie Duncan
– Andy Barnett
– Darrel Edgley
– Neil Howard, Larry Sleeper and Craig Mauldin
– For their time, although if there are mistakes,
they are all mine; I made modifications and
additions after they read the paper
Contact Information
Angelina Cecilia Mauldin
Abraxis BioScience
4505 Emperor Blvd. #400
Durham, NC 27703
919 433 8400
[email protected]