Document 7510756

Download Report

Transcript Document 7510756

[email protected]

“The DATA step is your most powerful programming tool.

So understand and use it well.”

Socrates

2

Objectives

understand DATA step:

processes

internals

defaults

3

processes internals defaults  compilation of DATA step source code  execution of resultant machine code 4

processes internals defaults compile and execute phases of: 

INPUT (non SAS data)

SET

5

processes internals defaults 

syntax scan

Compile Time Activities

source code translation to machine language

definition of input and output files

6

processes internals defaults 

input buffer

Compile Time Activities

LPDV (logical program data vector)

data set descriptor information

7

processes internals defaults

Creation of LPDV

Variables added in the order seen by the compiler

 during parsing and interpretation of source statements 8

processes internals defaults 

location critical

BY

WHERE

ARRAY

ATTRIB

FORMAT

INFORMAT

LENGTH Compile Time Statements

location irrelevant

DROP

KEEP

LABEL

RENAME

RETAIN

9

processes internals defaults

Retained Variables

all SAS special variables

 

_N_ _ERROR_

all vars in RETAIN statement

all vars from SET, MERGE, or UPDATE

accumulator vars in SUM statement(s)

10

processes internals defaults

Variables Not Retained

Variables from input statement

user defined variables (other than SUM statement)

11

processes internals defaults

Type and Length of Variables

determined at compile time

by first reference to the compiler (in the DATA step)

Numerics:

length is 8 during DATA step processing

length is an output property

12

INPUT statement

reading non-SAS data

Compile Loop and LPDV

data a ; put _all_ ; *write LPDV to LOG; input idnum diagdate: mmddyy8.

sex $ rx_grp $ 10. ; time = intck (‘year’, diagdate, today() ) ; put _all_; *write LPDV to LOG; cards ; 1 09-09-52 F placebo 2 3 run; 11-15-64 M 300 mg.

04-07-48 F 600 mg.

14

input buffer logical program data vector

idnum diagdate sex rx_grp time numeric numeric char char numeric 8 8 8 10 8

Building descriptor portion of SAS data set

15

logical program data vector

idnum diagdate sex rx_grp time _N_ _ERROR_ numeric numeric char char numeric 8 8 8 10 8 DKR* keep keep keep keep keep drop drop *Drop/keep/rename

16

Execution of a DATA Step

17

Execution of a DATA Step

implied output _N_ + 1

Initialization of LPDV read input file end of file?

Y N process statements in step next step termination 18

processes internals defaults

DATA Step Execution

Implied read/write loop, stopped by:

no more data to read

explicit STOP

no input data

some execution time errors

19

processes internals defaults

Execution Time Activities

execute initialize-to-missing (ITM)

read from input source

modify data using user-controlled statements

supply values of variables to LPDV

output observation to SAS data set

20

processes internals defaults

Initialization

 

_N_ _ERROR_ set to loop count set to 0

user variables set to missing

21

Execution Loop - raw data

data a ; put _all_ ; *write LPDV to LOG; input idnum diagdate: mmddyy8.

sex $ rx_grp $ 10. ; time = intck (‘year’, diagdate, today() ) ; put _all_; *write LPDV to LOG; cards ; 1 09-09-52 F placebo 2 11-15-64 M 300 mg.

3 04-07-48 F 600 mg.

run; proc contents; run; proc print; run;

22

.

2 .

3

LPDV

IDNUM DIAGDATE SEX RX_GRP TIME _N_

.

.

. 1 .

1 -2670 .

F placebo 48 .

1 2 M 300 mg.

1780 .

-4286 .

F 600 mg.

36 2 . 3 52 .

3 4

(over all executions of DATA step……..) 23

2 data a ; 3 put _all_ ; *write LPDV to LOG; 4 input idnum 5 diagdate: mmddyy8.

6 sex $ 7 rx_grp $ 10. ; 8 time = intck ('year', diagdate, today() ) ; 9 put _all_; *write LPDV to LOG; 10 cards ; IDNUM=. DIAGDATE=. SEX= RX_GRP= TIME=. _ERROR_=0 _N_=1 IDNUM=1 DIAGDATE=-2670 SEX=F RX_GRP=placebo TIME=49 _ERROR_=0 _N_=1 IDNUM=. DIAGDATE=. SEX= RX_GRP= TIME=. _ERROR_=0 _N_=2 IDNUM=2 DIAGDATE=1780 SEX=M RX_GRP=300 mg. TIME=37 _ERROR_=0 _N_=2 IDNUM=. DIAGDATE=. SEX= RX_GRP= TIME=. _ERROR_=0 _N_=3 IDNUM=3 DIAGDATE=-4286 SEX=F RX_GRP=600 mg. TIME=53 _ERROR_=0 _N_=3 IDNUM=. DIAGDATE=. SEX= RX_GRP= TIME=. _ERROR_=0 _N_=4 NOTE: The data set WORK.A has 3 observations and 5 variables.

NOTE: The DATA statement used 0.59 seconds.

14 run; 15 16 proc contents; run; NOTE: The PROCEDURE CONTENTS used 0.39 seconds.

24

Data Set Name: WORK.A Observations: 3 Member Type: DATA Variables: 5 Engine: V612 Indexes: 0 Created: 11:18 Saturday, January 20, 2001 Observation Length: 42 Last Modified: 11:18 Saturday, January 20, 2001 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information---- Data Set Page Size: 8192 Number of Data Set Pages: 1 File Format: 607 First Data Page: 1 Max Obs per Page: 194 Obs in First Data Page: 3 -----Alphabetic List of Variables and Attributes---- # Variable Type Len Pos ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 5 TIME Num 8 34 2 DIAGDATE Num 8 8 1 IDNUM Num 8 0 4 RX_GRP Char 10 24 3 SEX Char 8 16

25

PROC PRINT

IDNUM DIAGDATE SEX RX_GRP TIME 1 -2670 F placebo 48 2 1780 M 300 mg. 36 3 -4286 F 600 mg. 52 26

SET statement

reading existing SAS data

DATA Step Compile

no input buffer

compiler reads descriptor portion of input SAS data set to build the LPDV

returns same variables/attributes, including new variables

28

processes internals defaults

SET

determine which SAS data set to be read

identify next observation to be read

copy variable values to LPDV

29

Execution Loop - SAS data data sas_a ; put _all_ ;

set a ; tot_rec + 1 ;

put _all_ ; run;

30

Building LPDV from descriptor portion of old SAS data set

logical program data vector

idnum diagdate sex rx_grp time

tot_rec

numeric numeric char char numeric

numeric

8 8 8 10 8 8

Building descriptor portion of new SAS data set

31

LPDV

IDNUM DIAGDATE SEX RX_GRP TIME TOT_REC _N_

. . . 0 1 1 -2670 F placebo 48 1 1 1 -2670 F placebo 48 1 2 2 1780 M 300 mg. 36 2 2 2 1780 M 300 mg. 36 2 3 3 -4286 F 600 mg. 52 3 3 3 -4286 F 600 mg. 52 3 4

(over all executions of DATA step……..) 32

LOG

idnum=. diagdate=. sex= rx_grp= time=. tot_rec=0 _ERROR_=0 _N_=1 idnum=1 diagdate=-2670 sex=F rx_grp=placebo time=48 tot_rec=1 _ERROR_=0 _N_=1 idnum=1 diagdate=-2670 sex=F rx_grp=placebo time=48 tot_rec=1 _ERROR_=0 _N_=2 idnum=2 diagdate=1780 sex=M rx_grp=300 mg. time=36 tot_rec=2 _ERROR_=0 _N_=2 idnum=2 diagdate=1780 sex=M rx_grp=300 mg. time=36 tot_rec=2 _ERROR_=0 _N_=3 idnum=3 diagdate=-4286 sex=F rx_grp=600 mg. time=52 tot_rec=3 _ERROR_=0 _N_=3 idnum=3 diagdate=-4286 sex=F rx_grp=600 mg. time=52 tot_rec=3 _ERROR_=0 _N_=4 33

PROC PRINT

IDNUM DIAGDATE SEX RX_GRP TIME TOT_REC 1 -2670 F placebo 48 1 2 1780 M 300 mg. 36 2 3 -4286 F 600 mg. 52 3

34

Logic of a MERGE

compile

execute

35

; data left; input ID X Y ; cards; 1 88 99 2 66 77 3 44 55 data right; input ID A $ B $ ; cards; 1 A14 B32 ; 3 A53 B11

36

proc sort data=left; by ID; run; proc sort data=right; by ID; run; data both; merge left (in=inleft) right (in=inright) ; by ID ; run;

37

logical program data vector first iteration: MATCH

ID X Y A B INLEFT INRIGHT _N_ _ERROR_ 1 88 99 A14 B32 1 1 1 0

38

logical program data vector second iteration: NO MATCH

ID X Y A B INLEFT INRIGHT _N_ _ERROR_ 2 66 77 1 0 2 0

39

logical program data vector third iteration: MATCH

ID X Y A B INLEFT INRIGHT _N_ _ERROR_ 3 44 55 A53 B11 1 1 3 0

40

Let’s try this again………………… data left; data right; input ID X Y ; input ID A $ B $ ; ; cards; 1 88 99 2 66 77 3 44 55 cards; 1 A14 B32 ; 3 A53 B11

41

proc sort data=left; by ID; run; proc sort data=right; by ID; run; data both; merge left (in=inleft) right (in=inright) ; ***** by ID (one-on-one merge); run;

42

logical program data vector

first iteration:

1:1 “MATCH”

ID X Y A B _N_ _ERROR_ 1 88 99 A14 B32 1 0 1 OVERWRITTEN – value came from data set “right”

43

logical program data vector

second iteration:

1:1 “MATCH”

ID X Y A B _N_ _ERROR_ 2 66 77 A53 B11 2 0 3 OVERWRITTEN – value came from data set “right”

44

logical program data vector

third iteration:

1:1 “NO MATCH”

ID X Y A B _N_ _ERROR_ 3 44 55 3 0 MISSING – no values from “right”

45

Output SAS data set

ID X Y A B 1 3 3 88 99 A14 B32 66 77 A53 B11 44 55

46

DATA Step Conclusions

Understanding internals and default activities allows you to:

make informed coding decisions

write flexible and efficient code

debug and test effectively

interpret results readily

47

Remember

We have discussed DEFAULTS

As soon as you add options, statements, features, etc., the default actions change; TEST them!

You can use these same tools to track what’s happening.

48