presentation

Download Report

Transcript presentation

PhUSE 2011: Brighton
TS09
Rectifying Irregular Text Data
a Case for Using Regular Expressions in SAS
Jayshree Garade
Manjusha Gode
Outline
• Problems
• Solutions & Introducing Regular Expressions
• Advantages over SAS String Functions
• Points to note while using Regular Expressions
• References
2
Outline
• Problems
• Solutions & Introducing Regular Expressions
• Advantages over SAS String Functions
• Points to note while using Regular Expressions
• References
3
Problem: Physical abnormalities
SUBJID
TRT
ABNORMALITY
01-011
B
ANEMIA
01-036
D
ANAEMIA
01-026
C
ANEMEA
01-014
B
ANEMIC
4
Problem: Time point variable …
USUBJID
VISIT
VSDT
1
1
17-Oct-08
1
2
1
PRSDTLTM
VNTR_RT
VNTRTUN
Per 1 D01 Predose
47
/min
3-Nov-08
Per 1 D01 .5 hr
58
/min
2
3-Nov-08
Per 1 D 01 01 hr
51
/min
1
2
3-Nov-08
Per 1d01 02hr
49
/min
1
3
4-Nov-08
day1
53
/min
1
90
3-Feb-09
Poststudy
56
/min
5
…Problem: Time point variable
USUBJID
VISIT
VSDT
1
1
17-Oct-08
1
2
1
PRSDTLTM
VNTR_RT
VNTRTUN
Per 1 D01 Predose
47
/min
3-Nov-08
Per 1 D01 .5 hr
58
/min
2
3-Nov-08
Per 1 D 01 01 hr
51
/min
1
2
3-Nov-08
Per 1d01 02hr
49
/min
1
3
4-Nov-08
day1
53
/min
1
90
3-Feb-09
Poststudy
56
/min
6
…Problem: Time point variable
USUBJID VISIT
VSDT
PRSDTLTM
Per 1 D01
Predose
VNTR_RT
VNTRTUN
Time_desc
47
/min
Predose
1
1
17-Oct-08
1
2
3-Nov-08 Per 1 D01 .5 hr
58
/min
Day 1, 0.5 Hour
1
2
3-Nov-08 Per 1 D 01 01 hr
51
/min
Day 1, 1 Hour
1
2
3-Nov-08 Per 1d01 02hr
49
/min
Day 1, 2 Hours
1
3
4-Nov-08 day1
53
/min
Day 1
1
90
3-Feb-09 Poststudy
56
/min
Poststudy
7
…Problem: Time point variable
PRSDTLTM
Time_desc
D01
Day 1
D 01
Day 1
d01
Day 1
day1
Day 1
8
Outline
• Problems
• Solutions & Introducing Regular Expressions
• Advantages over SAS String Functions
• Points to note while using Regular Expressions
• References
9
…Ways to approach the problem
• Traditional --- Using SAS String Functions
INDEX TRANWRD SUBSTR ANYALNUM ANYALPHA
ANYDIGIT ANYSPACE NOTALNUM NOTALPHA ANYALNUM
NOTUPPER ANYALPHA FIND ANYDIGIT FINDC ANYPUNCT
ANYSPACE INDEXC NOTALNUM INDEXW NOTALPHA VERIFY
NOTDIGIT CALL CATS CALL CATT CALL CATX TRANSLATE SCAN
SCANQ CALL SCAN CALL SCANQ COMPARE COMPLEV CALL
COMPCOST SOUNDEX COMPGED SPEDIS MISSING RANK
REPEAT REVERSE…………
10
Alternative Approach to Problem
Introducing REGULAR EXPRESSIONS!!
11
Introduction – Regular Expressions
• Powerful technique for searching and manipulating
text data.
• A mini programming language - pattern matching.
• 2 types – pattern matching functions in SAS
SAS Regular Expressions – SAS Version 6.12
PERL Regular Expressions – SAS Version 9
12
Steps to use Regular Expressions…
Problem
Required Portion
Required
Portion
Pattern
Regular Expressions
Locate Reqd. Portion
Process Data
13
Step1 - Identify the problem …
Problem
Required Portion
USUB VISIT
JID
Locate Reqd.
Portion
Process Data
PRSDTLTM
VNTR_ VNTR
RT
TUN
time_desc
1
1
17-Oct- Per 1 D01
08
Predose
47
/min Predose
1
2
3-Nov- Per 1 D01 .5
08
hr
58
/min Day 1, 0.5 Hour
1
2
3-Nov- Per 1 D 01
08
01 hr
51
/min Day 1, 1 Hour
1
2
3-Nov- Per 1d01 02
08
hr
49
/min Day 1, 2 Hours
1
3
4-Nov- Day1
08
53
/min Day 1
1
90
3-Feb- Poststudy
09
56
/min Poststudy
Pattern
Regular
Expressions
VSDT
14
Step2 – Visualize the “Required Portion”
within the source text
Problem
Required Portion
PRSDTLTM
Per 1 D01 Predose
Per 1 D01 .5 hr
Pattern
Regular
Expressions
Per 1 D 01 01 hr
Per 1d01 02 hr
Locate Reqd.
Portion
Process Data
Day1
Poststudy
15
Step 3 – Identify a pattern
Problem
Required Portion
EXTRACT
Preceding
Blank
Pattern
Regular
Expressions
Locate Reqd.
Portion
Process Data
‘D’ or ‘d’
Following
Following
Blank
Blank
PRSDTLTM
Per 1 D01 Predose
Per 1 D01 .5 hrOne/more
digits
Per 1 D 01 01 hr
2- Non
Digits
Per 1d01 02 hr
Day1
Poststudy
16
Regular Expressions Syntax...at a glance
Metacharacter Description
*
Matches the previous sub expression zero or more times
+
Matches the previous sub expression one or more times
?
Matches the previous sub expression zero or one times
\d
Matches a digit (0-9)
\D
Matches a non-digit
\w
Matches a word character (upper or lower case letter, blank,
or underscore)
[abc]
\(
Matches any of the characters in the brackets
Matches (
19
Step 4 – Write the Regular Expression
for the pattern
("/ ?
Problem
/")
PRSDTLTM
Required Portion
Per 1 D01 Predose
Pattern
Regular
Expressions
Locate Reqd.
Portion
Process Data
Per 1 D01 .5 hr
Preceding
Blank
Per 1 D 01 01 hr
Per 1d01 02 hr
Day1
Poststudy
20
Step 4 – Write the Regular Expression
for the pattern
("/ ? [Dd]
Problem
PRSDTLTM
Required Portion
Per 1 D01 Predose
Pattern
Regular
Expressions
Locate Reqd.
Portion
Process Data
/")
Per 1 D01 .5 hr
‘D’ or ‘d’
Per 1 D 01 01 hr
Per 1d01 02 hr
Day1
Poststudy
21
Step 4 – Write the Regular Expression
for the pattern
("/ ? [Dd] (\D\D)?
Problem
PRSDTLTM
Required Portion
Per 1 D01 Predose
Pattern
Per 1 D01 .5 hr
Regular
Expressions
Locate Reqd.
Portion
Process Data
/")
Per 1 D 01 01 hr
2-Non
Digits
Per 1d01 02 hr
Day1
Poststudy
22
Step 4 – Write the Regular Expression
for the pattern
("/ ? [Dd] (\D\D)? ?
Problem
Required Portion
Pattern
Regular
Expressions
Locate Reqd.
Portion
Process Data
/")
PRSDTLTM
Following
Blank
Per 1 D01 Predose
Per 1 D01 .5 hr
Per 1 D 01 01 hr
Per 1d01 02 hr
Day1
Poststudy
23
Step 4 – Write the Regular Expression
for the pattern
("/ ? [Dd] (\D\D)? ? \d+
Problem
PRSDTLTM
Required Portion
Per 1 D01 Predose
Pattern
Per 1 D01 .5 hr
Regular
Expressions
Locate Reqd.
Portion
Process Data
/")
Per 1 D 01 01 hr
One/more
digits
Per 1d01 02 hr
Day1
Poststudy
24
Step 4 – Write the Regular Expression
for the pattern
("/ ? [Dd] (\D\D)? ? \d+ + /")
Problem
PRSDTLTM
Required Portion
Per 1 D01 Predose
Pattern
Per 1 D01 .5 hr
Regular
Expressions
Locate Reqd.
Portion
Process Data
Per 1 D 01 01 hr
Following
blank
Per 1d01 02 hr
Day1
Poststudy
25
Step 4 – Write the Regular Expression
for the pattern
Problem
Required Portion
("/ ?[Dd](\D\D)? ?\d+ +/")
PRSDTLTM
Per 1 D01 Predose
Pattern
Regular
Expressions
Locate Reqd.
Portion
Process Data
Per 1 D01 .5 hr
Per 1 D 01 01 hr
Per 1d01 02 hr
Day1
Poststudy
26
Step 4 – Write the Regular Expression
for the pattern
Problem
Required Portion
Pattern
Regular
Expressions
Locate Reqd.
Portion
Process Data
/* Extracting the Day Text portion*/
data day_txt;
set lb.ecg(keep = PRSDTLTM);
retain day_exp;
if _n_ = 1 then
do ;
* defined to describe the day text
pattern;
day_exp
=PRXPARSE ("/ ?[Dd](\D\D)? ?\d+ +/");
end;
Metacharacters
run;
27
Recap… Steps to use Regular Expressions…
Problem
Required Portion
Required
Portion
Pattern
Regular Expressions
Locate Reqd. Portion
Process Data
28
Recap… Steps to use Regular Expressions…
Problem
Required
Portion
Required
Portion
Pattern
Regular Expressions
Locate Reqd. Portion
Process Data
29
Recap… Steps to use Regular Expressions…
Problem
Required Portion
Required
Portion
Pattern
Regular Expressions
Locate Reqd. Portion
Process Data
30
Recap… Steps to use Regular Expressions…
Problem
Required Portion
Required
Portion
Pattern
Regular Expressions
Locate Reqd. Portion
Process Data
31
Step 5 – Locate the “Required Portion”
Problem
/* Extracting the Day Text portion*/
data day_txt;
Required Portion
Pattern
Regular
Expressions
Locate Reqd.
Portion
Process Data
set lb.ecg(keep = PRSDTLTM);
retain day_exp day_nexp;
if _n_ = 1 then
do ;
* defined to describe the day text pattern;
day_exp = PRXPARSE("/ ?[Dd](\D\D)? ?\d+ +/");
end;
*Locating the day text pattern
in the PRSDTLTM
Source
var;
Pattern defn
Variable
CALL
PRXSUBSTR(day_exp,PRSDTLTM,dayst,dayln);
run;
Stores Start
position of
matched string
Stores length of matched
string
32
Step 6 – Use other SAS text functions to
further process data
Problem
Required Portion
Pattern
Regular
Expressions
/* Extracting the Day Text portion*/
data day_txt;
set lb.ecg(keep = PRSDTLTM);
retain day_exp day_nexp;
if _n_ = 1 then
do ;
* defined to describe the day text pattern;
day_exp = PRXPARSE("/ ?[Dd](\D\D)? ?\d+ +/");
end;
* Locating the day text pattern in the PRSDTLTM var;
CALL PRXSUBSTR(day_exp,PRSDTLTM, dayst, dayln);
Locate Reqd.
Portion
* Extracting the day text pattern;
day_txt =
substrn(PRSDTLTM,dayst,dayln);
Process Data
run;
Source
Variable
Starting
Position
Length of
matched pattern
33
…Output
PRSDTLTM
Per 1 D01 Predose
day_txt
Extracted string
Per 1 D01 .5 hr
D01
Per 1 D 01 01 hr
D 01
Per 1d01 02 hr
d01
Day1
Day1
Poststudy
34
Outline
• Problems
• Solutions & Introducing Regular Expressions
• Advantages over SAS String Functions
• Points to note while using Regular Expressions
• References
35
Advantages…
• Compact solution
• Tremendous flexibility
Concise description.
Highly unstructured data streams.
Multiple matching patterns in one step.
36
Outline
• Problems
• Solutions & Introducing Regular Expressions
• Advantages over SAS String Functions
• Points to note while using Regular Expressions
• References
37
Look before you leap
 Document thoroughly.
 Understand patterns.
 Define before use.
 Define only once in a data step.
38
Outline
• Problems
• Solutions & Introducing Regular Expressions
• Advantages over SAS String Functions
• Points to note while using Regular Expressions
• References
39
…References
 Support.sas.com
 Paper TU02An Introduction to Regular Expressions with Examples from Clinical
Data - Richard F. Pless, Ovation Research Group, Highland Park, IL
 SUGI 29-Tutorials - Paper 265-29
An Introduction to Perl Regular Expressions in SAS 9
Ron Cody, Robert Wood Johnson Medical School, Piscataway, NJ
 An Introduction to PERL Regular Expression in SAS®
James J. Van Campen, SRI International, Menlo Park, CA
40
Q&A
Contact :
[email protected]
[email protected]
41
Thank you
42