( .* - (.* "" .*) )

Download Report

Transcript ( .* - (.* "" .*) )

Pattern Matching on Strings
using Regular Expressions
Num
Email
=
=
0 | [1-9][0-9]*
[a-z]+ "@" [a-z]+ ("." [a-z]+ )*
Claus Brabrand
Jakob G. Thomsen
[ [email protected] ]
[ [email protected] ]
IT University of Copenhagen
Aarhus University
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[1]
May 11, 2010
Outline
Pattern Matching (intro & motiv):
The Chomsky Hierarchy (1956)
Regular Expressions:
The Recording Construction
Ambiguity:
Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[3]
May 11, 2010
Introduction & Motivation
Pattern matching an indispensable problem
Many applications need to "parse" dynamic input
(list of key-value pairs)
1) URLs:
http://first.dk/index.php?id=141&view=details
protocol
host
path
query-string
2) Log Files:
13/02/2010 66.249.65.107 get /support.html
20/02/2010 42.116.32.64 post /search.html
3) DBLP:
<article>
<title>Three Models for the...</title>
<author>Noam Chomsky</author>
<year>1956</year>
</article>
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[4]
May 11, 2010
Outline
Pattern Matching (intro & motiv):
The Chomsky Hierarchy (1956)
Regular Expressions:
The Recording Construction
Ambiguity:
Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[5]
May 11, 2010
The Chomsky Hierarchy (1956)
Language classes (+formalisms):
Type-3 regular expressions "enough" for:
URLs, log files, DBLP, ...
"Trade" (excess) expressivity for:
declarativity, simplicity, and static safety !
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[6]
May 11, 2010
Type-0: java.net.URL
Turing-Complete programming (e.g., Java)
[ "unrestricted grammars" (e.g., rewriting systems) ]
Cyclomatic complexity (of official "java.net.URL"):
88 bug reports on Sun's Bug Repository !
Bug reports span more than a decade !
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[7]
May 11, 2010
Type-1: Context-Sensitivity
Not widely used (or studied?) formalism
-?Presumeably because:
Restricts expressivity w/o offering extra safety?
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[8]
May 11, 2010
Type-2: Context-Free Grammars
Conceptually harder than regexps
(conjecture!)
Essentially (Type-3) Regular Expressions + recursion
The ultimate end-all scientific argument:
We
d:
regexps 12 times more popular !
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[9]
May 11, 2010
Type-?: Regexp Capture Groups
Capturing groups (Perl, PHP, Java regex, ...):
Syntax:
(R)
(i.e., in parentheses)
Back-references:
Syntax:
\7
(i.e., "index of" capturing group)
Beyond regularity !:
{ an b an | n  0 }
(a*)b\1 is non-regular
In fact, not even context-free !!!:
(.*).\1 is non-context-free
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
{    | , * }
COPLAS DIKU, Denmark
[ 10 ]
May 11, 2010
Type-?: Regexp Capture Groups
Interpretation with back-tracking:
NP-complete (exponential worst-case):
:-(
regexp " a?nan " vs. string " an "
1 minute
0.02 msecs
3.000.000:1 on strings of length 29 !!!
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 11 ]
May 11, 2010
Type-3: Regular Expressions
Simple !
Declarative !
Closure properties:
Union
Concatenation
Iteration
Restriction
Intersection
Complement
...
C. Brabrand & J. G. Thomsen
Safe !
Decidability properties:
...
...
Containment: L(R)  L(R')
Ambiguity
...
...
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 12 ]
May 11, 2010
Outline
Pattern Matching (intro & motiv):
The Chomsky Hierarchy (1956)
Regular Expressions:
The Recording Construction
Ambiguity:
Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 13 ]
May 11, 2010
Regular Expressions
Syntax:
Semantics:
where:
L1  L2 is concatenation (i.e., { 1 2 | 1L1, 2L2 })
L* = i0 Li where L0 = {  } and Li = L  Li-1
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 14 ]
May 11, 2010
Common Extensions (sugar)
Any character (aka, dot):
"."
as
c1|c2|...|cn, ci
Character ranges:
"[a-z]"
as
a|b|...|z
One-or-more regexps:
"R+"
as
RR*
Optional regexp:
"R?"
as
|R
Various repetitions; e.g.:
"R{2,3}" as
C. Brabrand & J. G. Thomsen
RRR?
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 15 ]
May 11, 2010
Outline
Pattern Matching (intro & motiv):
The Chomsky Hierarchy (1956)
Regular Expressions:
The Recording Construction
Ambiguity:
Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 16 ]
May 11, 2010
Recording
Syntax:
"x " is a recording identifier
(it "remembers" the substring it matches)
Semantics:
NB: cannot use DFAs / NFAs !
- only recognition (yes / no)
- not how (i.e., "the structure")
Example (simplified emails):
<user = [a-z]+ > "@" <domain = [a-z]+ ("." [a-z]+)* >
Matching against string: "[email protected]"
yields: user = "obama" & domain = "whitehouse.gov"
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 17 ]
May 11, 2010
Recording (structured)
Another example (with nested recordings):
<date =
<day
= [0-9]{2} > "/"
<month = [0-9]{2} > "/"
<year = [0-9]{4} >
>
Matching against string:
yields:
"26/06/1992"
date = 26/06/1992
date.day = 26
date.month = 06
date.year = 1992
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 18 ]
May 11, 2010
Recording (structured, lists)
Yet another example (yielding lists):
<name = [a-z]+ > " & " <name = [a-z]+ >
( <name = [a-z]+ > "\n" )*
<name = [a-z]+ > (" & " <name = [a-z]+ > )*
Matching against string:
yields a list structure:
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
"obama & bush"
name = [obama,bush]
COPLAS DIKU, Denmark
[ 19 ]
May 11, 2010
Outline
Pattern Matching (intro & motiv):
The Chomsky Hierarchy (1956)
Regular Expressions:
The Recording Construction
Ambiguity:
Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 20 ]
May 11, 2010
Abstract Syntax Trees (ASTs)
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 21 ]
May 11, 2010
Ambiguity
Definition:
R
R ambiguous iff
T,T'ASTR: T  T'  ||T|| = ||T'||
T


=
R'
T'

where ||||: AST  * (the flattening) is:
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 22 ]
May 11, 2010
Characterization of Ambiguity
Theorem:
NB: sound & complete !
R unambiguous
iff
R* =  | RR*
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 23 ]
May 11, 2010
Examples
Ambiguous:
Unambiguous:
a|a
a|aa
L(a)  L(a) = { a }  Ø
a*a*
a*ba*
 L(a*) = { an }  Ø
L(a*) 
C. Brabrand & J. G. Thomsen
L(a)  L(aa) = Ø
REGULAR EXPRESSIONS
 L(ba*) = Ø
L(a*) 
COPLAS DIKU, Denmark
[ 24 ]
May 11, 2010
Ambiguity Examples
a?b+|(ab)*
*** ambiguous choice: a?b+ <-|-> (ab)*
shortest ambiguous string: "ab"
(a|ab)(ba|a)
*** ambiguous concatenation: (a|ab) <--> (ba|a)
shortest ambiguous string: "aba"
(aa|aaa)*
*** ambiguous star: (aa|aaa)*
shortest ambiguous string: "aaaaa"
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 25 ]
May 11, 2010
Outline
Pattern Matching (intro & motiv):
The Chomsky Hierarchy (1956)
Regular Expressions:
The Recording Construction
Ambiguity:
Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 27 ]
May 11, 2010
Disambiguation
1) Manual rewriting:
Always possible :-)
Tedious :-(
Error-prone :-(
Not structure-preserving :-(
3) Disambiguators:
2) Restriction:
R1 - R2
And then encode...:
RC
as: * - R
R1 & R2 as: (R1C|R2C)C
4) Default disamb:
From characterization:
concat:
'L', 'R'
choice:
'|L', '|R'
star:
'*L', '*R'
concat, choice, and star
are all left-biassed
(by default) !
(partial-order on ASTs)
(Our tool does this)
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 28 ]
May 11, 2010
Outline
Pattern Matching (intro & motiv):
The Chomsky Hierarchy (1956)
Regular Expressions:
The Recording Construction
Ambiguity:
Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 30 ]
May 11, 2010
Type Inference
Type Inference:
R : (L,S)
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 31 ]
May 11, 2010
Examples (Type Inference)
Regexp:
Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"
compile
(our tool)
Usage:
class Person { // auto-generated
String name;
int age;
static Person match(String s) { ... }
public String toString() { ... }
}
String s = "obama (48)";
Person p = Person.match(s);
print(p.name + " is " + p.age + "y old");
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 32 ]
May 11, 2010
Examples (Type Inference)
Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"
People = ( $Person "\n" )*
compile
(our tool)
Usage:
class People { // auto-generated
String[] name;
int[] age;
static Person match(String s) { ... }
public String toString() { ... }
}
String s = "obama (48) \n bush (63) \n ";
People p = People.match(s);
println("Second name is " + p[1].name);
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 33 ]
May 11, 2010
Examples (Type Inference)
Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"
People = ( <person = $Person > "\n" )* ;
compile
(our tool)
Usage:
class People { // auto-generated
Person[] person;
class Person { // nested class
String name;
int age; }
... }
String s = "obama (48) \n bush (63) \n ";
People people = People.match(s);
for (p : people.person) println(p.name);
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 34 ]
May 11, 2010
Outline
Pattern Matching (intro & motiv):
The Chomsky Hierarchy (1956)
Regular Expressions:
The Recording Construction
Ambiguity:
Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 35 ]
May 11, 2010
URLs
URLs:
(list of key-value pairs)
"http://www.google.com/search?q=record&hl=en"
protocol
host
path
query-string
(list of key-value pairs)
Regexp:
Host
Path
Query
URL
=
=
=
=
<host = [a-z]+ ("." [a-z]+ )* > ;
<path = [a-z/.]* > ;
<query = [a-z&=]* > ;
"http://" $Host "/" $Path "?" $Query ;
Query string further structured (list of key-value pairs):
KeyVal =
Query =
<key = [a-z]* > "=" <val = [a-z]* > ;
$KeyVal ("&" $KeyVal)* ;
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 36 ]
May 11, 2010
URLs (Usage Example)
Regexp:
Host
Path
KeyVal
Query
URL
=
=
=
=
=
<host = [a-z]+ ("." [a-z]+ )* > ;
<path = [a-z/.]* > ;
<key = [a-z]* > "=" <val = [a-z]* > ;
$KeyVal ("&" $KeyVal)* ;
"http://" $Host "/" $Path "?" $Query ;
Usage (example):
String s = "http://www.google.com/search?q=record";
URL url = URL.match(s);
print("Host is: " + url.host);
if (url.key.length>0) print("1st key: " + url.key[0]);
for (String val : url.val) println("value = " + val);
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 37 ]
May 11, 2010
Log Files
Format
13/02/2010 66.249.65.107 /support.html
20/02/2010 42.116.32.64 /search.html
...
Date
IP
Entry
Log
= <date =
Regexp
<day = $Day > "/"
<month = $Month > "/"
<year = [0-9]{4} > > ;
= <ip = [0-9]{1,3} ("." [0-9]{1,3} ){3} > ;
= <entry = $Date " " $IP " " $Path "\n" > ;
= $Entry * ;
Log log = Log.match(log_file);
Usage
for (Entry e : log.entry)
if (e.date.month == 02 && e.date.day == 29)
print("Access on LEAP YEAR from IP# " + e.ip);
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 38 ]
May 11, 2010
Log Files (cont'd, ambiguity)
Assume we forgot "/" (between day & month):
Day
Month
=
=
Date
= <date = <day = $Day >
<month = $Month > "/"
<year = [0-9]{4} > > ;
Regexp
0?[1-9] | [1-2][0-9] | 30 | 31 ;
0?[1-9] | 10 | 11 | 12 ;
// no slash !
Ambiguity:
*** ambiguous concatenation: <day> <--> <month>
shortest ambiguous string: "101"
i.e. "1/01" (January 1) vs. "10/1" (January 10)
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 39 ]
Error
:-)
May 11, 2010
DBLP (Format)
DBLP (XML) Format:
<article>
<author>Noam Chomsky</author>
<title>Three Models for the Description of Language</title>
<year>1956</year>
<journal>IRE Transactions on Information Theory</journal>
</article>
<article>
<author>Claus Brabrand</author>
<author>Jakob G Thomsen</author>
<title>Typed and Unambiguous Pattern Matching
on Strings using Regular Expressions</title>
<year>2010</year>
<note>Submitted</note>
</article>
...
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 40 ]
May 11, 2010
DBLP (Regexp)
DBLP Regexp:
Author
Title
Article
DBLP
=
=
=
=
"<author>" <author = [a-z]* > "</author>" ;
"<title>" <title = [a-z]* > "</title>" ;
"<article>" $Author* $Title .* "</article>" ;
<pub = $Article > * ;
Ambiguity !:
*** ambiguous star: <pub>*
shortest ambiguous string:
"<article><title></title></article>
<article><title></title></article>"
EITHER 2 publications (.* = "")
OR
1 publication (.* = gray part) !!!
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 41 ]
May 11, 2010
DBLP (Disambiguated)
DBLP Regexp:
Author
Title
Article
DBLP
=
=
=
=
"<author>" <author = [a-z]* > "</author>" ;
"<title>" <title = [a-z]* > "</title>" ;
"<article>" $Author* $Title .* "</article>" ;
<pub = $Article > * ;
Disambiguated (using "(R1-R2)"):
Article = "<article>"
$Author* $Title (.* - (.* "</article>" .*))
"</article>" ;
Unambiguous! :-)
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 42 ]
May 11, 2010
DBLP (Usage Example)
DBLP Regexp:
Author
Title
Article
DBLP
=
=
=
=
"<author>" <author = [a-z]* > "</author>" ;
"<title>" <title = [a-z]* > "</title>" ;
"<article>" $Author* $Title .* "</article>" ;
<article = $Article > * ;
Usage (example):
DBLP dblp = DBLP.match(readXMLfile("DBLP.xml"));
for (Article a: dblp.article)
print("Title: " + a.title);
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 43 ]
May 11, 2010
Outline
Pattern Matching (intro & motiv):
The Chomsky Hierarchy (1956)
Regular Expressions:
The Recording Construction
Ambiguity:
Disambiguation
Type Inference
Usage and Examples
Evaluation and Conclusion
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 44 ]
May 11, 2010
Evaluation
Evaluation summary:
[ Frisch&Cardelli'04 ]
[ NP-Complete ]
[ MatMult ]
Also, (Type-3) regexps expressive "enough"
for: URLs, Log files, DBLP, ...
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 45 ]
May 11, 2010
Type-3 vs. Type-0 (URLs)
Regexps vs. Java:
Regexps are 8 times more concise !
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 46 ]
May 11, 2010
java.util.regex vs. Our approach
Efficiency
(on DBLP):
2 mins
10 msecs
java.util.regex:
Exponential O(2||)
2,500 chars in 2 mins !
In contrast; ours:
Linear (on DBLP)
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
1,200,000 chars in 6 secs !
COPLAS DIKU, Denmark
[ 47 ]
May 11, 2010
Related Work
Recording (with lists in general):
"x as R" in XDuce; "x::R" in CDuce; and
"x@R" in Scala and HaRP
Ambiguity:
[Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDuce
but indirectly via NFAa, not directly (syntax-directed)
Disambiguation:
[Vansummeren'06] but with global, not local disambiguation
Type inference:
Exact type inference in XDuce & CDuce
(soundness+completeness proof in [Vansummeren'06])
but not for stand-alone and non-intrusive usage (Java)
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 48 ]
May 11, 2010
Conclusion
For string pattern matching, it is possible to:
"trade (excess) expressivity for safety+simplicity"
In conclusion:
We conclude that if regular expressions are sufficiently
expressive, they provide a simple, declarative, and safe
means for pattern matching on strings, capable of extracting
highly structural information in a statically type-safe and
unambiguous manner.
i.e., ambiguity checking and type inference !
+ stand-alone & non-intrusive language integration (Java) !
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 49 ]
May 11, 2010
</Talk>
[ http://www.cs.au.dk/~gedefar/reg-exp-rec/ ]
Questions ? Complaints ?
C. Brabrand & J. G. Thomsen
REGULAR EXPRESSIONS
COPLAS DIKU, Denmark
[ 50 ]
May 11, 2010