Proposed technology plan Toro phase I: augment Cognos with
Download
Report
Transcript Proposed technology plan Toro phase I: augment Cognos with
Regular Expression Learning
for Information Extraction
Yunyao Li*, Rajasekar Krishnamurthy*, Sriram Raghavan*,
Shivakumar Vaithyanathan*, H. V. Jagadish○
*IBM
Almaden Research Center
○
University of Michigan
http://www.almaden.ibm.com/cs/projects/avatar/
© 2008 IBM Corporation
Outline
Motivation
Regex Learning Problem
Regex Transformations
ReLIE Search Algorithm
Experiments
Summary
© 2008 IBM Corporation
Importance of Regular Expression (Regex)
Regex is essential to many information extraction (IE) tasks
Email addresses
Web collections
Software names
Credit card numbers
Email compliance
Social security numbers
Gene and Protein names
bioinformatics
….
But … writing regexes for an IE task is not straightforward
© 2008 IBM Corporation
Phone Number Extraction
A simple pattern:
blocks of digits separated by non-word character:
R0 = (\d+\W)+\d+
Identifies valid phone numbers (e.g. 800-865-1125, 725-1234)
Produces invalid matches (e.g. 123-45-6789, 10/19/2002, 1.25 …)
Misses valid phone numbers (e.g. (800) 865-CARE)
© 2008 IBM Corporation
Software Name Extraction
A simple pattern:
blocks of capitalized words followed by version number:
R0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+
Identifies valid software names (e.g. Eclipse 3.2, Windows 2000)
Produces invalid matches (e.g. English 123, Room 301, Chapter 1.2)
Misses valid software names (e.g. Windows XP)
© 2008 IBM Corporation
Conventional Regex Writing Process for IE
Regex3210
Sample
Documents
Match 1
Match 2
…
(\d+\W)+\d{4}
(\d+\W)+\d+
(\d{3}[\.\s\-])+\d{4}
(\d+[\.\s\-])+\d{4}
800-865-1125
725-1234
…
123-45-6789
10/19/2002
1.25
…
Y
Good Enough?
N
© 2008 IBM Corporation
Regexfinal
Our goal - Learning Regexfinal automatically
Regex0
Sample
Documents
Match 1
Match 2
…
Labeled Matches
© 2008 IBM Corporation
NegMatch 1
…
NegMatch m0
PosMatch 1
…
PosMatch n0
ReLIE
Regexfinal
Intuition
([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4}
…
…
Compute F-measure
([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4}
F1
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4}
([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4}
…
…
…
…
R’
…
([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3}
([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}
…………..
F7
([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4}
F8
(((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)
[A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}
…
…
…
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4}
…
R0
([A-Z] [a-z] {1,10}\s){1,5}\s*
(?!(201|…|330))(\w{0,2}\d[\.]?){1,4}
([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4}
F34
((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)
[A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4}
…
…
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*
(?!(201|…|330))(\w{0,2}\d[\.]?){1,4}
F35
F48
• Generate candidate regular expressions by modifying current regular expression
• Select the “best candidate” R’
• If R’ has better than current regular expression, repeat the process
© 2008 IBM Corporation
Outline
Motivation
Regex Learning Problem
Regex Transformations
ReLIE Search Algorithm
Experiments
Summary
© 2008 IBM Corporation
Regex Learning Problem
Ideally:
find the best Rf among all possible regexes
How do we define the best?
Highest F-measure over a document collection D.
We can only compute F-measure based on the labeled data
Must limited Rf such that any match of Rf is also a match of R0
© 2008 IBM Corporation
Regex Learning as a Search Problem
-
-
-
-
+
- - ++ + + +
+
+ +
+ + + + + +
-
-
-
+
- -++
+
+ +
+
+
- +
- - +
+ +
+ +
+ +
+
M(R
,
D)
+
- --M(R
f
- 0, D) +
+
+ + + +
+ + ++
+
++
+ ++
+
- +
+
+ +
- +
+
+
+
+ +
+
+
+
+
+
+ ++
- - - +
- -
M(R, D): Matches of R over document collection D.
© 2008 IBM Corporation
Outline
Motivation
Regex Learning Problem
Regex Transformations
ReLIE Search Algorithm
Experiments
Summary
© 2008 IBM Corporation
Two Regex Transformations
Drop-disjunct Transformation:
R = Ra(R1| R2|… Ri| Ri+1|…| Rn) Rb R’ = Ra (R1| … Ri|…) Rb
Include-Intersect Transformation
R = RaXRb R’ = Ra(X Y) Rb
where Y
© 2008 IBM Corporation
Applying Drop-Disjunct Transformation
Character Class Restriction
E.g. To restrict the matching of non-word characters
(\d+\W)+\d+ (\d+[\.\s\-])+\d+
Quantifier Restriction
E.g. To restrict the number of digits in a block
(\d+\W)+\d+ (\d{3}\W)+\d+
© 2008 IBM Corporation
Applying Include-Intersect Transformation
Negative Dictionaries
Disallow certain words from matching specific portions of the regex
E.g. a simple pattern for software name extraction:
blocks of capitalized words followed by version number:
R0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+
Identifies valid software name (e.g. Eclipse 3.2, Windows 2000)
Produces invalid matches (e.g. ENGLISH 123, Room 301, Chapter 1.2)
([A-Z]\w*\s*)+[Vv]?(\d+\.?)+ (((?! ENGLISH|Room|Chapter) [A-Z]\w*\s*)+[Vv]?(\d+\.?)+
© 2008 IBM Corporation
Outline
Motivation
Regex Learning Problem
Regex Transformations
ReLIE Search Algorithm
Experiments
Summary
© 2008 IBM Corporation
ReLIE Algorithm
([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4}
…
…
Compute F-measure
([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4}
F1
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4}
([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4}
([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3}
…
Quantifier
restrictions
([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}
…
…
…
…
R’
…………..
F7
([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4}
F8
…
Negative
dictionary
(((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)
[A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}
…
Quantifier
restrictions
…
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4}
…
R0
([A-Z] [a-z] {1,10}\s){1,5}\s*
(?!(201|…|330))(\w{0,2}\d[\.]?){1,4}
([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4}
F34
((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)
[A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4}
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*
(?!(201|…|330))(\w{0,2}\d[\.]?){1,4}
F35
…
…
Negative
dictionary
F48
• Generate candidate regular expressions by applying a single transformation
• Select the “best candidate” R’ based on F-measure on training corpus
• If R’ has better F-measure than current regular expression, repeat the process
• Use validation set to avoid over-fitting
© 2008 IBM Corporation
Outline
Motivation
Regex Learning Problem
Regex Transformations
ReLIE Search Algorithm
Experiments
Summary
© 2008 IBM Corporation
Experimental Set Up
Data Set
EWeb: 50K web pages from IBM intranet
AWeb: 50K web pages from University of Michigan web site.
• AWeb-S: subset of 10K pages from AWeb
Email: 10K emails from Enron collection
Extraction Tasks
SoftwareNameTask
PhoneNumberTask
CourseNumberTask
URLTask
Comparison Study
ReLIE
Conditional Random Fields (CRF):
• Base feature set
– matches corresponding to the input regex
– three adjacent words to each side of the matches
© 2008 IBM Corporation
Extraction Quality
(b) CourseNumberTask
(a) SoftwareNameTask
1
1
0.9
F-Measure
F-Measure
0.9
0.8
Program repeatedly failed
at training phrase.
0.7
0.8
0.7
0.6
0.6
ReLIE
CRF
0.5
10%
40%
10%
80%
CRF
40%
80%
Percentage of Data Used for Training
(d) PhoneNumberTask
Percentage(c)ofURLTask
Data Used for Training
1
1
0.9
0.9
F-Measure
F-Measure
ReLIE
0.5
0.8
0.7
0.8
0.7
0.6
0.6
ReLIE
CRF
0.5
ReLIE
CRF
0.5
10%
40%
Percentage of Data Used for Training
80%
10%
40%
80%
Percentage of Data Used for Training
ReLIE performs comparably with CRF with a slight edge with limited training data
© 2008 IBM Corporation
Cross-domain Evaluation
(a) SoftwareNameTask
training: EWeb, testing: AWeb
1
(b) CourseNameTask is not tested,
as course names exist only in AWeb.
F-Measure
0.8
0.6
0.4
ReLIE
0.2
CRF
0
10%
40%
80%
(c) URLTask
training: Aweb-S, testing: Email
(d) PhoneNumberTask
training: Email testing: AWeb
1
1
F-Measure
F-Measure
0.8
0.6
0.4
0.2
ReLIE
0
CRF
10%
40%
80%
Percentage of Data Used for Training
ReLIE
0.8
0.6
0.4
0.2
0
10%
40%
Percentage of Data Used for Training
ReLIE significantly outperforms CRF for all three tasks
© 2008 IBM Corporation
CRF
80%
Performance
Average Training/Testing Time (sec)(with 40% data for training)
ReLIE is an order of magnitude faster than CRF for both training and testing
© 2008 IBM Corporation
What has ReLIE learned?
Patterns learned by ReLIE are similar to features manually given to CRF
© 2008 IBM Corporation
ReLIE as Feature Extractor for CRF
C+RL: CRF + features learned by ReLIE
• Token level features learned by ReLIE
• helpful when the training data is small
• Character level features learned by ReLIE
• always helpful
© 2008 IBM Corporation
Outline
Motivation
Regex Learning Problem
Regex Transformations
ReLIE Search Algorithm
Experiments
Summary
© 2008 IBM Corporation
ReLIE
Effective for learning regexes for certain classes of IE
Particularly useful when
cross-domain, or
limited training data
Potentially becoming a powerful feature extractor for CRF and
other machine learning algorithms.
© 2008 IBM Corporation
http://www.almaden.ibm.com/cs/projects/avatar/
© 2008 IBM Corporation