Proposed technology plan Toro phase I: augment Cognos with

Download Report

Transcript Proposed technology plan Toro phase I: augment Cognos with

Regular Expression Learning
for Information Extraction
Yunyao Li*, Rajasekar Krishnamurthy*, Sriram Raghavan*,
Shivakumar Vaithyanathan*, H. V. Jagadish○
*IBM
Almaden Research Center
○
University of Michigan
http://www.almaden.ibm.com/cs/projects/avatar/
© 2008 IBM Corporation
Outline
 Motivation
 Regex Learning Problem
 Regex Transformations
 ReLIE Search Algorithm
 Experiments
 Summary
© 2008 IBM Corporation
Importance of Regular Expression (Regex)
 Regex is essential to many information extraction (IE) tasks
 Email addresses
Web collections
 Software names
 Credit card numbers
Email compliance
 Social security numbers
 Gene and Protein names
bioinformatics
 ….
But … writing regexes for an IE task is not straightforward
© 2008 IBM Corporation
Phone Number Extraction
 A simple pattern:
blocks of digits separated by non-word character:
R0 = (\d+\W)+\d+
 Identifies valid phone numbers (e.g. 800-865-1125, 725-1234)
 Produces invalid matches (e.g. 123-45-6789, 10/19/2002, 1.25 …)
 Misses valid phone numbers (e.g. (800) 865-CARE)
© 2008 IBM Corporation
Software Name Extraction
 A simple pattern:
blocks of capitalized words followed by version number:
R0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+
 Identifies valid software names (e.g. Eclipse 3.2, Windows 2000)
 Produces invalid matches (e.g. English 123, Room 301, Chapter 1.2)
 Misses valid software names (e.g. Windows XP)
© 2008 IBM Corporation
Conventional Regex Writing Process for IE
Regex3210
Sample
Documents
Match 1
Match 2
…
(\d+\W)+\d{4}
(\d+\W)+\d+
(\d{3}[\.\s\-])+\d{4}
(\d+[\.\s\-])+\d{4}
800-865-1125
725-1234
…
123-45-6789
10/19/2002
1.25
…
Y
Good Enough?
N
© 2008 IBM Corporation
Regexfinal
Our goal - Learning Regexfinal automatically
Regex0
Sample
Documents
Match 1
Match 2
…
Labeled Matches
© 2008 IBM Corporation
NegMatch 1
…
NegMatch m0
PosMatch 1
…
PosMatch n0
ReLIE
Regexfinal
Intuition
([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4}
…
…
Compute F-measure
([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4}
F1
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4}
([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4}
…
…
…
…
R’
…
([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3}
([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}
…………..
F7
([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4}
F8
(((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)
[A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}
…
…
…
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4}
…
R0
([A-Z] [a-z] {1,10}\s){1,5}\s*
(?!(201|…|330))(\w{0,2}\d[\.]?){1,4}
([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4}
F34
((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)
[A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4}
…
…
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*
(?!(201|…|330))(\w{0,2}\d[\.]?){1,4}
F35
F48
• Generate candidate regular expressions by modifying current regular expression
• Select the “best candidate” R’
• If R’ has better than current regular expression, repeat the process
© 2008 IBM Corporation
Outline
 Motivation
 Regex Learning Problem
 Regex Transformations
 ReLIE Search Algorithm
 Experiments
 Summary
© 2008 IBM Corporation
Regex Learning Problem
 Ideally:
 find the best Rf among all possible regexes
 How do we define the best?
 Highest F-measure over a document collection D.
 We can only compute F-measure based on the labeled data
 Must limited Rf such that any match of Rf is also a match of R0
© 2008 IBM Corporation
Regex Learning as a Search Problem
-
-
-
-
+
- - ++ + + +
+
+ +
+ + + + + +
-
-
-
+
- -++
+
+ +
+
+
- +
- - +
+ +
+ +
+ +
+
M(R
,
D)
+
- --M(R
f
- 0, D) +
+
+ + + +
+ + ++
+
++
+ ++
+
- +
+
+ +
- +
+
+
+
+ +
+
+
+
+
+
+ ++
- - - +
- -
M(R, D): Matches of R over document collection D.
© 2008 IBM Corporation
Outline
 Motivation
 Regex Learning Problem
 Regex Transformations
 ReLIE Search Algorithm
 Experiments
 Summary
© 2008 IBM Corporation
Two Regex Transformations
 Drop-disjunct Transformation:
R = Ra(R1| R2|… Ri| Ri+1|…| Rn) Rb  R’ = Ra (R1| … Ri|…) Rb
 Include-Intersect Transformation
R = RaXRb  R’ = Ra(X Y) Rb
where Y  
© 2008 IBM Corporation
Applying Drop-Disjunct Transformation
 Character Class Restriction
E.g. To restrict the matching of non-word characters
(\d+\W)+\d+  (\d+[\.\s\-])+\d+
 Quantifier Restriction
E.g. To restrict the number of digits in a block
(\d+\W)+\d+  (\d{3}\W)+\d+
© 2008 IBM Corporation
Applying Include-Intersect Transformation
 Negative Dictionaries
 Disallow certain words from matching specific portions of the regex
E.g. a simple pattern for software name extraction:
blocks of capitalized words followed by version number:
R0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+
 Identifies valid software name (e.g. Eclipse 3.2, Windows 2000)
 Produces invalid matches (e.g. ENGLISH 123, Room 301, Chapter 1.2)
([A-Z]\w*\s*)+[Vv]?(\d+\.?)+  (((?! ENGLISH|Room|Chapter) [A-Z]\w*\s*)+[Vv]?(\d+\.?)+
© 2008 IBM Corporation
Outline
 Motivation
 Regex Learning Problem
 Regex Transformations
 ReLIE Search Algorithm
 Experiments
 Summary
© 2008 IBM Corporation
ReLIE Algorithm
([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4}
…
…
Compute F-measure
([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4}
F1
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4}
([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4}
([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3}
…
Quantifier
restrictions
([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}
…
…
…
…
R’
…………..
F7
([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4}
F8
…
Negative
dictionary
(((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)
[A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}
…
Quantifier
restrictions
…
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4}
…
R0
([A-Z] [a-z] {1,10}\s){1,5}\s*
(?!(201|…|330))(\w{0,2}\d[\.]?){1,4}
([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4}
F34
((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)
[A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4}
([A-Z][a-zA-Z]{1,10}\s){1,5}\s*
(?!(201|…|330))(\w{0,2}\d[\.]?){1,4}
F35
…
…
Negative
dictionary
F48
• Generate candidate regular expressions by applying a single transformation
• Select the “best candidate” R’ based on F-measure on training corpus
• If R’ has better F-measure than current regular expression, repeat the process
• Use validation set to avoid over-fitting
© 2008 IBM Corporation
Outline
 Motivation
 Regex Learning Problem
 Regex Transformations
 ReLIE Search Algorithm
 Experiments
 Summary
© 2008 IBM Corporation
Experimental Set Up
 Data Set
 EWeb: 50K web pages from IBM intranet
 AWeb: 50K web pages from University of Michigan web site.
• AWeb-S: subset of 10K pages from AWeb
 Email: 10K emails from Enron collection
 Extraction Tasks
SoftwareNameTask
PhoneNumberTask
CourseNumberTask
URLTask
 Comparison Study
 ReLIE
 Conditional Random Fields (CRF):
• Base feature set
– matches corresponding to the input regex
– three adjacent words to each side of the matches
© 2008 IBM Corporation
Extraction Quality
(b) CourseNumberTask
(a) SoftwareNameTask
1
1
0.9
F-Measure
F-Measure
0.9
0.8
Program repeatedly failed
at training phrase.
0.7
0.8
0.7
0.6
0.6
ReLIE
CRF
0.5
10%
40%
10%
80%
CRF
40%
80%
Percentage of Data Used for Training
(d) PhoneNumberTask
Percentage(c)ofURLTask
Data Used for Training
1
1
0.9
0.9
F-Measure
F-Measure
ReLIE
0.5
0.8
0.7
0.8
0.7
0.6
0.6
ReLIE
CRF
0.5
ReLIE
CRF
0.5
10%
40%
Percentage of Data Used for Training
80%
10%
40%
80%
Percentage of Data Used for Training
ReLIE performs comparably with CRF with a slight edge with limited training data
© 2008 IBM Corporation
Cross-domain Evaluation
(a) SoftwareNameTask
training: EWeb, testing: AWeb
1
(b) CourseNameTask is not tested,
as course names exist only in AWeb.
F-Measure
0.8
0.6
0.4
ReLIE
0.2
CRF
0
10%
40%
80%
(c) URLTask
training: Aweb-S, testing: Email
(d) PhoneNumberTask
training: Email testing: AWeb
1
1
F-Measure
F-Measure
0.8
0.6
0.4
0.2
ReLIE
0
CRF
10%
40%
80%
Percentage of Data Used for Training
ReLIE
0.8
0.6
0.4
0.2
0
10%
40%
Percentage of Data Used for Training
ReLIE significantly outperforms CRF for all three tasks
© 2008 IBM Corporation
CRF
80%
Performance
Average Training/Testing Time (sec)(with 40% data for training)
ReLIE is an order of magnitude faster than CRF for both training and testing
© 2008 IBM Corporation
What has ReLIE learned?
Patterns learned by ReLIE are similar to features manually given to CRF
© 2008 IBM Corporation
ReLIE as Feature Extractor for CRF
C+RL: CRF + features learned by ReLIE
• Token level features learned by ReLIE
• helpful when the training data is small
• Character level features learned by ReLIE
• always helpful
© 2008 IBM Corporation
Outline
 Motivation
 Regex Learning Problem
 Regex Transformations
 ReLIE Search Algorithm
 Experiments
 Summary
© 2008 IBM Corporation
ReLIE
 Effective for learning regexes for certain classes of IE
 Particularly useful when
 cross-domain, or
 limited training data
 Potentially becoming a powerful feature extractor for CRF and
other machine learning algorithms.
© 2008 IBM Corporation
http://www.almaden.ibm.com/cs/projects/avatar/
© 2008 IBM Corporation