Transcript Segue

Declarative Information Extraction
The Avatar Group
IBM Almaden Research Center
Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss,
Shivakumar Vaithyanathan, and Huaiyu Zhu
Sonoma State University Computer Science Colloquium
03/06/2008
© 2008 IBM Corporation
Motivation
Hmmm…I don’t
know. Let me
check my email.
Where is
the party?
John and Jane are going to a salsa party tonight! But …
2
© 2008 IBM Corporation
Where is the party?
Hi guys,
salsa address
0 email found
salsa
100 emails found
We are planning a salsa party tonight starting at
10:00pm for our class at Miami Beach Club,
175 San Pedro Square
address
0 email found
San Jose, CA 95109
Whoever who is interested, please let me know
so we can organize some car-pooling.
The address of
the party!
PS: you can call me at 408.123.4567
if needed.
-Juan
But the email itself does not contain the word “address”!
3
© 2008 IBM Corporation
Information Extraction

Distill structured data from unstructured and semi-structured text
– E.g. extracting phone numbers from emails, extracting person names from the web

Exploit the extracted data in your applications
– E.g. for search, for advertisement
Hi guys,
Select Address
From EVENTS
Where event = ‘salsa party’
We are planning a salsa party
tonight starting at 10:00pm for
our salsa class at Miami Beach
Club,
175 San Pedro Square
San Jose, CA 95109
Whoever who is interested,
please let me know so we can
organize some car-pooling.
Event
salsa party
...
Address
175 San Pedro Square ...
...
-Juan
PS: you can call me at
408.123.4567 if needed.
4
175 San Pedro Square …
© 2008 IBM Corporation
Revisit: Where is the Party?
salsa address
Lotus Notes 8.01 Live Text
San Jose, CA 95109
5
© 2008 IBM Corporation
Other Commercial Applications
6
© 2008 IBM Corporation
And many others
 Literature Citations/ Research Communities
– DBLife
– Google Scholar
 Terminology Extraction
As Summarization
the amount of data in text explodes,
 Document
information extraction is becoming
 Life Science
– Eg. Gene Sequence Extraction, Protein Interaction Extraction
……
7
increasing important!
© 2008 IBM Corporation
Basic Terminology
Programs used
to extract
structured data
documents
Structured data
extracted by
annotators
Annotator
annotations
Annotator
annotations
8
Data Repository
…
Annotator
Higher Level Applications
annotations
© 2008 IBM Corporation
Background: Avatar
 Working on information extraction (IE) since 2003
 Main goals:
– Extract structured information from text
– Build a system that can scale IE to real enterprise apps
– Build new enterprise applications that leverage IE
9
© 2008 IBM Corporation
Evolution of the Avatar IE System
2004
Evolutionary Triggers
Custom Code
Large number of
annotators
2005
RAP
(CPSL-style cascading
grammar system)
Diverse data sets,
Complex
extraction tasks
2006
RAP++
(RAP + Extensions outside the
scope of grammars)
Performance,
Expressivity
2007
System T
(algebraic information
extraction system)
2008
10
© 2008 IBM Corporation
The Custom Code Era
Sonoma State University Computer Science Colloquium
03/06/2008
© 2008 IBM Corporation
Extracting Information with Custom Code
 “It’s just pattern matching”
– Use scripts and regular expressions
 Then reality sets in…
– Dozens of rules, even for simple concepts
– Many special cases
– Convoluted logic
– Painfully slow code
12
© 2008 IBM Corporation
The Age of Cascading Grammars
Sonoma State University Computer Science Colloquium
03/06/2008
© 2008 IBM Corporation
Historical Perspective
 MUC (Message Understanding Conference) – 1987 to 1997
– Competition-style conferences organized by DARPA
– Shared data sets and performance metrics
• News articles, Radio transcripts, Military telegraphic messages
 Classical IE Tasks
– Entity and Relationship/Link extraction
– Event detection, sentiment mining etc.
– Entity resolution/matching
 Several IE systems were built
– FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS [Appelt96],
LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]
14
© 2008 IBM Corporation
Cascading Finite-state Grammars
Most IE systems share a common formalism
 CPSL
–
standard
for specifying
cascading
grammars
–A
Input
text language
viewed as
a sequence
of tokens
– Created in 1998
– Rules expressed as regular expression patterns
 Several
implementations
overknown
the lexical
features of these tokens

– TextPro: reference implementation of CPSL by Doug Appelt
Several
levels of processing  Cascading
– JAPE (Java Annotation Pattern Engine)
Grammars
• Part of the GATE NLP framework
• Under active consideration for commercial use by several companies
15
© 2008 IBM Corporation
Cascading Grammars By Example
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Level 2
Name Token[~ “at”] Phone  PersonPhone
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 1
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin,
in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis,
acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Token[~ “[1-9]\d{2}-\d{4}”]  Phone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin
enina i facilisis, <Name> at 555-1212 arcu tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Token[~ “John | Smith| …”]+  Name
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
© 2008 IBM Corporation
16
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
Level 0 (Tokenize)
Experiences with Cascading Grammars
 Benefits
– Big step forward from custom code
– Can express many simple concepts
 Drawbacks
– Expressiveness
• Dealing with overlap
• Building complex structures
– Performance
17
© 2008 IBM Corporation
Sequencing Overlapping Input Annotations
ProperNoun
Instrument
Marco Doe on the Hammond organ
John Pipe plays the guitar
ProperNoun
Instrument
<ProperNoun>
Regular Expression
Match
ProperNoun
<within 30 characters>
<[A-Z]\w+(\s[A-Z]\w+)?>
Instrument
<Instrument>
Dictionary
Match
<d1|d2|…dn>
Example rule from the Band Review
18
© 2008 IBM Corporation
Sequencing Overlapping Input Annotations
 Possible options
– Pre-specified disambiguation rules (e.g., pick earlier annotation)
– Supply tie-breaking rules for every possible overlap scenario
– Let implementation make an internal non-deterministic choice (as in JAPE, RAP, ..)
Which of theOver
two should
pick?entries
Prefer ProperNoun
 Instrument
4.5Mwe
blog
a choiceover
one
way or
John Pipe
guitar
another on a single
rule plays
would the
change
the
ProperNoun Token Token Instrument
number of annotations by +/- 25%.
John Pipe plays the guitar
Instrument
John
ProperNoun
Instrument
Pipe
plays
Token
ThereToken
is noInstrument
magic!
Marco Doe
ProperNoun
on the
the
guitar
Token Instrument
Hammond organ
ProperNoun Token Token Instrument
Marco Doe on the Hammond organ
Marco Doe
ProperNoun
19
Instrument
on
the
Hammond organ
ProperNoun Token Token PoperNoun token
© 2008 IBM Corporation
Complex Structures Example: Signature Annotator
Person
Laura Haas, PhD
Distinguished Engineer and Director, Computer Science
Almaden Research Center
408-927-1700
Phone
http://www.almaden.ibm.com/cs
Organization
URL
Start with Person
Within 50 tokens
At least 1 Phone
Person
Organizatio
n
Phone
URL
At least 2 of {Phone, Organization, URL}
End with one of these.
20
© 2008 IBM Corporation
Complex Structures: Existing Solutions
 Approximate using regular expressions
 Example: Signature
– Rule: (Person Token{,25} Phone (Token{,25} Contact)+) |
(Person (Token{,25} Contact)+ Token{,25} Phone
(Token{,25} Contact)*)
– Problems:
• Need to enumerate all possible orders of sub-annotations
– What if you want at least one phone and one email?
• Does not restrict total token count
21
© 2008 IBM Corporation
Performance
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Level 2
Name Token[~ “at”] Phone  PersonPhone
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 1
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin,
in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis,
acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Token[~ “[1-9]\d{2}-\d{4}”]  Phone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin
enina i facilisis, <Name> at 555-1212 arcu tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Token[~ “John | Smith| …”]+  Name
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
© 2008 IBM Corporation
22
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
Each level
Level
0 in a cascading grammar looks at each character in each document
Dawn of Declarative Information
Extraction
Sonoma State University Computer Science Colloquium
03/06/2008
© 2008 IBM Corporation
System-T Architecture
Specify annotator
semantics declaratively
AQL Language
Annotation Algebra
Choose an efficient
execution plan that
implements semantics
Optimizer
Operator
Runtime
24
© 2008 IBM Corporation
Declarative Information Extraction: AQL
 SQL-like language for defining annotators
 Declarative
– Define basic patterns and the relationships between
them
– Let the system worry about the order of operations
25
© 2008 IBM Corporation
AQL Example
<ProperNoun>
Regular Expression
Match
<within 30 characters>
<Instrument>
Dictionary
Match
select
CombineSpans(name.match, instrument.match)
as annot
from
Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name,
Dictionary(“instr.dict”, DocScan.text) instrument
where
Follows(0, 30, name.match, instrument.match);
26
© 2008 IBM Corporation
Annotation Algebra
 Each Operator in the algebra…
– …operates on one or more tuples of annotations
– …produces tuples of annotations
 “Document at a time” execution model
– Algebra expression is defined over
• the current document d
• annotations defined over d
 Algebra expression is evaluated over each
document in the corpus individually
27
© 2008 IBM Corporation
Basic Single-Argument Operator
Output Tuple 1 Document Annotation 1
Output Tuple 2 Document Annotation 2
Parameters
Input Tuple
28
Operator
Document
© 2008 IBM Corporation
Comparison with Cascading Grammars
…<PersonPhone>…
John Smith at 555-1212
Apply PersonPhone
Join
John Smith
…<Name> at <Phone>…
Block
Apply Name Rule
John
Smith
Apply Phone Rule
Dictionary
…John Smith at 555-1212…
555-1212
Regex
…John
Smith
Fewer
passes
overatthe555-1212…
documents
Grammar Algebra
29
© 2008 IBM Corporation
Revisit Problem of Sequencing Annotations
Instrument
ProperNoun
John Pipe plays the guitar
ProperNoun
30
Instrument
Marco Benevento on the Hammond organ
ProperNoun
Instrument
© 2008 IBM Corporation
Algebra expression for the Rule from Band Review
(Reiss, Raghavan, Krishnamurthy, Zhu and Vaithyanathan, ICDE 2008)
<ProperNoun>
Regular Expression
Match
<within 30 characters>
Join
Regular
expression
ProperNoun
31
<Instrument>
Dictionary
Match
(followed within 30 characters)
Dictionary
Instrument
© 2008 IBM Corporation
ProperNoun <0-30 chars> Instrument
doc
doc
John Pipe
Marco Benevento
Join
ProperNoun
doc
doc
doc
guitar
Hammond organ
John Pipe
Marco Benevento
Hammond
Instrument
doc
doc
doc
Regex
Dictionary
Instrument
ProperNoun
John Pipe plays the guitar
ProperNoun
32
Pipe
guitar
Hammond organ
Instrument
Marco Benevento on the Hammond organ
ProperNoun
Instrument
© 2008 IBM Corporation
How is aggregation handled
Person
Laura Haas, PhD
Distinguished Engineer and Director, Computer Science
Almaden Research Center
408-927-1700
Phone
http://www.almaden.ibm.com/cs
Organization
URL
Start with Person
Within 50 tokens
At least 1 Phone
Person
Organizatio
n
Phone
URL
At least 2 of {Phone, Organization, URL}
End with one of these.
33
© 2008 IBM Corporation
Back to signature
Signature
Person
Organization
Phone
URL
Join
Organization
Phone
URL
Phone
Person
URL
Organization
Block
Union
Person
Cleaner and potentially faster
Org
Phone
34
URL
© 2008 IBM Corporation
Performance
 Performance issues with grammars
– Complete pass through tokens for each rule
– Many of these passes are wasted work
 Dominant approach: Make each pass go faster
– Doesn’t solve root problem!
 Algebraic approach: Build a query optimizer!
35
© 2008 IBM Corporation
Optimizations
 Query optimization is a familiar topic in databases
 What’s different in text?
– Operations over sequences and texts
– Document boundaries
– Costs concentrated in extraction operators (dictionary,
regular expression)
 Can leverage these characteristics
– Text-specific optimizations
– Significant performance improvements
36
© 2008 IBM Corporation
Optimization Example
<ProperNoun>
<within 30 characters>
<Instrument>
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elentu
non ante. John Pipe played the guitar. Aliquam erat volutpat. Curab
a massa. Vivamus luctus, risus in sagittis facilisis, arcu augue rutrum
Regex match
Dictionary match
0-30 characters
37
© 2008 IBM Corporation
Classic Query Optimization
(Followed within 30 characters)
Join
<ProperNoun>
<Instrument>
Plan A
Find <Instrument> within 30 characters
Find <ProperNoun> within 30 characters
Consider text to the right
Consider text to the left
<ProperNoun>
Plan B
38
<Instrument>
Plan C
© 2008 IBM Corporation
Example of Text-Specific Optimization:
 Conditional Evaluation (CE)
– Leverage document-at-atime processing
– Don’t evaluate the inner
operand of a join if the
outer has no results
– Costing plans is
challenging
Don’t evaluate this Regex
when there are no dictionary
matches.
39
John Smith at 555-1212
CEJoin
John Smith
555-1212
Dictionary
Regex
…John Smith at 555-1212…
© 2008 IBM Corporation
Experimental Results (Band Review Annotator)
Annotator Running Time
30000
Classical
query
optimization
Running Time (sec)
25000
20000
Text-specific
optimizations
15000
10000
5000
0
GRAMMAR
40
ALGEBRA (Baseline)
ALGEBRA (Optimized)
© 2008 IBM Corporation
IOPES: Extracting Relationships and Composite Entities
 IOPES = IBM Omnifind Personal Email Search
 Extract entities such as email address, url
 Associations such as name ↔ phone number
 Complex entities like conference schedules, directions, signature
blocks
41
© 2008 IBM Corporation
Thank you!
 For more information…
– Try out IOPES
• http://www.alphaworks.ibm.com/tech/emailsearch
– Avatar Project home page
• http://almaden.ibm.com/cs/projects/avatar/
– Contact me
• [email protected]
42
© 2008 IBM Corporation
Backup Slides
Sonoma State University Computer Science Colloquium
03/06/2008
© 2008 IBM Corporation
Block Operator (b)
Lorem
ipsum dolor
sit amet,
consectetuer
Constraint
on distance
between
inputs adipiscing
elit. In augue mi, scelerisque non, dictum non,
vestibulum congue, erat. Donec non felis. Maecenas
Block
urna
nunc, pulvinar et, fringilla a, porta
Input
Inputat, diam. In
iaculis dignissim erat. Quisque
Input pharetra. Suspendisse
cursus viverra urna. Aliquam erat volutpat. Donec
quis
Input
sapien et metus molestie eleifend. Maecenas sit amet
metus eleifend nibh semper fringilla. Pellentesque
Constraint
number
inputs
habitant morbi tristique
senectus on
et netus
et of
malesuada
44
© 2008 IBM Corporation
Conditional Evaluation (CE)
 Leverage document-at-atime processing
John Smith at 555-1212
 Don’t evaluate the inner
operand of a join if the
outer has no results
 Costing plans is
challenging
Don’t evaluate this Regex
when there are no dictionary
matches.
45
CEJoin
John Smith
555-1212
Dictionary
Regex
…John Smith at 555-1212…
© 2008 IBM Corporation
Restricted Span Evaluation
 Leverage the sequential
nature of text
John Smith at 555-1212
 Only evaluate the inner on
the relevant portions of the
document
 Limited applicability
(compared with CE)
– Only certain operands and
predicates
Only look for dictionary
matches in the vicinity of a
phone number.
46
RSEJoin
555-1212
John Smith
Regex
Dictionary
…John Smith at 555-1212…
© 2008 IBM Corporation
Implementing Restricted Span Evaluation (RSE)
s1 binding
 RSE join operator

p(s1,s2)Dict(D,s2)
 RSE extraction operator
 Pass join bindings down to
the inner of a join
s1
 Requires special physical
operators at edges of plan
R1
s2’s that satisfy
p(binding, s2)
47
p
D
RSEDict
RSE
Dictionary
Operator
© 2008 IBM Corporation
RSE Dictionary Operator
Length of longest dictionary
entry
To find dictionary matches
that end in this range…
ctetuer adipiscing elit. Proin tincidunt eleifend quam. Aliquam ut pede ut enim dapibus venen
…need to examine this range.
 RSE version of an operator must produce
the exact same answer
– Ongoing work: RSE Regular Expression operator
48
© 2008 IBM Corporation
Closely related work (Shen, Doan, Naughton, Ramakrishnan, VLDB 2007)
Regular
Expressions and
Custom Code
Cascading
Grammars
Workflows
CPSL, AFST
UIMA, GATE
System T
49
DBLife
In the context of Project Cimple.
Search for “cimple wisc”
© 2008 IBM Corporation
Delving deeper into System T versus DBLife
System T
50
Restricted
Span
Evaluation
Shared
Dictionary
Matching

Pushing Down
Text
Properties

Scoping
Extractions

Pattern
Matching
DBLife
Conditional
Evaluation
© 2008 IBM Corporation
Cascading Grammar Reality
 Set of simple grammar rules for person name recognition
Pre-processing step: Tokenization of the document text
Tokenize(Document Text)
 Sequence of <Token>
Level 1: Rules that look for patterns in each token to produce corresponding annotations
Token[~ “Mr. | Mrs. | Dr. | …”]
 Salutation
Token[~ “Ph.D | MBA | …”]
 Qualification
Token[~ “[A-Z][a-z]*”]
 CapsWord
Token[~ “Michael | Richard | Smith| …”]  PersonDict
Level 2: Rules that look for patterns involving Level-1 annotations to identify Persons
PersonDict PersonDict
 Person
Richard Smith
Salutation CapsWord CapsWord
 Person
Dr. Laura Haas
CapsWord CapsWord Token[~“,”]? Qualification  Person
51
Laura Haas, Ph.D
© 2008 IBM Corporation
IOPES: Extracting Relationships and Composite Entities
 IOPES = IBM Omnifind Personal Email Search
 Entities like addresses, person names
 Relationships like name ↔ phone number
 Complex entities like conference schedules, directions, signature
blocks
52
© 2008 IBM Corporation
Extracting Entities in Notes 8.01 Live Text
 Leverages Information
Extraction Techniques
 Names, addresses, phone
numbers…
 Ships with Lotus Notes 8.01
53
© 2008 IBM Corporation