1-Year Review

Download Report

Transcript 1-Year Review

An Algebraic Approach to Information Extraction
June 12, 2008
The Avatar Group
(Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss,
Shivakumar Vaithyanathan, and Huaiyu Zhu)
IBM Almaden Research Center
June 12, 2008
© 2008 IBM Corporation
Information Extraction (IE)
 Distill structured data from unstructured and semi-structured text
 Exploit the extracted data in your applications
For years, Microsoft
Corporation CEO Bill Gates
was against open source. But
today he appears to have
changed his mind. "We can be
open source. We love the
concept of shared source,"
said Bill Veghte, a Microsoft
VP. "That's a super-important
shift for us in terms of code
access.“
Annotations
Name
Bill Gates
Bill Veghte
Richard Stallman
Title
Organization
CEO
Microsoft
VP
Microsoft
Founder Free Soft..
Richard Stallman, founder of
the Free Software Foundation,
countered saying…
(from Cohen’s IE tutorial, 2003)
2
June 12, 2008
© 2008 IBM Corporation
The Avatar Group at IBM Almaden
 Working on information extraction (IE) since 2003
 Main goals:
– Extract structured information from text
– Build a system that can scale IE to real enterprise apps
– Build new enterprise applications that leverage IE
3
June 12, 2008
© 2008 IBM Corporation
Extracting Entities in Notes 8.01 Live Text
 Names, addresses, phone
numbers…
 Leverages the technologies
discussed here
 Ships with Lotus Notes 8.01
4
June 12, 2008
© 2008 IBM Corporation
IOPES: Extracting Relationships and Composite Entities
 IOPES = IBM Omnifind Personal Email Search
 Associations like name ↔ phone number
 Complex entities like conference schedules, directions,
signature blocks
5
June 12, 2008
© 2008 IBM Corporation
Road Map
An Algebraic Approach
to Information Extraction
System T and the AQL Language
Annotators built with AQL
6
June 12, 2008
© 2008 IBM Corporation
Evolution of the Avatar Project
2004
Evolutionary Triggers
Custom Code
Large number of
annotators
2005
RAP
(CPSL-style cascading
grammar system)
Diverse data sets,
Complex
extraction tasks
2006
RAP++
(RAP + Extensions outside the
scope of grammars)
Performance,
Expressivity
2007
System T
(algebraic information
extraction system)
2008
7
June 12, 2008
© 2008 IBM Corporation
Historical Perspective: Information Extraction
 MUC (Message Understanding Conference) – 1987
to 1997
– Competition-style conferences organized by DARPA
 Many different systems from this community
– FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS
[Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX
[Embley05]
 Recent interest from database/search community
– [Agichtein03] [Ipeirotis06] [Ramakrishnan06] [Shen07]
8
June 12, 2008
© 2008 IBM Corporation
An Aside: Rule-Based vs. Machine Learning
 Two dominant approaches to information
extraction (IE)
– Rule-Based: Define a set of extraction rules
– Machine Learning Based: Learn a parametric model
 Focus of our work: Rule-based IE
9
June 12, 2008
© 2008 IBM Corporation
Cascading Finite-state Grammars
 Most rule-based IE systems share a common
formalism
– Input text viewed as a sequence of tokens
– Rules expressed as regular expression patterns
over these tokens
 Several levels of processing  Cascading
Grammars
10
June 12, 2008
© 2008 IBM Corporation
Cascading Grammars By Example
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Level 2
Name Token[~ “at”] Phone  PersonPhone
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 1
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin,
in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis,
acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Token[~ “[1-9]\d{2}-\d{4}”]  Phone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin
enina i facilisis, <Name> at 555-1212 arcu tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Token[~ “John | Smith| …”]+  Name
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
© 2008 IBM Corporation
11
June 12, 2008
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
Level 0 (Tokenize)
Common Pattern Specification Language (CPSL)
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
 CPSL
– A standard language for specifying cascading grammars
Name Token[~ “at”] Phone  PersonPhone
Level 2– Created
in 1998
 Several known implementations
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
– facilisis
TextPro:
reference
ofhendrerit
CPSLfaucibus
by Doug
tus, risus in sagittis
arcu augue
rutrum velit,implementation
sed <Name> at <Phone>
pede miAppelt
ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
– JAPE (Java Annotation Pattern Engine)
Level 1
• Part of the GATE NLP framework Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin
enina i facilisis,
at 555-1212
arcu tincidunt orci.
for commercial
use<Name>
by several
companies
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin,
in sagittis facilisis, John
Smith active
at <Phone>
amet lt arcu
• Under
consideration
tincidunt orci. Pellentesque justo tellus , scelerisque quis,
acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Token[~ “[1-9]\d{2}-\d{4}”]  Phone
Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Token[~ “John | Smith| …”]+  Name
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
© 2008 IBM Corporation
12
June 12, 2008
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
Level 0
Experiences with Cascading Grammars
 Benefits
– Big step forward from custom code
– Can express many simple concepts
 Drawbacks
– Expressiveness
• Multiple tokenizations
• Dealing with overlap
• Building complex structures
– Performance
13
June 12, 2008
© 2008 IBM Corporation
Example Task: Finding informal reviews in blogs
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas turpis. Proin
nam ac ligula a lectus suscipit porttitor. Fusce non tellus sed urna pulvinar
tincidunt.
We went to a OTIS concert last Thursday. Suspendisse malesuada est vel risus.
Aenean sed ante fermentum dolor placerat rutrum. John Pipe plays guitar, id
pellentesque pede felis a erat. Felis Marco Benevento on the Hammond organ.
Curabitur sollicitudin porta velit. Donec scelerisque. Donec a magna sed sem
accumsan sodales. It was SO MUCH FUN! Hes accumsan sed, aliquam eget,
ornare et, metus. Integer eleifend tellus dictum nisi.
Etiam in enim. In blandit mi sit amet lectus. Nullam adipiscing fringilla odio. In hac
habitasse platea dictumst. Cum sociis natoque penatibus et magnis dis parturient
montes, nascetur ridiculus mus. Ut elementum quam eget justo. In arcu leo,
14
June 12, 2008
© 2008 IBM Corporation
Overlapping Annotations Example: Band Review Annotator
(pipe | guitar | hammond organ |…)
1-2 capitalized words


Instrument
Person
Person 0-5 tokens Instrument

PersonPlaysInstrument
John
Pipe
John
Pipe
plays
the
guitar
plays the
guitar
Person Person Token Token Instrument
John Pipe plays the guitar
Person Person
Instrument
Person Instrument Token Token Instrument
John Pipe
Person
plays
Token
the
Token
guitar
Instrument
Person
Instrument
15
June 12, 2008
© 2008 IBM Corporation
Complex Structures Example: Signature Annotator
Person
Laura Haas, PhD
Distinguished Engineer and Director, Computer Science
Almaden Research Center
408-927-1700
Phone
http://www.almaden.ibm.com/cs
Organization
URL
Start with Person
Within 50 tokens
At least 1 Phone
Person
Organizatio
n
Phone
URL
At least 2 of {Phone, Organization, URL, Email, Address}
End with one of these.
16
June 12, 2008
© 2008 IBM Corporation
Performance
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Level 2
Name Token[~ “at”] Phone  PersonPhone
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 1
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin,
in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis,
acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Token[~ “[1-9]\d{2}-\d{4}”]  Phone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin
enina i facilisis, <Name> at 555-1212 arcu tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Token[~ “John | Smith| …”]+  Name
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
© 2008 IBM Corporation
17
June 12, 2008
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
Level 0
Performance: Existing Solutions
 Performance issues
– Complete pass through tokens for each rule
– Many of these passes are wasted work
 Dominant approach: Make each pass go faster
– Faster finite state machines
– Batch processing
– Parallel processing
 Doesn’t solve root problem!
18
June 12, 2008
© 2008 IBM Corporation
The Algebraic Approach
 A different way of thinking
 Identify the most basic operations
 Create an operator for each basic operation
 Compose operators to build complex annotators
19
June 12, 2008
© 2008 IBM Corporation
Example: Regular Expression Extraction Operator
Output Tuple 1 Document
Span 1
Output Tuple 2 Document
Span 2
\d{3}-\d{4}
Regex
…
You can reach me at
555-1212 or 358-1237.
Input Tuple
20
Document
…
June 12, 2008
© 2008 IBM Corporation
Some Example Operators
 Regex
– Find all matches of a character-based regular
expression
 Dictionary
– Find all matches of an exhaustive dictionary of terms
 Join
– Find pairs of sub-annotations that match a predicate
 Block
– Identify contiguous blocks of lower-level matches
21
June 12, 2008
© 2008 IBM Corporation
Comparison with Cascading Grammars
…<PersonPhone>…
John Smith at 555-1212
Apply PersonPhone
Join
John Smith
…<Name> at <Phone>…
Block
Apply Name Rule
John
Smith
Apply Phone Rule
Dictionary
…John Smith at 555-1212…
555-1212
Regex
…John Smith at 555-1212…
Grammar Algebra
22
June 12, 2008
© 2008 IBM Corporation
Overlapping Annotations Example: Band Review Annotator
(pipe | guitar | hammond organ |…)
1-2 capitalized words


Instrument
Person
Person 0-5 tokens Instrument

PersonPlaysInstrument
John
Pipe
John
Pipe
plays
the
guitar
plays the
guitar
Person Person Token Token Instrument
John Pipe plays the guitar
Person Person
Instrument
Person Instrument Token Token Instrument
John Pipe
Person
plays
Token
the
Token
guitar
Instrument
Person
Instrument
23
June 12, 2008
© 2008 IBM Corporation
Overlapping Annotations
Explicitly remove
overlap with
Consolidate operator
John Pipe guitar
Consolidate
John
Pipe
John Pipe
Retain overlapping
matches by default
John
John
Pipe
John Pipe
Pipe
guitar
guitar
guitar
Join
Block
Pipe
guitar
John
Pipe
Regex
Dictionary
John Pipe plays the guitar
24
June 12, 2008
Person Person
© 2008 IBM Corporation
Instrument
Complex Structures Example: Signature Annotator
Person
Laura Haas, PhD
Distinguished Engineer and Director, Computer Science
Almaden Research Center
408-927-1700
Phone
http://www.almaden.ibm.com/cs
Organization
URL
Start with Person
Within 250 characters
At least 1 Phone
Person
Organizatio
n
Phone
URL
At least 2 of {Phone, Organization, URL, Email, Address}
End with one of these.
25
June 12, 2008
© 2008 IBM Corporation
Complex Structures Example: Signature Annotator
Join predicates
enforce additional
constraints
Signature
Person
Organization
Phone
URL
Join
Organization
Phone
URL
Phone
Person
URL
Organization
Block
Union
Find blocks of two or
more “contact info”
patterns
Person
Org
26
Phone
June 12, 2008
URL
© 2008 IBM Corporation
Performance
 Performance issues with grammars
– Complete pass through tokens for each rule
– Many of these passes are wasted work
 Dominant approach: Make each pass go faster
– Doesn’t solve root problem!
 Algebraic approach: Build a query optimizer!
27
June 12, 2008
© 2008 IBM Corporation
An Aside: Relational Query Optimization
 Central concept in relational databases
– User specifies what she is looking for
– System decides how to find it
– Greatly reduces development and maintenance costs
 Basic approach
– Enumerate many equivalent relational algebra
expressions
– Estimate the cost of each one
– Choose the fastest
28
June 12, 2008
© 2008 IBM Corporation
Optimizations
 Query optimization is a familiar topic in databases
 What’s different in text?
– Operations over sequences and spans
– Document boundaries
– Costs concentrated in extraction operators (dictionary,
regular expression)
 Can leverage these characteristics
– Text-specific optimizations
– Significant performance improvements
29
June 12, 2008
© 2008 IBM Corporation
Example: Restricted Span Evaluation (RSE)
 Leverage the sequential nature
of text
– Join predicates on character
or token distance
 Only evaluate the inner on the
relevant portions of the
document
 Limited applicability
– Need to guarantee exact
same results
Only look for dictionary
matches in the vicinity of a
phone number.
30
John Smith at 555-1212
RSEJoin
555-1212
John Smith
Regex
Dictionary
…John Smith at 555-1212…
June 12, 2008
© 2008 IBM Corporation
Experimental Results (Band Review Annotator)
Annotator Running Time
30000
Classical
query
optimization
Running Time (sec)
25000
20000
Text-specific
optimizations
15000
10000
5000
0
GRAMMAR
31
ALGEBRA (Baseline)
June 12, 2008
ALGEBRA (Optimized)
© 2008 IBM Corporation
Road Map
An Algebraic Approach
to Information Extraction
System T and the AQL Language
Annotators built with AQL
32
June 12, 2008
© 2008 IBM Corporation
System T
 Next-generation information extraction system
 Makes developing annotators like developing
other enterprise software
– AQL rule language
• Declarative language for building annotators
– Development environment
• Provides support for building complex annotators
– Runtime environment
• Deploy to corporate PCs or server farms
33
June 12, 2008
© 2008 IBM Corporation
System T Block Diagram
Development Environment
User
Interface
Execution
Engine
Sample
Documents
34
Rules
(AQL)
Annotated
Document
Stream
Runtime
Environment
Optimizer
Plan
(Algebra)
Input
Document
Stream
June 12, 2008
© 2008 IBM Corporation
AQL
 Declarative language for defining
annotators
– Compiles into our algebra
 Main features
– Separates semantics from performance
– Familiar syntax
– Full expressive power of algebra
35
June 12, 2008
© 2008 IBM Corporation
AQL By Example
<Person>
<PhoneNum>
0-30 chars
Contains “phone” or “at”
Within a single sentence
create view PersonPhone as
select P.name as person, N.number as phone
from Person P, PhoneNumber N, Sentence S
where
Follows(P.name. N.number, 0, 30)
and Contains(S.sentence, P.name)
and Contains(S.sentence, N.number)
and ContainsRegex(/\b(phone|at)\b/,
SpanBetween(P.name, N.number));
36
June 12, 2008
© 2008 IBM Corporation
AQL: Status
 Compiler and optimizer implemented in 2007
– First generation: Heuristic optimizer
– Second generation: Basic cost-based optimizer
– Third generation in progress
 Transitioning to several IBM products
– Used in Lotus Notes 8.01 (GA on March 2008)
– Next release of IOPES will be AQL-based (Notes 8.5,
Q4 2008)
– Several other products in development
37
June 12, 2008
© 2008 IBM Corporation
System T Development Environment
 Create and edit AQL
annotators
 Manage dictionaries and
document collections
 Test annotators and view
results
 Downloadable demo!
– (IBM internal only)
38
June 12, 2008
© 2008 IBM Corporation
Ongoing Work: Pattern Discovery
 The Problem:
– Building dictionaries and other basic building
blocks is a major part of the development process
• 80% or more of the work
 Solution:
– Providing tools to analyze annotations and their
context to discover useful low-level patterns
39
June 12, 2008
© 2008 IBM Corporation
Example: Building a Phone Number Annotator
Initial “rough”
regular expression
[\d()-\.]{7-15}
(123)4568909
1-800-124-2456
123-890-8990
Examples to help
improve original
pattern
Run over
sample
documents
789.890.8980
345-678-9012
123.345.7890
Cluster
results
1-890-890-0890
(408)123-7898
123.456.789.189
10.50-100.00
40
10.10.2008 June 12, 2008
© 2008 IBM Corporation
Example: Building a Phone Number Annotator
Cluster the text to
the left (or right) of
the matches
Left Context
[\d()-\.]{7-15}
Phone #: (123)4568909
Phone #: 1-800-124-2456
Telephone #: 123-890-8990
Identify contextual
“clues” that can
improve
confidence…
Ru
s
doc
Tel #: 789.890.8980
phone number is 345-678-9012
cell number is 123.345.7890
call me at 1-890-890-0890
call my office at (408)123-7898
…or indicate false
positives
41
IP address is 123.456.789.189
Price range: 10.50-100.00
OpenJuneon
10.10.2008
12, 2008
© 2008 IBM Corporation
C
r
Ongoing Work: Interface for Building Custom Annotators
 Problem:
– Customers need to build
– AQL is too powerful
 Solution:
– Simpler language with
compact syntax
– GUI annotator builder
42
June 12, 2008
© 2008 IBM Corporation
Road Map
An Algebraic Approach
to Information Extraction
System T and the AQL Language
Annotators built with AQL
43
June 12, 2008
© 2008 IBM Corporation
Named Entity Annotators
 Developed using System T and AQL
 Shipping with Lotus Notes 8.01
 Will ship with IOPES, other IBM products
 Statistics:
– 8 types of entities
– 327 AQL statements
– Throughput: 800+ kb/sec/core (on my laptop)
44
June 12, 2008
© 2008 IBM Corporation
Entities Currently Extracted
 Complex entities
– Person
– Address
– Organization
 “Simple” entities
– Phone Number
– Email address
– URL
– Time
– Date
45
June 12, 2008
© 2008 IBM Corporation
Languages Supported
 Already supported:
– English
– German
 Can support with straightforward extensions:
– Spanish
– French
– other Indo-European languages
 Extensions needed (ongoing work):
– Japanese (with Tokyo Research Lab)
– Hebrew (with Haifa?)
– Chinese
– Korean
46
June 12, 2008
© 2008 IBM Corporation
High-Level Dataflow Diagram
Person
Organization
Address
…
Stage 4
Handle overlap
Stage 3
Filter false positives
Identify lists
Stage 2
Find composite
patterns
Stage 1
Extract basic features
47
June 12, 2008
© 2008 IBM Corporation
Quality
48
Precision
Recall
Person
>90%
 90%
Address
>95%
 90%
Organization
>90%
 90%
Phone Number
> 95%
> 95%
June 12, 2008
© 2008 IBM Corporation
Performance: Laptop (Intel Core 2 Duo 2.33 GHz)
Just Person and Organization
All Named Entities
2000
Throughput (kb/sec)
Throughput (kb/sec)
2500
2000
1500
1000
500
1000
500
0
0
1
2
Number of Threads
49
1500
1
2
Number of Threads
June 12, 2008
© 2008 IBM Corporation
Performance: Server (4×quad-core AMD Opteron)
50
All Named Entities
7000
8000
6000
Throughput (kb/sec)
9000
7000
6000
5000
4000
3000
2000
5000
4000
3000
2000
1000
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Throughput (kb/sec)
Just Person and Organization
Number of Threads
Number of Threads
June 12, 2008
© 2008 IBM Corporation
Thank you!
 For more information…
– Read our ICDE 2008 paper (“An Algebraic Approach to RuleBased Information Extraction”)
– Try out IOPES
• http://www.alphaworks.ibm.com/tech/emailsearch
– Avatar Project home page
• http://almaden.ibm.com/cs/projects/avatar/
– Download System T (IBM only)
• http://fisher.almaden.ibm.com:8080/systemt
– Contact me
• [email protected]
51
June 12, 2008
© 2008 IBM Corporation
Backup Slides
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Road Map
An Algebraic Approach
to Information Extraction
System T and the AQL Language
Annotators built with AQL
53
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Extracting Information with Custom Code
 “It’s just pattern matching”
– Use scripts and regular expressions
 Then reality sets in…
– Dozens of rules, even for simple concepts
– Many special cases
– Convoluted logic
– Painfully slow code
54
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Operators in the Algebra
 Currently 44 operators
 Categories:
– Relational: Selection, Cross product, Join, Union, …
– Span extraction: Regular expression, Dictionary,
Sentence, Part of Speech…
– Span aggregation: Consolidation, Block
– Specialized: Detag HTML
– Input/Output: Document Scan, Annotation Scan,
ToHTML, ToAOM, …
55
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Take me back!
Multiple Tokenizations Example
Extraction Task
56
Ideal Tokenization
Identify company names
I.B.M.
Identify abbreviations
I.B.M.
Find sentence boundaries
I.B.M.
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Tokenization on Demand
First and Middle Initials
Company
Names
J.T. Smith
I.B.M.
Join
Embedded
Tokenizer
J.T.
I.B.M.
Tokenize
Between
“J.T.” and
“Smith”
Smith
Punctuation
No
Tokenization
Dictionary
.
.
.
.
.
Regex
Regex
Dictionary
…J.T. Smith works at I.B.M.…
57
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Overlapping Annotations Example: Band Review Annotator
(pipe | guitar | hammond organ |…)
1-2 capitalized words


Instrument
Person
Person 0-5 tokens Instrument

PersonPlaysInstrument
John Pipe
Instrument
Person
plays
Token
the
guitar
Token Instrument
John Pipe plays the guitar
Person
Instrument
John
Pipe
plays
Token Instrument Token
the
guitar
Token Instrument
Person
Marco Benevento on the Hammond organ
Person
58
Instrument
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Overlapping Annotations: Existing Solutions
 Explicit rule priority
– Higher-priority rules in a level dominate lower-priority ones
– Complex interactions between rules
– Not enough information available in low-level rules
Person dominates Instrument
Instrument dominates Person
John Pipe plays the guitar
Person
John Pipe plays the guitar
Instrument
Marco Benevento on the Hammond organ
Person
59
Person
Instrument
Instrument
Marco Benevento on the Hammond organ
Person
June 12, 2008
Instrument
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Overlapping Annotations Person <0-5 tokens> Instrument
doc
doc
John Pipe
Marco Benevento
Join
CapitalizedWord
doc
doc
doc
guitar
Hammond organ
John Pipe
Marco Benevento
Hammond
Instrument
doc
doc
doc
Pipe
guitar
Hammond organ
Regex
Dictionary
Instrument
ProperNoun
John Pipe plays the guitar
ProperNoun
60
Instrument
Marco Benevento on the Hammond organ
ProperNoun
June 12, 2008
Instrument
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Overlapping Annotations Example: Band Review Annotator
(pipe | guitar | hammond organ |…)
1-2 capitalized words


Instrument
Person
Person 0-5 tokens Instrument

PersonPlaysInstrument
When John Pipe plays the guitar, the crow
CPSL standard
Which ones to retain?
61
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Consolidation
Nonoverlapping
subset
 Operator that removes
overlap
 Several different policies
–
–
–
–
Exact match
Longest match
Left-to-right longest
…
Policy
Consolidate
 Consolidate only when
enough information is
available
Set of spans
62
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Second “John Pipe” Example
John Pipe
Block
When
John
Pipe
Find
Capitalized
Words
Regex
Consolidate
When
John
Pipe
When John
John Pipe
Remove
Overlap
John
Pipe
John Pipe
Select
Filter out
Stop-Words
When John Pipe plays the guitar, the crowd…
63
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Complex Structures: Existing Solutions
 Approximate using regular expressions
 Example: Signature
– Rule: (Person Token{,25} Phone (Token{,25} Contact)+) |
(Person (Token{,25} Contact)+ Token{,25} Phone
(Token{,25} Contact)*)
– Problems:
• Need to enumerate all possible orders of sub-annotations
– What if you want at least one phone and one email?
• Does not restrict total token count
64
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Performance: Existing Solutions
 Performance issues
– Complete pass through tokens for each rule
– Many of these passes are wasted work
 Dominant approach: Make each pass go faster
– Faster finite state machines
– Batch processing
– Parallel processing
 Doesn’t solve root problem!
65
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Types of Operator
 Select, project, join…
 Extraction operators
– Identify basic pattern matches in text
– Several subtypes: Regex, Dictionary, Sentence…
 Block
– Group together simpler annotations to produce complex ones
 Consolidation
– Decide between overlapping matches
66
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Multiple Tokenizations: Existing Solutions
 Use a “lowest common denominator” tokenizer
– Makes rules much more complicated
 Use a configurable tokenizer
– Can still need two different tokenizations
– Need to keep tokenization(s) in sync with rules
 Use character-based regular expressions
– Rules need to deal with whitespace, punctuation
67
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Shared Dictionary Matching (SDM)
 Dictionary matching has 3 steps:
– Tokenize text
– Hash each token
– Generate matches based on hash table entry
 Can share the first two steps among many dictionaries
D1
68
Dict
D2
subplan
Dict
D1
SDMDict
D2
subplan
June 12, 2008
SDM
Dictionary
Operator
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Conditional Evaluation (CE)
 Leverage document-at-atime processing
John Smith at 555-1212
 Don’t evaluate the inner
operand of a join if the
outer has no results
 Costing plans is
challenging
Don’t evaluate this Regex
when there are no dictionary
matches.
69
CEJoin
John Smith
555-1212
Dictionary
Regex
…John Smith at 555-1212…
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Implementing Restricted Span Evaluation (RSE)
s1 binding
 RSE join operator

p(s1,s2)Dict(D,s2)
 RSE extraction operator
 Pass join bindings down to
the inner of a join
s1
 Requires special physical
operators at edges of plan
R1
s2’s that satisfy
p(binding, s2)
70
June 12, 2008
p
D
RSEDict
RSE
Dictionary
Operator
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
RSE Dictionary Operator
Length of longest dictionary
entry
To find dictionary matches
that end in this range…
ctetuer adipiscing elit. Proin tincidunt eleifend quam. Aliquam ut pede ut enim dapibus venen
…need to examine this range.
 RSE version of an operator must produce
the exact same answer
– Ongoing work: RSE Regular Expression operator
71
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Separating Performance from Semantics
Specify annotator
semantics declaratively
AQL Language
Optimizer
Choose an efficient
execution plan that
implements semantics
Operator
Runtime
72
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Historical Perspective: Information Extraction
 MUC (Message Understanding Conference) – 1987 to 1997
– Competition-style conferences organized by DARPA
– Shared data sets and performance metrics
• News articles, Radio transcripts, Military telegraphic messages
 Classical IE Tasks
– Entity and Relationship/Link extraction
– Entity resolution/matching
– Event detection (Identify a complex event such as a merger or meeting involving multiple
entities)
 Several IE systems were built by this community
– FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS
[Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]
73
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Common Pattern Specification Language (CPSL)
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
 CPSL
Name Token[~
“at”]for
Phone
 PersonPhone
language
specifying
cascading
Level 2– A standard
grammars
rem ipsum dolor sit–amet,
consectetuerin
adipiscing
elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
Created
1998
tus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 1
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin,
in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis,
acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Token[~ “[1-9]\d{2}-\d{4}”]  Phone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin
enina i facilisis, <Name> at 555-1212 arcu tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Token[~ “John | Smith| …”]+  Name
rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamu
tus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
© 2008 IBM Corporation
74
June 12, 2008
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
Level 0
BACKUP
SLIDE
BACKUP
Execution Time Breakdown
100%
Dictionary
Regular
Expression
Join
Other
0%
Naïve Plan
75
Optimized
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
AQL Syntax
<ProperNoun>
Regular Expression
Match
<within 30 characters>
<Instrument>
Dictionary
Match
select
CombineSpans(name.match, instrument.match)
as annot,
name.match as name,
instrument.match as instr
from
Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name,
Dictionary(“instr.dict”, DocScan.text) instrument
where
Follows(0, 30, name.match, instrument.match);
76
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Road Map
An Algebraic Approach
to Information Extraction
System T and the AQL Language
Annotators built with AQL
77
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Annotation Development Cycle
Test
Deploy
Identify
Problems
Develop
Define
78
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Annotation Development Cycle
Test
Deploy
Runtime
Environment
Annotator
Development
Environment
Develop
Identify
Problems
Define
Ease of development and maintenance
79
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
System T Block Diagram
Rules
(AQL)
Development
Environment
80
Optimizer
Representative
Documents
Annotated
Document
Stream
Plan
(Algebra)
Runtime
Environment
Input
Document
Stream
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
UIMA
 Where does UIMA fit in all of this?
– UIMA is a software framework for NLP
• Allows complex annotators to be composed as a pipeline of smaller building blocks
– What UIMA is not….
• Does not specify how an annotator performs its extraction task
• Does not provide a rule language nor a rule-matching engine
– Orthogonal to the focus of this talk
 However
– The AQL runtime can be embedded inside a UIMA annotator.
Rules
(AQL)
Optimizer
JavaUIMA
Code
Plan
(Algebra)
AQL Runtime
JavaUIMA
Code
Annotator A
UIMA Annotator A
81
Annotator A
UIMA Annotator B
June 12, 2008
UIMA Annotator C
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
The AQL Rule Language
<Person>
<PhoneNum>
0-30 chars
Custom Code
Set<Pair<Span>> extractPersonPhoneCandidate(String text)
{
Set<Span> Person = extractPersons(text);
Set<Span> PhoneNum = extractPhoneNumber(text);
Set<Span> Sentence = extractSentence(text);
Set<Pair<Span>> PersonPhoneCandidate = new HashSet<Pair<Span>>()
Contains “phone” or “at”
for (Span P : Person) {
for (Span N : PhoneNum) {
if (Follows(P,N,0,30)) then {
String textBetween = text.substring(P.end, N.begin);
Pattern R = Pattern.compile(“\\b(phone|at)\\b“);
if (matches(R, textBetween) {
PersonPhoneCandidate.add(new Pair<Span>(P,N));
}
}
}
}
 Development costs
Within a single sentence
 Maintenance costs
Set<Pair<Span>> PersonPhone = new HashSet<Pair<Span>>();
for (Pair<Span> C : PersonPhoneCandiate) {
for (Span S : Sentence) {
if(S.contains(C)) {
PersonPhone.add(C);
}
}
}
AQL
create view PersonPhone as
select P.name as person, N.number as phone
from Person P, PhoneNumber N, Sentence S
where
Follows(P.name. N.number, 0, 30)
and Contains(S.sentence, P.name)
and Contains(S.sentence, N.number)
and ContainsRegex(/\b(phone|at)\b/,
SpanBetween(P.name, N.number));
 Performance
 Correctness
return C;
}
boolean Follows(Span first, Span second, int min, int max) {
int firstEnd = first.end;
int secondBegin = second.begin;
int distance = (secondBegin – firstEnd);
if ((distance >= min) && (distance <= max)) {
return true;
} else {
return false;
}
}
82
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Example: Building a Phone Number Annotator
[\d()-\.]{7-15} Right Context
er
e
nts
r
s
83
(123)4568909
Ext 12345
1-800-124-2456
x1235
123-890-8990
ext-1230
789.890.8980
.
345-678-9012
.
123.345.7890
.
1-890-890-0890
.
(408)123-7898
\n
123.456.789.189
or
10.50-100.00
$
10.10.2008
10:00am
Additional patterns for
Phone Number:
Extension Numbers
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Road Map
An Algebraic Approach
to Information Extraction
System T and the AQL Language
Annotators built with AQL
84
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Person Annotator
 Names appear in widely varying contexts
–
–
–
–
Mr. Dabrowski received a Bachelor degree…
Dr. Jean L. Rouleau Dean of Medicine University…
…met Peter and Katie Lawton who have…
…lives in Riverdale, NY, with his wife Marie-Jeanne. He has two married
sons, James and Michael.
– The Honorable Carol Boyd Hallett - Of Counsel…
– Kimberly Purdy Lloyd received a Bachelor of Science degree from the
University of Texas…
 Additional Challenges
– Avoiding person names inside/overlap with other entities
• Organization, Address
– List of person names
• Attendees Ida White, Bridget McBean, Volker Hauck
Currently supports names from > 8 countries, including Israel
85
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Person Annotator Outline
 Stage 1: Identify individual features
• <FirstName>, <LastName>, <Salutation>, <CapsPerson>, <Initial> …
– Dictionaries, Regular expressions
 Stage 2 : Identify candidate persons based on strong patterns
•
•
•
•
<FirstName>(<CapsPerson>|<Initial>)?<LastName>
<Salutation>(<CapsPerson>|<Initial>)?<CapsPerson>
<LastName>, <FirstName>
…
– Joins, Selection predicates, Block
 Stage 3 : Eliminate weaker matches, handle lists
• Delete annotations generated by lower priority rules
–
Consolidation, Minus, Selection predicates
 Stage 4 : Remove matches within other entities
–
86
Consolidation, Minus, Selection predicates
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Address Annotator
 USAddress has well-defined pattern
– <StreetAddress> <SecondaryUnit>? <City> <State> <Zipcode>?
– 1515 Pioneer Drive Harrison, AR 72601
– 3607 Church Street, Suite 300 · Cincinnati, Ohio 45244
– 101 S. Webster Street . PO Box 7921 . Madison, Wisconsin 537077921
 Challenges
–
–
–
–
Multiple parts to the Address
Some parts are optional (e.g., Secondary Unit, Zipcode)
<City> cannot be identified using Dictionary due to resource restrictions
Handling ambiguous abbreviations
• Ms MA In
• Dr. Row
state names
Street suffixes
Currently supports U.S. and German addresses
87
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Address Annotator Outline
 Stage 1 : Primary features identified
• <StreetAddress>, <Secondary Unit>, <State>, <Zipcode>
– Regular Expressions, Dictionaries, Joins
 Stage 2 : Complete StreetAddress identified
• <StreetAddress> <SecondaryUnit>?
– Join, Union
 Stage 3 : StreetAddress combined with State information
• <StreetAddress><SecondaryUnit>?<City><State>
– Join, Union, Selection Predicates
 Stage 4 : Combining with Zipcode
• <StreetAddress><SecondaryUnit>?<City><State><Zipcode>?
– Join, Union, Selection Predicates
88
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Organization Annotator
 Organization names appear in wide range of
– is a graduate of Hofstra University
– He joined Interactive Data in 2003
– President of Foley & Lardnear LLP
– Received her B.S in English from University of Wisconsin
– The bill at the Savoy Hotel
 Additional Challenges
– Long organization names (Q: where is the begin & end?)
• The Chartered Institute of Public Finance and Accountancy
– May contain list of person names
• Squar, Milner, Peterson, Miranda & Williamson, LLP
• John Ortiz, James & James Ltd
– Adjacent organization names
• University of Michigan Ross School of Business
– Multiple representation for the same organization & its subdivisions
• Enron, Enron Corp., Enron Corporation, Enron Metals & Commodity Corp.
89
June 12, 2008
© 2008 IBM Corporation
BACKUP
SLIDE
BACKUP
Organization Annotator Outline
 Stage 1: Identify individual features
• CommonOrganization, Suffix, Prefix, IndustryType, CapsOrg, PrepOrg
– Dictionaries, Regular expressions
 Stage 2 : Identify candidate organization based on strong patterns
•
•
•
•
(<The><CapsOrg>{1,3}<Conj>)?<CapsOrg>{1,3}<Suffix>|<IndustryType>
<CapsOrg><1,3><Prefix><PrepOrg><CapsOrg>{1,2}(<Conj><CapsOrg>{1,2})?
<CommonOrganization>(<CapsOrg>{1,3}(<Suffix>|<IndustryType>))?
…
– Joins, Selection predicates, Block
 Stage 3 : Eliminate weaker matches, handle lists
• Delete annotations generated by lower priority rules
–
Consolidation, Minus, Selection predicates
 Stage 4 : Remove matches within other entities
–
90
Consolidation, Minus, Selection predicates
June 12, 2008
© 2008 IBM Corporation