Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea.

Transcript Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea.

Machine Learning for Information
Extraction: An Overview
Kamal Nigam
Google Pittsburgh
With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea
Example: A Problem
Mt. Baker, the school district
Baker Hostetler, the company
Baker, a job opening
Genomics job
Example: A Solution
Job Openings:
Category = Food Services
Keyword = Baker
Location = Continental U.S.
Extracting Job Openings from the Web
Title: Ice Cream Guru
Description: If you dream of cold creamy…
Contact: [email protected]
Category: Travel/Hospitality
Function: Food Services
Potential Enabler of Faceted Search
Lots of Structured Information in Text
IE from Research Papers
What is Information Extraction?
• Recovering structured data from formatted text
What is Information Extraction?
• Recovering structured data from formatted text
– Identifying fields (e.g. named entity recognition)
What is Information Extraction?
• Recovering structured data from formatted text
– Identifying fields (e.g. named entity recognition)
– Understanding relations between fields (e.g. record
association)
What is Information Extraction?
• Recovering structured data from formatted text
– Identifying fields (e.g. named entity recognition)
– Understanding relations between fields (e.g. record
association)
– Normalization and deduplication
What is Information Extraction?
• Recovering structured data from formatted text
– Identifying fields (e.g. named entity recognition)
– Understanding relations between fields (e.g. record
association)
– Normalization and deduplication
• Today, focus mostly on field identification &
a little on record association
IE Posed as a Machine Learning Task
• Training data: documents marked up with
ground truth
• In contrast to text classification, local features
crucial. Features of:
–
–
–
–
…
Contents
Text just before item
Text just after item
Begin/end boundaries
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun
prefix
contents
suffix
…
Good Features for Information Extraction
Creativity and Domain Knowledge Required!
contains-question-mark
begins-with-number Example word features:
– identity of word
contains-question-word
begins-with-ordinal
– is in all caps
ends-with-question-mark
begins-with-punctuation
– ends in “-ski”
first-alpha-is-capitalized
begins-with-question-word– is part of a noun phrase
indented
– is in a list of city names
begins-with-subject
– is under node X in WordNet orindented-1-to-4
blank
Cyc
indented-5-to-10
contains-alphanum
– is in bold font
more-than-one-third-space
contains-bracketed– is in hyperlink anchor
only-punctuation
– features of past & future
number
– last person name was female prev-is-blank
contains-http
– next two words are “and
prev-begins-with-ordinal
contains-non-space
Associates”
shorter-than-30
contains-number
contains-pipe
Good Features for Information Extraction
Creativity and Domain Knowledge Required!
Is Capitalized
Is Mixed Caps
Is All Caps
Initial Cap
Contains Digit
All lowercase
Is Initial
Punctuation
Period
Comma
Apostrophe
Dash
Preceded by HTML tag
Character n-gram classifier
says string is a person
name (80% accurate)
In stopword list
(the, of, their, etc)
In honorific list
(Mr, Mrs, Dr, Sen, etc)
In person suffix list
(Jr, Sr, PhD, etc)
In name particle list
(de, la, van, der, etc)
In Census lastname list;
segmented by P(name)
In Census firstname list;
segmented by P(name)
In locations lists
(states, cities, countries)
In company name list
(“J. C. Penny”)
In list of company suffixes
(Inc, & Associates,
Foundation)
Word Features
– lists of job titles,
– Lists of prefixes
– Lists of suffixes
– 350 informative phrases
HTML/Formatting Features
– {begin, end, in} x
{, , <a>, <hN>} x
{lengths 1, 2, 3, 4, or longer}
– {begin, end} of line
IE History
Pre-Web
• Mostly news articles
– De Jong’s FRUMP [1982]
• Hand-built system to fill Schank-style “scripts” from news wire
– Message Understanding Conference (MUC) DARPA [’87-’95],
TIPSTER [’92-’96]
• Most early work dominated by hand-built models
– E.g. SRI’s FASTUS, hand-built FSMs.
– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and
then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Web
• AAAI ’94 Spring Symposium on “Software Agents”
– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
• Tom Mitchell’s WebKB, ‘96
– Build KB’s from the Web.
• Wrapper Induction
– Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…
Landscape of ML Techniques for IE:
Classify Candidates
Abraham Lincoln was born in Kentucky.
Sliding Window
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
BEGIN
Classifier
Classifier
which class?
which class?
Classifier
Try alternate
window sizes:
which class?
BEGIN
Finite State Machines
Abraham Lincoln was born in Kentucky.
END
BEGIN
END
Wrapper Induction
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
Learn and apply pattern for a website


PersonName
Any of these models can be used to capture words, formatting or both.
Sliding Windows & Boundary Detection
Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Information Extraction with Sliding Windows
[Freitag 97, 98; Soderland 97; Califf 98]
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
contents
suffix
• Standard supervised learning setting
– Positive instances: Candidates with real label
– Negative instances: All other candidates
– Features based on candidate, prefix and suffix
• Special-purpose rule learning systems work well
courseNumber(X) :tokenLength(X,=,2),
every(X, inTitle, false),
some(X, A, <previousToken>, inTitle, true),
some(X, B, <>. tripleton, true)
Rule-learning approaches to slidingwindow classification: Summary
• Representations for classifiers allow
restriction of the relationships between
tokens, etc
• Representations are carefully chosen
subsets of even more powerful
representations based on logic programming
(ILP and Prolog)
• Use of these “heavyweight” representations is
complicated, but seems to pay off in results
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
BWI: Learning to detect boundaries
[Freitag & Kushmerick, AAAI 2000]
• Another formulation: learn three probabilistic
classifiers:
– START(i) = Prob( position i starts a field)
– END(j) = Prob( position j ends a field)
– LEN(k) = Prob( an extracted field has length k)
• Then score a possible extraction (i,j) by
START(i) * END(j) * LEN(j-i)
• LEN(k) is estimated from a histogram
BWI: Learning to detect boundaries
• BWI uses boosting to find “detectors” for
START and END
• Each weak detector has a BEFORE and
AFTER pattern (on tokens before/after
position i).
• Each “pattern” is a sequence of tokens and/or
wildcards like: anyAlphabeticToken, anyToken,
anyUpperCaseLetter, anyNumber, …
• Weak learner for “patterns” uses greedy
search (+ lookahead) to repeatedly extend a
pair of empty BEFORE,AFTER patterns
BWI: Learning to detect boundaries
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
Problems with Sliding Windows
and Boundary Finders
• Decisions in neighboring parts of the input
are made independently from each other.
– Naïve Bayes Sliding Window may predict a
“seminar end time” before the “seminar start time”.
– It is possible for two overlapping windows to both
be above threshold.
– In a Boundary-Finding system, left boundaries are
laid down independently from right boundaries,
and their pairing happens as a separate step.
Finite State Machines
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP, …
Graphical model
Finite state model
S t-1
St
S t+1
...
...
observations
...
Generates:
State
sequence
Observation
sequence
transitions
O
Ot
t -1
O t +1

|o|
o1
o2
o3
o4
o5
o6
o7
o8
 
P( s , o )   P( st | st 1 ) P(ot | st )
S={s1,s2,…}
Start state probabilities: P(st )
Transition probabilities: P(st|st-1 )
t 1
Parameters: for all states
Usually a multinomial over
Observation (emission) probabilities: P(ot|st ) atomic, fixed alphabet
Training:
Maximize probability of training observations (w/ prior)
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Lawrence Saul spoke this example sentence.
and a trained HMM:
 
Find the most likely state sequence: (Viterbi) arg maxs P(s , o)
Yesterday Lawrence Saul spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Lawrence Saul
Generative Extraction with HMMs
[McCallum, Nigam, Seymore & Rennie ‘00]
• Parameters: {P(st|st-1), P(ot|st), for all states st, words ot}
• Parameters define generative model:

|o|
 
P( s , o )   P( st | st 1 ) P(ot | st )
t 1
HMM Example: “Nymble”
[Bikel, et al 97]
Task: Named Entity Extraction
Person
end-ofsentence
start-ofsentence
Org
Other
Train on 450k words of news wire text.
Case
Mixed
Upper
Mixed
Observation
probabilities
P(st | st-1, ot-1 )
P(ot | st , st-1 )
or
(Five other name classes)
Results:
Transition
probabilities
Language
English
English
Spanish
P(ot | st , ot-1 )
Back-off to:
Back-off to:
P(st | st-1 )
P(ot | st )
P(st )
P(ot )
F1 .
93%
91%
90%
Other examples of HMMs in IE: [Leek ’97; Freitag & McCallum ’99; Seymore et al. 99]
Regrets from Atomic View of Tokens
Would like richer representation of text:
multiple overlapping features, whole chunks of text.
Example word features:
–
–
–
–
–
–
–
–
–
–
–
line, sentence, or paragraph features:
identity of word
is in all caps
ends in “-ski”
is part of a noun phrase
is in a list of city names
is under node X in WordNet or Cyc
is in bold font
is in hyperlink anchor
features of past & future
last person name was female
next two words are “and Associates”
–
–
–
–
–
–
–
–
–
length
is centered in page
percent of non-alphabetics
white-space aligns with next line
containing sentence has two verbs
grammatically contains a question
contains links to “authoritative” pages
emissions that are uncountable
features at multiple levels of granularity
Problems with Richer Representation
and a Generative Model
• These arbitrary features are not independent:
– Overlapping and long-distance dependences
– Multiple levels of granularity (words, characters)
– Multiple modalities (words, formatting, layout)
– Observations from past and future
 
P( s , o )
• HMMs are generative models of the text:
• Generative models do not easily handle these nonindependent features. Two choices:
– Model the dependencies. Each state would have its own
Bayes Net. But we are already starved for training data!
– Ignore the dependencies. This causes “over-counting” of
evidence (ala naïve Bayes). Big problem when combining
evidence, as in Viterbi!
Conditional Sequence Models
• We would prefer a conditional model:
P(s|o) instead of P(s,o):
– Can examine features, but not responsible for generating
them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are
given at test time anyway.
• If successful, this answers the challenge of
integrating the ability to handle many arbitrary
features with the full power of finite state automata.
Conditional Markov Models
Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000]
MaxEnt POS Tagger [Ratnaparkhi, 1996]
SNoW-based Markov Model [Punyakanok & Roth, 2000]
Conditional
Generative (traditional HMM)
S t-1
St
S t+1
...
transitions
...
observations
...
O
t -1
Ot

|o|
O t +1
 
P( s , o )   P( st | st 1 ) P(ot | st )
t 1
S t-1
St
S t+1
...
transitions
...
observations
...
O
t -1
Ot
O t +1

|o|
 
P( s | o )   P( st | st 1 , ot )
t 1
Standard belief propagation: forward-backward procedure.
Viterbi and Baum-Welch follow naturally.
Exponential Form
for “Next State” Function
Capture dependency on st-1 with |S|
independent functions, Pst-1(st|ot).
st
st-1
Each state contains a “next-state classifier”
that, given the next observation, produces a
probability of the next state, Pst-1(st|ot).
1


P(st | st 1 , ot )  Ps t1 (st | ot ) 
exp  k f k (ot , st ) 
Z (ot , st 1 )
 k

weight
Recipe:
- Labeled data is assigned to transitions.
- Train each state’s exponential model by maximum entropy
feature
Label Bias Problem
• Consider this MEMM, and enough training data to perfectly model it:
Pr(0123|rib)=1
Pr(0453|rob)=1
Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3
= 0.5 * 1 * 1
Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’
= 0.5 * 1 *1
Conditional Random Fields (CRFs)
[Lafferty, McCallum, Pereira ‘2001]
From HMMs to MEMMs to CRFs

s  s1 , s2 ,...sn
HMM

o  o1 , o2 ,...on
|o|
 
P( s , o )   P( st | st 1 ) P(ot | st )
t 1

|o |
MEMM
St-1
Ot-1
 
P( s | o )   P( st | st 1 , ot )
Ot
St-1
t 1
   j f j ( st , st 1 ) 
 j

1

exp 

t 1 Z st 1 ,ot
    k g k ( st , ot ) 
 k

   j f j ( st , st 1 ) 

|o |
 j

1
 
P( s | o ) 
exp 


Z o t 1
    k g k ( st , ot ) 
 k

St
St+1
...
Ot+1
St
...
St+1
...

|o |
CRF
Ot-1
Ot
St-1
Ot-1
(A special case of MEMMs and CRFs.)
Ot+1
St
Ot
...
St+1
...
Ot+1
...
Conditional Random Fields (CRFs)
[Lafferty, McCallum, Pereira ‘2001]
St
St+1
St+2
St+3
St+4
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
Markov on s, conditional dependency on o.

|o|
1

 
 
P( s | o ) 
exp

f
(
s
,
s
,
o
, t) 



k k
t
t 1
Z o t 1
 k

Hammersley-Clifford-Besag theorem stipulates that the CRF
has this form—an exponential function of the cliques in the graph.
Assuming that the dependency structure of the states is tree-shaped
(linear chain is a trivial tree), inference can be done by dynamic
programming in time O(|o| |S|2)—just like HMMs.
Training CRFs
Maximizelog - likelihoodof parametersgiven training data :
  (i )
L({ k } | { o, s })
Log - likelihoodgradient :
feature count using correct labels
-
feature count using labelsassigned by current parameters
-
smoothing penalty
L
 
 
 
2
  Ck ( s (i ) , o (i ) )   P{ k } ( s | o (i ) ) Ck ( s , o (i ) )   k

 k
i
i
s
 

Ck ( s , o )   f k (o, t , st 1 , st )
t
Methods:
• iterative scaling (quite slow)
• conjugate gradient (much faster)
• conjugate gradient with preconditioning (super fast)
• limited-memory quasi-Newton methods (also super fast)
Complexity comparable to standard Baum-Welch
[Sha & Pereira 2002]
& [Malouf 2002]
Sample IE Applications of CRFs
•
•
•
•
•
•
Noun phrase segmentation [Sha & Pereira, 03]
Named entity recognition [McCallum & Li 03]
Protein names in bio abstracts [Settles 05]
Addresses in web pages [Culotta et al. 05]
Semantic roles in text [Roth & Yih 05]
RNA structural alignment [Sato & Satakibara 05]
Examples of Recent CRF Research
• Semi-Markov CRFs [Sarawagi & Cohen 05]
– Awkwardness of token level decisions for segments
– Segment sequence model alleviates this
– Two-level model with sequences of segments,
which are sequences of tokens
• Stochastic Meta-Descent [Vishwanathan 06]
–
–
–
–
Stochastic gradient optimization for training
Take gradient step with small batches of examples
Order of magnitude faster than L-BFGS
Same resulting accuracies for extraction
Further Reading about CRFs
Charles Sutton and Andrew McCallum. An
Introduction to Conditional Random Fields for
Relational Learning. In Introduction to Statistical
Relational Learning. Edited by Lise Getoor and
Ben Taskar. MIT Press. 2006.
http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf

Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea.

Transcript Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea.

Directory