people.cs.umass.edu

Download Report

Transcript people.cs.umass.edu

Information Extraction,
Conditional Random Fields,
and Social Network Analysis
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with
Aron Culotta, Charles Sutton, Ben Wellner, Khashayar Rohanimanesh, Wei Li,
Andres Corrada, Xuerui Wang
Goal:
Mine actionable knowledge
from unstructured text.
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
A Portal for Job Openings
Category = High Tech
Keyword = Java
Location = U.S.
Job Openings:
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Data Mining the Extracted Job Information
IE from
Chinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy of Sciences
200k+ documents
several millennia old
- Qing Dynasty Archives
- memos
- newspaper articles
- diaries
IE from Cargo Container Ship Manifests
Cargo Tracking Div.
US Navy
IE from Research Papers
[McCallum et al ‘99]
IE from Research Papers
Mining Research Papers
[Rosen-Zvi, Griffiths, Steyvers,
Smyth, 2004]
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Larger Context
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Prediction
Outlier detection
Decision support
Outline
• Examples of IE and Data Mining.
a
• Conditional Random Fields and Feature Induction.
• Joint inference: Motivation and examples
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Iterated Conditional Samples.)
• Two example projects
– Email, contact address book, and Social Network Analysis
– Research Paper search and analysis
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP, …
Graphical model
Finite state model
S t-1
St
S t+1
...
...
observations
...
Generates:
State
sequence
Observation
sequence
transitions
O
Ot
t -1
O t +1

|o|
o1
o2 o3
o4
o5
o6 o7
o8
 
P( s , o )   P( st | st 1 ) P(ot | st )
S={s1,s2,…}
Start state probabilities: P(st )
Transition probabilities: P(st|st-1 )
t 1
Parameters: for all states
Usually a multinomial over
Observation (emission) probabilities: P(ot|st ) atomic, fixed alphabet
Training:
Maximize probability of training observations (w/ prior)
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Rich Caruana spoke this example sentence.
and a trained HMM:
person name
location name
background
Find the most likely state sequence: (Viterbi)
Yesterday Rich Caruana spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Rich Caruana
We want More than an Atomic View of Words
Would like richer representation of text:
many arbitrary, overlapping features of the words.
S t-1
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is “Wisniewski”
is in a list of city names
is under node X in WordNet
part of
ends in
is in bold font
noun phrase
“-ski”
is indented
O t 1
is in hyperlink anchor
last person name was female
next two words are “and Associates”
St
S t+1
…
…
Ot
O t +1
Problems with Richer Representation
and a Joint Model
These arbitrary features are not independent.
– Multiple levels of granularity (chars, words, phrases)
– Multiple dependent modalities (words, formatting, layout)
– Past & future
Two choices:
Model the dependencies.
Each state would have its own
Bayes Net. But we are already
starved for training data!
Ignore the dependencies.
This causes “over-counting” of
evidence (ala naïve Bayes).
Big problem when combining
evidence, as in Viterbi!
S t-1
St
S t+1
S t-1
St
S t+1
O
Ot
O t +1
O
Ot
O t +1
t -1
t -1
Conditional Sequence Models
• We prefer a model that is trained to maximize a
conditional probability rather than joint probability:
P(s|o) instead of P(s,o):
– Can examine features, but not responsible for generating
them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are
given at test time anyway.
From HMMs to Conditional Random Fields
s  s1,s2,...sn
o  o1,o2,...on
[Lafferty, McCallum, Pereira 2001]
|o |
Joint
P(s,o )   P(st | st1)P(ot | st )
St-1
St
St+1
...
t1
Ot-1
Conditional

Ot
|o |
1
P(s | o) 
P(st | st1)P(ot | st )

P(o ) t1
St-1
St
St+1
...
|o |
1

s (st ,st1)o (ot ,st )

Z(o) t1

where

...
Ot+1


o (t)  exp k f k (st ,ot )
 k

Ot-1
Ot
Ot+1
...
(A super-special case of
Conditional Random Fields.)
Set parameters by maximum likelihood, using optimization method on dL.
Conditional Random Fields
[Lafferty, McCallum, Pereira 2001]
1. FSM special-case:
linear chain among unknowns, parameters tied across time steps.
St
St+1
St+2
St+3
St+4
 
P( s | o ) 

|o |
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
1

 
exp

f
(
s
,
s
,
o
, t) 

 

k k
t
t 1
Z (o ) t 1
 k

2. In general:
CRFs = "Conditionally-trained Markov Network"
arbitrary structure among unknowns
3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]:
Parameters tied across hits from SQL-like queries ("clique templates")
Training CRFs
Maximize log - likelihood of parameters given training data
L({k } |{ o,s
(i)
:
})
Log - likelihood gradient :
L
2
  Ck (s (i),o (i) )    P{ k } (s | o (i) ) Ck (s,o (i) )  k
k
i
i
s
Ck (s,o )   f k (o,t,st1,st )
t
Feature count using
correct labels
-
Feature count using
predicted labels
-
Smoothing penalty
Linear-chain CRFs vs. HMMs
• Comparable computational efficiency for inference
• Features may be arbitrary functions of any or all
observations
– Parameters need not fully specify generation of
observations; can require less training data
– Easy to incorporate domain knowledge
IE from Research Papers
[McCallum et al ‘99]
IE from Research Papers
Field-level F1
Hidden Markov Models (HMMs)
75.6
[Seymore, McCallum, Rosenfeld, 1999]
Support Vector Machines (SVMs)
89.7
 error
40%
[Han, Giles, et al, 2003]
Conditional Random Fields (CRFs)
[Peng, McCallum, 2004]
93.9
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
-------------------------------------------------------------------------------:
:
Production of Milk and Milkfat 2/
:
Number
:------------------------------------------------------Year
:
of
:
Per Milk Cow
:
Percentage
:
Total
:Milk Cows 1/:-------------------: of Fat in All :-----------------:
: Milk : Milkfat : Milk Produced : Milk : Milkfat
-------------------------------------------------------------------------------: 1,000 Head
--- Pounds --Percent
Million Pounds
:
1993
:
9,589
15,704
575
3.66
150,582 5,514.4
1994
:
9,500
16,175
592
3.66
153,664 5,623.7
1995
:
9,461
16,451
602
3.66
155,644 5,694.3
-------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
Table Extraction from Government Reports
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
100+ documents from www.fedstats.gov
Labels:
CRF
of milk during 1995 at $19.9 billion dollars, was
eturns averaged $12.93 per hundredweight,
1994. Marketings totaled 154 billion pounds,
ngs include whole milk sold to plants and dealers
consumers.
ds of milk were used on farms where produced,
es were fed 78 percent of this milk with the
cer households.
1993-95
------------------------------------
n of Milk and Milkfat 2/
-------------------------------------: Percentage :
Non-Table
Table Title
Table Header
Table Data Row
Table Section Data Row
Table Footnote
... (12 in all)
Features:
uction of Milk and Milkfat:
w
•
•
•
•
•
•
•
Total
----: of Fat in All :-----------------Milk Produced : Milk : Milkfat
------------------------------------
•
•
•
•
•
•
•
Percentage of digit chars
Percentage of alpha chars
Indented
Contains 5+ consecutive spaces
Whitespace in this line aligns with prev.
...
Conjunctions of all previous features,
time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
Table Extraction Experimental Results
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Line labels,
percent correct
Table segments,
F1
HMM
65 %
Stateless
MaxEnt
85 %
CRF w/out
conjunctions
52 %
68 %
CRF
95 %
92 %
64 %
-
Feature Induction for CRFs
[McCallum, 2003, UAI]
1. Begin with knowledge of atomic features,
but no features yet in the model.
2. Consider many candidate features, including
atomic and conjunctions.
3. Evaluate each candidate feature.
4. Add to the model some that are ranked
highest.
5. Train the model.
Candidate Feature Evaluation
[McCallum, 2003, UAI]
Common method: Information Gain
InfoGain (C , F )  H (C )   P( f ) H (C | f )
f F
True optimization criterion: Likelihood of training data
Likelihood Gain ( f ,  )  L  f  L
Technical meat is in how to calculate this
efficiently for CRFs
• Mean field approximation
• Emphasize error instances (related to Boosting)
• Newton's method to set 
Named Entity Recognition
CRICKET MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side
Boland said on Thursday they
had signed Leicestershire fast
bowler David Millns on a one
year contract.
Millns, who toured Australia with
England A in 1992, replaces
former England all-rounder
Phillip DeFreitas as Boland's
overseas professional.
Labels:
PER
ORG
LOC
MISC
Examples:
Yayuk Basuki
Innocent Butare
3M
KDP
Cleveland
Cleveland
Nirmal Hriday
The Oval
Java
Basque
1,000 Lakes Rally
Automatically Induced Features
[McCallum & Li, 2003, CoNLL]
Index
Feature
0
inside-noun-phrase (ot-1)
5
stopword (ot)
20
capitalized (ot+1)
75
word=the (ot)
100
in-person-lexicon (ot-1)
200
word=in (ot+2)
500
word=Republic (ot+1)
711
word=RBI (ot) & header=BASEBALL
1027
header=CRICKET (ot) & in-English-county-lexicon (ot)
1298
company-suffix-word (firstmentiont+2)
4040
location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1)
4945
moderately-rare-first-name (ot-1) & very-common-last-name (ot)
4474
word=the (ot-2) & word=of (ot)
Named Entity Extraction Results
[McCallum & Li, 2003, CoNLL]
Method
F1
HMMs BBN's Identifinder
73%
CRFs w/out Feature Induction 83%
CRFs with Feature Induction
based on LikelihoodGain
90%
Outline
• Examples of IE and Data Mining.
a
• Conditional Random Fields and Feature Induction.
a
• Joint inference: Motivation and examples
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Iterated Conditional Samples.)
• Two example projects
– Email, contact address book, and Social Network Analysis
– Research Paper search and analysis
Knowledge
Discovery
IE
Segment
Classify
Associate
Cluster
Problem:
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Combined in serial juxtaposition,
IE and KD are unaware of each others’
weaknesses and opportunities.
1) KD begins from a populated DB, unaware of where
the data came from, or its inherent uncertainties.
2) IE is unaware of emerging patterns and
regularities in the DB.
The accuracy of both suffers, and significant mining
of complex text sources is beyond reach.
Solution:
Uncertainty Info
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Emerging Patterns
Prediction
Outlier detection
Decision support
Solution:
Unified Model
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Probabilistic
Model
Discover patterns
- entity types
- links / relations
- events
Discriminatively-trained undirected graphical models
Document
collection
Conditional Random Fields
[Lafferty, McCallum, Pereira]
Conditional PRMs
[Koller…], [Jensen…],
[Geetor…], [Domingos…]
Complex Inference and Learning
Just what we researchers like to sink our teeth into!
Actionable
knowledge
Prediction
Outlier detection
Decision support
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
But errors cascade--must be perfect at every stage to do well.
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Joint prediction of part-of-speech and noun-phrase in newswire,
matching accuracy with only 50% of the training data.
Inference:
Tree reparameterization BP
[Wainwright et al, 2002]
2. Jointly labeling distant mentions
Skip-chain CRFs [Sutton, McCallum, SRL 2004]
…
Senator Joe Green said today
…
.
Green ran
for …
Dependency among similar, distant mentions ignored.
2. Jointly labeling distant mentions
Skip-chain CRFs [Sutton, McCallum, SRL 2004]
…
Senator Joe Green said today
…
.
Green ran
14% reduction in error on most repeated field
in email seminar announcements.
Inference:
Tree reparameterization BP
[Wainwright et al, 2002]
for …
3. Joint co-reference among all pairs
Affinity Matrix CRF
“Entity resolution”
“Object correspondence”
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
99
Y/N
Y/N
11
~25% reduction in error on
co-reference of
proper nouns in newswire.
. . . she . . .
Inference:
Correlational clustering
graph partitioning
[Bansal, Blum, Chawla, 2002]
[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]
Coreference Resolution
AKA "record linkage", "database record deduplication",
"entity resolution", "object correspondence", "identity uncertainty"
Output
Input
News article,
with named-entity "mentions" tagged
Number of entities, N = 3
Today Secretary of State Colin Powell
met with . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . he . . . . . .
. . . . . . . . . . . . . Condoleezza Rice . . . . .
. . . . Mr Powell . . . . . . . . . .she . . . . . . .
. . . . . . . . . . . . . . Powell . . . . . . . . . . . .
. . . President Bush . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . Rice . . . . . . . . . .
. . . . . . Bush . . . . . . . . . . . . . . . . . . . . .
........... . . . . . . . . . . . . . . . .
#1
Secretary of State Colin Powell
he
Mr. Powell
Powell
#2
Condoleezza Rice
she
Rice
.........................
#3
President Bush
Bush
Inside the Traditional Solution
Pair-wise Affinity Metric
Mention (3)
. . . Mr Powell . . .
N
Y
Y
Y
Y
N
Y
Y
N
Y
N
N
Y
Y
Mention (4)
Y/N?
. . . Powell . . .
Two words in common
One word in common
"Normalized" mentions are string identical
Capitalized word in common
> 50% character tri-gram overlap
< 25% character tri-gram overlap
In same sentence
Within two sentences
Further than 3 sentences apart
"Hobbs Distance" < 3
Number of entities in between two mentions = 0
Number of entities in between two mentions > 4
Font matches
Default
OVERALL SCORE =
29
13
39
17
19
-34
9
8
-1
11
12
-3
1
-19
98
> threshold=0
The Problem
. . . Mr Powell . . .
affinity = 98
Y
affinity = 104
Pair-wise merging
decisions are being
made independently
from each other
. . . Powell . . .
N
Y
affinity = 11
. . . she . . .
Affinity measures are noisy and imperfect.
They should be made
in relational dependence
with each other.
A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003, ICML]
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
30
Y/N
Y/N
Make pair-wise merging
decisions in dependent
relation to each other by
- calculating a joint prob.
- including all edge weights
- adding dependence on
consistent triangles.
11
. . . she . . .


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003]
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
30
Y/N
Y/N
11
Make pair-wise merging
decisions in dependent
relation to each other by
- calculating a joint prob.
- including all edge weights
- adding dependence on
consistent triangles.

. . . she . . .


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003]
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
(30)
N
Y
(11)
. . . she . . .
infinity


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003]
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
(30)
N
N
(11)
. . . she . . .
64


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
log (P(y | x )    l f l (x i , x j , y ij ) 
i, j
l
w
i, j w/in
paritions
ij

w
i, j across
paritions
ij
Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
log (P(y | x )    l f l (x i , x j , y ij ) 
i, j
l
w
i, j w/in
paritions
ij

w
i, j across
paritions
ij
= 22
Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
log (P(y | x )    l f l (x i , x j , y ij ) 
i, j
l
w
i, j w/in
paritions
ij

w'
i, j across
paritions
ij
= 314
Co-reference Experimental Results
[McCallum & Wellner, 2003]
Proper noun co-reference
DARPA ACE broadcast news transcripts, 117 stories
Single-link threshold
Best prev match [Morton]
MRFs
Partition F1
16 %
83 %
88 %
error=30%
Pair F1
18 %
89 %
92 %
error=28%
DARPA MUC-6 newswire article corpus, 30 stories
Single-link threshold
Best prev match [Morton]
MRFs
Partition F1
11%
70 %
74 %
error=13%
Pair F1
7%
76 %
80 %
error=17%
Joint co-reference among all pairs
Affinity Matrix CRF
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
99
Y/N
Y/N
11
~25% reduction in error on
co-reference of
proper nouns in newswire.
. . . she . . .
Inference:
Correlational clustering
graph partitioning
[Bansal, Blum, Chawla, 2002]
[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]
4. Joint segmentation and co-reference
Extraction from and matching of
research paper citations.
o
s
Laurel, B. Interface Agents:
Metaphors with Character, in
The Art of Human-Computer Interface
Design, B. Laurel (ed), AddisonWesley, 1990.
World
Knowledge
c
y
Brenda Laurel. Interface Agents:
Metaphors with Character, in
Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
p
Co-reference
decisions
y
Database
field values
c
s
c
y
Citation attributes
s
o
Segmentation
o
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Inference:
Variant of Iterated Conditional Modes
[Besag, 1986]
[Wellner, McCallum, Peng, Hay, UAI 2004]
see also [Marthi, Milch, Russell, 2003]
4. Joint segmentation and co-reference
Joint IE and Coreference from Research Paper Citations
Textual citation mentions
(noisy, with duplicates)
Paper database, with fields,
clean, duplicates collapsed
AUTHORS
TITLE
Cowell, Dawid…
Probab…
Montemerlo, Thrun…FastSLAM…
Kjaerulff
Approxi…
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
VENUE
Springer
AAAI…
Technic…
Citation Segmentation and Coreference
Laurel, B.
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
Citation Segmentation and Coreference
Laurel, B.
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1)
Segment citation fields
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1)
Segment citation fields
2)
Resolve coreferent citations
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
AUTHOR =
TITLE =
PAGES =
BOOKTITLE =
EDITOR =
PUBLISHER =
YEAR =
Brenda Laurel
Interface Agents: Metaphors with Character
355-366
The Art of Human-Computer Interface Design
T. Smith
Addison-Wesley
1990
1)
Segment citation fields
2)
Resolve coreferent citations
3)
Form canonical database record
Resolving conflicts
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
AUTHOR =
TITLE =
PAGES =
BOOKTITLE =
EDITOR =
PUBLISHER =
YEAR =
Perform
Brenda Laurel
Interface Agents: Metaphors with Character
355-366
The Art of Human-Computer Interface Design
T. Smith
Addison-Wesley
1990
1)
Segment citation fields
2)
Resolve coreferent citations
3)
Form canonical database record
jointly.
IE + Coreference Model
AUT AUT YR TITL TITL
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the…
IE + Coreference Model
AUTHOR = “J Besag”
YEAR =
“1986”
TITLE = “On the…”
Citation mention attributes
c
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Structure for each
citation mention
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Binary coreference variables
for each pair of mentions
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Binary coreference variables
for each pair of mentions
y
n
n
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
AUTHOR = “P Smyth”
YEAR =
“2001”
TITLE = “Data Mining…”
...
Research paper entity
attribute nodes
y
n
n
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
4. Joint segmentation and co-reference
[Wellner, McCallum,
Peng, Hay, UAI 2004]
o
Extraction from and matching of
research paper citations.
s
Laurel, B. Interface Agents:
Metaphors with Character, in
The Art of Human-Computer Interface
Design, B. Laurel (ed), AddisonWesley, 1990.
World
Knowledge
c
y
Brenda Laurel. Interface Agents:
Metaphors with Character, in
Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
p
Co-reference
decisions
y
Database
field values
c
s
c
y
s
o
Citation attributes
Segmentation
o
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Inference:
Variant of Iterated Conditional Modes
[Besag, 1986]
Outline
• Examples of IE and Data Mining.
a
a
• Conditional Random Fields and Feature Induction.
• Joint inference: Motivation and examples
a
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Iterated Conditional Samples.)
• Two example projects
– Email, contact address book, and Social Network Analysis
– Research Paper search and analysis
Managing and Understanding
Connections of People in our Email World
Workplace effectiveness ~ Ability to leverage network of acquaintances
But filling Contacts DB by hand is tedious, and incomplete.
Contacts DB
Email Inbox
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Automatically
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
WWW
Qu i ck Ti me ™a nd a
TIF F (Un co mpre ss ed )d ec omp res so r
a re ne ed ed to s ee th i s pi c tu re.
System Overview
WWW
Email
CRF
Qu i ck Ti me ™a nd a
TIF F (Un co mpre ss ed )d ec omp res so r
a re ne ed ed to s ee th i s pi c tu re.
Keyword
Extraction
Person
Name
Extraction
Name
Coreference
Homepage
Retrieval
names
Contact
Info and
Person
Name
Extraction
Social
Network
Analysis
An Example
To: “Andrew McCallum” [email protected]
Subject ...
Search for
new people
First
Name:
Andrew
Middle
Name:
Kachites
Last
Name:
McCallum
JobTitle:
Associate Professor
Company:
University of Massachusetts
Street
Address:
140 Governor’s Dr.
City:
Amherst
State:
MA
Zip:
01003
Company
Phone:
(413) 545-1323
Links:
Fernando Pereira, Sam
Roweis,…
Key
Words:
Information extraction,
social network,…
Example keywords
extracted
Person
Keywords
William Cohen
Logic programming
Text categorization
Data integration
Rule learning
Daphne Koller
Bayesian networks
Relational models
Probabilistic models
Hidden variables
Deborah
McGuiness
Semantic web
Description logics
Knowledge representation
Ontologies
Tom Mitchell
1.
2.
Machine learning
Cognitive states
Learning apprentice
Artificial intelligence
Summary of Results
Contact info and name extraction performance (25 fields)
CRF
Token
Acc
Field
Prec
Field
Recall
Field
F1
94.50
85.73
76.33
80.76
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Expert Finding:
When solving some task, find friends-of-friends with relevant expertise.
Avoid “stove-piping” in large org’s by automatically suggesting collaborators.
Given a task, automatically suggest the right team for the job. (Hiring aid!)
Social Network Analysis:
Understand the social structure of your organization.
Suggest structural changes for improved efficiency.
Clustering words into topics with
Latent Dirichlet Allocation
[Blei, Ng, Jordan 2003]
Example topics
induced from a large collection of text
JOB
SCIENCE
BALL
FIELD
STORY
MIND
DISEASE
WATER
WORK
STUDY
GAME
MAGNETIC
STORIES
WORLD
BACTERIA
FISH
JOBS
SCIENTISTS
TEAM
MAGNET
TELL
DREAM
DISEASES
SEA
CAREER
SCIENTIFIC FOOTBALL
WIRE
CHARACTER
DREAMS
GERMS
SWIM
KNOWLEDGE
BASEBALL EXPERIENCE
NEEDLE
THOUGHT CHARACTERS
FEVER
SWIMMING
WORK
PLAYERS EMPLOYMENT
CURRENT
AUTHOR
IMAGINATION
CAUSE
POOL
OPPORTUNITIES
RESEARCH
PLAY
COIL
READ
MOMENT
CAUSED
LIKE
WORKING
CHEMISTRY
FIELD
POLES
TOLD
THOUGHTS
SPREAD
SHELL
TRAINING
TECHNOLOGY PLAYER
IRON
SETTING
OWN
VIRUSES
SHARK
SKILLS
MANY
BASKETBALL
COMPASS
TALES
REAL
INFECTION
TANK
CAREERS
MATHEMATICS COACH
LINES
PLOT
LIFE
VIRUS
SHELLS
POSITIONS
BIOLOGY
PLAYED
CORE
TELLING
IMAGINE
MICROORGANISMS SHARKS
FIND
FIELD
PLAYING
ELECTRIC
SHORT
SENSE
PERSON
DIVING
POSITION
PHYSICS
HIT
DIRECTION
INFECTIOUS
DOLPHINS CONSCIOUSNESS FICTION
FIELD
LABORATORY
TENNIS
FORCE
ACTION
STRANGE
COMMON
SWAM
OCCUPATIONS
STUDIES
TEAMS
MAGNETS
TRUE
FEELING
CAUSING
LONG
REQUIRE
WORLD
GAMES
BE
EVENTS
WHOLE
SMALLPOX
SEAL
OPPORTUNITY
SPORTS
MAGNETISM SCIENTIST
TELLS
BEING
BODY
DIVE
EARN
STUDYING
BAT
POLE
TALE
MIGHT
INFECTIONS
DOLPHIN
ABLE
SCIENCES
TERRY
INDUCED
NOVEL
HOPE
CERTAIN
UNDERWATER
[Tennenbaum et al]
Example topics
induced from a large collection of text
JOB
SCIENCE
BALL
FIELD
STORY
MIND
DISEASE
WATER
WORK
STUDY
GAME
MAGNETIC
STORIES
WORLD
BACTERIA
FISH
JOBS
SCIENTISTS
TEAM
MAGNET
TELL
DREAM
DISEASES
SEA
CAREER
SCIENTIFIC FOOTBALL
WIRE
CHARACTER
DREAMS
GERMS
SWIM
KNOWLEDGE
BASEBALL EXPERIENCE
NEEDLE
THOUGHT CHARACTERS
FEVER
SWIMMING
WORK
PLAYERS EMPLOYMENT
CURRENT
AUTHOR
IMAGINATION
CAUSE
POOL
OPPORTUNITIES
RESEARCH
PLAY
COIL
READ
MOMENT
CAUSED
LIKE
WORKING
CHEMISTRY
FIELD
POLES
TOLD
THOUGHTS
SPREAD
SHELL
TRAINING
TECHNOLOGY PLAYER
IRON
SETTING
OWN
VIRUSES
SHARK
SKILLS
MANY
BASKETBALL
COMPASS
TALES
REAL
INFECTION
TANK
CAREERS
MATHEMATICS COACH
LINES
PLOT
LIFE
VIRUS
SHELLS
POSITIONS
BIOLOGY
PLAYED
CORE
TELLING
IMAGINE
MICROORGANISMS SHARKS
FIND
FIELD
PLAYING
ELECTRIC
SHORT
SENSE
PERSON
DIVING
POSITION
PHYSICS
HIT
DIRECTION
INFECTIOUS
DOLPHINS CONSCIOUSNESS FICTION
FIELD
LABORATORY
TENNIS
FORCE
ACTION
STRANGE
COMMON
SWAM
OCCUPATIONS
STUDIES
TEAMS
MAGNETS
TRUE
FEELING
CAUSING
LONG
REQUIRE
WORLD
GAMES
BE
EVENTS
WHOLE
SMALLPOX
SEAL
OPPORTUNITY
SPORTS
MAGNETISM SCIENTIST
TELLS
BEING
BODY
DIVE
EARN
STUDYING
BAT
POLE
TALE
MIGHT
INFECTIONS
DOLPHIN
ABLE
SCIENCES
TERRY
INDUCED
NOVEL
HOPE
CERTAIN
UNDERWATER
[Tennenbaum et al]
From LDA to Author-Recipient-Topic
(ART)
Inference and Estimation
Gibbs Sampling:
- Easy to implement
- Reasonably fast
r
Enron Email Corpus
• 250k email messages
• 23k people
Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)
From: [email protected]
To: [email protected]
Subject: Enron/TransAltaContract dated Jan 1, 2001
Please see below. Katalin Kiss of TransAlta has requested an
electronic copy of our final draft? Are you OK with this? If
so, the only version I have is the original draft without
revisions.
DP
Debra Perlingiere
Enron North America Corp.
Legal Department
1400 Smith Street, EB 3885
Houston, Texas 77002
[email protected]
Topics, and prominent sender/receivers
discovered by ART
Topics, and prominent sender/receivers
discovered by ART
Beck = “Chief Operations Officer”
Dasovich = “Government Relations Executive”
Shapiro = “Vice Presidence of Regulatory Affairs”
Steffes = “Vice President of Government Affairs”
Comparing Role Discovery
Traditional SNA
ART
Author-Topic
distribution over
authored topics
distribution over
authored topics
connection strength (A,B) =
distribution over
recipients
Comparing Role Discovery
Tracy Geaconne  Dan McCarty
Traditional SNA
ART
Similar roles
Different roles
Geaconne = “Secretary”
McCarty = “Vice President”
Author-Topic
Different roles
Comparing Role Discovery
Tracy Geaconne  Rod Hayslett
Traditional SNA
Different roles
ART
Not very similar
Author-Topic
Very similar
Geaconne = “Secretary”
Hayslett = “Vice President & CTO”
Comparing Role Discovery
Lynn Blair  Kimberly Watson
Traditional SNA
Different roles
ART
Very similar
Author-Topic
Very different
Blair = “Gas pipeline logistics”
Watson = “Pipeline facilities planning”
Comparing Group Discovery
Enron TransWestern Division
Traditional SNA
Block structured
ART
Not
Author-Topic
Not
McCallum Email Corpus 2004
• January - October 2004
• 23k email messages
• 825 people
From: [email protected]
Subject: NIPS and ....
Date: June 14, 2004 2:27:41 PM EDT
To: [email protected]
There is pertinent stuff on the first yellow folder that is
completed either travel or other things, so please sign that
first folder anyway. Then, here is the reminder of the things
I'm still waiting for:
NIPS registration receipt.
CALO registration receipt.
Thanks,
Kate
McCallum Email Blockstructure
Four most prominent topics
in discussions with ____?
Two most prominent topics
in discussions with ____?
Words
love
house
time
great
hope
dinner
saturday
left
ll
visit
evening
stay
bring
weekend
road
sunday
kids
flight
Prob
0.030514
0.015402
0.013659
0.012351
0.011334
0.011043
0.00959
0.009154
0.009154
0.009009
0.008282
0.008137
0.008137
0.007847
0.007701
0.007411
0.00712
0.006829
0.006539
0.006539
Words
today
tomorrow
time
ll
meeting
week
talk
meet
morning
monday
back
call
free
home
won
day
hope
leave
office
tuesday
Prob
0.051152
0.045393
0.041289
0.039145
0.033877
0.025484
0.024626
0.023279
0.022789
0.020767
0.019358
0.016418
0.015621
0.013967
0.013783
0.01311
0.012987
0.012987
0.012742
0.012558
Topic 37
Words
jean
james
lab
space
ciir
students
bruce
office
staff
funding
mike
adam
move
grad
ll
rod
time
asked
student
today
Prob
0.032349
0.02684
0.017229
0.016292
0.015471
0.015354
0.014885
0.014768
0.013244
0.012072
0.011017
0.010666
0.010666
0.01008
0.009962
0.009494
0.008556
0.008556
0.008556
0.008322
Sender
Recipient
Prob
jean
mccallum
0.111932
mccallum
jean
0.072785
jean
jean
0.056024
mccallum
allan
0.054266
jean
allan
0.035162
allan
mccallum
0.027778
mccallum
croft
0.025668
mccallum
grupen
0.021917
mccallum
stowell
0.01887
mccallum
[email protected]
grupen
mccallum
0.018167
gwking
mccallum
0.016057
[email protected]
mccallum
0.014299
mccallum
saunders
0.014299
mccallum
jensen
0.007267
mccallum
culotta
0.006915
gauthier
mccallum
0.006564
grupen
grupen
0.006446
corrada
mccallum
0.006212
mccallum
pal
0.005626
Topic 40
Words
code
mallet
files
al
file
version
java
test
problem
run
cvs
add
directory
release
output
bug
source
ps
log
created
Prob
0.060565
0.042015
0.029115
0.024201
0.023587
0.022113
0.021499
0.020025
0.018305
0.015356
0.013391
0.012776
0.012408
0.012285
0.011916
0.011179
0.010197
0.009705
0.008968
0.0086
Sender
Recipient
Prob
hough
mccallum
0.067076
mccallum
hough
0.041032
mikem
mccallum
0.028501
culotta
mccallum
0.026044
saunders
mccallum
0.021376
mccallum
saunders
0.019656
pereira
mccallum
0.017813
casutton
mccallum
0.017199
mccallum
ronb
0.013514
mccallum
pereira
0.013145
hough
melinda.gervasio 0.013022
mccallum
casutton
0.013022
fuchun
mccallum
0.010811
mccallum
culotta
0.009828
ronb
mccallum
0.009705
westy
hough
0.009214
xuerui
corrada
0.0086
ghuang
mccallum
0.008354
khash
mccallum
0.008231
melinda.gervasio
mccallum
0.008108
Pairs with highest
rank difference between ART & SNA
5 other professors
3 other ML researchers
Role-Author-Recipient-Topic Models
Main Application Project:
Main Application Project:
Cites
Research
Paper
Main Application Project:
Expertise
Cites
Research
Paper
Grant
Venue
Person
University
Groups
Summary
• Conditional Random Fields combine the benefits of
– Conditional probability models (arbitrary features)
– Markov models (for sequences or other relations)
• Success in
–
–
–
–
Factorial finite state models
Jointly labeling distant entities
Coreference analysis
Segmentation uncertainty aiding coreference
• Current projects
– Email, contact management, expert-finding, SNA
– Mining the scientific literature
End of Talk