Automatically Building Special Purpose Search Engines with

Download Report

Transcript Automatically Building Special Purpose Search Engines with

Toward Unified Graphical Models of Information Extraction and Data Mining

Andrew McCallum

Computer Science Department University of Massachusetts Amherst

Joint work with Charles Sutton, Aron Culotta, Ben Wellner, Khashayar Rohanimanesh, Wei Li

Goal:

Mine actionable knowledge from unstructured text.

Extracting Job Openings from the Web

foodscience.com-Job2

JobTitle: Ice Cream Guru Employer: foodscience.com

JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

A Portal for Job Openings

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Data Mining the Extracted Job Information

IE from Chinese Documents regarding Weather

Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

IE from Research Papers

[McCallum et al ‘99]

IE from Research Papers

Mining Research Papers

[Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004]

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

[Giles et al]

What is “Information Extraction”

As a family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation , countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction”

As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction”

As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction”

As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ * * * * Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Richard Stallman , founder of the Free Software Foundation , countered saying…

Larger Context

Spider Filter

IE

Segment Classify Associate Cluster Document collection Database Data Mining Discover patterns - entity types - links / relations - events Actionable knowledge Prediction Outlier detection Decision support

Outline

Brief review of Conditional Random Fields

• Joint inference: Motivation and examples – Joint Labeling of Cascaded Sequences

(Belief Propagation)

– Joint Labeling of Distant Entities

(BP by Tree Reparameterization)

– Joint Co-reference Resolution

(Graph Partitioning)

– Joint Segmentation and Co-ref

(Iterated Conditional Samples)

• Efficiently training large, multi-component models: Piecewise Training

Hidden Markov Models

HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Finite state model Graphical model ...

S t - 1 S t S t+1 transitions ...

Generates: State sequence Observation sequence

o 1

observations ...

o 2 o 3 o 4 o 5 o 6 o 7 o 8 P

(

s

 , 

o

)

O

O t -1

t

 |

o

|   1

P

(

s t

t O t +1

|

s t

 1 )

P

(

o t

|

s t

)

Parameters: for all states

S={s 1 ,s 2 ,…}

Start state probabilities:

P(s t )

Transition probabilities:

P(s t |s t-1 )

Observation (emission) probabilities:

P(o t |s t )

Training: Usually a multinomial over atomic, fixed alphabet Maximize probability of training observations (w/ prior)

IE with Hidden Markov Models

Given a sequence of observations: Yesterday Rich Caruana spoke this example sentence.

and a trained HMM:

person name location name background

Find the most likely state sequence: (Viterbi) Yesterday Rich Caruana spoke this example sentence.

Any words said to be generated by the designated “person name” state extract as a person name: Person name: Rich Caruana

We want More than an Atomic View of Words

Would like richer representation of text: many arbitrary, overlapping features of the words.

identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is “Wisniewski” is under node X in WordNet is in bold font part of noun phrase is indented is in hyperlink anchor last person name was female next two words are “and Associates” S t - 1 ends in “-ski” O t -1 S t O t S t+1 O t +1 … …



From HMMs to Conditional Random Fields

s

s

1 ,

s

2 ,...

s n

Joint

o

o

1 ,

o

2 ,...

o n P

(

s

,

o

)  |

o

| 

t

 1

P

(

s t

|

s t

 1 )

P

(

o t

|

s t

)

[Lafferty, McCallum, Pereira 2001]

S t-1 S t S t+1 ...

O t-1 O t O t+1 ...

Conditional



P

(

s

|

o

)  1

P

(

o

) |

o

| 

t

 1

P

(

s t

|

s t

 1 )

P

(

o t

|

s t

)

S t-1 S t S t+1

...

1

Z

(

o

) |

o

| 

t

 1 

s

(

s t

,

s t

 1 ) 

o

(

o t

,

s t

)  

where

o

(

t

)  exp   

k

k f k

(

s t

,

o t

)  

O t-1 O t O t+1 Set parameters by maximum likelihood, using optimization method on

L.

...

(A super-special case of Conditional Random Fields.)



Conditional Random Fields

[Lafferty, McCallum, Pereira 2001]

1. FSM special-case: linear chain among unknowns, parameters tied across time steps.

S t S t+1 S t+2 S t+3 S t+4

P

(

s

 | 

o

) 

Z

( 1 

o

)

t

 |

o

 |  1 exp   

k

k

O = O t , O t+1 , O t+2 , O t+3 , O t+4

f k

(

s t

,

s t

 1 , 

o

,

t

)  

2. In general: CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns 3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]: Parameters tied across hits from SQL-like queries ("clique templates")

Linear-chain CRFs vs. HMMs

• Comparable computational efficiency for inference • Features may be arbitrary functions of

any

observations or

all

– Parameters need not fully specify generation of observations; can require less training data – Easy to incorporate domain knowledge

Table Extraction from Government Reports

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 ------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------ Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :----------------- : : Milk : Milkfat : Milk Produced : Milk : Milkfat ------------------------------------------------------------------------------- : 1,000 Head --- Pounds -- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 ------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

100+ documents from www.fedstats.gov

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 ------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ ------------------------------------------------------ Year : of : Per Milk Cow : Percentage : Total -------------------: of Fat in All :----------------- : : Milk : Milkfat : Milk Produced : Milk : Milkfat ------------------------------------------------------------------------------- Pounds -- Percent Million Pounds CRF Labels:

• • • • • • • Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote ...

(12 in all)

Features:

• • • • • • • Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev.

...

Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

Table Extraction Experimental Results

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

HMM Line labels, percent correct 65 % Table segments, F1 64 % Stateless MaxEnt 85 % CRF 95 % 92 %

IE from Research Papers

[McCallum et al ‘99]

IE from Research Papers

Hidden Markov Models (HMMs)

[Seymore, McCallum, Rosenfeld, 1999]

Field-level F1 75.6

Support Vector Machines (SVMs)

[Han, Giles, et al, 2003]

89.7

error 40% Conditional Random Fields (CRFs)

[Peng, McCallum, 2004]

93.9

Named Entity Recognition

CRICKET MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African Boland provincial side said on Thursday they had signed Leicestershire bowler David Millns fast on a one year contract. Millns , who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland 's overseas professional.

Labels: PER ORG LOC MISC Examples: Yayuk Basuki Innocent Butare 3M KDP Cleveland Cleveland Nirmal Hriday The Oval Java Basque 1,000 Lakes Rally

Named Entity Extraction Results

[McCallum & Li, 2003, CoNLL]

Method F1 HMMs BBN's Identifinder 73% CRFs w/out Feature Induction 83% CRFs with Feature Induction based on LikelihoodGain 90%

Outline

Joint inference: Motivation and examples

– Joint Labeling of Cascaded Sequences

(Belief Propagation)

– Joint Labeling of Distant Entities

(BP by Tree Reparameterization)

– Joint Co-reference Resolution

(Graph Partitioning)

– Joint Segmentation and Co-ref

(Iterated Conditional Samples)

• Efficiently training large, multi-component models: Piecewise Training

Larger Context

Spider Filter

IE

Segment Classify Associate Cluster Document collection Database Data Mining Discover patterns - entity types - links / relations - events Actionable knowledge Prediction Outlier detection Decision support

IE Segment Classify Associate Cluster Knowledge Discovery Discover patterns - entity types - links / relations - events

Problem:

Database Document collection Actionable knowledge

Combined in serial juxtaposition, IE and KD are unaware of each others’ weaknesses and opportunities.

1) KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties.

2) IE is unaware of emerging patterns and regularities in the DB.

The accuracy of both suffers, and significant mining of complex text sources is beyond reach.

Solution:

Spider Filter

IE

Segment Classify Associate Cluster Document collection Uncertainty Info Database Data Mining Discover patterns - entity types - links / relations - events Emerging Patterns Actionable knowledge Prediction Outlier detection Decision support

Solution:

Spider Unified Model Filter

IE

Segment Classify Associate Cluster Probabilistic Model Data Mining Discover patterns - entity types - links / relations - events Discriminatively-trained undirected graphical models Document collection Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Complex Inference and Learning Just what we researchers like to sink our teeth into!

Actionable knowledge Prediction Outlier detection Decision support

Larger-scale Joint Inference for IE

• What model structures will capture salient dependencies?

• Will joint inference improve accuracy?

• How do to inference in these large graphical models?

• How to efficiently train these models, which are built from multiple large components?

1. Jointly labeling cascaded sequences Factorial CRFs

[Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words

1. Jointly labeling cascaded sequences Factorial CRFs

[Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words

1. Jointly labeling cascaded sequences Factorial CRFs

[Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words But errors cascade--must be perfect at every stage to do well.

1. Jointly labeling cascaded sequences Factorial CRFs

[Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data.

Inference: Tree reparameterization BP [Wainwright et al, 2002]

2. Jointly labeling distant mentions Skip-chain CRFs

[Sutton, McCallum, SRL 2004]

Senator Joe Green said today … . Green ran for … Dependency among similar, distant mentions ignored.

2. Jointly labeling distant mentions Skip-chain CRFs

[Sutton, McCallum, SRL 2004]

Senator Joe Green said today … . Green ran for … 14% reduction in error on most repeated field in email seminar announcements.

Inference: Tree reparameterization BP [Wainwright et al, 2002]

3. Joint co-reference among all pairs Affinity Matrix CRF

. . . Mr Powell . . .

“Entity resolution” “Object correspondence” Y/N 45 . . . Powell . . .

99 Y/N Y/N 11 ~25% reduction in error on co-reference of proper nouns in newswire.

. . . she . . .

Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] [McCallum, Wellner, IJCAI WS 2003, NIPS 2004]

Coreference Resolution

AKA "record linkage", "database record deduplication", "entity resolution", "object correspondence", "identity uncertainty" Input News article, with named-entity " mentions " tagged Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . he . . . . . .

. . . . . . . . . . . . . Condoleezza Rice . . . . .

. . . . Mr Powell . . . . . . . . . .

she . . . . . . .

. . . . . . . . . . . . . . Powell . . . . . . . . . . . .

. . . President Bush . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . Rice . . . . . . . . . .

. . . . . . Bush . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Output Number of entities, N = 3 #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice #3 President Bush Bush

N Y Y Y Y N Y Y N Y N N Y Y

Inside the Traditional Solution

Pair-wise Affinity Metric Mention (3) Mention (4) Y/N?

. . . Mr Powell . . .

. . . Powell . . .

Two words in common One word in common "Normalized" mentions are string identical Capitalized word in common > 50% character tri-gram overlap < 25% character tri-gram overlap In same sentence Within two sentences Further than 3 sentences apart "Hobbs Distance" < 3 Number of entities in between two mentions = 0 Number of entities in between two mentions > 4 Font matches Default OVERALL SCORE = 29 13 39 17 19 -34 9 8 -1 11 12 -3 1 -19 98 > threshold=0

The Problem

. . . Mr Powell . . .

Y affinity = 98 . . . Powell . . .

affinity =

104 N Y affinity = 11 . . . she . . .

Affinity measures are noisy and imperfect.

Pair-wise merging decisions are being made independently from each other They should be made in relational dependence with each other.

A Generative Model Solution

[Russell 2001], [Pasula et al 2002], [Milch et al 2003], [Marthi et al 2003] (Applied to citation matching, and object correspondence in vision)

N id context words id surname distance fonts gender age .

.

.

.

.

.



A Markov Random Field for Co-reference

(MRF)

[McCallum & Wellner, 2003, ICML]

. . . Mr Powell . . .

30 Y/N Y/N 45 Y/N . . . Powell . . .

Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob.

- including all edge weights - adding dependence on consistent triangles.

11 . . . she . . .

P

(

y

|

x

)  1

Z x

exp  

i

, 

j

l

l f l

(

x i

,

x j

,

y ij

)    '

i

,

j

,

k f

'(

y ij

,

y jk

,

y ik

)  



A Markov Random Field for Co-reference

(MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . .

Y/N Y/N Y/N 11



A Markov Random Field for Co-reference

(MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . .

N

(

45 ) . . . Powell . . .

N

( 

30 ) Y

( 11 ) . . . she . . .

4

P

(

y

|

x

)  1

Z x

exp  

i

, 

j

l

l f l

(

x i

,

x j

,

y ij

)    '

i

,

j

,

k f

'(

y ij

,

y jk

,

y ik

)  



A Markov Random Field for Co-reference

(MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . .

Y

(

45 ) . . . Powell . . .

N

( 

30 ) Y

( 11 )

infinity . . . she . . .

P

(

y

|

x

)  1

Z x

exp  

i

, 

j

l

l f l

(

x i

,

x j

,

y ij

)    '

i

,

j

,

k f

'(

y ij

,

y jk

,

y ik

)  



A Markov Random Field for Co-reference

(MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . .

Y

(

45 ) . . . Powell . . .

( 

30 ) N N

( 11 ) . . . she . . .

64

P

(

y

|

x

)  1

Z x

exp  

i

, 

j

l

l f l

(

x i

,

x j

,

y ij

)    '

i

,

j

,

k f

'(

y ij

,

y jk

,

y ik

)  



Inference in these MRFs = Graph Partitioning

[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . .

45 . . . Powell . . .

106

30

134 11 . . . Condoleezza Rice . . .

. . . she . . .

10

log (

P

(

y

|

x

)     

l l i

,

j f l

(

x i

,

x j

,

y ij

)   w ij 

i

,

j

w/in paritions  w ij

i

,

j

across paritions



Inference in these MRFs = Graph Partitioning

[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . .

45 . . . Powell . . .

106

30

134 11 . . . Condoleezza Rice . . .

. . . she . . .

10

log (

P

(

y

|

x

)     

l l i

,

j f l

(

x i

,

x j

,

y ij

)   w ij 

i

,

j

w/in paritions  w ij

i

,

j

across paritions

=

22



Inference in these MRFs = Graph Partitioning

[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . .

45 . . . Powell . . .

106

30

134 11 . . . Condoleezza Rice . . .

. . . she . . .

10

log (

P

(

y

|

x

)     

l l i

,

j f l

(

x i

,

x j

,

y ij

)   w ij 

i

,

j

w/in paritions  w' ij

i

,

j

across paritions

= 314

Co-reference Experimental Results

[McCallum & Wellner, 2003]

Proper noun co-reference DARPA ACE broadcast news transcripts, 117 stories Single-link threshold Best prev match [Morton] MRFs Partition F1 16 % 83 % 88 %

error=30% Pair F1 18 % 89 % 92 %

error=28% DARPA MUC-6 newswire article corpus, 30 stories Single-link threshold Best prev match [Morton] MRFs Partition F1 11% 70 % 74 %

error=13% Pair F1 7 % 76 % 80 %

error=17%

Joint co-reference among all pairs Affinity Matrix CRF

. . . Mr Powell . . .

Y/N 45 . . . Powell . . .

99 Y/N Y/N 11 ~25% reduction in error on co-reference of proper nouns in newswire.

. . . she . . .

Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] [McCallum, Wellner, IJCAI WS 2003, NIPS 2004]

4. Joint segmentation and co-reference

Extraction from and matching of research paper citations.

Laurel, B Design Wesley, . Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface , B. Laurel (ed), Addison 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in Laurel, The Art of Human-Computer Interface Design , 355-366, 1990 .

s c y s o c p Database field values y y

World Knowledge

s Co-reference decisions c Citation attributes Segmentation o o

35% reduction in co-reference error by using segmentation uncertainty.

6-14% reduction in segmentation error by using co-reference.

Inference: Variant of Iterated Conditional Modes [Besag, 1986] [Wellner, McCallum, Peng, Hay, UAI 2004] see also [Marthi, Milch, Russell, 2003]

4. Joint segmentation and co-reference

Joint IE and Coreference from Research Paper Citations Textual citation mentions (noisy, with duplicates) Paper database, with fields, clean, duplicates collapsed AUTHORS TITLE VENUE Cowell, Dawid… Probab… Springer Montemerlo, Thrun… FastSLAM… AAAI… Kjaerulff Approxi… Technic…

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

Y ?

N

Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) 2) Segment citation fields Resolve coreferent citations

Y ?

N

Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR = PUBLISHER = YEAR = Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990 1) 2) 3) Segment citation fields Resolve coreferent citations Form canonical database record

Resolving conflicts

Y ?

N

Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR = PUBLISHER = YEAR =

Perform

Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990 1) 2) 3) Segment citation fields Resolve coreferent citations Form canonical database record

jointly.

IE + Coreference Model

CRF Segmentation Observed citation s

AUT AUT YR TITL TITL

x

J Besag 1986 On the…

IE + Coreference Model

Citation mention attributes CRF Segmentation Observed citation

AUTHOR = “J Besag” YEAR = “1986” TITLE = “On the…”

c s x

J Besag 1986 On the…

IE + Coreference Model

Smyth , P Data mining…

Structure for each citation mention

Smyth . 2001 Data Mining…

c s x

J Besag 1986 On the…

IE + Coreference Model

Smyth , P Data mining…

Binary coreference variables for each pair of mentions

Smyth . 2001 Data Mining…

c s x

J Besag 1986 On the…

IE + Coreference Model

Smyth , P Data mining…

Binary coreference variables for each pair of mentions

y n n Smyth . 2001 Data Mining…

c s x

J Besag 1986 On the…

IE + Coreference Model

Smyth , P Data mining… AUTHOR = “P Smyth” YEAR = “2001” TITLE = “Data Mining…” ...

Research paper entity attribute nodes

y n n Smyth . 2001 Data Mining…

c s x

J Besag 1986 On the…

IE + Coreference Model

Smyth , P Data mining…

Research paper entity attribute node

y y y Smyth . 2001 Data Mining…

c s x

J Besag 1986 On the…

IE + Coreference Model

Smyth , P Data mining… Smyth . 2001 Data Mining… y n n

c s x

J Besag 1986 On the…

Such a highly connected graph makes exact inference intractable, so…

Approximate Inference 1

m 1

(

v 2

) • Loopy Belief Propagation

v 1 m 2

(

v 1

)

v 2 m 2

(

v 3

)

m 3

(

v 2

)

v 3

messages passed between

nodes

v 4 v 5 v 6

Approximate Inference 1

m 1

(

v 2

) • Loopy Belief Propagation

v 1 m 2

(

v 1

)

v 2 m 2

(

v 3

)

m 3

(

v 2

)

v 3

messages passed between

nodes

v 4 v 5 v 6

• Generalized Belief Propagation

v 1 v 2 v 3 v 4 v 5 v 6

messages passed between

regions

v 7 v 8 v 9

Here, a

message

is a conditional probability table passed among nodes

.

But when

message size grows exponentially

with size of overlap between regions!

Approximate Inference 2

• Iterated Conditional Modes (ICM) [Besag 1986]

v 1 v 2 v 3 v 4 v 5 v 6

v 6 i+1 = argmax P(v 6 i |

v

v 6 i \ v 6 i ) = held constant

Approximate Inference 2

• Iterated Conditional Modes (ICM) [Besag 1986]

v 1 v 2 v 3 v 4 v 5 v 6

v 5 j+1 = argmax P(v 5 j |

v

v 5 j \ v 5 j ) = held constant

Approximate Inference 2

• Iterated Conditional Modes (ICM) [Besag 1986]

v 1 v 2 v 3 v 4 v 5 v 6

v 4 k+1 = argmax P(v 4 k |

v

v 4 k \ v 4 k ) = held constant

Structured inference scales well here, but greedy, and easily falls into local minima.

Approximate Inference 2

• Iterated Conditional Modes (ICM) [Besag 1986]

v 1 v 2 v 3 v 4 v 5 v 6

v 4 k+1 = argmax P(v 4 k |

v

v 4 k \ v 4 k ) • Iterated Conditional Sampling (ICS)

(our name)

Instead of selecting only argmax , sample of argmaxes of P(v 4 k |

v

\ v 4 k ) e.g. an N-best list (the top N values) = held constant

v 1 v 2 v 3 v 4 v 5 v 6

Can use “

Generalized Version

” of this; doing exact inference on a

region

of several nodes at once.

Here, a “message”

grows only linearly with overlap region size

and N

!

1) 2) 3)

Features of this Inference Method

Structured or “factored” representation (ala GBP) Uses samples to approximate density Closed-loop message-passing on

loopy

graph (ala BP)

Related Work

• Beam search – “Forward”-only inference • Particle filtering, e.g.

[Doucet 1998]

– Usually on tree-shaped graph, or “feedforward” only.

• MC Sampling…Embedded HMMs

[Neal, 2003]

– Sample from high-dim continuous state space; do forward-backward • Sample Propagation

[Paskin, 2003]

– Messages = samples, on a junction tree • Fields to Trees

[Hamze & de Freitas, UAI 2003]

– Rao-Blackwellized MCMC, partitioning G into non-overlapping trees • Factored Particles for DBNs

[Ng, Peshkin, Pfeffer, 2002]

– Combination of Particle Filtering and Boyan-Koller for DBNs

IE + Coreference Model

Smyth , P Data mining…

Exact inference on these linear-chain regions From each chain pass an N-best List into coreference

Smyth . 2001 Data Mining… J Besag 1986 On the…

IE + Coreference Model

Smyth , P Data mining…

Approximate inference by graph partitioning… …integrating out uncertainty in samples of extraction

Smyth . 2001 Data Mining…

Make scale to 1M citations with

Canopies

[McCallum, Nigam, Ungar 2000]

J Besag 1986 On the…

IE + Coreference Model

Smyth , P Data mining…

Exact (exhaustive) inference over entity attributes

y n n Smyth . 2001 Data Mining… J Besag 1986 On the…

IE + Coreference Model

Smyth , P Data mining…

Revisit exact inference on IE linear chain, now conditioned on entity attributes

y n n Smyth . 2001 Data Mining… J Besag 1986 On the…

Parameter Estimation

Separately for different regions

IE Linear-chain Exact MAP Coref graph edge weights MAP on individual edges Entity attribute potentials MAP, pseudo-likelihood

y n n

In all cases: Climb MAP gradient with quasi-Newton method

Experimenal Results

• Set of citations from CiteSeer – 1500 citation mentions – to 900 paper entities • Hand-labeled for coreference and field-extraction • Divided into 4 subsets, each on a different topic – RL, Face detection, Reasoning, Constraint Satisfaction – Within each subset many citations share authors, publication venues, publishers, etc.

• 70% of the citation mentions are singletons

Coreference Results

N

1 (Baseline) 3 7 9 Optimal

Coreference cluster recall Reinforce Face Reason

0.946

0.95

0.95

0.982

0.99

0.96

0.98

0.98

0.97

0.99

0.94

0.96

0.95

0.96

0.99

Constraint

0.96

0.96

0.97

0.97

0.99

• Average error reduction is 35%.

• “Optimal” makes best use of N-best list by using true labels.

• Indicates that even more improvement can be obtained

Information Extraction Results

Baseline w/ Coref Err. Reduc.

P-value

Reinforce Segmentation F1 Face Reason

.943

.949

.101

.0442

.908

.914

.062

.0014

.929

.935

.090

.0001

Constraint

.934

.943

.142

.0001

• Error reduction ranges from 6-14%.

• Small, but significant at 95% confidence level (p-value < 0.05)

Biggest limiting factor in both sets of results: data set is small, and does not have large coreferent sets.

Parameter Estimation

Separately for different regions

IE Linear-chain Exact MAP Coref graph edge weights MAP on individual edges Entity attribute potentials MAP, pseudo-likelihood

y n n

In all cases: Climb MAP gradient with quasi-Newton method

Outline

• – Joint Labeling of Cascaded Sequences

(Belief Propagation)

– Joint Labeling of Distant Entities

(BP by Tree Reparameterization)

– Joint Co-reference Resolution

(Graph Partitioning)

– Joint Segmentation and Co-ref

(Iterated Conditional Samples)

Efficiently training large, multi-component models: Piecewise Training

Piecewise Training

Piecewise Training with NOTA

Experimental Results

Named entity tagging (CoNLL-2003) Training set = 15k newswire sentences 9 labels Test F1 Training time CRF MEMM 89.87

88.90

9 hours 1 hour CRF-PT 90.50

5.3 hours stat. sig. improvement at p = 0.001

Experimental Results 2

Part-of-speech tagging (Penn Treebank, small subset ) Training set = 1154 newswire sentences 45 labels Test F1 Training time CRF MEMM 88.1

88.1

14 hours 2 hours CRF-PT 88.8

2.5 hours stat. sig. improvement at p = 0.001

“Parameter Independence Diagrams”

Graphical models = formalism for representing independence assumptions among variables.

Here we represent independence assumptions among parameters (in factor graph)

“Parameter Independence Diagrams”

Train some in pieces, some globally

Piecewise Training Research Questions

• How to select the boundaries of “pieces”?

• What choices of limited interaction are best?

• How to sample sparse subsets of NOTA instances?

• Application to simpler models (classifiers) • Application to more complex models (parsing)

Main Application Project:

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Main Application Project:

Cites Research Paper

Main Application Project:

Grant Cites Research Paper Person Expertise Venue University Groups

Summary

• Conditional Random Fields – Conditional probability models of structured data • Data mining complex unstructured text suggests the need for joint inference IE -> DM.

• Early examples – Factorial finite state models – Jointly labeling distant entities – Coreference analysis – Segmentation uncertainty aiding coreference • Piecewise Training – Faster + higher accuracy by making independence assumptions

End of Talk

4. Joint segmentation and co-reference

o

[Wellner, McCallum, Peng, Hay, UAI 2004] Extraction from and matching of research paper citations.

Laurel, B Design Wesley, . Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface , B. Laurel (ed), Addison 1990 .

s c

World Knowledge

Co-reference decisions

Brenda Laurel . Interface Agents: Metaphors with Character , in Laurel, The Art of Human-Computer Interface Design , 355-366, 1990 .

c y p Database field values y y c Citation attributes s s Segmentation o o

35% reduction in co-reference error by using segmentation uncertainty.

6-14% reduction in segmentation error by using co-reference.

Inference: Variant of Iterated Conditional Modes [Besag, 1986]

Label Bias Problem in Conditional Sequence Models

• Example (after Bottou ‘91): r o b

“rob”

start r i b

“rib”

• Bias toward states with few “siblings”.

• Per-state normalization in MEMMs does not allow “probability mass” to transfer from one branch to the other.

Proposed Solutions

• Determinization: – not always possible – state-space explosion start • Use fully-connected models: – lacks prior structural knowledge.

r o i • Our solution:

Conditional random fields

(CRFs): – Probabilistic conditional models generalizing MEMMs.

– Allow some transitions to

vote

more strongly than others in computing state sequence probability.

Whole sequence

rather than per-state normalization.

b

“rob”

b

“rib”