Co-reference in Content Analysis

Download Report

Transcript Co-reference in Content Analysis

Introduction to
Cross-Document Coreference
Amit Bagga
StreamSage/Comcast
[email protected]
Outline
• Motivation and Definition
• Comparison with Within-Document Coreference,
WSD and other NL tasks
• Methodologies for Entity Cross-Document
Coreference
• Other types of Cross-Document Coreference
–
–
–
–
Concept Cross-Document Coreference
Event Cross-Document Coreference
Cross-Media Coreference
Cross-Language, Cross-Document Coreference
• Scoring Methodologies
Motivation
• Proper names comprise approximately 10%
of news text (Coates-Stephens, 1992)
• Names are often ambiguous across
documents
– increasingly becoming a challenge for NLP
systems as collection size and generality grow
– also as systems break the “document boundary”
Definition
• Cross-Document Coreference (CDC) for
entities, in broad terms, asks
– how can one computationally disambiguate the
intended referent of a name
• Winchester & Lee 2002
– for example, it asks, which ‘John Smith’ is
meant by a particular occurrence of the string
“John Smith”
Comparison with Within-Document
Coreference
• Within a document
– Identical or similarly named entities seldom
appear in the same context
• when they do, writers distinguish them explicitly
• i.e. it is usually the case that we have one referent
per discourse
– Variant form of the same name generally obey
certain regularities which are predictable
• For example: Michael Jordan may be referred to by
the following – Michael, Mr. Jordan, Jordan, etc.
• Across documents
– Assumption that same or similar names refer to
same entity is not valid
– Linguistics theories do not apply
– The only way to distinguish between these
entities is to examine context
Comparison with WSD
• CDC can be thought of as disambiguating the
“sense” of usage of a name
• In WSD:
– Usually possible to enumerate a priori all possible
senses of word
– Number of possible senses of word is small (1-10)
• In CDC:
– A large corpus can contain 10s or 100s of entities with
same name which are impossible to enumerate a priori
– From linguistic perspective, all entities equally
plausible
The Role of Context
• Similar to WSD, context is vital for CDC
– context can be of different sizes
• window of words centered around a name, sentence
containing name, group of sentences, or even whole
document
– modeling context can be done in many different
ways
• bag of words, set of phrases, set of entities, set of
relations, etc.
• All CDC systems use context in one form or
another
Bag of Words Approach
– Bagga and Baldwin, 1998
– Within-document coreference system is used to identify all
mentions of entity
– Sentences containing mention are extracted from each
document
• “summaries” with respect to entity
– Set of summaries compared using VSM (tf*idf)
– Single-link clustering used
– Version 2 (1999) eliminates use of within document
coreference system
• sentences containing any variant of name extracted
Corpus, Evaluation, and Results
– 197 articles containing “John Smith” extracted from
2 years of New York Times data
• 35 different John Smiths
– B-CUBED algorithm used
– Version 2 results
• 84% F-Measure
• 90% Precision, 78% Recall
• < 1% F-Measure drop when compared to original system
Minimizing Context Matches
• Kazi and Ravin, 2000
• Problem with Bagga and Baldwin, 1998
– Prohibitively expensive in terms of storage and n-to-n comparisons
(specially in a large corpus)
• Use IBM’s Nominator for named entity identification and
within document coreference (non-pronominal)
• CDC task is merging canonical names from different
documents that refer to same entity
• Context analysis done by use of a Context Thesaurus
– Given a name, returns a ranked list of terms that are related to
name in the corpus
# Docs
Nominator Output
17
Bush (unspecified gender)
1
Christopher Bush (male)
1
Douglas Bush (male)
26
George Bush (male)
2
George Bush; President Bush (male)
1
George W. Bush; Gov. George W. Bush;
President George Bush (male)
1
Mr. Bush (male)
2
President Bush (male)
7
Vannevar Bush (unspecified gender)
E1
E2
E3
Christopher Bush
Douglas Bush
George W. Bush
M1
E4
Vannevar Bush
M2
M3
M4M9
George Bush,
mergeable with E3
(first name and gender)
Mr. Bush,
mergeable with E1-E4
President Bush,
mergeable with E1-E4
Bush,
mergeable with E1-E4
• E = Exclusives – i.e. no merging possible
• M = Mergeables – i.e. compatible with some or all
exclusives
• Tables are created by analyzing two lists sorted by
ambiguity
– PERS names
• George Walker Bush > George W. Bush > George Bush > G.
Bush > Bush
– PLACE names
• Albany, NY > Albany
• Merging steps
– Merge identical canonical strings >= 2 words
• Merges 28 George Bush, 2 President Bush 7 Vannevar Bush
articles into 3 equivalence classes
– Between mergeables and exclusives, combine if any
variants share a common prefix
• Merges E3, M1 and M3 (common prefix = President)
• Reduces # of context matches from 58x58 to 7x4
Corpus, Evaluation, and Results
• Corpus – 1998 editions of New York Times
• 15 name families
– For example: Berger, Black, Brown, Bush, Clinton,
Gore, etc.
• B-CUBED algorithm for scoring
• Without context comparisons:
– Avg Precision = 98.5%
– Avg Recall = 72.85%
• No results reported when context comparisons are
used (Ravin and Kazi, 1999)
3 Models of Similarity
• Gooi and Allan, 2004
• Methodology similar to Bagga and Baldwin
– extract 55 word snippets centered at name or its variant
• Problem with Bagga and Baldwin
– sharp drop off in F-Measure around threshold
• 3 different models of similarity
– Incremental Vector Space
• tf*idf, but with average link clustering
– KL divergence
• snippets are represented as probability distribution of words
• similarity = “distance” between two probability distributions
– Agglomerative Vector Space
• tf*idf with bottom-up, complete-link clustering
Corpus
• John Smith corpus (Bagga and Baldwin)
• Person-x corpus
– created by querying TREC collection with
queries like arts, business, sports, etc.
– BBN’s IdentiFinder used for named entity
recognition
– one name (and its corresponding variants)
randomly replaced with phrase Person-x
– 34,404 documents; 14,767 actual unique
entities
Evaluation and Results
• B-CUBED algorithm used for scoring
• Agglomerative VS best
– 88.2% F-Measure for John Smith corpus
– 83% F-Measure for Person-x corpus
• When run on each sub-corpus (arts, sports, etc.) of
Person-x corpus
– F-Measure drops to 77%
– shows that a more homogenous corpus is more difficult
• Results for Agglomerative VS degrade much more
smoothly around threshold than others
Second Order Co-Occurrence
• Three methods – independently published
• Bagga, Baldwin, and Ramesh, 2001 - 2-pass
algorithm
– First pass: as before
– Second pass:
• for each chain, compute set of most frequent overlapping
words in chain (signature words for chain)
• for each singleton document after pass 1, compare to each
chain
–
–
–
–
use signature words to extract additional sentences
compare enhanced summary to every summary in chain
merge if similarity > threshold
if not merged with any chain, remains singleton
• Winchester and Lee, 2001
– named entity detection and conflation within documents
is done as pre-processing step
– based on Schutze’s (1998) algorithm for context-group
discrimination
– 3 types of vectors are created
• Term Vectors – formed for each name occurring in context of
entity of interest and its variants
– stores co-occurrence stats for term across whole corpus
• Context Vectors – formed for entity of interest by summing all
term vectors associated with its context
– term vectors are weighted with their idf scores before sum
• Entity Vectors – for each entity, it is centroid of set of context
vectors
– entity disambiguation is done by comparing Entity
Vectors using VSM with single-link clustering
Corpus, Evaluation, Results
• Bagga, Baldwin, and Ramesh
– John Smith corpus, B-CUBED scoring
– new F-Measure 91% (+7 from before)
• Winchester and Lee
–
–
–
–
30 name sets; 10 each of PER, LOC, ORG
from 6000 WSJ articles
B-CUBED scoring
discovered that selective creation of 3 types of vectors
boosts performance
• for example, LOC helps disambiguate other LOC
• Birmingham, Alabama vs UK; John Smith associated with
Pocahontas
– overall F-Measure 78.5%
• NAM – 90.3%, LOC – 79.2%, ORG – 72.5%
• Guha and Garg, 2004
– mine descriptions associated with entity of interest
(sketch)
• descriptions are other entities + professions that are in close
proximity
– comparing descriptions
• different weights given to different descriptions given type of
entity of interest and entity-type of description
– for example: location is more likely to be disambiguated by
another location than by the name of a person
– Corpus and Evaluation
• 26 entities (names + places), 2-6 instances identified of each
• sent as queries to search engines, top 150 results collated and
manually tagged for truth
• best F-Measure = 90.3%
Maximum Entropy Model
• Fleischman and Hovy, 2004 – use ME to determine if two
concept/instance pairs are same entity
– concept/instance pairs – ACL dataset (2M pairs)
• John Edwards/lawyer and John Edwards/politician
– Name features: NAME-COMMON (census), NAME-FAME (ACL
dataset), WEB-FAME (Google)
– Web features: based on # of Google hits with name plus headwords of
concepts used as queries
– Overlap features: based on # words overlapping in context of names and
concepts
– Semantic features: based on semantic relatedness of concepts (WordNet)
• for example: lawyers are more likely to become politicians
– Estimated Statistics features: probabilities that a name is associated with a
particular concept (computed over entire ACL dataset)
• Disambiguation using group-average agglomerative clustering
• Tested on set of 31 concept/instance pairs (1875 used for training)
– 20 had a single referent
– F-Measure = 93.9%
– baseline (all in same chain) = 92.4%
Robust Reading Approach
• Li, Morie, and Roth, 2004
– a global probabilistic view of how documents are generated and
how entities are “sprinkled” into them
• Model 1 (simplest – no notion of author)
– entities are present in a document with a prior probability,
independent of other entities
– mentions (references) are selected according to probability
distribution P(mj|ei)
– i.e. entity referenced by a mention is not dependent on other
mentions
• Model 2 (more expressive)
– # of entities in doc and # of mentions follow uniform distribution
– entities enter doc with a prior probability, independent of others
– representative (canonical form) for each entity is selected
according to P(rj|ei)
– for each representative, mentions are selected by P(mk|rj)
– i.e. entity referenced by a mention depends on other mentions in
the same document
• Model 3 (least relaxation)
– # of entities based on uniform distribution – but not independent of each
other
– entities in doc viewed as nodes in a weighted directed graph with edges
labeled as P(ej|ei)
– entities inserted in document via a random walk starting at an entity with
prior probability P(ek)
– representatives and mentions follow the same probabilities as Model 2
– i.e. entity referenced by a mention depends on other mentions in same
document, but also on other entities in entire corpus
• Models learned using truncated EM algorithm
• Evaluation
– 300 NYT articles from TREC corpus
– 8000 mentions corresponding to 2000 entities (people, locations,
organizations)
– compared to SOFT-TF-IDF and baseline (entities with identical writing
are same)
– overall F-Measure = 89% (model 2)
– baseline = 70.7% and SOFT-TF-IDF = 79.8%
• Model 3 does not perform best because
– global dependencies enforces restrictions over groupings of similar
mentions
– because of limited document set, estimating global dependency is
inaccurate
Using IE Features
• 3 different methods published
• Mann and Yarowsky, 2003
– use unsupervised learning to learn patterns from corpus
that capture biographical features
• birth day, birth year, birth place and occupation
– use bottom-up centroid agglomerative clustering for
disambiguation
– vectors for each document are generated by using the
following
•
•
•
•
all words (plain) or proper nouns (nnp)
most relevant words (mi and tf-idf)
basic biographical features (feat)
extended biographical features (extfeat)
Corpus, Evaluation, and Results
• Mann and Yarowsky
– Pseudoname corpus
• query Google with names of 8 people
– take 28 possible pairs and replace with different pseudonames
– Naturally occurring corpus
• query for 4 naturally occurring polysemous names
– example: Jim Clark
• 60 articles for each name
• 3-way classification (top 2 occurring people + “others”)
– Disambiguating accuracy for Pseudonames
• 86.4% with nnp+feat+tf-idf
– For naturally occurring corpus
• using mutual information 88% Precision and 73% Recall
• Niu, Li, and Srihari, 2004 - use 3 different
categories of contextual features
– set of 50 words centered around name (or alias)
– other entities occurring in 50 word context of name (or
alias)
– automatic extracted relationships (25 possible)
• birth day, age, affiliation, title, address, degree, etc.
– features combined using Maximum Entropy Model
• Evaluation using B-CUBED algorithm
– 4 sets of 4 famous names mixed together using
pseudonames
• 88% F-Measure achieved
– 2 naturally occurring sets
• Peter Sutherland – 96% F-Measure
• John Smith – 85% F-Measure
• Dozier and Zielund, 2004
– CDC for people in legal domain
• attorneys, judges, and expert witnesses
– Combine IE techniques with record linkage techniques
• biographical records for attorneys and judges created manually
from Westlaw Legal Directory
• biographical record for expert witnesses created through text
mining
• IE techniques extract templates associated with each type from
document
• record linkage part uses Bayesian network to match templates
with biographical records
– Evaluation
• for docs with stereotypical syntax and full names – 98%
precision and 95% recall
• Otherwise, 95% precision and 60% recall
Baseline
• Guha and Garg, 2004
– established baseline when full docs were
compared using TF-IDF without considering
context for 26 entities (names and places)
– 2-6 instances of each entity considered
– for each instance, top 10 results evaluated
– 22.5% accuracy overall
Types of CDC
• Named Entities
– described earlier
• Terms or Concept
– Kazi and Ravin, 2000
• Events
– Bagga and Baldwin, 1999
• Cross-Media and/or Multimedia Coreference
– Between text and pictures for names (Bagga and Hu, unpublished)
– Between text and video for names (Satoh and Kanade, 1997)
– Between video streams (using image and text) for events (Bagga,
Hu, and Zhong, 2002)
• Cross-Language, Cross-Document Coreference
– parallel corpus (Harabagiu and Maiorano, 2000)
– non-parallel corpus – open problem, although manual results
encouraging (Bagga and Baldwin, unpublished)
Term or Concept CDC
• Single or multi-word terms refer to concepts
occurring in domain
• Multi-word terms
– identified by Terminator (rule-based)
• form subset of noun phrases in document
– discard those that occur only once in document
• for example: price rose where rose is mistakenly identified as
noun
– discard those that are found only as proper sub-strings
• for example: dimension space (part of lower dimension space)
– are seldom ambiguous and are merged across
documents
Single Word Terms
• Capitalized single words are most common
sources of ambiguity
– for example: Wired – name of magazine and an
adjective that is first word in sentence
• Within-doc categorization of single words
– If capitalized word occurs in lowercase in document –
consider as regular word
– If capitalized word appears as capitalized in middle of
sentence – consider as name
– If no lowercase occurrences and word appears at
beginning of sentence or in title/header - consider as
term
– All other single words not identified as part of name or
multi-word terms – consider as lower-case term
Disambiguating Single Words
Across Documents
Lower-case Uncat.
Term
Name
Uppercase Term
Lowercase term
Uncat.
Name
Name is
variant of
enliven
Enliven
Finds
finds
----
----
bush
Bush
Loss
loss
----
----
wired
Wired
Allied
----
Allied
----
Microsoft
----
Microsoft
Microsoft
Corp.
----
New York
Unambiguous cases –
no merging
Ambiguous cases –
N.Y.
---merge if only name
or only lower-case term found in corpus
Upper-case
Term
Uncat. Name
# Docs
# occurrences
within doc
Find
Find
2
1
Please
Please
8
1
Met
Met
5
2-3
Sun
Sun
12
1-3
Apple
Apple
203
1-46
• Single occurrences of single capitalized terms can be
merged with occurrences of corresponding names if names
occur more than once in at least one document
• No evaluation was performed
Event CDC
• Bagga and Baldwin, 1999
– similar approach to entity-based CDC
• Two events are coreferent iff the players, time, and
location are the same
• Event CDC system extracts as “summaries”
sentences which contain:
– main event verb (for example: resign)
– nominalization of main verb (for example: resignation)
– synonyms (for example: quit)
• Summaries are clustered using single-link
clustering and VSM similarity
Evaluation and Results
• Articles chosen for 3 events: resignations, elections, and
espionage
– 2 years of New York Times data
• B-CUBED algorithm used for scoring
Event
F-Measure
# docs F-Measure Precision Recall
(2-pass algorithm)
resignations
219
84
95
75
84
elections
135
43
50
37
45
espionage
184
76
79
74
81
Analysis
• Events are harder than entities:
– no within-document coreference
– no explicit references
– are at time spread over the entire document
• Analysis of Elections event
– elections are temporal in nature
• disambiguating phrases largely use temporal references (for example
– upcoming fall elections, elections last year, next elections, etc)
• exposes weakness of using a bag of words approach
– presence of sub-events
• US General election consists of both Presidential elections and
Congressional elections
– “players” are the same due to high rate of incumbency
– descriptions of events are very similar
• issues in every election are similar (inflation, unemployment,
economy)
Cross-Media Coreference – Between
Text and Video (Names)
• Satoh and Kanade, 1997
• Association of face and name in video
– given unknown face, infer name or,
– given name, guess faces which are likely to
have that name
• Use closed caption transcripts and video
images for correlation
• Face extraction: neural-network based face
detector to locate faces in images
• Name candidate extraction: use Oxford Text
Archive dictionary (appx 70k words)
– Word is considered to be a proper noun if
• annotated as one in dictionary
• not found in dictionary
• Face similarity: eigenvector based method
to compute distance between two faces
• Face and name co-occurrence: use cooccurrence factor
– captures how well name and face co-occur in
time
Corpus, Evaluation, and Results
• No large scale
evaluation done
• Problem with
technique: false
positives
– specially for famous
people
– Clinton mentioned by
news anchor repeatedly
– name gets associated
with news anchor
Between Text and Pictures (Names)
• Bagga and Hu, unpublished (2004)
• Algorithm
– Use text and image based features to identify
coreference
– Tested on web pages
• Text narrowed by extracting sentences containing name
variants of entity
• Image features computed by analyzing distribution of colors in
L*a*b perceptual color space
– Across URLs, first compute text similarity (VSM) and
image similarity (L*a*b) and then combine
Preliminary Results
Portraits of Captain
John Smith
Maps related to
Captain John Smith’s
explorations
Captain John Smith
as portrayed in the
movie Pocahontas
Cross-Media Coreference
• Goal: identify and track “important” news
events in broadcast news video
• Observations:
– “important” stories of the day are repeated
within/across stations
– common footage scenes can be used as
representative clips for these stories
Structure of
Broadcast News
News
Story 1
Story seg. 2
Story seg. 1
Scene 1
Story 2
Scene 2
Scene 3
Scene 4
images
images
images
sound
sound
sound
Closed
Caption
Closed
Caption
Closed
Caption
Commercial Segment 1
Story seg. 3
Scene 5
Scene 7
Scene 6
Methodology
• For each video source, use closed caption text:
– to identify segment boundaries (>> signs indicate
speaker change)
– identify and eliminate commercial segments (based
upon text-tiling method)
– cluster story segments into stories
• Use complete link, hierarchical clustering to
identify overlapping stories between programs
– identify common footage scenes between each pair of
overlapping stories
Common Footage Detection
key frames
Scenes from
video source 1
text
Visual
similarity
Overlapping Story
key frames
Scenes from
video source 2
text
CombinedMedia
clustering
Text
similarity
Common Footages
Examples – Found by System
News conference
On Iraqi bombing
CBS 2829
CBS 3873
NBC 3885
NBC 5061
Flood rescue -> rescue school bus
CBS 38805
NBC 20805
US submarine->US submarine incident
CBS 13833
NBC 16317
Topic: US/Iraq->US bombing
of Iraq.
CBS 4125
CBS 4257
NBC 7377
More Examples
UN cars->UN inspectors leaving Iraq
Found by algorithm, but
missed by human subjects
CBS 5193
NBC 30021
Same stories and similar key-frame images, but not really
identical footage.
Night at Baghdad->night bombing
at Iraq.
CBS 2253
NBC 4173
Iraqi map
CBS 2001
NBC 3177
Missed by system
US submarine incident.
Missed because weak text
link and image intensity
change.
CBS 501
CBS 13305
NBC 16977
False positive:
Death of Dale Earnhardt
Results
• System achieves on average 71% recall, 37%
precision
– 4 test sets
– each set consisted of 2 thirty minute news programs
from CBS and NBC (same day)
• Majority of false positives occur due to presence
of studio scenes
• If studio scenes are eliminated from results (when
stories are the same)
– precision increases to 87%
Cross-Language CDC:
Parallel Corpus
• Harabagiu and Maiorano, 2000
• Use parallel corpus English and Romanian
– Romanian obtained by manually translating MUC-6
and MUC-7 corpora
• Within-document coreference system run within
each language
• Parallelism used to improve coreference in each
language by using features/coreference chain
information from the other
– English precision increases from 84% to 87% while
preserving recall
– Romanian precision increases from 72% to 76% while
preserving recall
Cross-Language CDC:
Non-Parallel Corpus
• Bagga and Baldwin (unpublished)
• Algorithm evaluated manually on a small set of documents
in English and Korean
– for each document, extract sentences containing mentions of entity
(name variants only) – “summary”
– translate each summary from non-English language to English
using a bi-lingual dictionary (word for word translation, without
regard for sense)
– Compare “approximate translations” with English summaries using
VSM
• Initial results were promising with limited decline in FMeasure
• Identification of transliterated names is a major problem
Cross-Language CDC:
Arabic Non-Parallel Corpus
• Sayeed, et al., 2009
• Based on Bagga and Baldwin, 1998
– Use BBN’s Serif for computing Within-Document Coreference chains
– for each document, extract windows of 50 words around mentions of
entity – “summary”
– One variation of the system tries to address the name transliteration
problem by
• a) translating the longest names in each document into English
• b) correlating which ones are “similar” in English, and
• c) attempting to find xdoc coreference between these discovered pairs
– A baseline system identified xdoc coreference when longest matching
names were exact matches
• Tested on 412 document set from ACE 2008 corpus
– Baseline B-Cubed F-Measure = 40.6 (best F-Measure for task = 69)
– F-Measure for System without name translation = 40.6
– F-Measure for system with name translation = 41.3
Evaluation Methodologies
• MUC-6/7 algorithm
– Vilain, et al., 1996
– originally developed for within-document coreference
• B-CUBED
– Bagga and Baldwin, 1998
• Clustering
– Treat CDC as a clustering problem
• ACE – Automatic Content Extraction Program
– developed for Entity Detection and Tracking (EDT)
task (currently, used for within-document EDT only)
The MUC Scorer: Example
Truth:
Response A:
Response B:
MUC Scoring Algorithm
• Precision Error is determined by asking:
– How many links must be added to truth (key) to
have the same equivalence classes as the
response?
• For recall error, reverse the roles above.
Problem: All Errors are Equal
• For response A:
– Precision = 9/10
– Recall = 9/9
• For response B:
– Precision = 9/10
– Recall = 9/9
• Unintuitive results in
the extreme cases
– N = # of entities
– m = # of chains (truth)
– All entities in same
chain:
P
N-m
N -1
– P => 1, if N >> m
An Intuition for Scoring
Differently
Truth:
Response A:
A Mistake!
Response B:
A Bigger Mistake!!
B-CUBED Algorithm: An entity
based approach
• For each entity, i:
Precisioni  # of correct elements in output chain containing elementi
# of elements in output chain containing elementi
Recalli  # of correct elements in output chain containing elementi
# of elements in the truth chain containing elementi
1 N
Final Precision  *  Precisioni
N 1
Example: Precision
Response A:
1  5
2
5  16
*  *5  * 2  *5  (76%)
12  5
7
7  21
Response B:
1  5
2
5  7
*  *5  * 2  *5  (58%)
12  10
2
10  12
Recall for both responses is 100%
ACE Scoring Algorithm
• Types of errors: miss and false alarm
• Score is calculated as a function of “cost”
• Cost depends on
– entity type
• person, organization, geo-political entity, location, and facility
– entity level
• name, nominal reference, and pronominal reference
• used for evaluation purposes only
• No published CDC evaluation using this algorithm
The cost of a single miss or false alarm
Type/
Level
PER
ORG
GPE
LOC
FAC
NAM
1
0.5
0.25
0.1
0.05
NOM
0.2
0.1
0.05
0.02
0.01
PRO
0.04
0.02
0.01
0.004
0.002
sum over type, t
sum over level, l
CEDT(S) =
{CMiss(t, l)*NMiss(t, l) + CFA(t, l)*NFA(t, l)}
sum over type, t
sum over level, l
{CMiss(t, l)*NRef(t, l)}
NREF = total number of reference entities in source, S
Denominator is normalization factor
= cost when no entities are output
Applications
• IR, EDT, and TDT
• Name Matching Problem (Patman and Thomson, 2003)
– When are different name strings potential references to the same
entity? (Qaddafi, Gadafi, Gaddafi, Kaddafi, Qaddafy, etc.)
• Cross-Document IE and Information Fusion
– increases chances of a pattern match
– information may be more explicit in one or more articles
– the set of articles may contain more information than any one
• Multi-Document Summarization
– 2002 DUC evaluation – earthquakes
– systems had difficulty distinguishing between earthquakes
• Question Answering
– When was Kennedy born? – which Kennedy is being referred to?
• Link Analysis
– linking entities is a first step towards identifying more complex
relationships across documents
Conclusions
• CDC is a feasible task
– context (text/images/video) around entity/event provides enough
information to disambiguate
• Entity-based CDC
– many different methods/models
– performance over different, large corpora is consistently in mid 80s
• Other types of CDC
– simple models/methods have been tried
– plenty of opportunity to explore more sophisticated contextual
models
• Evaluation Methodologies
– several different ones exist; no consensus on best one
• Applications
– time is ripe for integrating entity-based CDC in higher level
applications