CS276B Text Information Retrieval, Mining, and Exploitation Lecture 13 Text Mining II Feb 27, 2003 (includes slides borrowed from J.
Download
Report
Transcript CS276B Text Information Retrieval, Mining, and Exploitation Lecture 13 Text Mining II Feb 27, 2003 (includes slides borrowed from J.
CS276B
Text Information Retrieval, Mining, and
Exploitation
Lecture 13
Text Mining II
Feb 27, 2003
(includes slides borrowed from J. Allan, G. Doddington, G. Neumann,
M. Venkataramani, and D. Radev)
Today’s Topics
First story detection (FSD)
Summarization
Coreference resolution
First Story Detection
First Story Detection
Automatically identify the first story on a
new event from a stream of text
Topic Detection and Tracking – TDT
“Bake-off” sponsored by US government
agencies
Applications
Intelligence services
Finance: Be the first to trade a stock
Examples
2002 Presidential Elections
Thai Airbus Crash (11.12.98)
On topic: stories reporting details of the crash, injuries and deaths; reports on the
investigation following the crash; policy changes due to the crash (new runway lights
were installed at airports).
Euro Introduced (1.1.1999)
On topic: stories about the preparation for the common currency (negotiations about
exchange rates and financial standards to be shared among the member nations);
official introduction of the Euro; economic details of the shared currency; reactions
within the EU and around the world.
First Story Detection
Other technologies don’t work for this
Information retrieval
Text classification
Why?
The First-Story Detection Task
To detect the first story that discusses a topic,
for all topics.
First Stories
Time
= Topic 1
= Topic 2
Not First Stories
There is no supervised topic training
(like Topic Detection)
Definitions
Event: A reported occurrence at a specific
time and place, and the unavoidable
consequences. Specific elections, accidents,
crimes, natural disasters.
Activity: A connected set of actions that have
a common focus or purpose - campaigns,
investigations, disaster relief efforts.
Topic: a seminal event or activity, along with
all directly related events and activities
Story: a topically cohesive segment of news
that includes two or more DECLARATIVE
independent clauses about a single topic.
TDT Tasks
First story detection (FSD)
Detect the first story on a new topic
Topic tracking
Once a topic has been detected, identify
subsequent stories about it
Standard text classification task
However, very small training set (initially: 1!)
First Story Detection (FSD)
First story detection is an unsupervised learning task.
On-line vs. Retrospective
On-line: Flag onset of new events from live news feeds as
stories come in
Retrospective: Detection consists of identifying first story
looking back over longer period
Lack of advance knowledge of new events, but have
access to unlabeled historical data as a contrast set
FSD input: stream of stories in chronological order
simulating real-time incoming document stream
FSD output: YES/NO decision per document
Patterns in Event Distributions
News stories discussing the same event tend to be
temporally proximate
A time gap between burst of topically similar stories
is often an indication of different events
Different earthquakes
Airplane accidents
A significant vocabulary shift and rapid changes in
term frequency are typical of stories reporting a new
event, including previously unseen proper nouns
Events are typically reported in a relatively brief time
window of 1- 4 weeks
Similar Events over Time
TDT: The Corpus
TDT evaluation corpora consist of text and
transcribed news from 1990s.
A set of target events (e.g., 119 in TDT2) is used for
evaluation
Corpus is tagged for these events (including first
story)
TDT2 consists of 60,000 news stories, Jan-June
1998, about 3,000 are “on topic” for one of 119
topics
Stories are arranged in chronological order
Ideas?
Approach 1: KNN
On-line processing of each incoming story
Compute similarity to all previous stories
Cosine similarity
Language model
Prominent terms
Extracted entities
If similarity is below threshold: new story
If similarity is above threshold for previous
document d: assign to topic of d
Optimal threshold can be chosen based on historical
data
Threshold is not topic specific!
Variant: Single Pass Clustering
Assign each incoming document to one of a
set of topic clusters
A topic cluster is represented by its centroid
(vector average of members)
For incoming story compute similarity s with
centroid
As before:
s>θ: add document to corresponding cluster
s<θ: first story!
Approach 2: KNN + Time
Only consider documents in a (short) time
window
Compute similarity in a time weighted
fashion:
m: number of documents in window, d_i: ith
document in window
Time weighting significantly increases
performance.
FSD - Results
Umass , CMU: Single-Pass Clustering
FSD Error vs. Classification
Error
Discussion
Hard problem
Becomes harder the more topics need to be
tracked. Why?
Second Story Detection much easier that
First Story Detection
Example: retrospective detection of first
9/11 story easy, on-line detection hard
Summarization
What is a Summary?
Informative summary
Purpose: replace original document
Example: executive summary
Indicative summary
Purpose: support decision: do I want to read
original document yes/no?
Example: Headline, scientific abstract
Why Automatic Summarization?
Algorithm for reading many genres is:
1)
2)
3)
read summary
decide whether relevant or not
if relevant: read whole document
Summary is gate-keeper for large number of
documents.
Information overload
Often the summary is all that is read.
Example from last quarter: summaries of
search engine hits
Human-generated summaries are expensive.
Summary Length (Reuters)
Goldstein
et al.
1999
Summary Compression
(Reuters)
Goldstein
et al.
1999
Summarization Algorithms
Natural language understanding / generation
Keyword summaries
Build knowledge representation of text
Generate sentences summarizing content
Hard to do well
Display most significant keywords
Easy to do
Hard to read, poor representation of content
Sentence extraction
Extract key sentences
Medium hard
Summaries often don’t read well
Good representation of content
Sentence Extraction
Represent each sentence as a feature vector
Compute score based on features
Select n highest-ranking sentences
Present in order in which they occur in text.
Postprocessing to make summary more
readable/concise
Eliminate redundant sentences
Anaphors/pronouns
Delete subordinate clauses, parentheticals
Oracle Context
Sentence Extraction: Example
Sigir95 paper on
summarization by
Kupiec, Pedersen,
Chen
Trainable
sentence
extraction
Proposed
algorithm is
applied to its own
description (the
paper)
Sentence Extraction: Example
Feature Representation
Fixed-phrase feature
Paragraph feature
Repetition is an indicator of importance
Uppercase word feature
Paragraph initial/final more likely to be
important.
Thematic word feature
Certain phrases indicate summary, e.g. “in
summary”
Uppercase often indicates named entities.
(Taylor)
Sentence length cut-off
Summary sentence should be > 5 words.
Feature Representation (cont.)
Sentence length cut-off
Summary sentences have a minimum length.
Fixed-phrase feature
True for sentences with indicator phrase
Paragraph feature
Paragraph initial/medial/final
Thematic word feature
“in summary”, “in conclusion” etc.
Do any of the most frequent content words
occur?
Uppercase word feature
Is uppercase thematic word introduced?
Training
Hand-label sentences in training set
(good/bad summary sentences)
Train classifier to distinguish good/bad
summary sentences
Model used: Naïve Bayes
Can rank sentences according to score and
show top n to user.
Evaluation
Compare extracted sentences with sentences
in abstracts
Evaluation
Baseline (choose first n sentences): 24%
Overall performance (42-44) not very good.
However, there is more than one good
summary.
Multi-Document (MD)
Summarization
Summarize more than one document
Why is this harder?
But benefit is large (can’t scan 100s of docs)
To do well, need to adopt more specific
strategy depending on document set.
Other components needed for a production
system, e.g., manual postediting.
DUC: government sponsored bake-off
200 or 400 word summaries
Longer -> easier
Types of MD Summaries
Single event/person tracked over a long time
period
Multiple events of a similar nature
Elizabeth Taylor’s bout with pneumonia
Give extra weight to character/event
May need to include outcome (dates!)
Marathon runners and races
More broad brush, ignore dates
An issue with related events
Gun control
Identify key concepts and select sentences
accordingly
Determine MD Summary Type
First, determine which type of summary to
generate
Compute all pairwise similarities
Very dissimilar articles -> multi-event
(marathon)
Mostly similar articles
Is most frequent concept named entity?
Yes -> single event/person (Taylor)
No -> issue with related events (gun control)
MultiGen Architecture
(Columbia)
Generation
Ordering according to date
Intersection
Find concepts that occur repeatedly in a time
chunk
Sentence generator
Processing
Selection of good summary sentences
Elimination of redundant sentences
Replace anaphors/pronouns with noun
phrases they refer to
Need coreference resolution
Delete non-central parts of sentences
Performance (Columbia System)
(1) Precision and recall on “model units”
(facts)
(2) Coherence, grammaticality, readability
Newsblaster (Columbia)
Query-Specific Summarization
So far, we’ve look at generic summaries.
A generic summary makes no assumption
about the reader’s interests.
Query-specific summaries are specialized for
a single information need, the query.
Summarization is much easier if we have a
description of what the user wants.
Recall from last quarter:
Google-type excerpts – simply show keywords
in context
Genre
Some genres are easy to summarize
Some genres are hard to summarize
Newswire stories
Inverted pyramid structure
The first n sentences are often the best
summary of length n
Poems
Long documents (novels, the bible)
Trainable summarizers are genre-specific.
Non-Text Summaries
Summarization also important for non-text:
Speech (phone conversations, radio)
Video (surveillance, TV)
Similar techniques are used.
Text is easier to scan than speech/video.
Discussion
Correct parsing of document format is
critical.
Need to know headings, sequence, etc.
Limits of current technology
Some good summaries require natural
language understanding
Example: President Bush’s nominees for
ambassadorships
Contributors to Bush’s campaign
Veteran diplomats
Others
Coreference Resolution
Coreference
Two noun phrases referring to the same
entity are said to corefer.
Example: Transcription from RL95-2 is
mediated through an ERE element at the 5flanking region of the gene.
Coreference resolution is important for
many text mining tasks:
Information extraction
Summarization
First story detection
Types of Coreference
Noun phrases: Transcription from RL95-2 …
the gene …
Pronouns: They induced apoptosis.
Possessives: … induces their rapid
dissociation …
Demonstratives: This gene is responsible for
Alzheimer’s
Preferences in Pronoun
Interpretation
Recency: John has an Integra. Bill has a
legend. Mary likes to drive it.
Grammatical role: John went to the Acura
dealership with Bill. He bought an Integra.
Non-ambiguity: John and Bill went to the
Acura dealership. He bought an Integra.
Repeated mention: John needed a car to go
to his new job. He decided that he wanted
something sporty. Bill went to the Acura
dealership with him. He bought an Integra.
Copyright: D. Radev
Preferences in Pronoun
Interpretation
Parallelism: Mary went with Sue to the Acura
dealership. Sally went with her to the Mazda
dealership.
??? Mary went with Sue to the Acura
dealership. Sally told her not to buy
anything.
Verb semantics: John telephoned Bill. He lost
his pamphlet on Acuras. John criticized Bill.
He lost his pamphlet on Acuras.
Copyright: D. Radev
Algorithm for Coreference
Resolution
Two steps: discourse model update and
pronoun resolution.
Salience values are introduced when a noun
phrase that evokes a new entity is
encountered.
Salience factors: set empirically.
Copyright: D. Radev
Salience Weights
(Lappin&Leass)
Sentence recency
100
Subject emphasis
80
Existential emphasis
70
Accusative emphasis
50
Indirect object and oblique complement
emphasis
40
Non-adverbial emphasis
50
Head noun emphasis
80
Copyright: D. Radev
Lappin&Leass (cont’d)
Recency: weights are cut in half after each sentence
is processed.
Examples:
An Acura Integra is parked in the lot.
There is an Acura Integra parked in the lot.
John parked an Acura Integra in the lot.
John gave Susan an Acura Integra.
In his Acura Integra, John showed Susan his new CD
player.
Copyright: D. Radev
Algorithm (Lappin&Leass)
1.
2.
3.
4.
5.
Collect the potential referents (up to four
sentences back).
Remove potential referents that do not
agree in number or gender with the
pronoun.
Remove potential referents that do not pass
intrasentential syntactic coreference
constraints.
Compute the total salience value of the
referent by adding any applicable values for
role parallelism (+35) or cataphora (-175).
Select the referent with the highest salience
value. In case of a tie, select the closest
Copyright: D. Radev
Example
John saw a beautiful Acura Integra at the dealership
last week. He showed it to Bob. He bought it.
Rec
Subj
John
100
80
Integra
100
dealership
100
Exist
Obj
50
Ind
Obj
Non
Adv
Head
N
Total
50
80
310
50
80
280
50
80
230
Copyright: D. Radev
Example (cont’d)
Referent
Phrases
Value
John
{John}
165
Integra
{a beautiful Acura Integra}
140
dealership
{the dealership}
115
Copyright: D. Radev
Example (cont’d)
Referent
Phrases
Value
John
{John, he1}
475
Integra
{a beautiful Acura Integra}
140
dealership
{the dealership}
115
Copyright: D. Radev
Example (cont’d)
Referent
Phrases
Value
John
{John, he1}
475
Integra
{a beautiful Acura Integra, it}
400
dealership
{the dealership}
115
Copyright: D. Radev
Example (cont’d)
Referent
Phrases
Value
John
{John, he1}
475
Integra
{a beautiful Acura Integra, it}
400
Bill
{Bill}
270
dealership
{the dealership}
115
Copyright: D. Radev
Example (cont’d)
Referent
Phrases
Value
John
{John, he1}
237.5
Integra
{a beautiful Acura Integra, it1}
200
Bill
{Bill}
135
dealership
{the dealership}
57.5
Copyright: D. Radev
Observations
Lappin & Leass - tested on computer
manuals - 86% accuracy on unseen data.
Centering (Grosz, Josh, Weinstein):
additional concept of a “center”.
Centering has not been automatically tested
on actual data.
Copyright: D. Radev
MUC Information Extraction:
State of the Art c. 1997
NE – named entity recognition
CO – coreference resolution
TE – template element construction
TR – template relation construction
ST – scenario template production
Resources
[4] Umass at TDT2000, Allan, Lavrenko, Frey, Khandelwal (Umass, 2000)
[6] Learning Approaches for Detecting and Tracking News Events, Yang, Carbonell,
Brown (CMU, 1999)
A study on retrospective and on-line event detection. Yang, Pierce, Carbonell.
http://www.cs.columbia.edu/nlp/newsblaster/
A Trainable Document Summarizer (1995) Julian Kupiec, Jan Pedersen, Francine
ChenResearch and Development in Information Retrieval
The Columbia Multi-Document Summarizer for DUC 2002 K. McKeown, D. Evans, A.
Nenkova, R. Barzilay, V. Hatzivassiloglou, B. Schiffman, S. Blair-Goldensohn, J.
Klavans, S. Sigelman, Columbia University
Coreference: detailed discussion of the term:
http://www.ldc.upenn.edu/Projects/ACE/PHASE2/Annotation/guidelines/EDT/coreferenc
e.shtml