CS276B Text Information Retrieval, Mining, and Exploitation Lecture 13 Text Mining II Feb 27, 2003 (includes slides borrowed from J.

Download Report

Transcript CS276B Text Information Retrieval, Mining, and Exploitation Lecture 13 Text Mining II Feb 27, 2003 (includes slides borrowed from J.

CS276B
Text Information Retrieval, Mining, and
Exploitation
Lecture 13
Text Mining II
Feb 27, 2003
(includes slides borrowed from J. Allan, G. Doddington, G. Neumann,
M. Venkataramani, and D. Radev)
Today’s Topics



First story detection (FSD)
Summarization
Coreference resolution
First Story Detection
First Story Detection


Automatically identify the first story on a
new event from a stream of text
Topic Detection and Tracking – TDT


“Bake-off” sponsored by US government
agencies
Applications


Intelligence services
Finance: Be the first to trade a stock
Examples


2002 Presidential Elections
Thai Airbus Crash (11.12.98)


On topic: stories reporting details of the crash, injuries and deaths; reports on the
investigation following the crash; policy changes due to the crash (new runway lights
were installed at airports).
Euro Introduced (1.1.1999)

On topic: stories about the preparation for the common currency (negotiations about
exchange rates and financial standards to be shared among the member nations);
official introduction of the Euro; economic details of the shared currency; reactions
within the EU and around the world.
First Story Detection

Other technologies don’t work for this



Information retrieval
Text classification
Why?
The First-Story Detection Task
To detect the first story that discusses a topic,
for all topics.
First Stories
Time
= Topic 1
= Topic 2
Not First Stories

There is no supervised topic training
(like Topic Detection)
Definitions




Event: A reported occurrence at a specific
time and place, and the unavoidable
consequences. Specific elections, accidents,
crimes, natural disasters.
Activity: A connected set of actions that have
a common focus or purpose - campaigns,
investigations, disaster relief efforts.
Topic: a seminal event or activity, along with
all directly related events and activities
Story: a topically cohesive segment of news
that includes two or more DECLARATIVE
independent clauses about a single topic.
TDT Tasks

First story detection (FSD)


Detect the first story on a new topic
Topic tracking



Once a topic has been detected, identify
subsequent stories about it
Standard text classification task
However, very small training set (initially: 1!)
First Story Detection (FSD)


First story detection is an unsupervised learning task.
On-line vs. Retrospective





On-line: Flag onset of new events from live news feeds as
stories come in
Retrospective: Detection consists of identifying first story
looking back over longer period
Lack of advance knowledge of new events, but have
access to unlabeled historical data as a contrast set
FSD input: stream of stories in chronological order
simulating real-time incoming document stream
FSD output: YES/NO decision per document
Patterns in Event Distributions


News stories discussing the same event tend to be
temporally proximate
A time gap between burst of topically similar stories
is often an indication of different events




Different earthquakes
Airplane accidents
A significant vocabulary shift and rapid changes in
term frequency are typical of stories reporting a new
event, including previously unseen proper nouns
Events are typically reported in a relatively brief time
window of 1- 4 weeks
Similar Events over Time
TDT: The Corpus





TDT evaluation corpora consist of text and
transcribed news from 1990s.
A set of target events (e.g., 119 in TDT2) is used for
evaluation
Corpus is tagged for these events (including first
story)
TDT2 consists of 60,000 news stories, Jan-June
1998, about 3,000 are “on topic” for one of 119
topics
Stories are arranged in chronological order
Ideas?
Approach 1: KNN


On-line processing of each incoming story
Compute similarity to all previous stories







Cosine similarity
Language model
Prominent terms
Extracted entities
If similarity is below threshold: new story
If similarity is above threshold for previous
document d: assign to topic of d
Optimal threshold can be chosen based on historical
data

Threshold is not topic specific!
Variant: Single Pass Clustering




Assign each incoming document to one of a
set of topic clusters
A topic cluster is represented by its centroid
(vector average of members)
For incoming story compute similarity s with
centroid
As before:


s>θ: add document to corresponding cluster
s<θ: first story!
Approach 2: KNN + Time




Only consider documents in a (short) time
window
Compute similarity in a time weighted
fashion:
m: number of documents in window, d_i: ith
document in window
Time weighting significantly increases
performance.
FSD - Results
Umass , CMU: Single-Pass Clustering
FSD Error vs. Classification
Error
Discussion




Hard problem
Becomes harder the more topics need to be
tracked. Why?
Second Story Detection much easier that
First Story Detection
Example: retrospective detection of first
9/11 story easy, on-line detection hard
Summarization
What is a Summary?

Informative summary



Purpose: replace original document
Example: executive summary
Indicative summary


Purpose: support decision: do I want to read
original document yes/no?
Example: Headline, scientific abstract
Why Automatic Summarization?

Algorithm for reading many genres is:
1)
2)
3)
read summary
decide whether relevant or not
if relevant: read whole document
 Summary is gate-keeper for large number of

documents.
Information overload



Often the summary is all that is read.
Example from last quarter: summaries of
search engine hits
Human-generated summaries are expensive.
Summary Length (Reuters)
Goldstein
et al.
1999
Summary Compression
(Reuters)
Goldstein
et al.
1999
Summarization Algorithms

Natural language understanding / generation




Keyword summaries




Build knowledge representation of text
Generate sentences summarizing content
Hard to do well
Display most significant keywords
Easy to do
Hard to read, poor representation of content
Sentence extraction




Extract key sentences
Medium hard
Summaries often don’t read well
Good representation of content
Sentence Extraction





Represent each sentence as a feature vector
Compute score based on features
Select n highest-ranking sentences
Present in order in which they occur in text.
Postprocessing to make summary more
readable/concise



Eliminate redundant sentences
Anaphors/pronouns
Delete subordinate clauses, parentheticals

Oracle Context
Sentence Extraction: Example



Sigir95 paper on
summarization by
Kupiec, Pedersen,
Chen
Trainable
sentence
extraction
Proposed
algorithm is
applied to its own
description (the
paper)
Sentence Extraction: Example
Feature Representation

Fixed-phrase feature


Paragraph feature


Repetition is an indicator of importance
Uppercase word feature


Paragraph initial/final more likely to be
important.
Thematic word feature


Certain phrases indicate summary, e.g. “in
summary”
Uppercase often indicates named entities.
(Taylor)
Sentence length cut-off

Summary sentence should be > 5 words.
Feature Representation (cont.)

Sentence length cut-off


Summary sentences have a minimum length.
Fixed-phrase feature

True for sentences with indicator phrase


Paragraph feature


Paragraph initial/medial/final
Thematic word feature


“in summary”, “in conclusion” etc.
Do any of the most frequent content words
occur?
Uppercase word feature

Is uppercase thematic word introduced?
Training




Hand-label sentences in training set
(good/bad summary sentences)
Train classifier to distinguish good/bad
summary sentences
Model used: Naïve Bayes
Can rank sentences according to score and
show top n to user.
Evaluation

Compare extracted sentences with sentences
in abstracts
Evaluation



Baseline (choose first n sentences): 24%
Overall performance (42-44) not very good.
However, there is more than one good
summary.
Multi-Document (MD)
Summarization






Summarize more than one document
Why is this harder?
But benefit is large (can’t scan 100s of docs)
To do well, need to adopt more specific
strategy depending on document set.
Other components needed for a production
system, e.g., manual postediting.
DUC: government sponsored bake-off


200 or 400 word summaries
Longer -> easier
Types of MD Summaries

Single event/person tracked over a long time
period




Multiple events of a similar nature



Elizabeth Taylor’s bout with pneumonia
Give extra weight to character/event
May need to include outcome (dates!)
Marathon runners and races
More broad brush, ignore dates
An issue with related events


Gun control
Identify key concepts and select sentences
accordingly
Determine MD Summary Type




First, determine which type of summary to
generate
Compute all pairwise similarities
Very dissimilar articles -> multi-event
(marathon)
Mostly similar articles



Is most frequent concept named entity?
Yes -> single event/person (Taylor)
No -> issue with related events (gun control)
MultiGen Architecture
(Columbia)
Generation


Ordering according to date
Intersection


Find concepts that occur repeatedly in a time
chunk
Sentence generator
Processing



Selection of good summary sentences
Elimination of redundant sentences
Replace anaphors/pronouns with noun
phrases they refer to


Need coreference resolution
Delete non-central parts of sentences
Performance (Columbia System)


(1) Precision and recall on “model units”
(facts)
(2) Coherence, grammaticality, readability
Newsblaster (Columbia)
Query-Specific Summarization





So far, we’ve look at generic summaries.
A generic summary makes no assumption
about the reader’s interests.
Query-specific summaries are specialized for
a single information need, the query.
Summarization is much easier if we have a
description of what the user wants.
Recall from last quarter:

Google-type excerpts – simply show keywords
in context
Genre

Some genres are easy to summarize




Some genres are hard to summarize



Newswire stories
Inverted pyramid structure
The first n sentences are often the best
summary of length n
Poems
Long documents (novels, the bible)
Trainable summarizers are genre-specific.
Non-Text Summaries

Summarization also important for non-text:




Speech (phone conversations, radio)
Video (surveillance, TV)
Similar techniques are used.
Text is easier to scan than speech/video.
Discussion

Correct parsing of document format is
critical.


Need to know headings, sequence, etc.
Limits of current technology


Some good summaries require natural
language understanding
Example: President Bush’s nominees for
ambassadorships



Contributors to Bush’s campaign
Veteran diplomats
Others
Coreference Resolution
Coreference



Two noun phrases referring to the same
entity are said to corefer.
Example: Transcription from RL95-2 is
mediated through an ERE element at the 5flanking region of the gene.
Coreference resolution is important for
many text mining tasks:



Information extraction
Summarization
First story detection
Types of Coreference




Noun phrases: Transcription from RL95-2 …
the gene …
Pronouns: They induced apoptosis.
Possessives: … induces their rapid
dissociation …
Demonstratives: This gene is responsible for
Alzheimer’s
Preferences in Pronoun
Interpretation




Recency: John has an Integra. Bill has a
legend. Mary likes to drive it.
Grammatical role: John went to the Acura
dealership with Bill. He bought an Integra.
Non-ambiguity: John and Bill went to the
Acura dealership. He bought an Integra.
Repeated mention: John needed a car to go
to his new job. He decided that he wanted
something sporty. Bill went to the Acura
dealership with him. He bought an Integra.
Copyright: D. Radev
Preferences in Pronoun
Interpretation



Parallelism: Mary went with Sue to the Acura
dealership. Sally went with her to the Mazda
dealership.
??? Mary went with Sue to the Acura
dealership. Sally told her not to buy
anything.
Verb semantics: John telephoned Bill. He lost
his pamphlet on Acuras. John criticized Bill.
He lost his pamphlet on Acuras.
Copyright: D. Radev
Algorithm for Coreference
Resolution



Two steps: discourse model update and
pronoun resolution.
Salience values are introduced when a noun
phrase that evokes a new entity is
encountered.
Salience factors: set empirically.
Copyright: D. Radev
Salience Weights
(Lappin&Leass)
Sentence recency
100
Subject emphasis
80
Existential emphasis
70
Accusative emphasis
50
Indirect object and oblique complement
emphasis
40
Non-adverbial emphasis
50
Head noun emphasis
80
Copyright: D. Radev
Lappin&Leass (cont’d)


Recency: weights are cut in half after each sentence
is processed.
Examples:





An Acura Integra is parked in the lot.
There is an Acura Integra parked in the lot.
John parked an Acura Integra in the lot.
John gave Susan an Acura Integra.
In his Acura Integra, John showed Susan his new CD
player.
Copyright: D. Radev
Algorithm (Lappin&Leass)
1.
2.
3.
4.
5.
Collect the potential referents (up to four
sentences back).
Remove potential referents that do not
agree in number or gender with the
pronoun.
Remove potential referents that do not pass
intrasentential syntactic coreference
constraints.
Compute the total salience value of the
referent by adding any applicable values for
role parallelism (+35) or cataphora (-175).
Select the referent with the highest salience
value. In case of a tie, select the closest
Copyright: D. Radev
Example

John saw a beautiful Acura Integra at the dealership
last week. He showed it to Bob. He bought it.
Rec
Subj
John
100
80
Integra
100
dealership
100
Exist
Obj
50
Ind
Obj
Non
Adv
Head
N
Total
50
80
310
50
80
280
50
80
230
Copyright: D. Radev
Example (cont’d)
Referent
Phrases
Value
John
{John}
165
Integra
{a beautiful Acura Integra}
140
dealership
{the dealership}
115
Copyright: D. Radev
Example (cont’d)
Referent
Phrases
Value
John
{John, he1}
475
Integra
{a beautiful Acura Integra}
140
dealership
{the dealership}
115
Copyright: D. Radev
Example (cont’d)
Referent
Phrases
Value
John
{John, he1}
475
Integra
{a beautiful Acura Integra, it}
400
dealership
{the dealership}
115
Copyright: D. Radev
Example (cont’d)
Referent
Phrases
Value
John
{John, he1}
475
Integra
{a beautiful Acura Integra, it}
400
Bill
{Bill}
270
dealership
{the dealership}
115
Copyright: D. Radev
Example (cont’d)
Referent
Phrases
Value
John
{John, he1}
237.5
Integra
{a beautiful Acura Integra, it1}
200
Bill
{Bill}
135
dealership
{the dealership}
57.5
Copyright: D. Radev
Observations



Lappin & Leass - tested on computer
manuals - 86% accuracy on unseen data.
Centering (Grosz, Josh, Weinstein):
additional concept of a “center”.
Centering has not been automatically tested
on actual data.
Copyright: D. Radev
MUC Information Extraction:
State of the Art c. 1997
NE – named entity recognition
CO – coreference resolution
TE – template element construction
TR – template relation construction
ST – scenario template production
Resources







[4] Umass at TDT2000, Allan, Lavrenko, Frey, Khandelwal (Umass, 2000)
[6] Learning Approaches for Detecting and Tracking News Events, Yang, Carbonell,
Brown (CMU, 1999)
A study on retrospective and on-line event detection. Yang, Pierce, Carbonell.
http://www.cs.columbia.edu/nlp/newsblaster/
A Trainable Document Summarizer (1995) Julian Kupiec, Jan Pedersen, Francine
ChenResearch and Development in Information Retrieval
The Columbia Multi-Document Summarizer for DUC 2002 K. McKeown, D. Evans, A.
Nenkova, R. Barzilay, V. Hatzivassiloglou, B. Schiffman, S. Blair-Goldensohn, J.
Klavans, S. Sigelman, Columbia University
Coreference: detailed discussion of the term:
http://www.ldc.upenn.edu/Projects/ACE/PHASE2/Annotation/guidelines/EDT/coreferenc
e.shtml