No Slide Title

Download Report

Transcript No Slide Title

Text summarization
Tutorial
ACM SIGIR
New Orleans, Louisiana
September 9, 2001
Dragomir R. Radev
School of Information, Department of Electrical Engineering and
Computer Science, and Department of Linguistics
University of Michigan
http://www.si.umich.edu/~radev
Part I
Introduction
The BIG problem
• Information overload: 1.39 Billion
URLs catalogued by Google
• Possible approaches:
–
–
–
–
–
–
information retrieval
document clustering
information extraction
visualization
question answering
text summarization
Some concepts
• Abstracts: “a concise summary of
the central subject matter of a
document” [Paice90].
• Indicative, informative, and critical
summaries
• Extracts (representative sentences)
Informative summaries
Lines sometimes blurred
Net Tax Moratorium Clears House
The House passed a bill to extend the current
moratorium on new Internet taxes until 2006. The
moratorium forbids states from trying to find new
ways of taxing Internet use, like imposing taxes on
monthly access charges for Internet service
providers.
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html
House Votes to Ban Internet Taxes for 5 More Years
By LIZETTE ALVAREZ
WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please
taxpayers, the House today rushed to the floor and then handily passed a bill to extend the
current moratorium on new Internet taxes until 2006.
The moratorium, which is due to expire in October 2001, forbids states to try to find new ways
of taxing Internet use, like imposing taxes on monthly access charges for Internet service
providers.
The legislation passed today, which faces an uncertain future in the Senate, does not directly
address the question of sales taxes; it would not stop states from trying to collect taxes for
goods sold on the Internet.
By failing to address sales taxes, however, the measure alarmed some traditional retailers, as
well as state governments that say they have found it nearly impossible to collect taxes for
goods sold online.
"The single largest contributor to our economic prosperity has been the growth of information
technology -- the Internet," said
Representative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why
would we try to abuse something, why would we try to limit something that generates
unprecedented growth, wealth, opportunity and unprecedented individual power?"
Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of
how state and local governments can best collect taxes on the billions of dollars of
merchandise sold over the Internet each year. These taxes are expected to provide a crucial
future source of revenue for states, especially as more consumers buy goods online.
The bill's opponents -- a consortium of retailers, small-business groups and governors -- say
that consumers who buy merchandise over the Internet can easily circumvent the sales and "use"
taxes that would be collected automatically if the same merchandise is bought at a bricks-andmortar retail store.
The National Governors' Association is working on the best way to collect electronic sales tax.
Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
Retailers and small businesses have complained that the current system unfairly places at a
disadvantage the traditional retailers that do not sell their wares online and must charge
sales tax.
"It's easy to imagine how these kinds of losses can affect state and local governments' ability
to provide essential services," said Representative William D. Delahunt, a Massachusetts
Democrat, citing the concerns of many governors. "They will be compelled to cut back local
services or raise income taxes or property taxes."
The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of
Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor
to be freed from tax."
Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats
approved the measure after they received assurance that Congress would hold hearings concerning
sales taxes and would try to come up with a solution.
The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity
to have that debate," said Representative Robert Goodlatte, a Virginia Republican.
The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the
Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill
last month after Republican senators, some of them former governors, expressed reservations
about extending the moratorium.
The legislation also faces opposition from the Clinton administration, which signaled support
today for a two-year moratorium. The full House today rejected a two-year extension in a
separate vote.
Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an
extension of the moratorium. But the governor must tread carefully around the issue because
Texas, which does not have a state income tax, would stand to lose substantial revenue if
sales taxes are not made workable on the Internet.
A spokesman for Al Gore said the vice president supported a two-year extension of the
moratorium "at a minimum." If a five-year moratorium is put into place, "it should include
flexibility" to adjust federal policies on Internet taxation "to take into account the fastpaced change in the Internet world.”
Types of summaries
• dimensions
• genres
• context
Dimensions
• Single-document vs. multi-document
Genres
•
•
•
•
•
•
•
•
headlines
outlines
minutes
biographies
abridgments
sound bites
movie summaries
chronologies, etc.
[Mani and Maybury 1999]
Context
• Query-specific
• Query-independent
What does summarization
involve?
• Three stages (typically)
– content identification
– conceptual organization
– realization
Spärck Jones’s three sets of
factors
• Input factors (source form, subject
type, unit)
• Purpose factors (situation, audience,
use)
• Output factors (material, format,
style)
[Spärck Jones 99]
ProSum
http://transend.labs.bt.com/prosum/word/index.html
•
•
•
•
•
•
Profile-based summarization
Control of summarization length
Retention of user-defined text
Customizable heading treatment
Customizable table treatment
Customizable text differentiation
Example (New York Times)
Net Tax Moratorium Clears House
The House passed a bill to extend the current
moratorium on new Internet taxes until 2006.
The moratorium forbids states from trying to find
new ways of taxing Internet use, like imposing taxes
on monthly access charges for Internet service
providers.
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html
House Votes to Ban Internet Taxes for 5 More Years
By LIZETTE ALVAREZ
WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please
taxpayers, the House today rushed to the floor and then handily passed a bill to extend the
current moratorium on new Internet taxes until 2006.
The moratorium, which is due to expire in October 2001, forbids states to try to find new ways
of taxing Internet use, like imposing taxes on monthly access charges for Internet service
providers.
The legislation passed today, which faces an uncertain future in the Senate, does not directly
address the question of sales taxes; it would not stop states from trying to collect taxes for
goods sold on the Internet.
By failing to address sales taxes, however, the measure alarmed some traditional retailers, as
well as state governments that say they have found it nearly impossible to collect taxes for
goods sold online.
"The single largest contributor to our economic prosperity has been the growth of information
technology -- the Internet," said
Representative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why
would we try to abuse something, why would we try to limit something that generates
unprecedented growth, wealth, opportunity and unprecedented individual power?"
Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of
how state and local governments can best collect taxes on the billions of dollars of
merchandise sold over the Internet each year. These taxes are expected to provide a crucial
future source of revenue for states, especially as more consumers buy goods online.
The bill's opponents -- a consortium of retailers, small-business groups and governors -- say
that consumers who buy merchandise over the Internet can easily circumvent the sales and "use"
taxes that would be collected automatically if the same merchandise is bought at a bricks-andmortar retail store.
The National Governors' Association is working on the best way to collect electronic sales tax.
Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
Retailers and small businesses have complained that the current system unfairly places at a
disadvantage the traditional retailers that do not sell their wares online and must charge
sales tax.
"It's easy to imagine how these kinds of losses can affect state and local governments' ability
to provide essential services," said Representative William D. Delahunt, a Massachusetts
Democrat, citing the concerns of many governors. "They will be compelled to cut back local
services or raise income taxes or property taxes."
The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of
Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor
to be freed from tax."
Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats
approved the measure after they received assurance that Congress would hold hearings concerning
sales taxes and would try to come up with a solution.
The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity
to have that debate," said Representative Robert Goodlatte, a Virginia Republican.
The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the
Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill
last month after Republican senators, some of them former governors, expressed reservations
about extending the moratorium.
The legislation also faces opposition from the Clinton administration, which signaled support
today for a two-year moratorium. The full House today rejected a two-year extension in a
separate vote.
Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an
extension of the moratorium. But the governor must tread carefully around the issue because
Texas, which does not have a state income tax, would stand to lose substantial revenue if
sales taxes are not made workable on the Internet.
A spokesman for Al Gore said the vice president supported a two-year extension of the
moratorium "at a minimum." If a five-year moratorium is put into place, "it should include
flexibility" to adjust federal policies on Internet taxation "to take into account the fastpaced change in the Internet world.”
Microsoft Autosummarize output
House Votes to Ban Internet Taxes for 5 More Years
The moratorium, which is due to expire in October 2001, forbids states to try to
find new ways of taxing Internet use, like imposing taxes on monthly access charges
for Internet service providers.
By failing to address sales taxes, however, the measure alarmed some traditional
retailers, as well as state governments that say they have found it nearly
impossible to collect taxes for goods sold online.
10% summary
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html
House Votes to Ban Internet Taxes for 5 More Years
By LIZETTE ALVAREZ
WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please
taxpayers, the House today rushed to the floor and then handily passed a bill to extend the
current moratorium on new Internet taxes until 2006.
The moratorium, which is due to expire in October 2001, forbids states to try to find new ways
of taxing Internet use, like imposing taxes on monthly access charges for Internet service
providers.
The legislation passed today, which faces an uncertain future in the Senate, does not directly
address the question of sales taxes; it would not stop states from trying to collect taxes for
goods sold on the Internet.
By failing to address sales taxes, however, the measure alarmed some traditional retailers, as
well as state governments that say they have found it nearly impossible to collect taxes for
goods sold online.
"The single largest contributor to our economic prosperity has been the growth of information
technology -- the Internet," said
Representative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why
would we try to abuse something, why would we try to limit something that generates
unprecedented growth, wealth, opportunity and unprecedented individual power?"
Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of
how state and local governments can best collect taxes on the billions of dollars of
merchandise sold over the Internet each year. These taxes are expected to provide a crucial
future source of revenue for states, especially as more consumers buy goods online.
The bill's opponents -- a consortium of retailers, small-business groups and governors -- say
that consumers who buy merchandise over the Internet can easily circumvent the sales and "use"
taxes that would be collected automatically if the same merchandise is bought at a bricks-andmortar retail store.
The National Governors' Association is working on the best way to collect electronic sales tax.
Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
Retailers and small businesses have complained that the current system unfairly places at a
disadvantage the traditional retailers that do not sell their wares online and must charge
sales tax.
"It's easy to imagine how these kinds of losses can affect state and local governments' ability
to provide essential services," said Representative William D. Delahunt, a Massachusetts
Democrat, citing the concerns of many governors. "They will be compelled to cut back local
services or raise income taxes or property taxes."
The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of
Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor
to be freed from tax."
Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats
approved the measure after they received assurance that Congress would hold hearings concerning
sales taxes and would try to come up with a solution.
The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity
to have that debate," said Representative Robert Goodlatte, a Virginia Republican.
The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the
Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill
last month after Republican senators, some of them former governors, expressed reservations
about extending the moratorium.
The legislation also faces opposition from the Clinton administration, which signaled support
today for a two-year moratorium. The full House today rejected a two-year extension in a
separate vote.
Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an
extension of the moratorium. But the governor must tread carefully around the issue because
Texas, which does not have a state income tax, would stand to lose substantial revenue if
sales taxes are not made workable on the Internet.
A spokesman for Al Gore said the vice president supported a two-year extension of the
moratorium "at a minimum." If a five-year moratorium is put into place, "it should include
flexibility" to adjust federal policies on Internet taxation "to take into account the fastpaced change in the Internet world.”
Microsoft Autosummarize output
House Votes to Ban Internet Taxes for 5 More Years
The moratorium, which is due to expire in October 2001, forbids states to try to
find new ways of taxing Internet use, like imposing taxes on monthly access charges
for Internet service providers.
The legislation passed today, which faces an uncertain future in the Senate, does
not directly address the question of sales taxes; it would not stop states from
trying to collect taxes for goods sold on the Internet.
By failing to address sales taxes, however, the measure alarmed some traditional
retailers, as well as state governments that say they have found it nearly
impossible to collect taxes for goods sold online.
The National Governors' Association is working on the best way to collect
electronic sales tax. Representative Ernest J. Istook Jr. of Oklahoma circulated a
letter stating, "The Internet should not be singled out to be taxed, nor to be
freed from tax."
Senator John McCain, chairman of the Commerce Committee, who advocates a permanent
tax moratorium, canceled a hearing on the bill last month after Republican
senators, some of them former governors, expressed reservations about extending the
moratorium.
25% summary
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html
House Votes to Ban Internet Taxes for 5 More Years
By LIZETTE ALVAREZ
WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please
taxpayers, the House today rushed to the floor and then handily passed a bill to extend the
current moratorium on new Internet taxes until 2006.
The moratorium, which is due to expire in October 2001, forbids states to try to find new ways
of taxing Internet use, like imposing taxes on monthly access charges for Internet service
providers.
The legislation passed today, which faces an uncertain future in the Senate, does not directly
address the question of sales taxes; it would not stop states from trying to collect taxes for
goods sold on the Internet.
By failing to address sales taxes, however, the measure alarmed some traditional retailers, as
well as state governments that say they have found it nearly impossible to collect taxes for
goods sold online.
"The single largest contributor to our economic prosperity has been the growth of information
technology -- the Internet," said
Representative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why
would we try to abuse something, why would we try to limit something that generates
unprecedented growth, wealth, opportunity and unprecedented individual power?"
Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of
how state and local governments can best collect taxes on the billions of dollars of
merchandise sold over the Internet each year. These taxes are expected to provide a crucial
future source of revenue for states, especially as more consumers buy goods online.
The bill's opponents -- a consortium of retailers, small-business groups and governors -- say
that consumers who buy merchandise over the Internet can easily circumvent the sales and "use"
taxes that would be collected automatically if the same merchandise is bought at a bricks-andmortar retail store.
The National Governors' Association is working on the best way to collect electronic sales tax.
Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
Retailers and small businesses have complained that the current system unfairly places at a
disadvantage the traditional retailers that do not sell their wares online and must charge
sales tax.
"It's easy to imagine how these kinds of losses can affect state and local governments' ability
to provide essential services," said Representative William D. Delahunt, a Massachusetts
Democrat, citing the concerns of many governors. "They will be compelled to cut back local
services or raise income taxes or property taxes."
The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of
Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor
to be freed from tax."
Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats
approved the measure after they received assurance that Congress would hold hearings concerning
sales taxes and would try to come up with a solution.
The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity
to have that debate," said Representative Robert Goodlatte, a Virginia Republican.
The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the
Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill
last month after Republican senators, some of them former governors, expressed reservations
about extending the moratorium.
The legislation also faces opposition from the Clinton administration, which signaled support
today for a two-year moratorium. The full House today rejected a two-year extension in a
separate vote.
Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an
extension of the moratorium. But the governor must tread carefully around the issue because
Texas, which does not have a state income tax, would stand to lose substantial revenue if
sales taxes are not made workable on the Internet.
A spokesman for Al Gore said the vice president supported a two-year extension of the
moratorium "at a minimum." If a five-year moratorium is put into place, "it should include
flexibility" to adjust federal policies on Internet taxation "to take into account the fastpaced change in the Internet world.”
Outline
I
Introduction
II
Traditional approaches
III
Multi-document summarization
IV
Knowledge-rich techniques
V
Evaluation methods
VI
The MEAD project
VII
Language modeling
Part II
Traditional approaches
Human summarization and
abstracting
• What professional abstractors do
• Ashworth:
• “To take an original article, understand it
and pack it neatly into a nutshell without
loss of substance or clarity presents a
challenge which many have felt worth taking
up for the joys of achievement alone. These
are the characteristics of an art form”.
Borko and Bernier 75
• The abstract and its use:
– Abstracts promote current awareness
– Abstracts save reading time
– Abstracts facilitate selection
– Abstracts facilitate literature searches
– Abstracts improve indexing efficiency
– Abstracts aid in the preparation of
reviews
Cremmins 82, 96
• American National Standard for Writing
Abstracts:
– State the purpose, methods, results, and conclusions
presented in the original document, either in that order
or with an initial emphasis on results and conclusions.
– Make the abstract as informative as the nature of the
document will permit, so that readers may decide,
quickly and accurately, whether they need to read the
entire document.
– Avoid including background information or citing the
work of others in the abstract, unless the study is a
replication or evaluation of their work.
Cremmins 82, 96
– Do not include information in the abstract that is not
contained in the textual material being abstracted.
– Verify that all quantitative and qualitative information
used in the abstract agrees with the information
contained in the full text of the document.
– Use standard English and precise technical terms, and
follow conventional grammar and punctuation rules.
– Give expanded versions of lesser known abbreviations
and acronyms, and verbalize symbols that may be
unfamiliar to readers of the abstract.
– Omit needless words, phrases, and sentences.
Cremmins 82, 96
• Original version:
• Edited version:
There were significant
positive associations
between the concentrations
of the substance
administered and mortality in
rats and mice of both sexes.
Mortality in rats and mice of
both sexes was dose related.
There was no convincing
evidence to indicate that
endrin ingestion induced and
of the different types of
tumors which were found in
the treated animals.
No treatment-related tumors
were found in any of the
animals.
Redundancy of English
• 75% redundancy of English
[Shannon 51]
• [Burton & Licklider 55] show that
humans are as good at guessing the
next letter after seeing 32 letters as
after 10,000 letters.
Morris et al. 92
• Reading comprehension of summaries
• Compare manual abstracts, Edmundsonstyle extracts, and full documents
• Extracts containing 20% or 30% of original
document are effective surrogates of
original document
• Performance on 20% and 30% extracts is
no different than informative abstracts
Extraction models
• Extracts vs.
abstracts
• Linear model
• Text structure
based
• New techniques
Information content
|S|
Compression Ratio =
|D|
i (S)
Retention Ratio =
i (D)
Text compaction techniques
Missam ad amicum pro onsolatione epistolam,
dilectissime, vestram ad me forte quidam
nuper attulit.
Quam ex ipsa statim tituli fronte vestram esse
considerans, tanto ardentius eam cepi legere
quanto scriptorem ipsum karius amplector, ut
cuius rem perdidi verbis saltem tanquam eius
quadam imagine recreer.
Erant, memini, huius epistole fere omnia felle
et absintio plena, que scilicet nostre
conversionis miserabilem hystoriam et tuas,
unice, cruces assiduas referebant.
Complesti revera in epistola illa quod in
exordio eius amico promisisti, ut videlicet in
omparatione tuarum suas molestias nullas vel
parvas reputaret; ubi quidem expositis prius
magistrorum tuorum in te persequutionibus,
deinde in corpus tuum summe proditionis
iniuria, ad condiscipulorum quoque tuorum
Alberici videlicet Remensis et Lotulfi
Lumbardi execrabilem invidiam et
infestationem nimiam stilum contulisti.
Missam ad amicum pro onsolatione epistolam,
dilectissime, vestram ad me forte quidam nuper
attulit.
Erant, memini, huius epistole fere omnia felle
et absintio plena, que scilicet nostre
conversionis miserabilem hystoriam et tuas,
unice, cruces assiduas referebant.
Text compaction techniques
Missam ad amicum pro onsolatione epistolam,
dilectissime, vestram ad me forte quidam nuper
attulit.
Erant, memini, huius epistole fere omnia felle
et absintio plena, que scilicet nostre
conversionis miserabilem hystoriam et tuas,
unice, cruces assiduas referebant.
Missam vestram nuper attulit.
Erant, scilicet nostre conversionis miserabilem
hystoriam referebant.
Luhn 58
– stemming
– bag of words
E
FREQUENCY
• Very first work in
automated
summarization
• Computes
measures of
significance
• Words:
WORDS
Resolving power of significant words
Luhn 58
• Sentences:
SENTENCE
– concentration of
high-score words
• Cutoff values
established in
experiments with
100 human
subjects
SIGNIFICANT WORDS
*
1
2
* *
3
4
5
6
*
7
ALL WORDS
SCORE = 42/7  2.3
Edmundson 69
• Cue method:
– stigma words
(“hardly”,
“impossible”)
– bonus words
(“significant”)
• Key method:
– similar to Luhn
• Title method:
– title + headings
• Location method:
– sentences under
headings
– sentences near
beginning or end of
document and/or
paragraphs (also
[Baxendale 58])
Edmundson 69
1
• Linear combination
of four features:
C+T+L
C+K+T+L
 1C +  2K +  3T +  4L
LOCATION
CUE
TITLE
• Manually labelled
training corpus
• Key not important!
KEY
RANDOM
0
10
20 30 40 50
60 70 80 90 100 %
Paice 90
• Survey up to 1990
• Techniques that
(mostly) failed:
– syntactic criteria
[Earl 70]
– indicator phrases
(“The purpose of
this article is to
review…)
• Problems with
extracts:
– lack of balance
– lack of cohesion
• anaphoric reference
• lexical or definite
reference
• rhetorical
connectives
Paice 90
• Lack of balance
– later approaches
based on text
rhetorical structure
• Lack of cohesion
– recognition of
anaphors [Liddy et
al. 87]
• Example: “that” is
– nonanaphoric if
preceded by a
research-verb (e.g.,
“demonstrat-”),
– nonanaphoric if
followed by a pronoun,
article, quantifier,…,
– external if no later than
10th word,
else
– internal
Brandow et al. 95
• ANES: commercial
news from 41
publications
• “Lead” achieves
acceptability of
90% vs. 74.4% for
“intelligent”
summaries
• 20,997 documents
• words selected
based on tf*idf
• sentence-based
features:
–
–
–
–
signature words
location
anaphora words
length of abstract
Brandow et al. 95
• Sentences with no
signature words
are included if
between two
selected sentences
• Evaluation done at
60, 150, and 250
word length
• Non-task-driven
evaluation:
“Most summaries
judged less-thanperfect would not
be detectable as
such to a user”
Lin & Hovy 97
• Optimum position
policy
• Measuring yield of
each sentence
position against
keywords
(signature words)
from Ziff-Davis
corpus
• Preferred order
[(T) (P2,S1) (P3,S1)
(P2,S2) {(P4,S1)
(P5,S1) (P3,S2)}
{(P1,S1) (P6,S1)
(P7,S1) (P1,S3)
(P2,S3) …]
Kupiec et al. 95
• Extracts of roughly
20% of original text
• Feature set:
– sentence length
• |S| > 5
– fixed phrases
• 26 manually chosen
– paragraph
• sentence position in
paragraph
– thematic words
• binary: whether
sentence is included
in manual extract
– uppercase words
• not common
acronyms
• Corpus:
• 188 document +
summary pairs from
scientific journals
Kupiec et al. 95
• Uses Bayesian classifier:
P( F1 , F2 ,...Fk | s  S ) P( s  S )
P( s  S | F1 , F2 ,...Fk ) 
P( F1 , F2 ,...Fk )
• Assuming statistical independence:

P( s  S | F , F ,...F ) 
1
2
k
k
j 1
P( F j | s  S ) P( s  S )

k
j 1
P( F j )
Kupiec et al. 95
• Performance:
– For 25% summaries, 84% precision
– For smaller summaries, 74%
improvement over Lead
Salton et al. 97
• document analysis
based on semantic
hyperlinks (among
pairs of paragraphs
related by a lexical
similarity significantly
higher than random)
• Bushy paths (or
paths connecting
highly connected
paragraphs) are
more likely to
contain information
central to the topic
of the article
Salton et al. 97
Salton et al. 97
Overlap between manual extracts: 46%
Algorithm Optimistic
Global
bushy
Global
depth-first
Segmented
bushy
Random
Pessimistic Intersection
Union
45.60%
30.74%
47.33%
55.16%
43.98%
27.76%
42.33%
52.48%
45.48%
26.37%
38.17%
52.95%
39.16%
22.07%
38.47%
44.24%
Marcu 97-99
• Based on RST
(nucleus+satellite
relations)
• text coherence
• 70% precision and
recall in matching
the most important
units in a text
• Example: evidence
[The truth is that the pressure to
smoke in junior high is greater
than it will be any other time of
one’s life:][we know that 3,000
teens start smoking each day.]
• N+S combination
increases R’s
belief in N [Mann
and Thompson 88]
2
Elaboration
2
Elaboration
2
Background
Justification
With its
distant orbit
(50 percent
farther from
the sun than
Earth) and
slim
atmospheric
blanket,
(1)
Mars
experiences
frigid
weather
conditions
(2)
8
Example
3
Elaboration
Surface
temperature
s typically
average
about -60
degrees
Celsius (-76
degrees
Fahrenheit)
at the
equator and
can dip to 123 degrees
C near the
poles
(3)
8
Concession
45
Contrast
Only the
midday sun
at tropical
latitudes is
warm
enough to
thaw ice on
occasion,
(4)
5
Evidence
Cause
but any
liquid water
formed in
this way
would
evaporate
almost
instantly
(5)
Although the
atmosphere
holds a
small
amount of
water, and
water-ice
clouds
sometimes
develop,
(7)
because of
the low
atmospheric
pressure
(6)
Most
Martian
weather
involves
blowing dust
and carbon
monoxide.
(8)
10
Antithesis
Each winter,
for example,
a blizzard of
frozen
carbon
dioxide
rages over
one pole,
and a few
meters of
this dry-ice
snow
accumulate
as
previously
frozen
carbon
dioxide
evaporates
from the
opposite
polar cap.
(9)
Yet even on
the summer
pole, where
the sun
remains in
the sky all
day long,
temperature
s never
warm
enough to
melt frozen
water.
(10)
Barzilay and Elhadad 97
• Lexical chains [Stairmand 96]
Mr. Kenny is the person that invented the anesthetic
machine which uses micro-computers to control
the rate at which an anesthetic is pumped into the
blood. Such machines are nothing new. But his
device uses two micro-computers to achineve
much closer monitoring of the pump feeding the
anesthetic into the patient.
Barzilay and Elhadad 97
• WordNet-based
• three types of relations:
– extra-strong (repetitions)
– strong (WordNet relations)
– medium-strong (link between synsets is
longer than one + some additional
constraints)
Barzilay and Elhadad 97
• Scoring chains:
– Length
– Homogeneity index:
= 1 - # distinct words in chain
Score = Length * Homogeneity
Score > Average + 2 * st.dev.
Other approaches
• Salience-based [Boguraev and
Kennedy 97]
• Computational linguistics papers
[Teufel and Moens 97]
Part III
Multi-document
summarization
Mani & Bloedorn 97,99
• Summarizing
differences and
similarities across
documents
• Single event or a
sequence of
events
• Text segments are
aligned
• Evaluation: TREC
relevance
judgments
• Significant
reduction in time
with no significant
loss of accuracy
Carbonell & Goldstein 98
• Maximal Marginal
Relevance (MMR)
• Query-based
summaries
• Law of diminishing
returns
C = doc collection
Q = user query
R = IR(C,Q,)
S = already retrieved
documents
Sim = similarity
metric used
MMR = argmax [ l (Sim1(Di,Q) - (1-l) max Sim2(Di,Dj)]
DiR\S
DiS
Radev et al. 00
• MEAD
• Centroid-based
• Based on sentence
utility
• Topic detection
and tracking
initiative [Allen et
al. 98, Wayne 98]
TIME
ARTICLE 18853: ALGIERS, May 20 (AFP)
ARTICLE 18854: ALGIERS, May 20 (UPI)
1. Eighteen decapitated bodies have been found
in a mass grave in northern Algeria, press reports
said Thursday, adding that two shepherds were
murdered earlier this week.
1. Algerian newspapers have reported that 18
decapitated bodies have been found by authorities
in the south of the country.
2. Security forces found the mass grave on
Wednesday at Chbika, near Djelfa, 275 kilometers
(170 miles) south of the capital.
2. Police found the ``decapitated bodies of women,
children and old men,with their heads thrown on a
road'' near the town of Jelfa, 275 kilometers (170
miles) south of the capital Algiers.
3. It contained the bodies of people killed last
year during a wedding ceremony, according to Le
Quotidien Liberte.
3. In another incident on Wednesday, seven people
-- including six children -- were killed by terrorists,
Algerian security forces said.
4. The victims included women, children and old
men.
4. Extremist Muslim militants were responsible for
the slaughter of the seven people in the province
of Medea, 120 kilometers (74 miles) south of
Algiers.
5. Most of them had been decapitated and their
heads thrown on a road, reported the Es Sahafa.
6. Another mass grave containing the bodies of
around 10 people was discovered recently near
Algiers, in the Eucalyptus district.
7. The two shepherds were killed Monday evening
by a group of nine armed Islamists near the
Moulay Slissen forest.
8. After being injured in a hail of automatic
weapons fire, the pair were finished off with
machete blows before being decapitated, Le
Quotidien d'Oran reported.
9. Seven people, six of them children, were killed
and two injured Wednesday by armed Islamists
near Medea, 120 kilometers (75 miles) south of
Algiers, security forces said.
10. The same day a parcel bomb explosion
injured 17 people in Algiers itself.
11. Since early March, violence linked to armed
Islamists has claimed more than 500 lives,
according to press tallies.
5. The killers also kidnapped three girls during the
same attack, authorities said, and one of the girls
was found wounded on a nearby road.
6. Meanwhile, the Algerian daily Le Matin today
quoted Interior Minister Abdul Malik Silal as
saying that ``terrorism has not been eradicated,
but the movement of the terrorists has significantly
declined.''
7. Algerian violence has claimed the lives of more
than 70,000 people since the army cancelled the
1992 general elections that Islamic parties were
likely to win.
8. Mainstream Islamic groups, most of which are
banned in the country, insist their members are not
responsible for the violence against civilians.
9. Some Muslim groups have blamed the army,
while others accuse ``foreign elements conspiring
against Algeria.’’
Vector-based representation
Term 1
Document
Term 3

Centroid
Term 2
Vector-based matching
• The cosine measure
sim( D, C ) 
 d .c .idf (k )
 d  .  c 
k
k
k
2
2
k
k
k
k
CIDR
sim  T
sim < T
Centroids
C 00022 (N =44)
(10000) 1.93
d iana
p rincess
1.52
C 00035 (N =22)
(10000) 1.45
airlines
finnair
0.45
C 00031 (N =34)
el(10000) 1.85
nino
1.56
C 00026 (N =10)
(10000) 1.50
u niverse
exp ansion 1.00
bang
0.90
C 10062 (N =161)
microsoft
3.24
justice
0.93
d epartmen
0.88
w indt ow s
0.98
corp
0.61
softw are
0.57
ellison
0.07
hatch
0.06
netscape
0.04
metcalfe
0.02
C 00025 (N =19)
(10000) 3.00
albanians
C 00008 (N =113)
(10000) 1.98
space
shuttle
1.17
station
0.75
nasa
0.51
columbia
0.37
mission
0.33
mir
0.30
astronaut
0.14
s
steering
0.11
safely
0.07
C 10007 (N =11)
(10000) 1.00
crashes
safety
0.55
transportat 0.55
ion
d rivers
0.45
board
0.36
flight
0.27
buckle
0.27
pittsburgh 0.18
grad uating 0.18
automobile 0.18
MEAD
...
...
MEAD
• INPUT: Cluster of d documents with n
sentences (compression rate = r)
• OUTPUT: (n * r) sentences from the
cluster with the highest values of
SCORE
SCORE (s) = Si (wcCi + wpPi + wfFi)
[Barzilay et al. 99]
• Theme intersection (paraphrases)
• Identifying common phrases across
multiple sentences:
– evaluated on 39 sentence-level
predicate-argument structures
– 74% of p-a structures automatically
identified
Other multi-document approaches
• Reformulation [McKeown et al. 99]
• Generation by Selection and Repair
[DiMarco et al. 97]
• Topic and event distinctions
[Fukumoto & Suzuki 00]
Part IV
Knowledge-rich
approaches
Overview
• Schank and Abelson 77
– scripts
• DeJong 79
– FRUMP (slot-filling from UPI news)
• Graesser 81
– Ratio of inferred propositions to these
explicitly stated is 8:1
• Young & Hayes 85
– banking telexes
Radev and McKeown 98
MESSAGE: ID
MESSAGE: TEMPLATE
INCIDENT: DATE
INCIDENT: LOCATION
INCIDENT: TYPE
INCIDENT: STAGE OF EXECUTION
INCIDENT: INSTRUMENT ID
INCIDENT: INSTRUMENT TYPE
PERP: INCIDENT CATEGORY
PERP: INDIVIDUAL ID
PERP: ORGANIZATION ID
PERP: ORG. CONFIDENCE
PHYS TGT: ID
PHYS TGT: TYPE
PHYS TGT: NUMBER
PHYS TGT: FOREIGN NATION
PHYS TGT: EFFECT OF INCIDENT
PHYS TGT: TOTAL NUMBER
HUM TGT: NAME
HUM TGT: DESCRIPTION
HUM TGT: TYPE
HUM TGT: NUMBER
HUM TGT: FOREIGN NATION
HUM TGT: EFFECT OF INCIDENT
HUM TGT: TOTAL NUMBER
TST3-MUC4-0010
2
30 OCT 89
EL SALVADOR
ATTACK
ACCOMPLISHED
TERRORIST ACT
"TERRORIST"
"THE FMLN"
REPORTED: "THE FMLN"
"1 CIVILIAN"
CIVILIAN: "1 CIVILIAN"
1: "1 CIVILIAN"
DEATH: "1 CIVILIAN"
Generating text from templates
On October 30, 1989, one civilian was killed in a
reported FMLN attack in El Salvador.
Input: Cluster of templates
T1
…..
T2
Tm
Conceptual combiner
Combiner
Domain
ontology
Planning
operators
Paragraph planner
Linguistic realizer
Sentence planner
Lexicon
Lexical chooser
Sentence generator
OUTPUT: Base summary
SURGE
Excerpts from four articles
1
2
3
4
JERUSALEM - A Muslim suicide bomber blew apart 18 people on a Jerusalem bus and wounded 10 in a mirror-image of an attack
one week ago. The carnage could rob Israel's Prime Minister Shimon Peres of the May 29 election victory he needs to pursue Middle East
peacemaking. Peres declared all-out war on Hamas but his tough talk did little to impress stunned residents of Jerusalem who said the
election would turn on the issue of personal security.
JERUSALEM - A bomb at a busy Tel Aviv shopping mall killed at least 10 people and wounded 30, Israel radio said quoting police.
Army radio said the blast was apparently caused by a suicide bomber. Police said there were many wounded.
A bomb blast ripped through the commercial heart of Tel Aviv Monday, killing at least 13 people and wounding more than 100.
Israeli police say an Islamic suicide bomber blew himself up outside a crowded shopping mall. It was the fourth deadly bombing in Israel
in nine days. The Islamic fundamentalist group Hamas claimed responsibility for the attacks, which have killed at least 54 people. Hamas
is intent on stopping the Middle East peace process. President Clinton joined the voices of international condemnation after the latest
attack. He said the ``forces of terror shall not triumph'' over peacemaking efforts.
TEL AVIV (Reuter) - A Muslim suicide bomber killed at least 12 people and wounded 105, including children, outside a crowded
Tel Aviv shopping mall Monday, police said.
Sunday, a Hamas suicide bomber killed 18 people on a Jerusalem bus. Hamas has now killed at least 56 people in four attacks in nine
days.
The windows of stores lining both sides of Dizengoff Street were shattered, the charred skeletons of cars lay in the street, the
sidewalks were strewn with blood.
The last attack on Dizengoff was in October 1994 when a Hamas suicide bomber killed 22 people on a bus.
Four templates
MESSAGE: ID
SECSOURCE: SOURCE
SECSOURCE: DATE
PRIMSOURCE: SOURCE
INCIDENT: DATE
INCIDENT: LOCATION
INCIDENT: TYPE
HUM TGT: NUMBER
TST-REU-0001
Reuters
March 3, 1996 11:30
1
March 3, 1996
Jerusalem
Bombing
“killed: 18''
“wounded: 10”
PERP: ORGANIZATION ID
MESSAGE: ID
SECSOURCE: SOURCE
SECSOURCE: DATE
PRIMSOURCE: SOURCE
INCIDENT: DATE
INCIDENT: LOCATION
INCIDENT: TYPE
HUM TGT: NUMBER
PERP: ORGANIZATION ID
MESSAGE: ID
SECSOURCE: SOURCE
SECSOURCE: DATE
PRIMSOURCE: SOURCE
INCIDENT: DATE
INCIDENT: LOCATION
INCIDENT: TYPE
HUM TGT: NUMBER
2
TST-REU-0002
Reuters
March 4, 1996 07:20
Israel Radio
March 4, 1996
Tel Aviv
Bombing
“killed: at least 10''
“wounded: more than 100”
PERP: ORGANIZATION ID
TST-REU-0003
Reuters
March 4, 1996 14:20
3
March 4, 1996
Tel Aviv
Bombing
“killed: at least 13''
“wounded: more than 100”
“Hamas”
MESSAGE: ID
SECSOURCE: SOURCE
SECSOURCE: DATE
PRIMSOURCE: SOURCE
INCIDENT: DATE
INCIDENT: LOCATION
INCIDENT: TYPE
HUM TGT: NUMBER
PERP: ORGANIZATION ID
TST-REU-0004
Reuters
March 4, 1996 14:30
4
March 4, 1996
Tel Aviv
Bombing
“killed: at least 12''
“wounded: 105”
Fluent summary with
comparisons
Reuters reported that 18 people were killed on
Sunday in a bombing in Jerusalem. The next
day, a bomb in Tel Aviv killed at least 10
people and wounded 30 according to Israel
radio. Reuters reported that at least 12 people
were killed and 105 wounded in the second
incident. Later the same day, Reuters reported
that Hamas has claimed responsibility for the
act.
(OUTPUT OF SUMMONS)
Operators
• If there are two templates
AND
the location is the same
AND
the time of the second template is after the time of the
first template
AND
the source of the first template is different from the
source of the second template
AND
at least one slot differs
THEN
combine the templates using the contradiction operator...
Operators: Change of
Perspective
Change of perspective
Precondition:
The same source reports a change in a small
number of slots
March 4th, Reuters reported that a bomb in Tel Aviv
killed at least 10 people and wounded 30. Later the
same day, Reuters reported that exactly 12 people
were actually killed and 105 wounded.
Operators: Contradiction
Contradiction
Precondition:
Different sources report contradictory values for
a small number of slots
The afternoon of February 26, 1993, Reuters reported
that a suspected bomb killed at least six people in the
World Trade Center. However, Associated Press
announced that exactly five people were killed in the
blast.
Operators: Refinement and
Agreement
Refinement
On Monday morning, Reuters announced that a
suicide bomber killed at least 10 people in Tel Aviv.
In the afternoon, Reuters reported that Hamas
claimed responsibility for the act.
Agreement
The morning of March 1st 1994, both UPI and
Reuters reported that a man was kidnapped in the
Bronx.
Operators: Generalization
Generalization
According to UPI, three terrorists were arrested in
Medellín last Tuesday. Reuters announced that the
police arrested two drug traffickers in Bogotá the
next day.
A total of five criminals were arrested in Colombia
last week.
Other conceptual methods
• Operator-based transformations
using terminological knowledge
representation [Reimer and Hahn 97]
• Topic interpretation [Hovy and Lin
98]
Part V
Evaluation techniques
Overview of techniques
• Extrinsic techniques (task-based)
• Intrinsic techniques
Hovy 98
• Can you recreate what’s in the original?
– the Shannon Game [Shannon 1947–50].
– but often only some of it is really important.
• Measure info retention (number of keystrokes):
– 3 groups of subjects, each must recreate
text:
• group 1 sees original text before starting.
• group 2 sees summary of original text before
starting.
• group 3 sees nothing before starting.
• Results (# of keystrokes; two different paragraphs):
Group 1
approx. 10
Group 2
approx. 150
Group 3
approx. 1100
Hovy 98
• Burning questions:
1. How do different evaluation methods compare for
each type of summary?
2. How do different summary types fare under different
methods?
3. How much does the evaluator affect things?
4. Is there a preferred evaluation method?
• Small Experiment
– 2 texts, 7
groups.
• Results:
– No difference!
– As other
experiment…
– ? Extract is
best?
Shannon
Q&A
1
1
Backg round
Just-the-News
1
3
3
1
1
1
1
1
1
Reg ular
Keywords
Random
1
2
2
4
3
1
1
1
1
1
1
1
1
1
3
5
Original
Abstract
Ext ract
No Text
1-2: 50%
2-3: 50%
1-2: 30%
2-3: 20%
3-4: 20%
4-5:100%
Classification
1
1
1
Precision and Recall
System:
relevant
System:
non-relevant
Relevant
Non-relevant
A
B
C
D
Precision and Recall
A
Precision : P 
A B
A
Recall : R 
AC
2 PR
F
( P  R)
Jing et al. 98
• Small experiment
with 40 articles
• When summary
length is given,
humans are pretty
consistent in
selecting the same
sentences
• Percent agreement
• Different systems
achieved
maximum
performance at
different summary
lengths
• Human agreement
higher for longer
summaries
SUMMAC [Mani et al. 98]
• 16 participants
• 3 tasks:
– ad hoc: indicative,
user-focused
summaries
– categorization:
generic summaries,
five categories
– question-answering
• 20 TREC topics
• 50 documents per
topic (short ones
are omitted)
SUMMAC [Mani et al. 98]
• Participants
submit a fixedlength summary
limited to 10% and
a “best” summary,
not limited in
length.
• variable-length
summaries are as
accurate as full
text
• over 80% of
summaries are
intelligible
• technologies
perform similarly
Goldstein et al. 99
• Reuters, LA Times
• Manual summaries
• Summary length
rather than
summarization
ratio is typically
fixed
• Normalized version
of R & F.
A
R 
min(A  B,A  C)
'
'
2 PR
F 
'
(P  R )
'
Goldstein et al. 99
• How to measure
relative
performance?
p = performance
b = baseline
g = “good” system
s = “superior” system
( p  b)
p 
( 1  b)
'
(s '  g ' )
(s  g )

'
g
(g  b)
Radev et al. 00
Ideal
System 1
System 2
S1
+
+
-
S2
+
+
+
S3
-
-
-
S4
-
-
+
S5
-
-
-
S6
-
-
-
S7
-
-
-
S8
-
-
-
S9
-
-
-
S10
-
-
-
Cluster-Based Sentence Utility
Cluster-Based Sentence Utility
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
Ideal
System 1
System 2
+
+
-
+
+
-
+
+
-
Summary sentence extraction method
Ideal
System 1
System 2
S1
10(+)
10(+)
5
S2
8(+)
9(+)
8(+)
S3
2
3
4
S4
7
6
9(+)
CBSU method
CBSU(system, ideal)= % of ideal utility
covered by system summary
Interjudge agreement
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Judge1
10
8
2
5
Judge2
10
9
3
6
Judge3
5
8
4
9
Relative utility
RU =
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Judge1
10
8
2
5
Judge2
10
9
3
6
Judge3
5
8
4
9
Relative utility
RU =
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Judge1
10
8
2
5
17
Judge2
10
9
3
6
Judge3
5
8
4
9
Relative utility
RU =
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Judge1
10
8
2
5
13
17
Judge2
10
9
3
6
= 0.765
Judge3
5
8
4
9
Normalized System Performance
Judge 1
Judge 2
Judge 3
Average
Judge 1
1.000
1.000
0.765
0.883
Judge 2
1.000
1.000
0.765
0.883
Judge 3
0.722
0.789
1.000
0.756
System performance
Normalized system performance
Random performance
(S-R)
D=
(J-R)
Interjudge agreement
Random Performance
(S-R)
D=
(J-R)
Random Performance
n!
average of all
systems
( n(1-r))! (r*n)!
(S-R)
D=
(J-R)
Random Performance
n!
average of all
systems
( n(1-r))! (r*n)!
(S-R)
D=
(J-R)
{12}
{13}
{14}
{23}
{24}
{34}
Examples
(S-R)
D {14} =
(J-R)
=
0.833 - 0.732
0.841 - 0.732
= 0.927
Examples
(S-R)
D {14} =
(J-R)
=
0.833 - 0.732
0.841 - 0.732
D {24} = 0.963
= 0.927
Normalized evaluation of {14}
1.0
J’ = 1.0
S’ = 0.927 = D
J = 0.841
S = 0.833
R = 0.732
0.5
0.5
0.0
R’= 0.0
Cross-sentence Informational
Subsumption and Equivalence
• Subsumption: If the information content of
sentence a (denoted as I(a)) is contained
within sentence b, then a becomes
informationally redundant and the content
of b is said to subsume that of a:
I(a)  I(b)
• Equivalence: If I(a)  I(b)  I(b)  I(a)
Example
(1) John Doe was found guilty of the
murder.
(2) The court found John Doe guilty of
the murder of Jane Doe last August
and sentenced him to life.
Cross-sentence Informational
Subsumption
Article 1
Article 2
Article 3
S1
10
10
5
S2
8
9
8
S3
2
3
4
S4
7
6
9
Evaluation
Cluster
#
docs
#
sents
source
news
sources
topic
A
2
25
clari.world.africa.northwest
ern
AFP, UPI
Algerian terrorists threaten
Belgium
B
3
45
clari.world.terrorism
AFP, UPI
The FBI puts Osama bin Laden
on the most wanted list
C
D
2
7
65
189
clari.world.europe.russia
AP, AFP
clari.world.europe.russia
AP, AFP,
UPI
Explosion in a Moscow
apartment building (Sept. 9,
1999)
Explosion in a Moscow
apartment building (Sept. 13,
1999)
General strike in Denmark
Toxic spill in Spain
E
10
151
TDT-3 corpus, topic 78
AP, PRI,
VOA
F
3
83
TDT-3 corpus, topic 67
AP, NYT
1
0.95
0.9
Agreement (J)
Cluster A
Cluster B
Cluster C
Cluster D
Cluster E
Cluster F
0.85
Inter-judge agreement
versus compression
0.8
0.75
10
20
30
40
50
60
Compression rate (r)
70
80
90
100
Evaluating Sentence
Subsumption
Sent
Judge
1
Judge
2
Judge
3
Judge
4
Judge
5
+ score
A1-1
-
A2-1
A2-1
-
A2-1
3
A1-2
A2-5
A2-5
-
-
A2-5
3
A1-3
-
-
-
-
A2-10
A1-4
A2-10
A2-10
A2-10
-
A2-10
A1-5
-
A2-1
-
A2-2
A2-4
2
A1-6
-
-
-
-
A2-7
4
A1-7
-
-
-
-
A2-8
4
- score
4
4
Subsumption (Cont’d)
SCORE (s) = Si (wcCi + wpPi + wfFi) - wRRs
Rs = cross-sentence word overlap
Rs = 2 * (# overlapping words) / (# words in sentence
1 + # words in sentence 2)
wR = Maxs (SCORE(s))
Subsumption analysis
Cluster
A
Cluster
B
Cluster
C
Cluster
D
Cluster
E
Cluster
F
#judges
agreeing
+
-
+
-
+
-
+
-
+
-
+
-
5
0
7
0
24
0
45
0
88
1
73
0
61
4
1
6
3
6
1
10
9
37
8
35
0
11
3
3
6
4
5
4
4
28
20
5
23
3
7
2
1
1
2
1
1
0
7
0
7
0
1
0
Total: 558 sentences, full agreement on 292 (1+291), partial on 406 (23+383)
Of 80 sentences with some indication of subsumption, only 24 had agreement of 4 or more
judges.
Results
10%
20%
30%
40%
50%
60%
70%
80%
90%
Cluster A
0.855 0.572 0.427 0.759 0.862 0.910 0.554 1.001 0.584
Cluster B
0.365 0.402 0.690 0.714 0.867 0.640 0.845 0.713 1.317
Cluster C
0.753 0.938 0.841 1.029 0.751 0.819 0.595 0.611 0.683
Cluster D
0.739 0.764 0.683 0.723 0.614 0.568 0.668 0.719 1.100
Cluster E
1.083 0.937 0.581 0.373 0.438 0.369 0.429 0.487 0.261
Cluster F
1.064 0.893 0.928 1.000 0.732 0.805 0.910 0.689 0.199
MEAD performed better than Lead in 29 (in bold) out of 54 cases.
MEAD+Lead performed better than the Lead baseline in 41 cases
Donaway et al. 00
• Sentence-rank based measures
– IDEAL={2,3,5}:
compare {2,3,4} and {2,3,9}
• Content-based measures
– vector comparisons of summary and
document
Proposed TIDES evaluation
•
•
•
•
•
•
Creation of corpora
Development of evaluation software
TREC-style evaluation
Intrinsic and extrinsic evaluations
Multilingual summaries (over time)
Question-answering evaluation
Part VII
The MEAD project
Background
•
•
•
•
Summer 2001
Eight weeks
Johns Hopkins University
Participants: Dragomir Radev, Simone Teufel,
Horacio Saggion, Wai Lam, Elliott Drabek, Hong
Qi, Danyu Liu, John Blitzer, and Arda Çelebi
Technical objectives
• Develop a summarization toolkit including
a modular state-of-the art summarizer:
single-document, multi-document,
generic, query-based
• Develop a summarization evaluation
toolkit allowing comparisons between
extractive and non-extractive summaries
• Produce an annotated corpus for further
research in text summarization
Sample scenarios
•
•
•
•
•
Evaluate an existing summarizer
Build a summarizer from scratch
Test a summarization feature
Test a new evaluation metric
Test a machine translation system
Resources
•
•
•
•
•
•
•
•
•
•
•
•
•
•
manual summaries (extracts and abstracts)
baseline summaries
automatic summaries
manual and automatic relevance judgements
XREF, lemmatized, tagged versions of the corpus
manual and automatic query translations
sentence segmentation
sentence alignments
XML DTDs, converters
subsumption judgements
guidelines for judges
guidelines for building summarizers
evaluation software
modular, trainable summarizer
Sample English Query
<?xml version='1.0'?>
<!DOCTYPE QUERY SYSTEM "../../../dtd/query.dtd" >
<QUERY QID="Q-241-E" QNO="241" TRANSLATED="NO">
<TITLE>
Fire safety, building management concerns
</TITLE>
</QUERY>
Sample Chinese Query
<?xml version='1.0'?>
<!DOCTYPE QUERY SYSTEM “../../../dtd/query.dtd" >
<QUERY QID="Q-241-C" QNO="241" TRANSLATED="NO">
<TITLE>
¨¾¤õ·NÃÑ,¤j·HºÞ²z
</TITLE>
</QUERY>
Sample Retrieval Result for Full-length
Documents
<?xml version='1.0'?>
<!DOCTYPE DOC-JUDGE SYSTEM "/export/ws01summ/dtd/docjudge.dtd" >
<DOC-JUDGE QID="Q-241-E" SYSTEM="SMART" LANG="ENG">
<D DID="D-20000126_008.e" RANK="1" SCORE="135.0000" CORR-DOC="D-20000126_012.c"/>
<D DID="D-19980625_007.e" RANK="2" SCORE="99.0000" CORR-DOC="D-19980625_006.c"/>
<D DID="D-19990126_017.e" RANK="3" SCORE="98.0000" CORR-DOC="D-19990126_018.c"/>
<D DID="D-19981007_018.e" RANK="4" SCORE="91.0000" CORR-DOC="D-19981007_023.c"/>
<D DID="D-19980121_004.e" RANK="5" SCORE="78.0000" CORR-DOC="D-19980121_009.c"/>
<D DID="D-19971016_004.e" RANK="6" SCORE="72.0000" CORR-DOC="D-19971016_005.c"/>
Sample Retrieval Result for Lead-Based
Summary (5%)
<?xml version='1.0'?>
<!DOCTYPE DOC-JUDGE SYSTEM
"/export/ws01summ/dtd/docjudge.dtd" >
<DOC-JUDGE QID="Q-241-E" SYSTEM="SMART" LANG="ENG">
<D DID="D-20000126_008.e" RANK="1" SCORE="14.0000"
<D DID="D-19991214_002.e" RANK="2" SCORE="11.0000"
<D DID="D-19980810_006.e" RANK="3" SCORE="10.0000"
<D DID="D-19990505_028.e" RANK="4" SCORE="9.0000"
<D DID="D-19980115_009.e" RANK="4" SCORE="9.0000"
CORR-DOC="D-20000126_012.c"/>
CORR-DOC="D-19991214_001.c"/>
CORR-DOC="D-19980810_003.c"/>
CORR-DOC="D-19990505_034.c"/>
CORR-DOC="D-19980115_013.c"/>:
Single-document situation
query
SMART
IR results
Ranked
document
list
Correlation
document
LDC Judges
Summarizer
Ranked
document
list
Extract
Summary
comparison
Baselines
1. Co-selection
2. Similarity
Multi-document situation
document
cluster
LDC Judges
Manual sum.
Summarizer
Extracts
Baselines
Summary
comparison
1. Co-selection
2. Similarity
Summaries produced
• Single-document extracts
– automatic (135 runs on 18,146 documents
each): 10 compression rates, Word/Sentence,
English/Chinese/Xlingual, 10 summarization
methods
– manual (80 runs on 200 documents each): 10
compression rates, Word/Sentence, (3 judges
+ average)
Summaries produced
• Multi-document summaries
– 3 lengths, 3 judges, 14 queries (out of 40)
• Multi-document extracts
– automatic (160 extracts) = 8 compression rates
(5-40%,50-200AW) x 20 clusters
– manual (320 extracts) = 8 compression rates x
10 clusters x (3 judges + average)
List of summarizers
• MEAD, Websumm, Summarist,
LexChains, Align
• English, Chinese
• Single-document, Multi-document
…
…
…
…
…
MEAD architecture
…
…
…
…
…
Feature scorer
SVM
…
…
… Relation scorer
…
…
Subsumption
…
…
…
…
…
…
Extractor
…
…
Emergency relief by SWD
The Social Welfare Department has provided relief articles and hot meals to 114 people who were affected
by the rainstorm or mudslip throughout the territory. The people, comprising adults and children, come from
30 families. Some of them are taking temporary shelter at Lung Hang Estate Community Centre in Sha Tin,
and Shek Lei Estate Community Centre and Princess Alexandra Community Centre in Tsuen Wan. The
Regional Social Welfare Officer (New Territories East), Mrs Lily Wong, visited victims at Lung Hang State
Community Centre this (Thursday) afternoon to offer any necessary assistance. Six victims have so far
requested for Comprehensive Social Security Allowance and the applications are being processed. Social
workers also escorted an 88-year old man who was feeling unwell to the Prince of Wales hospital for
medical checkup.
WEBSUMM:
RANDOM:
Some of them are taking temporary shelter at Lung
Hang Estate Community Centre in Sha Tin, and
Shek Lei Estate Community Centre and Princess
Alexandra Community Centre in Tsuen Wan.
The Social Welfare Department has provided relief
articles and hot meals to 114 people who were
affected by the rainstorm or mudslip throughout the
territory. Some of them are taking temporary shelter
at Lung Hang Estate Community Centre in Sha Tin,
and Shek Lei Estate Community Centre and Princess
Alexandra Community Centre in Tsuen Wan.
MEAD:
The Social Welfare Department has provided relief
articles and hot meals to 114 people who were
affected by the rainstorm or mudslip throughout the
territory. The Regional Social Welfare Officer (New
Territories East), Mrs Lily Wong, visited victims at
Lung Hang State Community Centre this (Thursday)
afternoon to offer any necessary assistance.
LEAD:
The Social Welfare Department has provided relief
articles and hot meals to 114 people who were
affected by the rainstorm or mudslip throughout the
territory. The people, comprising adults and
children, come from 30 families.
Humans: Percent Agreement (20cluster average) and compression
1
0.9
0.8
0.7
0.6
% agreement 0.5
0.4
0.3
0.2
0.1
0
5
10
20
30
40
compression
50
60
70
80
90
Humans: precision/recall (cluster
average) and compression
1
0.9
0.8
0.7
0.6
p/r 0.5
0.4
Random
Humans
0.3
0.2
0.1
0
5
10
20
30
40
compression
50
60
70
80
90
Kappa
P( A)  P( E )

1  P( E )
• N: number of items (index i)
• n: number of categories (index j)
• k: number of annotators


  mij 
 i 1

 Nk 




N
N n
1
1
2
P( A) 
mij 

Nk (k  1) i 1 j 1
k 1
n
P( E )  
j 1
2
Humans: Kappa and compression
1
0.9
0.8
0.7
0.6
K 0.5
0.4
0.3
0.2
0.1
0
5
10
20
30
40
compression
50
60
70
80
90
Kappa, human agreement, 40%
0.7
0.6
0.5
0.4
K
0.3
0.2
0.1
0
2
46
54
60
61
62
112
125
199
323
cluster no
398
447
551
827
883
885
1014 1197
241
1018
Multi-document summaries of
length 50 words, kappa on 10
clusters
0.7
0.6
0.5
0.4
K
0.3
MEAD
Humans
0.2
0.1
0
112
125
199
241
323
cluster no
398
551
883
1014
1197
Relative utility (upper and lower bounds), Q125, 5%
0.95
0.9
0.85
0.8
0.75
0.7
0.65
R
0.6
J
0.55
0.5
0.45
A
B
C
D
J
E
F
G
H
R
I
J
A
B
C
D
E
F
G
H
I
J
R
0.648
0.65
0.652
0.465
0.626
0.727
0.509
0.497
0.644
0.566
J
0.715
0.666
0.859
0.726
0.876
0.944
0.909
0.776
0.71
0.869
Relative utility (upper and lower bounds), Q125, 20%
0.95
0.9
0.85
0.8
0.75
0.7
0.65
R
0.6
J
0.55
0.5
0.45
A
B
C
D
J
E
F
G
H
R
I
J
A
B
C
D
E
F
G
H
I
J
R
0.69
0.685
0.679
0.523
0.642
0.741
0.541
0.553
0.699
0.595
J
0.827
0.73
0.866
0.828
0.838
0.913
0.861
0.876
0.736
0.874
Relative utility (upper and lower bounds), Q125, 40%
0.95
0.85
0.75
0.65
R
J
0.55
0.45
A
B
C
D
J
E
F
G
H
R
I
J
A
B
C
D
E
F
G
H
I
J
R
0.74
0.738
0.724
0.653
0.695
0.77
0.647
0.679
0.764
0.664
J
0.836
0.754
0.878
0.954
0.91
0.952
0.919
0.954
0.811
0.904
Relative Utility (RU) per summarizer and compression rate (Single-document)
1
0.95
0.9
0.85
Summarizer
J
R
WEBS
0.8
MEAD
LEAD
0.75
0.7
0.65
0.6
5
10
20
30
40
50
60
70
80
90
J
0.785
0.79
0.81
0.833
0.853
0.875
0.913
0.94
0.962
0.982
R
0.636
0.65
0.68
0.711
0.738
0.765
0.804
0.84
0.896
0.961
WEBS
0.761
0.765
0.776
0.801
0.828
MEAD
0.748
0.756
0.764
0.782
0.808
0.834
0.863
0.895
0.921
0.968
LEAD
0.733
0.738
0.772
0.797
0.829
0.85
0.877
0.906
0.936
0.973
Compression rate
Relative Utility (RU) per compression rate (Multi-document)
0.81
0.79
0.77
0.75
0.73
R
RU 0.71
S
J
0.69
0.67
0.65
0.63
0.61
5
10
20
30
R
0.6116
0.6302
0.6614
0.6894
S
0.6928
0.7246
0.7476
0.766
J
0.6886
0.7296
0.7582
0.7904
Compression rate
Relevance correlation (RC)
r
 ( x  x )( y
i
i
 y)
i
 ( xi  x )
i
2
 ( yi  y )
i
2
Relevance Preservation Value (RPV) as a function of compression rate (RANDOM)
0.94
0.84
Query 112
Query 125
0.74
Query 241
RPV
Query 323
Query 551
AVERAGE (10 queries)
0.64
0.54
0.44
5
10
20
30
40
50
60
70
80
90
Query 112
0.5
0.64
0.8
0.86
0.91
0.93
0.95
0.97
0.98
0.99
Query 125
0.44
0.66
0.78
0.87
0.91
0.91
0.96
0.97
0.98
0.99
Query 241
0.68
0.77
0.87
0.91
0.94
0.96
0.97
0.98
0.99
1
Query 323
0.63
0.78
0.85
0.9
0.93
0.95
0.97
0.98
0.99
1
Query 551
0.52
0.69
0.79
0.88
0.92
0.94
0.95
0.97
0.98
0.99
AVERAGE (10 queries)
0.553
0.687
0.8
0.874
0.912
0.932
0.956
0.973
0.984
0.992
Summary length (%)
Relevance Preservation Value (RPV) for different summarizers (English, 20% )
0.97
0.92
RPV
0.87
Q125
0.82
Q551
FD
MEAD
WEBS
Q125
1
0.92
Q551
1
0.9
AVG(10Q)
1
Q112
Q323
Q112
AVG(10Q)
Q551
Q125
SUMM
RAND
LEAD
Summarizer
WEBS
MEAD
FD
0.77
Q241
AVG(10Q)
Q112
Q241
Query
Q323
LEAD
RAND
SUMM
0.82
0.8
0.78
0.79
0.88
0.81
0.79
0.81
0.903
0.843
0.802
0.8
0.775
1
0.91
0.88
0.8
0.8
0.77
Q241
1
0.93
0.89
0.84
0.87
0.85
Q323
1
0.92
0.91
0.85
0.85
0.88
Relevance Preservation Value (RPV) for different summarizers (Chinese, 20% )
0.98
0.93
0.88
0.83
RPV
0.78
0.73
Q112
0.68
Q323
Q551
Q241
AVG(10Q)
Q551
Q323
Q112
RAND
LEAD
ALGN
Summarizer
SUMM
MEAD
FD
0.58
Q125
0.63
AVG(10Q)
Q125
Query
Q241
FD
MEAD
SUMM
ALGN
LEAD
RAND
Q112
1
0.87
0.76
0.74
0.72
0.71
Q323
1
0.66
0.84
0.59
0.58
0.6
Q551
1
0.91
0.75
0.72
0.75
0.74
AVG(10Q)
1
0.85
0.755
0.738
0.733
0.744
Q125
1
0.87
0.75
0.72
0.71
0.75
Q241
1
0.93
0.85
0.83
0.83
0.85
Relevance Preservation Value (RPV) per compression rate and summarizer (English, 5 queries)
1
0.95
0.9
0.85
0.8
RPV
0.75
5%
0.7
10%
20%
0.65
30%
0.6
40%
0.55
40%
FD
30%
MEAD
WEBS
Summarizer
FD
MEAD
5%
1
10%
1
20%
20%
LEAD
SUMM
Compression rate
10%
RAND
5%
WEBS
LEAD
SUMM
RAND
0.724
0.73
0.66
0.622
0.554
0.834
0.804
0.73
0.71
0.708
1
0.916
0.876
0.82
0.82
0.818
30%
1
0.946
0.912
0.88
0.848
0.884
40%
1
0.962
0.936
0.906
0.862
0.922
Relevance Preservation Value (RPV) with and without cutoff (English, 5% )
0.8
0.7
0.6
0.5
RPV 0.4
0.3
with cutoff
0.2
no cutoff
0.1
0
no cutoff
SUMM
LEAD
MEAD
Summarizer
with cutoff
Correlation method
RAND
WEBS
SUMM
LEAD
MEAD
RAND
WEBS
with cutoff
0.48
0.55
0.61
0.29
0.6
no cutoff
0.61
0.59
0.74
0.44
0.63
Relevance Preservation Value (RPV) with and without cutoff (English, 10% )
0.9
0.8
0.7
0.6
0.5
RPV
0.4
0.3
with cutoff
0.2
no cutoff
0.1
0
no cutoff
SUMM
LEAD
MEAD
Summarizer
with cutoff
Correlation method
RAND
WEBS
SUMM
LEAD
MEAD
RAND
WEBS
with cutoff
0.65
0.65
0.76
0.56
0.7
no cutoff
0.73
0.71
0.84
0.66
0.72
Relevance Preservation Value (RPV) with and without cutoff (English, 20% )
1
0.9
0.8
0.7
0.6
RPV 0.5
0.4
0.3
with cutoff
0.2
no cutoff
0.1
0
no cutoff
SUMM
LEAD
MEAD
Summarizer
with cutoff
Correlation method
RAND
WEBS
SUMM
LEAD
MEAD
RAND
WEBS
with cutoff
0.71
0.74
0.88
0.72
0.8
no cutoff
0.79
0.8
0.92
0.78
0.82
Relevance Preservation Value (RPV) per MEAD policy (5 queries)
0.93
0.92
0.91
0.9
RPV 0.89
0.88
Q551
0.87
Q112
ASGEMEAD
Q-AVG
Q241
Q323
Q-AVG
Q112
Q551
MEADS002
MEAD003
MEAD policy
MEAD002
MEADORIG
ASGEMEAD
0.85
Q125
0.86
Q125
Q323
Query
Q241
MEADORIG
MEAD002
MEAD003
MEADS002
Q551
0.88
0.9
0.89
0.89
Q112
0.86
0.91
0.9
0.9
0.9
Q-AVG
0.886
0.916
0.908
0.908
0.9125
Q125
0.87
0.92
0.91
0.91
0.91
Q323
0.89
0.92
0.91
0.91
0.91
Q241
0.93
0.93
0.93
0.93
0.93
Properties of evaluation metrics
Agreement Human
extracts
Agreement human
extracts – automatic
extracts
Agreement human
summaries/extracts
Non-binary decisions
Kappa,
P/R,
accuracy
X
RU
X
Word
Relevance
overlap,
preserv.
cosine, lcs
X
X
X
X
Full documents vs.
extracts
Systems with different
sentence segm.
Multidocument extracts X
Full corpus coverage
X
X
X
X
X
X
X
X
X
X
X
X
Part VII
Language modeling
Language modeling
• Source/target language
• Coding process
Noisy channel
e
Recovery
f
e*
Language modeling
• Source/target language
• Coding process
e* = argmax p(e|f) = argmax p(e) . p(f|e)
e
e
p(E) = p(e1).p(e2|e1).p(e3|e1e2)…p(en|e1…en-1)
p(E) = p(e1).p(e2|e1).p(e3|e2)…p(en|en-1)
Summarization using LM
• Source language: full document
• Target language: summary
Berger & Mittal 00
• Gisting (OCELOT)
g* = argmax p(g|d) = argmax p(g) . p(d|g)
g
g
• content selection (preserve frequencies)
• word ordering (single words, consecutive
positions)
• search: readability & fidelity
Berger & Mittal 00
• Limit on top 65K words
• word relatedness = alignment
• Training on 100K summary+document
pairs
• Testing on 1046 pairs
• Use Viterbi-type search
• Evaluation: word overlap (0.2-0.4)
• transilingual gisting is possible
• No word ordering
Berger & Mittal 00
Sample output:
Audubon society atlanta area savannah georgia chatham
and local birding savannah keepers chapter of the audubon
georgia and leasing
Banko et al. 00
•
•
•
•
•
Summaries shorter than 1 sentence
headline generation
zero-level model: unigram probabilities
other models: Part-of-speech and position
Sample output:
Clinton to meet Netanyahu Arafat Israel
Knight and Marcu 00
• Use structured (syntactic)
information
• Two approaches:
– noisy channel
– decision based
• Longer summaries
• Higher accuracy
Conclusion
• Summarization is coming of age
• For general domains: sentence
extraction
• IR techniques not always
appropriate: NLP needed
• New challenges: language modeling,
multilingual summaries
APPENDIX
Conferences
• Dagstuhl Meeting, 1993 (Karen Spärck Jones,
Brigitte Endres-Niggemeyer)
• ACL/EACL Workshop, Madrid, 1997 (Inderjeet
Mani, Mark Maybury)
• AAAI Spring Symposium, Stanford, 1998
(Dragomir Radev, Eduard Hovy)
• ANLP/NAACL, Seattle, 2000 (Udo Hahn, Chin-Yew
Lin, Inderjeet Mani, Dragomir Radev)
• NAACL, Pittsburgh, 2001 (Jade Goldstein and
Chin-Yew Lin
• DUC, 2001 (Donna Harman and Daniel Marcu)
Readings
Advances in Automatic Text
Summarization by Inderjeet Mani
and Mark T. Maybury (eds.)
http://mitpress.mit.edu/book-table-of-contents.tcl?isbn=0262133598
(A detailed bibliography is available
at the end of this handout)
1
2
3
4
5
6
7
Automatic Summarizing : Factors and Directions (K. Spärck-Jones )
The Automatic Creation of Literature Abstracts (H. P. Luhn)
New Methods in Automatic Extracting (H. P. Edmundson)
Automatic Abstracting Research at Chemical Abstracts Service (J. J. Pollock and A. Zamora)
A Trainable Document Summarizer (J. Kupiec, J. Pedersen, and F. Chen)
Development and Evaluation of a Statistically Based Document Summarization System (S. H. Myaeng and D. Jang)
A Trainable Summarizer with Knowledge Acquired from Robust NLP Techniques (C. Aone, M. E. Okurowski, J. Gorlinsky,
and B. Larsen)
8 Automated Text Summarization in SUMMARIST (E. Hovy and C. Lin)
9 Salience-based Content Characterization of Text Documents (B. Boguraev and C. Kennedy)
10 Using Lexical Chains for Text Summarization (R. Barzilay and M. Elhadad)
11 Discourse Trees Are Good Indicators of Importance in Text (D. Marcu)
12 A Robust Practical Text Summarizer (T. Strzalkowski, G. Stein, J. Wang, and B. Wise)
13 Argumentative Classification of Extracted Sentenses as a First Step Towards Flexible Abstracting (S. Teufel and M.
Moens)
14 Plot Units: A Narrative Summarization Strategy (W. G. Lehnert)
15 Knowledge-based text Summarization: Salience and Generalization Operators for Knowledge Base Abstraction (U. Hahn
and U. Reimer)
16 Generating Concise Natural Language Summaries (K. McKeown, J. Robin, and K. Kukich)
17 Generating Summaries from Event Data (M. Maybury)
18 The Formation of Abstracts by the Selection of Sentences (G. J. Rath, A. Resnick, and T. R. Savage)
19 Automatic Condensation of Electronic Publications by Sentence Selection (R. Brandow, K. Mitze, and L. F. Rau)
20 The Effects and Limitations of Automated Text Condensing on Reading Comprehension Performance (A. H. Morris, G. M.
Kasper, and D. A. Adams)
21 An Evaluation of Automatic Text Summarization Systems (T. Firmin and M J. Chrzanowski)
22 Automatic Text Structuring and Summarization (G. Salton, A. Singhal, M. Mitra, and C. Buckley)
23 Summarizing Similarities and Differences among Related Documents (I. Mani and E. Bloedorn)
24 Generating Summaries of Multiple News Articles (K. McKeown and D. R. Radev)
25 An Empirical Study of the Optimal Presentation of Multimedia Summaries of Broadcast News (A Merlino and M. Maybury)
26 Summarization of Diagrams in Documents (R. P. Futrelle)
Collections of papers
• Information Processing and
Management, 1995
• Computational Linguistics (in
progress), 2002
Web resources
http://www.summarization.com
http://www.cs.columbia.edu/~jing/summarization.html
http://www.dcs.shef.ac.uk/~gael/alphalist.html
http://www.csi.uottawa.ca/tanka/ts.html
http://www.ics.mq.edu.au/~swan/summarization/
Ongoing projects
•
•
•
•
•
•
Columbia
ISI
JHU, Michigan
CMU, JPRC, etc.
Sheffield
elsewhere ...
Existing companies/systems
•
•
•
•
•
Microsoft
British Telecom
http://extractor.iit.nrc.ca/
inXight
http://www.islandsoft.com/products.h
tml (IslandInTEXT )
• www.pertinence.net
Available corpora
– SUMMAC corpus
• send mail to [email protected]
– <Text+Abstract+Extract> corpus
• send mail to [email protected]
– Open directory project
• http://dmoz.org
– MEAD corpus
• send mail to [email protected]
Possible research topics
• Corpus creation and annotation
• MMM: Multidocument, Multimedia,
Multilingual
• Evolving summaries
• Personalized summarization
• Web-based summarization
Cross-document structure theory
Number
1
Relationship type
Identity
Level
Any
2
Equivalence (paraphrasing)
S, D
3
Translation
P, S
4
Subsumption
S, D
5
6
Contradiction
Historical background
S, D
S
7
8
9
10
Cross-reference
Citation
Modality
Attribution
P
S, D
S
S
11
Summary
S, D
Description
The same text appears in more than one
location
Two text spans have the same
information content
Same information content in different
languages
One sentence contains more
information than another
Conflicting information
current
puts
that
Information
information in context
The same entity is mentioned
One sentence cites another document
Qualified version of a sentence
One sentence repeats the information of
another while adding an attribution
Similar to Summary in RST: one
sentence summarizes another
DOC 1
cross-document
link
cross-sentential
link
phrasal
link
word link
DOC 2
DOC 3
Word level
Phrase level
Paragraph/sentence level
Document level
1. Clustering
2. Document
Analysis
3. Link
Analysis
4. Summarization
Principles of Summarization
• Put a disclaimer indicating that (automated)
summaries may not preserve the emphasis and
meaning of the document.
• Preserve attribution.
• Always give users a pointer to the original
document.
• Indicate that the summary has been generated
automatically.
• In case of conflicting sources, give all points of
view.
Bibliography
THE END