Text Summarization

Download Report

Transcript Text Summarization

Text Summarization
Jagadish M(07305050)
Annervaz K M (07305063)
Joshi Prasad(07305047)
Ajesh Kumar S(07305065)
Shalini Gupta(07305R02)
Introduction



Summary: Brief but accurate
representation of the contents of a
document
Goal: Take an information source,
extract the most important content
from it and present it to the user in a
condensed form and in a manner
sensitive to the user’s needs.
Compression: Amount of text to
present or the length of the summary
to the length of the source.
MSWord AutoSummarize
Presentation Outline









Motivation
Different Genres
Simple Statistical Techniques
Degree Centrality
Lex Rank
Lexical/Co-reference Chains
Rhetorical Structure Theory
WordNet Based Methods
DUC/TAC
Motivation






Abstracts for Scientific and other
articles
News summarization (mostly Multiple
document summarization)
Classification of articles and other
written data
Web pages for search engines
Web access from PDAs, Cell phones
Question answering and data gathering
Genres





Indicative vs. informative
 used for quick categorization vs. content
processing.
Extract vs. abstract
 lists fragments of text vs. re-phrases content
coherently.
Generic vs. query-oriented
 provides author’s view vs. reflects user’s interest.
Background vs. just-the-news
 assumes reader’s prior knowledge is poor vs. upto-date.
Single-document vs. multi-document source
 based on one text vs. fuses together many texts.
Statistical scoring

Scoring techniques




Word frequencies throughout the
text(Luhn58)
Position in the text(Edmundson69)
Title Method(Edmundson69)
Cue phrases in sentences (Edmundson69)
Luhn58


Important words
occur fairly
frequently
Earliest work in
field
Statistical Approaches(contd..)



Degree Centrality
LexRank
Continuous LexRank
Degree Centrality

Problem Formulation



Represent each sentence by a vector
Denote each sentence as the node of a
graph
Cosine similarity determines the edges
between nodes
Degree Centrality

Since we are
interested in
significant
similarities,
we can
eliminate
some low
values in
this matrix
by defining a
threshold.
Degree Centrality


Compute the degree
of each sentence
Pick the nodes
(sentences) with
high degrees
Degree Centrality

Disadvantage in Degree Centrality
approach
LexRank

Centrality vector p which will give a
lexrank of each sentence (similar to
page rank) defined by :
What Should B Satisfy?



Stochastic Matrix and Markov Chain
property.
Irreducible.
Aperiodic
Perron-Frobenius Theorem

An irreducible and aperiodic Markov
chain is guaranteed to converge to a
stationary distribution
Reducibility
Aperiodicity
LexRank



B is a stochastic matrix
Is it an irreducible and aperiodic
matrix?
Dampness (Page et al. 1998)
Matrix Form of p for Dampening

Solve for p using Power method
Continuous LexRank
Linguistic/Semantic Methods


Co-reference /Lexical Chain
Rhetorical Analysis
Co-reference/Lexical Chains



Assumption/Observation :- Important
parts in a text will be more related in
a semantic interpretation
Co-reference / Lexical Chains (ObjectAction, Part-of relation, Semantically
related)
Important sentences will be traversed
by more number of such chains
Co-reference/Lexical Chains

Mr. Kenny is the person that
invented the anesthetic machine
which uses micro-computers to control
the rate at which an anesthetic is
pumped into the blood. Such
machines are nothing new. But his
device uses two micro-computers to
achieve much closer monitoring of
the pump feeding the anesthetic into
the patient
Rhetorical Structure Theory


Mann & Thompson 88
Rhetoric Relation



Between two non-overlapping text
snippets
Nucleus - Core Idea, Writers Purpose
Satellite - Referred in context to nucleus
for Justifying, Evidencing, Contradicting
etc
Rhetorical Structure Theory



Nucleus of a rhetorical relation is
comprehensible independent of the satellite,
but not vice versa
All rhetoric relations are not nucleus-satellite
relations, Contrast is a multinuclear
relationship
Example: evidence
[The truth is that the pressure to smoke in
'junior high' is greater than it will be any
other time of one’s life:][ we know that
3,000 teens start smoking each day.]
Rhetorical Structure Theory

Rhetoric Parsing



Breaks into elementary units
Uses cue phrases(discourse markers) and
notion of semantic similarity in order to
hypothesize rhetorical relations
Rhetorical relations can be assembled
into rhetorical structure trees (RStrees) by recursively applying
individual relations across the whole
text
2
Elaboration
2
Elaboration
2
Background
Justification
With its
distant orbit
(50 percent
farther from
the sun than
Earth) and
slim
atmospheric
blanket,
(1)
Mars
experiences
frigid
weather
conditions
(2)
8
Example
3
Elaboration
Surface
temperature
s typically
average
about -60
degrees
Celsius (-76
degrees
Fahrenheit)
at the
equator and
can dip to 123 degrees
C near the
poles
(3)
8
Concession
45
Contrast
Only the
midday sun
at tropical
latitudes is
warm
enough to
thaw ice on
occasion,
(4)
5
Evidence
Cause
but any
liquid water
formed in
this way
would
evaporate
almost
instantly
(5)
Although the
atmosphere
holds a
small
amount of
water, and
water-ice
clouds
sometimes
develop,
(7)
because of
the low
atmospheric
pressure
(6)
Most Martian
weather
involves
blowing dust
and carbon
monoxide.
(8)
10
Antithesis
Each winter,
for example,
a blizzard of
frozen
carbon
dioxide
rages over
one pole,
and a few
meters of
this dry-ice
snow
accumulate
as
previously
frozen
carbon
dioxide
evaporates
from the
opposite
polar cap.
(9)
Yet even on
the summer
pole, where
the sun
remains in
the sky all
day long,
temperature
s never
warm
enough to
melt frozen
water.
(10)
RST Based Summarization




Multiple RS-trees
A built RS-tree captures relations in
the text and can be used for high
quality summarization
Picking up the ‘K’ nodes nearest to the
root
Disadvantages
WordNet based Approach for
Summarization





Preprocessing of text
Constructing sub-graph from WordNet
Synset Ranking
Sentence Selection
Principal Component Analysis
Preprocessing





Break text into sentences
Apply POS tagging
Identify collocations in the text
Remove the stop words
Sequence is important
Constructing sub-graph from
WordNet



Mark all the words and collocations in
the WordNet graph which are present
in the text
Traverse the generalization edges up
to a fixed depth, and mark the synsets
you visit
Construct a graph, containing only the
marked synsets
Synset Ranking



Rank synsets based on their relevance to
text
Construct a Rank vector, corresponding to
each node of the graph, initialized to 1/√
(no_of_nodes, n in graph)
Create an authority matrix, A(i,j) =
1/(num_of_predecessors(j)), if j is a child
of i.
Synset Ranking


Update the R vector iteratively as,
Higher value implies better rank and
higher relevance
Sentence Selection



Construct a matrix, M with m rows and n
columns
m is number of sentences and n is number
of nodes
For each sentence Si



Traverse graph G, starting with words present
in Si and following generalization edges
Find set of reachable synsets, SYi
For each syij SYi

set M[Si][syij] to rank of syij calculated in
previous step
Principal Component Analysis




Apply PCA on matrix M and get set of
principal components or eigen vectors
Eigen value of each eigen vector is
measure of relevance of eigen vector
to the meaning
Sort Eigen vectors according to Eigen
values
For each Eigen vector, find its
projection on each sentence
Principal Component Analysis



Select top nnumselect sentences for each
eigen vector
nnumselect is proportional to the eigen
values of the eigen vectors
nnumselect = i/ j(j))
where i is
the eigen value corresponding to the
eigen vector, i
Document Understanding
Conference(DUC)


Text Analysis Conference(TAC)
Interest and activity aimed at building
powerful multi-purpose information
systems

Evaluation results of various
summarization techniques

www-nlpir.nist.gov/projects/duc/data.html
Human Summary of Our
Presentation :)



What is Text Summarization?
Why Text Summarization?
Methods to Summarization




LexRank
Lexical Chains
Rhetorical Structure Theory
Wordnet Based
Challenges ahead..






Ensuring text coherency
Sentences may have dangling
anaphors
Summarizing non-textual data
Handling multiple sources effectively
High reduction rates are needed
Achieving human quality
summarization!!
References



Erkan, Radev, 2004. LexRank: Graph-based
Lexical Centrality as Salience in Text
Summarization. Vol: 22, 457 – 479, Journal of
Artificial Intelligence Research
Barzilay, R. and M. Elhadad. 1997. Using Lexical
Chains for Text Summarization. In Proceedings
of the Workshop on Intelligent Scalable Text
Summarization at the ACL/EACL Conference,
10–17. Madrid, Spain.
Mann, W.C. and S.A. Thompson. 1988.
Rhetorical Structure Theory: Toward a
Functional Theory of Text Organization. Text
8(3), 243–281. Also available as
USC/Information Sciences Institute Research
Report RR-87-190.
References



Baldwin, B. and T. Morton. 1998. CoreferenceBased Summarization. In T. Firmin Hand and B.
Sundheim (eds). TIPSTER-SUMMAC
Summarization Evaluation. Proceedings of the
TIPSTER Text Phase III Workshop. Washington.
Marcu, D. 1998. Improving Summarization
Through Rhetorical Parsing Tuning. Proceedings
of the Workshop on Very Large Corpora.
Montreal, Canada.
Ramakrishnan and Bhattacharya, 2003. Text
representation with wordnet synsets. Eighth
International Conference on Applications of
Natural Language to Information Systems
(NLDB2003)
References



Bellare,Anish S., Atish S., Loiwal, Bhattacharya,
Mehta, Ramakrishnan, 2004. Generic Text
Summarization using WordNet
Inderjeet Mani and Mark T. Maybury (eds).
Advances in Automatic Text. Summarization.
MIT Press, 1999. ISBN 0-262-13359-8.
www.wikipedia.com
Thank You