Corpora and Statistical Methods Lecture 12

Download Report

Transcript Corpora and Statistical Methods Lecture 12

Corpora and Statistical Methods
Lecture 12
Albert Gatt
Part 2
Automatic summarisation
The task
 Given a single document or collection of documents, return an
abridged version that distils the most important information
(possibly for a particular task/user)
 Summarisation systems perform:
1.
2.
3.


Content selection: choosing the relevant information in the source
document(s), typically in the form of sentences/clauses.
Information ordering
Sentence realisation: cleaning up the sentences to make them
fluent.
Note the similarity to NLG architectures.
Main difference: summarisation input is text, whereas NLG
input is non-linguistic data.
Types of summaries
 Extractive vs. Abstractive
 Extractive: select informative sentences/clauses in the source
document and reproduce them
 most current systems (and our focus today)
 Abstractive: summarise the subject matter (usually using new
sentences)
 much harder, as it involves deeper analysis & generation
 Dimensions
 Single-document vs. multi-document
 Context
 Query-specific vs. query-independent
Extracts vs Abstracts: Lincoln’s
Gettsyburg Address
Extract
Abstract
Source: Jurafsky & Martin (2009), p. 823
A Summarization Machine
DOC
MULTIDOCS
QUERY
50%
10%
Extract
Very Brief
Headline
100%
Brief
Long
ABSTRACTS
Abstract
?
Indicative
Informative
Generic
Query-oriented
Background
Just the news
EXTRACTS
CASE FRAMES
TEMPLATES
CORE CONCEPTS
CORE EVENTS
RELATIONSHIPS
CLAUSE FRAGMENTS
INDEX TERMS
Adapted from: Hovy & Marcu (1998). Automated text summarization. COLING-ACL Tutorial.
http://www.isi.edu/~marcu/
The Modules of the Summarization Machine
MULTIDOC
EXTRACTS
E
X
T
R
A
C
T
I
O
N
F
I
L
T
E
R
I
N
G
DOC
EXTRACTS
G
E
N
E
R
A
T
I
O
N
I
N
T
E
R
P
R
E
T
A
T
I
O
N
ABSTRACTS
?
EXTRACTS
7
CASE FRAMES
TEMPLATES
CORE CONCEPTS
CORE EVENTS
RELATIONSHIPS
CLAUSE FRAGMENTS
INDEX TERMS
Unsupervised single-document
summarisation I
“bag of words” approaches
Basic architecture for single-doc
The central task in singledocument summarisation. Can be
supervised or unsupervised.
Less critical. Since we have only
one document, we can rely on the
order in which sentences occur in
the source itself.
Unsupervised content selection I: Topic
Signatures
 Simplest unsupervised algorithm:
 Split document into sentences.
 Select those sentences which contain the most
salient/informative words.
 Salient term = a term in the topic signature (words that are
crucial to identifying the topic of the document)
 Topic signature detection:
 Represent sentences (documents) as word vectors
 Compute the weight of each word
 Weight sentences by the average weight of their (non-stop)
words.
Vector space revisited
Document collection
 Doc 1: To make fried
chicken, take the chicken,
chop it up and put it in a
pan until golden. Remove
the fried chicken pieces
and serve hot.
 Doc 2: To make roast
chicken, take the chicken
and put in the oven until
golden. Remove the
chicken and serve hot.
Key terms * document
2

0
3

1

1

0

0
1
3
0
1
1
( fried ) 

( roast ) 
( chicken ) 

( pan ) 

( golden )

( oven ) 
Columns = documents
Rows = term frequencies
NB: Stop list to remove v. high frequency
words!
Term weighting: tf-idf
 Common term weighting
scheme used in the
information retrieval
literature.
 tf (term frequency) =
freq. of term in document
 idf (inverse document
frequency) = log(N/ni)
 N = no. of documents
 ni = no. of docs in which term
i occurs
w i , j  tf i , j  idf i
 0 .6

 0
 0

 0 .3

0

 0

0
0 .3
0
0
0
0 .3
( fried ) 

( roast ) 
( chicken ) 

( pan ) 

( golden )

( oven ) 
Method:
1. Count frequency of term in the doc
being considered.
2. Count inverse doc frequency over
whole document collection
3. Compute tf-idf score
Term weighting: log likelihood ratio
 Requirements:
 Since LLR is asymptotically
 A background corpus
 In our case, for a term w,
LLR is the ratio between:
 Prob. of observing w in the
input corpus
 Prob. of observing w in the
background corpus
chi-square distributed, if
the LLR value is significant,
we treat the term as a key
term.
 Chi-square values tend to
be significant at p = .001 if
they are greater than 10.8
1 if  2 log  ( w i )  10 . 8
weight ( w i )  
 0 otherwise
weight ( s i ) 
weight ( w )
 w | w  s 
w s i
i
Sentence centrality
 Instead of weighting sentences by averaging individual term
weights, we can compute pairwise distance between
sentences and choose those sentences which are closer to
eachother on average.
 Example: represent sentences as tf-idf vectors and compute
cosine for each sentence x in relation to all other sentences y
centrality ( x ) 
1
K
 tf
y
 where K = total no. of sentences
 idf  cos ine ( x , y )
Unsupervised single-document
summarisation II
Using rhetorical structure
Rhetorical Structure Theory
 RST (Mann and Thompson 1988) is a theory of text
structure
 Not about what texts are about but
 How bits of the underlying content of a text are structured so
as to hang together in a coherent way.
 The main claim of RST:
 Parts of a text are related to eachother in predetermined ways.
 There is a finite set of such relations.
 Relations hold between two spans of text
 Nucleus
 Satellite
A small example
You should visit the new exhibition. It’s excellent. It got very good reviews.
It’s completely free.
MOTIVATION
ENABLEMENT
EVIDENCE
You should ...
It’s excellent...
It got ...
It’s completely ...
An RST relation definition
MOTIVATION
 Nucleus represents an action which the hearer is meant to do at
some point in future.
 You should go to the exhibition
 Satellite represents something which is meant to make the hearer
want to carry out the nucleus action.
 It’s excellent. It got a good review.
 Note: Satellite need not be a single clause. In our example, the
satellite has 2 clauses. They themselves are related to eachother by the
EVIDENCE relation.
 Effect: to increase the hearer’s desire to perform the nucleus
action.
RST relations more generally
 An RST relation is defined in terms of the
 Nucleus + constraints on the nucleus
 (e.g. Nucleus of motivation is some action to be performed by H)
 Satellite + constraints on satellite
 Desired effect.
 Other examples of RST relations:
 CAUSE: the nucleus is the result; the satellite is the cause
 ELABORATION: the satellite gives more information about the nucleus
 Some relations are multi-nuclear
 Do not relate a nucleus and satellite, but two or more nuclei (i.e. 2 pieces of
information of the same status).
 Example: SEQUENCE
 John walked into the room. He turned on the light.
Some more on RST
 RST relations are neutral with respect to their realisation.
 E.g.You can express EVIDENCE in lots of different ways.
It’s excellent. It got very good reviews.
EVIDENCE
It’s excellent...
It got ...
You can see that it’s excellent from its great
reviews.
It’s excellence is evidenced by the good
reviews it got.
It must be excellent since it got good
reviews.
RST for unsupervised content selection
Compute coherence relations between units (= clauses)
1.


Can use a discourse parser and/or rely on cue phrases
Corpora annotated with RST relations exist
Use the intuition that the nucleus of a relation is more central to
the content than the satellite to identify the set of salient units
Sal:
2.


Base case: If n consists of a leaf node, then Sal(n) = {n}
Recursive case: if n is non-leaf, then
Sal ( n ) 
 Sal ( c )
c  Nuc  Child ( n )
3.
Rank nodes in Sal(n): the higher the node of which n is a
nucleus, the more salient it is
Rhetorical structure: example
Ranking of a nodes: 2 > 8 > 3 ...
Supervised content selection
Basic idea
 Input: a training set consisting of:
 Document + human-produced (extractive) summaries
 So sentences in each doc can be marked with a binary feature (1
= included in summary; 0 = not included)
 Train a machine learner to classify sentences as 1 (extract-
worthy) or 0, based on features.
Features
 Position: important sentences tend to occur early in a document




(but this is genre dependent). E.g. news articles: most important
sentence is the title.
Cue phrases: sentences with phrases like to summarise give
important summary info. (Again, genre dependent: different
genres have different cue phrases).
Word informativeness: words in the sentence which belong to the
doc’s topic signature
Sentence length: we usually want to avoid very short sentences
Cohesion: we can use lexical chains to compute how many words
are in a sentence which are also in the document lexical chain
 Lexical chain: a series of words that are indicative of the document’s
topic
Algorithms
 Once we have the feature set F, we want to compute:
P ( include ( s ) | F )
 Many methods we’ve discussed will do!
 Naive Bayes
 Maximum Entropy
 ...
Which corpus?
 There are some corpora with extractive summaries, but
often we come up against the problem of not having the right
data.
 Many types of text in themselves contain summaries, e.g.
scientific articles have abstracts
 But these are not purely extractive!
 (though people tend to include sentences in abstracts that are
very similar to the sentences in their text).
 Possible method: align sentences in an abstract with
sentences in the document, by computing their overlap (e.g.
using n-grams)
Realisation
Sentence simplification
Realisation
 With single-doc summarisation, realisation isn’t a big problem
(we’re reproducing sentences from summaries).
 But we may want to simplify (or compress) the sentences.
 Simplest method is to use heuristics, e.g.:
 Appositives: Rajam, 28, an artist who lives in Philadelphia, found
inspiration in the back of city magazines.
 Sentential adverbs: As a matter of fact, this policy will be ruinous.
 A lot of current research on simplification/compression, often
using parsers to identify dependencies that can be omitted with
little loss of information.
 Realisation is much more of an issue in multi-document
summarisation.
Multi-document summarisation
Why multi-document
 Very useful when:
 queries return multiple documents from the web
 Several articles talk about the same topic (e.g. a disease)
 ...
 The steps are the same as for single-doc summarisation, but:
 We’re selecting content from more than one source
 We can’t rely on the source documents only for ordering
 Realisation is required to ensure coherence.
Content selection
 Since we have multiple docs, we have a problem with
redundancy: repeated info in several documents; overlapping
words, sentences, phrases...
 We can modify sentence scoring methods to penalise redundancy,
by comparing a candidate sentence to sentences already selected.
 Methods:
 Modify sentence score to penalise redundancy:
penalty ( s )   max
s i  summary
Sim ( s , s i )
(sentence is compared to sentences already chosen in the summary)
 Use clustering to group related sentences, and then perform selection
on clusters.
 More on clustering next week.
Information ordering
 If sentences are selected from multiple documents, we risk creating an incoherent
document.
Rhetorical structure:
1.


Lexical cohesion:
2.





*We had chicken for dinner. Paul was late. It was roasted.
We had chicken for dinner. I was roasted. Paul was late.
Referring expressions:
3.

*Therefore, I slept. I was tired.
I was tired. Therefore, I slept.
*He said that ... . George W. Bush was speaking at a meeting.
George W. Bush said that ... . He was speaking at a meeting.
These heuristics can be combined.
We can also do information ordering during the content selection process itself.
Information ordering based on
reference
 Referring expressions (NPs that identify objects) include
pronouns, names, definite NPs...
 Centering Theory (Grosz et al 1995): every discourse segment
has a focus (what the segment is “about”).
 Entities are salient in discourse depending on their position in
the sentence: SUBJECT >> OBJECT >> OTHER
 A coherent discourse is one which, as far as possible, maintains
smooth transitions between sentences.
Information ordering based on lexical
cohesion
 Sentences which are “about” the same things tend to occur
together in a document.
 Possible method:
 use tf-idf cosine to compute pairwise similarity between
selected sentences
 attempt to order sentences to maximise the similarity between
adjacent pairs.
Realisation
 Compare:
Source: Jurafsky & Martin (2009), p. 835
Uses of realisation
 Since sentences come from different documents, we may end
up with infelicitous NP orderings (e.g. pronoun before
definite). One possible solution:
1.
2.
3.
run a coreference resolver on the extracted summary
Identify reference chains (NPs referring to the same entity)
Replace or reorder NPs if they violate coherence.
 E.g. use full name before pronoun
 Another interesting problem is sentence aggregation or
fusion, where different phrases (from different sources) are
combined into a single phrase.
Evaluating summarisation
Evaluation baselines
 Random sentences:
 If we’re producing summaries of length N, we use as baseline a
random extractor that pulls out N sentences.
 Not too difficult to beat.
 Leading sentences:
 Choose the first N sentences.
 Much more difficult to beat!
 A lot of informative sentences are at the beginning of documents.
Some terminology (reminder)
 Intrinsic evaluation: evaluation of output in its own right,
independent of a task (e.g. Compare output to human
output).
 Extrinsic evaluation: evaluation of output in a particular task
(e.g. Humans answer questions after reading a summary)
 We’ve seen the uses of BLEU (intrinsic) for realisation in
NLG.
 A similar metric in Summarisation is ROUGE (RecallOriented Understudy for Gisting Evaluation)
BLEU vs ROUGE
BLEU
ROUGE
 Precision-oriented
 Recall-oriented
 Looks at n-gram overlap
 N-gram is fixed:
for different values of n up
to some maximum.
 Measures the average ngram overlap between an
output text and a set of
reference texts.
 ROUGE-1, ROUGE-2 etc
(for different n-gram
lengths)
 Measures how many n-
grams an output summary
contains from the source
summary.
ROUGE

ROUGE
2

 Count
match
( bigram )
s  REF bigram  s

 Count
( bigram )
s  REF bigram  s
 Generalises easily to any n-gram length.
 Other versions:
 ROUGE-L: measures longest common subsequence between
reference summary and output
 ROUGE-SU: uses skip bigrams
Intrinsic vs. Extrinsic again
 Problem: ROUGE assumes that reference summaries are “gold
standards”, but people often disagree about summaries, including
wording.
 Same questions arise as for NLG (and MT):
 To what extent does this metric actually tell us about the effectiveness
of a summary?
 Some recent work has shown that the correlation between ROUGE
and a measure of relevance given by humans is quite low.
 See: Dorr et al. (2005). A Methodology for Extrinsic Evaluation of Text
Summarization: Does ROUGE Correlate? Proceedings of the ACLWorkshop on Intrinsic
and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 1–
8, Ann Arbor, June 2005
The Pyramid method (Nenkova et al)
 Also intrinsic, but relies on semantic content units instead of n-
grams.
Human annotators label SCUs in sentences from human
summaries.
1.


2.
3.
4.
Based on identifying the content of different sentences, and
grouping together sentences in different summaries that talk about
the same thing.
Goes beyond surface wording!
Find SCUs in the automatic summaries.
Weight SCUs
Compute the ratio of the sum of weights of SCUs in the
automatic summary to the weight of an optimal summary of
roughly the same length.