GIDA Project (David)

Download Report

Transcript GIDA Project (David)

GIDA
IST-2000-31123
TKE Conference,
Page 1
28 – 30 August 2002
Nancy - France
A Financial News Summarisation
System based on Lexical Cohesion
Paulo Cesar Fernandes de Oliveira
Khurshid Ahmad
Lee Gillam
Authors (s):
Number of pages
Type of Document:
Document Number:
Date:
PO, KA, LG
PPT
Organisation
Number of Annexes:
GIDA Task:
Version:
University of Surrey
0
1
30.08.2002
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Introduction
“Stock market news has gone from
hard to find (in the 1970s and early
1980s), then easy to find (in the late
1980s), then hard to get away from”.
(From Peter Lynch (2000))
 growth in the volumes of financial news
 consequence of this growth  the need of text
summarisation
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Introduction
Automatic Summarisation
Get an information source;
Extract some content from it;
Present the most important part to the user
xx xxx xxxx x xx xxxx
xxx xx xxx xx xxxxx x
xxx xx xxx xx x xxx xx
xx xxx x xxx xx xxx x
xx x xxxx xxxx xxxx xx
xx xxxx xxx
xxx xx xx xxxx x xxx
xx x xx xx xxxxx x x xx
xxx xxxxxx xxxxxx x x
xxxxxxx xx x xxxxxx
xxxx
xx xx xxxxx xxx xx x xx
xx xxxx xxx xxxx xx
xxx xx xxx xxxx xx
xxx x xxxx x xx xxxx
xx xxx xxxx xx x xxx
xxx xxxx x xxx x xxx
xx xx xxxxx x x xx
xxxxxxx xx x xxxxxx
xxxx
xx xx xxxxx xxx xx
xxx xx xxxx x xxxxx
xx xxxxx x
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Introduction
 What is a summary?
A summary is a text that is
produced from one or more texts,
that contains a significant portion
of the information in the original
text(s).
(From Hovy and Lin (1998))
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Introduction
 What constitutes a good summary?
Mrs. Coolidge: what did the preacher discuss in his sermon?
President Coolidge: sin.
Mrs. Coolidge: what did he say?
President Coolidge: he said he was against it.
President Calvin Coolidge,
Grace Coolidge, and dog, Rob
Roy, c.1925. Plymouth Notch,
Vermont.
(Copyright © 2001 The MITRE Corporation)
Source: Bartlett, J. 1983. Collection of Familiar Quotations, 15th edition, Citadel Press, 1983. (noted by
Graeme Hirst)
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Lexical Cohesion
Definition
The tendency of the sentences in a text to
carry information about a certain topic
through related words provides quality of
unity to the text.
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Lexical Cohesion
 Halliday and Hasan (1976) have looked at the
question of cohesion in text.
Their focus was on grammatical and on lexical
cohesion.
I will deal only with lexical cohesion:
Halliday and Hasan have come up with a new
terminology
 ‘selecting the same lexical item twice, or selecting two that are
closely related’ (p.12)
 Tie  ‘single instance of cohesion’ (p.3)
 Texture  a property of ‘being a text’ (p.2)
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Lexical Cohesion
 Hoey (1991) has looked at cohesion in text from
a lexical perspective. He has suggested that
cohesion ‘may be crudely defined as the way certain
words of a sentence can connect that sentence to its
predecessors (and successors) in a text’.
link – occurrence of an item in two separate
sentences
bond – ‘connection between any two sentences by
virtue of there being a sufficient number of links
between them’ (p.91)
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Lexical Cohesion
Links Example
Sentence 15:
"For the stock market this
move was so deeply
discounted that I don't think it
will have a major impact".
Sentence 42:
Lucent, the most active stock on
the New York Stock Exchange,
skidded 47 cents to $4.31, after
falling to a low at $4.30.
Sentence 23:
J&J's stock added 83 cents to
$65.49.
Sentence 26:
Flagging stock markets
kept merger activity and
new stock offerings on
the wane, the firm said.
Text title: U.S. stocks hold some gains.
Collected from Reuters’ Website on 20 March 2002.
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Lexical Cohesion
Bonds Example
17.
In other news, Hewlett-Packard said preliminary
estimates showed shareholders had approved its purchase of
Compaq Computer -- a result unconfirmed by voting officials.
19.
In a related vote, Compaq shareholders are expected on
Wednesday to back the deal, catapulting HP into contention
against International Business Machines for the title of No. 1
computer company.
Text title: U.S. stocks hold some gains.
Collected from Reuters’ Website on 20 March 2002.
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Lexical Cohesion
Simple
Repetition
two identical items (e.g. bear – bear) or two similar items whose difference is ‘entirely
explicable in terms of a closed grammatical paradigm’ (e.g. bears (N) – bears (N)) (p.53)
Complex
Repetition
which results from two items sharing a lexical morpheme but differing with respect to other
morphemes or grammatical function (e.g. human (N) – human (Adj.), dampness – damp)
Simple
Paraphrase
two different items of the same grammatical class which are ‘interchangeable in the context’
(p.69) and ‘whenever a lexical item may substitute for another without loss or gain in
specificity and with no discernible change in meaning’. (p.62). (e.g. sedated – tranquillised)
Complex
Paraphrase
two different items of the same or different grammatical class; this is restricted to three
situations:
a) antonyms which do not share a lexical morpheme (e.g. hot – cold);
b) two items one of which ‘is a complex repetition of the other, and also a simple paraphrase
(or antonym) of a third’ (p.64). (e.g. a complex paraphrase is recorded for ‘finance’ (v) and
‘funds’ (n) if a simple paraphrase has been recorded for ‘finance’ (v) and ‘fund’ (v), and a
complex repetition has been recorded for ‘fund’ (v) and ‘funds’ (n);
c) when there is the possibility of substituting an item for another (for instance, a complex
paraphrase is recorded between ‘record’ and ‘discotheque’ if ‘record’ can be replaced with
‘disc’.
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
SummariserPort
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
SummariserPort
 Summariser-Port is a revised and object-oriented version
of the TelePattan developed at Surrey during 1994-1999 by
Benbrahim and Tostevin.
 The TelePattan system was used to investigate cohesion in
technical texts by Trine Dahl, Bergen Business School.
 TelePattan was entered in the DARPA sponsored SUMAC
(1997) competition where its summary were judged to
amongst the best machine produced summaries by
independent evaluators.
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
SummariserPort
 Parser
 Reads the text file
 Segments it into sentences.
 BreakIterator - Java class designed specifically to
parse natural language into words and sentences.
 Features:
built-in knowledge of punctuation rules;
it does not require any special mark-up.
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
SummariserPort
 Patterns Extractor
 Performs simple repetition
 Pattern-matching operation
 Includes an optional file of closed class words
and other non-lexical items (e.g. pronouns,
prepositions, determiners, articles, conjunctions,
some adverbs, etc.)
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
SummariserPort
 Morphological Rules
 Performs complex repetition
 Instances of complex repetition are looked up by
means of a list of derivational suffixes encoded into
the program. For the English language, it contains 75
morphology conditions that lead to approximately
2500 possible relations among words.
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
SummariserPort
 Output
 Produces the results
 Files created:
Summary File
Whole text
MoreInfo File
Link Matrix
Bond Matrix
Word
Frequency List
List of
Sentences (TO,
TC, MB)
Summary
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
SummariserPort
i j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 …
1
4 5 1 2 0 2 0 0
0
0
2
0
0
2
0
1
0
0
1
2
0 0 0 0 0 0 0
0
0
0
0
0
1
0
1
0
0
1
3
2 1 1 2 0 0
0
0
0
0
2
0
0
0
0
0
0
4
1 0 1 0 0
0
0
0
0
0
0
0
0
0
0
0
5
0 1 0 0
0
0
0
0
0
0
0
0
0
0
0
6
1 0 0
0
0
0
0
1
0
0
0
0
0
0
7
0 0
0
0
0
0
2
0
0
0
0
0
0
8
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
9
10
11
12
13
14
15
16
17
18
19
i j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 …
1
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6
0 0 0 0 0 0 0 0 0 0 0 0 0 0
7
0 0 0 0 0 0 0 0 0 0 0 0 0
8
0 0 0 0 0 0 0 0 0 0 0 0
9
0 0 0 0 0 0 0 0 0 0 0
10
0 0 0 0 0 0 0 0 0 0
0
11
0 0 0 0 0 0 0 0 0
0
0
12
0 0 0 0 0 0 0 0
0
0
0
13
0 0 0 0 0 0 0
0
0
0
0
14
0 0 0 0 0 0
0
0
0
0
15
0 0 0 0 0
0
2
0
0
16
0 0 0 0
3
4
1
17
1 1 0
5
2
18
1 0
2
19
0
20
20
…
…
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
SummariserPort
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Evaluation
 Question Game or Q&A Evaluation
To measure information content (retention)
Some people see text and create a set of questions about content
(questioners)
 Other people (answerers) see:
1. Nothing – but must try to answer the questions (default
knowledge)
2. Summary – must answer the same questions
3. Full Text – must answer the same questions again
Compute the quality of Summaries (% answers correct)
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Evaluation
Q&A Evaluation
100.0%
80.0%
60.0%
40.0%
Information
Retention
20.0%
Summary 5
Summary 4
Summary 3
Summary 2
Summary 1
0.0%
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Conclusions
 We are very keen to devise strategies for
independent and objective evaluation of our
system.
 Human evaluation is continuing within the
GIDA project – reviewed by project partners and
EU-appointed evaluators.
 Machine-based evaluation, based on neural
network classification of summarised and
original texts, is also continuing.
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.
Future Work
 Conduct further evaluation tests
 Implement Simple Paraphrase
 Conduct experiments in Brazilian Portuguese
 Complex Repetition
 Simple Paraphrase
This document is property of the GIDA consortium. It cannot be used, copied or referred to without prior written authorisation.