Gross-grained RST through XML Metadata for Multilingual

Download Report

Transcript Gross-grained RST through XML Metadata for Multilingual

Gross-grained RST through XML
Metadata for Multilingual
Document Generation
G. Barrutieta, J. Abaitua & J. Díaz
MT Summit VIII
Santiago de Compostela
Spain
Introduction
The web is full of documents
(web pages) and it can be
seen as a huge database
containing a lot of useful
information for a wide
range of users.
But it is difficult to find
relevant documents or
relevant information
within a document.
Problem: the web contains a
lot of data but the data is
unstructured [Sobrino]
Text to text generation (or
text regeneration) Is NLG
a selection problem? (8th
EWNLG Toulouse 2001)
The web is full of text that
can be used to generate
“new” text by taking bits
from here and there. This
approach requires
structured data.
The above is roughly what
the CourseViewGenerator
does.
A “view” is a new document
generated by parts of a
master document [Hirst et
al.]
Prototype´s schema
Time
Language Subject
User aspects
...
xml-dtd
Inputs
COURSE VIEW GENERATOR
Select content and format in an
“intelligent” way for reader support.
Course material
(multilingual
parallel corpus)
xml-dtd-xsl
Javascript-html
Generation engine
Document generation
Web browser
Document view
At least 3 research issues
• Where to generate from? Creation of the corpus.
This is the source of the generated documents.
This is the main focus of this paper.
• Content selection/determination algorithm – User
profiles or aspects to choose parts of the discourse
that are relevant to the students.
• Presentation selection algorithm – User profiles or
aspects to help students read and understand, to
motivate them and to involve them in the learning
process.
Multilingual parallel corpus –
Master document
• “Master document contains all the information,
including illustrations, that the system might wish
to include in any individual brochure, along with
annotations as to when each piece of information
is relevant.” [Hirst et al.] HealthDoc project.
• The same idea is used here. The master document
contains all the information about the subject
matter that the students, the administration and the
professor might need before, during and after the
course.
Data and metadata –
Level of segmentation
• How is this information going to be represented in
the multilingual parallel corpus?
– The data is text encapsulated in XML [Bray et al.] tags
– The metadata (tags) is data about the text
• The metadata is RST discourse trees [Mann &
Thompson]
• What is the size of the text spans (segments) to be
encapsulated or the level of segmentation?
– The text spans are “typically” clauses (minimal text
spans or elementary units) with a discourse function
[Marcu et al.]
RST discourse tree
.
.
.
<EXPLANATION>
<RST>
<RST-N>
<RST>
<RST-N>
<S>
Darwin as a geologist
</S>
</RST-N>
<RST-S>
<CONCESSION>
<S>
He tends to be viewed now as a biologist,
</S>
</CONCESSION>
<RST-S>
<RST>
</RST-N>
<RST-S>
<EVIDENCE>
<S>
but in his five years on the Beagle his main work was geology
</S>
</EVIDENCE>
<EVIDENCE>
<S>
and he saw himself as a geologist.
</S>
</EVIDENCE>
<EVIDENCE>
<S>
His work contributed significantly to the field.
</S>
</EVIDENCE>
</RST-S>
</RST>
</EXPLANATION>
.
.
.
RST in XML
.
.
.
<!ELEMENT EXPLANATION (RST+ )>
<!ELEMENT RST (RST-S|RST-N)*>
<!ELEMENT RST-N (S|RST)*>
<!ELEMENT RST-S (
EVIDENCE|
CONCESSION)*>
<!ELEMENT EVIDENCE (S+)>
<!ELEMENT CONCESSION (S+)>
<!ELEMENT S (#PCDATA)>
.
.
.
DTD
Non-isomorphism
• Marcu et al. show examples of multilingual text
analysis that are not isomorphic and proposes 4
possible solutions for further research. One of
them is explored later in this presentation.
• Moore & Pollack show examples of monolingual
RST text analysis that are not isomorphic either.
This ambiguiety is due to intentional and
informational level discrepancies.
• This non-isomorphism supposes that one content
selection/determination algorithm might be
necessary for each language. This is not workable.
Moore & Pollack´s nonisomorphism
• Not a problem in this particular case since
the corpus is manually created. The
ambiguieties are going to be addressed and
resolved by the human author.
Marcu et al.´s non-isomorphism
• They propose 4 possible solutions.
• One of them is “Derive a languageindependent discourse structure and then
linearize it”.
• This idea in their paper triggered our
“gross-grained RST”.
Gross-grained RST
• The language independent discourse structure is
bigger segments of text (bigger than clauses).
• Our segments are never smaller than sentences.
• Our segments are groups of sentences (at least
one) with a clear communicative goal.
• The RST theory is flexible enough to allow this.
Of course, we lose rhetorical information within
those bigger chunks of text but this approach
makes senses because we are not going to give the
students only a portion of an exercises, for
example.
Gross-grained RST in XML
<RST>
<RST-S>
<PREPARATION>
<S>
What is knowledge management?
</S>
</PREPARATION>
</RST-S>
<RST-N>
<S>
Knowledge, in a business context, is the
organizational
memory, which people know collectively
and individually
</S>
<S>
Management is the judicious use of
means to accomplish
an end
</S>
<S>
Knowledge management is the
combination of those
concepts, KM = knowledge +
management
</S>
</RST-N>
</RST>
<RST>
<RST-S>
<PREPARATION>
<S>
¿Qué es gestión del conocimiento?
</S>
</PREPARATION>
</RST-S>
<RST-N>
<S>
Conocimiento, en el contexto de los
negocios, es la
memoria de la organización, lo que la
gente sabe colectiva
e individualmente
</S>
<S>
Gestión es el uso juicioso de recursos
para alcanzar un fin
</S>
<S>
Gestión del conocimiento es la
combinación de esos dos
conceptos, GC = gestión + conocimiento
</S>
</RST-N>
</RST>
<RST>
<RST-S>
<PREPARATION>
<S>
Zer da ezagutzaren kudeaketa?
</S>
</PREPARATION>
</RST-S>
<RST-N>
<S>
Kudeaketa, negozioetan, erakundearen
memoria
da, jendeak bakarka eta taldeka dakiena
</S>
<S>
Kudeaketak erabideen erabilera zuzena
du helburu
</S>
<S>
Ezagutzaren kudeaketa bi kontzeptu
hauen nahasketa da,
EK = ezagutza + kudeaketa
</S>
</RST-N>
</RST>
Discussion
• Tests to be carried-out in the near future. The
documents generated by the CourseViewGenerator
are going to be presented to the students to let
them judge.
• A lot of fine tuning and refining will be required to
decide what are going to be the final user aspects
and the final content selection and presentation
selection algorithms. The refining will be guided
by the tests.
• Open questions:
– Is this approach applicable to other domains and
contexts?
– Will this approach work with bigger corpora?
Bibliography
All the bibliografical
references are in the
paper except the
following:
Moore, J.D. and Pollack,
M.E. (1992). A Problem
for RST: The Need for
Multi-Level Discourse
Analysis. Computational
Linguistics.