Тема IV-10. Тема и рема. Исходный пункт. Под

Download Report

Transcript Тема IV-10. Тема и рема. Исходный пункт. Под

REFERENTIAL CHOICE:
FACTORS AND MODELING
Andrej A. Kibrik, Mariya V. Khudyakova,
Grigoriy B. Dobrov, and Anastasia S. Linnik
[email protected]
Night Whites SPb
February 28, 2014
Referential choice in discourse
 When a speaker needs to mention (or refer
to) a specific, definite referent, s/he chooses
between several options, including:
 Full noun phrase
• Proper name (e.g. Peter)
• Description = common noun (with or without
modifiers) (e.g. the tzar)
• Mix: Peter the Great
 Reduced NP, particularly a third person pronoun
(e.g. he)
2
Example
Description
Proper name
Pronoun
 The Victorian house that Ms. Johnson is inspecting
has been deemed unsafe by town officials. But she
asks a workman toting the bricks from the lawn to
give her a boost through an open first-floor
window. Once inside, she spends nearly four hours
Ø measuring and diagramming each room in the
80-year-old house, Ø gathering enough information
to Ø estimate what it would cost to rebuild it. She
snaps photos of the buckled floors and the plaster
that has fallen away from the walls.
Zero
3
Research question
How is referential choice made?
4
Why is this question
important?
 Reference is among the most basic
cognitive operations performed by
language users
 Reference constitutes a lion’s share of all
information in natural communication
 Consider text manipulation according to
the method of Biber et al. 1999: 230-232
5
Referential expressions
marked in green
 The Victorian house that Ms. Johnson is
inspecting has been deemed unsafe by town
officials. But she asks a workman toting the
bricks from the lawn to give her a boost through
an open first-floor window.
6
Referential expressions removed
 The Victorian house that Ms. Johnson is

inspecting has been deemed unsafe by town
officials. But she asks a workman toting the
bricks from the lawn
to give her a boost through an open first-floor
window.
7
Referential expressions kept
 The Victorian house that Ms. Johnson is
inspecting has been deemed unsafe by town
officials. But she asks a workman toting the
bricks from the lawn to give her a boost through
an open first-floor window.
8
Types of referential devices:
levels of granularity
We mostly concentrate
on the two upper levels
in this hierarchy
◘╕
REG tradition:
most attention
to varieties of
descriptive
full NPs
9
Multi-factorial character of
referential choice
 Multiple factors of referential choice
 Distance to antecedent
 Along the linear discourse structure (Givón)
 Along the hierarchical discourse structure
(Fox, Kibrik)
 Antecedent role (Centering theory)
 Referent animacy (Dahl)
 Protagonisthood (Grimes)
.........................................
Properties
of the
discourse
context
Properties
of the
referent
10
Cognitive multi-factorial model
of referential choice
Discourse
context
Referent’s
properties
Referent activation
in working memory
Referential
choice
Factors of
referential
choice
11
Rhetorical distance





Distance along the hierarchical discourse structure
between
 the current point in discourse, where referential choice is to be
made
 the antecedent
Measured in elementary discourse units
 roughly equaling clauses
Rhetorical structure theory by Mann and Thompson
(RST)
Very important factor
RST Discourse Treebank corpus (Marcu et al.)
12
Example of a rhetorical graph
from RST Discourse Treebank
13
RefRhet and MoRA
 RST Discourse Treebank + our annotation
= RefRhet corpus
 Subcorpus RefRhet 3 (2013-2014)
 Annotation scheme MoRA
(Moscow Referential Annotation)
14
RefRhet 3
 64 texts
 6294 markables
 1852 anaphor-antecedent pairs
 475 pronouns
 1377 full NPs
• 706 descriptions
• 671 proper names
15
Candidate factors of ref. choice
Some other are
computed
automatically
Some values are
drawn from MoRA
annotation
╕◘
Factor-predicted
variable
Discourse
context
16
Windows of the MMAX2
program
17
Some properties of the MoRA
scheme
 Wide range of activation factors and their values
 E.g. multiple values of the “grammatical role” factor
 Annotation of groups
 complex markables serving as antecedents
•
•
•
•
and-coordinate
or-coordinate
prepositional (children with their parents)
discontinuous
18
A discontinuous group
19
Tasks for machine learning





Candidate factors:
 All potential parameters implemented in corpus annotation
Factor-predicted variable:
 Form of referential expression (np_form)
Two-way task:
 Full NP vs. pronoun
Three-way task:
 Definite description vs. proper name vs. pronoun
Accuracy maximization:
 Ratio of correct predictions to the overall number of instances
20
Machine learning methods
(Weka, a data mining system)
 Logical algorithms
• Decision trees (C4.5)
• Decision rules (JRip)
Logistic regression
 Compositions


Boosting
Bagging
 Quality control – the cross-validation method
21
Results of machine learning
on RefRhet 3 and MoRA
Accuracy
two-way
Accuracy
two-way
(2012)
Accuracy
three-way
Baseline (frequency of the
most common ref. option)
74,4%
74,4%
37,9%
Logistic regression
87,2%
Decision tree algorithm
93,7%
86,1%
74,0%
Bagging
89,4%
88,0%
76,1%
Boosting
89,5%
86,2%
74,0%
Algorithm
71,3%
22
Non-categorical referential
choice (Kibrik 1999)
Cognitive plane:
graded variable
Linguistic plane:
binary variable
min
full NP
Peter
Referent activation
max
pronoun
he
23
Non-categorical referential
choice




In many instances, more than one referential options
can be used
Referential choice is less than fully categorical (cf. Belz
& Varges 2007, van Deemter et al. 2012: 173–179)
In the intermediate activation instances both the original
text author and the algorithm:
 more or less randomly make a categorical decision at the
linguistic plane
 those decisions do not have to always coincide
Therefore, no model can predict the actual referential
choice with 100% accuracy
24
Experiment: Understanding (allegedly
non-categorical) referential expressions





9 texts, in which the algorithms have diverged in their
prediction from the original referential choice
9 original texts (proper name) and 9 altered texts
(pronoun) distributed between 2 experimental lists
60 participants
1 experimental question + 2 control question
If the instances of divergence are explained by
intermediate referent activation, the accuracy in
experimental questions should not be lower than the
accuracy in control questions
25
Experiment: results





Control questions – 84%
Questions to proper names – 84%
Questions to pronouns – 75%
If we exclude questions #2 and #5, then the accuracy for questions to
pronouns is 80%, not differing significantly from control and PN questions
In general, the algorithm diverges from the original in the places where
that is acceptable, that is, referent activation is intermediate
26
Non-categorical referential
choice
 Sometimes referential choice allows more
than one option
 A proper model of referential choice must
account for this property of human
speakers
 Our modeling procedures actually conform
to this requirement
27
Further studies
 Explore logistic regression’s ability to evaluate
the certainty of prediction
 and attempt to correlate that with the human’s
assessment of non-categorical referential choice
 as well as with the theoretical notion of intermediate
referent activation
 Cheap data modeling
 Secondary referential options, such as

demonstrative descriptions
Genres and referential choice
28
Conclusions
 Multi-factorial approach
 Corpus large enough for machine-learning



modeling
Results of prediction close to theoretical
maximum
Account of the non-deterministic character of
referential choice
This approach can be applied to a wide range of
other linguistic choices
29
Thank you
for your attention
30