Towards a model of formal and informal address in English Manaal Faruqui Language Technologies Institute, CMU (Work done at IIT Kharagpur, India) Sebastian Padó Univ.

Download Report

Transcript Towards a model of formal and informal address in English Manaal Faruqui Language Technologies Institute, CMU (Work done at IIT Kharagpur, India) Sebastian Padó Univ.

Towards a model of formal and
informal address in English
Manaal Faruqui
Language Technologies Institute, CMU
(Work done at IIT Kharagpur, India)
Sebastian Padó
Univ. of Heidelberg, Germany
Formal and informal address
• Most languages distinguish formal (V) and informal (T)
address in direct speech (Brown & Gilman 1960)
• Formal address: Neutrality, distance, used for “superordinates”
• Informal address: used for friends, “subordinates”
• Variety of realizations in languages
• Frequently pronoun choice (French vous/tu, German Sie/du)
• Verbal inflection (e.g. Japanese)
1
T/V and English
• Contemporary English is conspicuous by not realizing
the T/V distinction
• Pronoun “you” is both formal and informal
• No differences in verbal inflection
• Does English really differ in such a fundamental way
from virtually all other related languages?
2
Main goals of this work
• Goal 1: Determine whether English distinguishes V and
T consistently, but using different indicators
• If yes, what are these indicators?
• Goal 2: Develop a computational model that labels
English sentences as T or V
• Ideally without spending effort on annotation
3
Methodology
• Use a parallel corpus to analyze aligned sentences with
overt (German) T/V choice and covert English T/V
choice
• For Goal 1: Compare German and English address
• For Goal 2: Project German labels onto English sentences
4
Digression: Creation of a parallel corpus
• Current parallel corpora are not suitable
• EUROPARL: overwhelmingly formal (>99%)
• Newswire: no dialogue
• Creation of a new corpus: English—German literary texts
• 106 19th-century novels and stories (project Gutenberg)
• Sentence-aligned: Gargantuan (Braune & Fraser 2010)
• POS-tagged (Schmid 1994)
• German sentences can be labeled as T, V or NONE
•Rules for labeling follow on the next slide
5
Labeling German Pronouns as T/V
• Du/du: Singular T
• Sie: Singular V (except for utterance initial positions)
• sie: Ignored
•
Third person pronoun (she/they)
• ihr: Ignored
•
•
Plural T address or archaic sing./plural V address
• Can be ideally distinguished by capitalization but
errors present in the corpus
Dative form of 3rd person “she” pronoun sie
• Neutral wrt T/V
6
Goal 1:
Compare German and English address
• Give English monolingual text to human annotators
• Ask for T/V judgment
• Their annotation provides the following information
• How well do annotators agree on English text?
• Does English monolingual text provide enough information
to identify T/V? (1a)
• How well do annotators agree with copied labels?
• Is there a direct correspondence ? (1b)
• Only if this is the case is the copying of labels appropriate
7
Experiment 1: Human Annotation
• 200 randomly drawn English sentences
• Two annotators (“A1”, “A2”)
• Two conditions:
• No context: just one sentence
• In context: three sentences pre- and post-context each
8
Results: Reliability
A1 vs. A2
No Context
In Context
.75 (k=.49)
.79 (k=.58)
• Context improves reliability
• Many sentences can not be tagged with T/V in isolation
“And she is a sort of relation of your lordship’s,” said Dawson.
“And perhaps sometime you may see her.”
•
Reliability in context is reasonable:
•
Goal 1a ✓
English does provide strong (if imperfect) clues on T/V
9
Results: Correspondence
(A1∩ A2) vs. Projection
No Context
In Context
.67 (k=.34)
.79 (k=.58)
• Agreement with German projected labels again
reasonable, but not perfect
Goal 1b ✓
• Error analysis showed strong influence of social norms
• Example: Lovers in 19th cent. novels use V (!)
[...] she covered her face with the other to conceal her tears. “Corinne!”, said
Oswald, “Dear Corinne! My absence has then rendered you unhappy!”
10
Experiment 2: Prediction of T/V
• Copy German T/V labels onto English: No annotation
• Learn L2-regularized logit classifier on train set; optimize
on dev set; evaluate on test set
• Feature candidates :
• Lexical features (bag-of-words, χ² feature selection)
• Distributional semantic word classes
• 200 word classes clustered with the algorithm by Clark (2003)
• Politeness theory (Brown & Levinson 2003)
• Polite speech has specific features, which are inherited by V
11
Parallel Corpus: Some statistics
• German
•
•
•
#Sent_V: 37K & #Sent_T: 28K
Around 270 (<0.5%) sentences were both T & V
• Ignored!
No error in manually verified randomly selected 300
German sentences
• English
•
•
•
•
•
#Sent_V: 25K & #Sent_T: 18K
Training data: 74 novels (26K)
Development data: 19 novels (9K)
Test data: 13 novels (8K)
Corpus available at http://www.nlpado.de/
12
Politeness theory features
13
Context
• As shown by human annotation: Individual sentences
often insufficient for classification
• Simplest solution: Compute features over a window of
context sentences
• Problem: context typically includes non-speech sentences
“I am going to see his ghost!” Lorry quietly chafed the hands that held
his arm.
14
Context
•
Our solution: A simple
“direct speech” recognizer
CRF-based sequence tagger
(Mallet) trained on 1000
sentences
•
Ideal results for 8 sentences
of direct speech context
+5% accuracy over no
context
Speech context
Sentence context
B-SP: “I
am going to see his ghost!”
O: Lorry quietly chafed the hands that held his arm.
15
Quantitative results
Model
Accuracy
Frequency BL (V)
Lexical features
Semantic class features
Politeness features
59.1
67.0
57.5
59.6
• Only lexical features yield significant improvement over
frequency baseline
Goal 2 ✓
16
Qualitative analysis: Lexical Features
• Top 10 most-associated words for V (left) and T (right)
• V: Titles, formulaic language
• T: mixed bag, mostly very infrequent
17
Qualitative analysis: Semantic classes
No.
P(c|V) /
P(c|T)
Words with highest
P(w|V) / P(w|T)
1.
4.59
Mister, sir, Monsieur, sirrah
2.
2.36
Mlle., Mr., Herr, Dr., Mrs.
3.
1.60
Gentlemen, patients, rascals
…
…
…
200.
0.02
believest, lovest, makest, couldst
• Only 3-4 of 200 classes are associated with T or V
18
Qualitative analysis: Politeness features
• Politeness features failed to yield a good result
• Problem 1: Hand-built lists do have insufficient coverage
• Difficult: what linguistic expressions convey “distance”?
• Problem 2: Features (at least in their current version) do
not distinguish well between T and V
• p(f|V)/p(f|T) values for all classes between 0.9 and 1.3
• For 13 of 16 features, p(f|V)/p(f|T) >1: indicative of V
19
Conclusions
• Formal and informal language exists in English as well
• Indicators more dispersed across context
• Bootstrapping a T/V classifier for English possible
• Results still fairly modest
• Asymmetry: V more marked than T → better features
• Difficult to operationalize features with high recall
(sociolinguistic features, first names, …)
20
Future Work
• Learn social networks from the novel
• Change the scope of T/V from the sentence level to a
pair of interlocutors
21
References
•
•
•
•
•
•
•
M. Faruqui & S. Pado, “I thou thee, thou traitor”: Predicting formal vs.
informal address in English literature. ACL 2011.
M. Faruqui & S. Pado, Towards a model of formal and informal address in
English. EACL 2012.
Roger Brown and Albert Gilman. 1960. The pronouns of power and
solidarity. In Thomas A. Sebeok, editor, Style in Language, pages 253–277.
MIT Press, Cambridge, MA.
Penelope Brown and Stephen C. Levinson. 1987. Politeness: Some Universals
in Language Usage. Number 4 in Studies in Interactional Sociolinguistics.
Cambridge University Press.
Fabienne Braune & Alexander Fraser. Improved unsupervised sentence
alignment for symmetrical and asymmetrical parallel corpora. COLING 2010
Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision
Trees. In Proceedings of the International Conference on New Methods in
Language Processing, pages 44–49, Manchester, UK.
Andrew Kachites McCallum. 2002. Mallet: A machine learning for language
toolkit. http://mallet.cs.umass.edu.
22
Thank you!
Questions?
Please write to: [email protected]
[email protected]
23