Diapositiva 1 - ACORN Aston Corpus Network

Download Report

Transcript Diapositiva 1 - ACORN Aston Corpus Network

Collocations in translated text
issues, insights, implications
Silvia Bernardini
University of Bologna, Italy
[email protected]
Aston Corpus symposium
23 May 2008
Talk outline
• Collocations
– Corpus Linguistics
– Corpus-Based Translation Studies
• Research questions, methodology, results
– Fiction
– Open source software
• Implications
– Descriptive and applied
• Methodological follow up
• Future work
Background: Collocations in CL
– “Phraseology-oriented” approaches
• E.g. (Howarth 1996:47)
[Restricted collocations are] combinations in which
one component is used in its literal meaning, while
the other is used in a specialised sense. The
specialised meaning of one element can be
figurative, delexical or in some way technical and
is an important determinant of limited collocability
at the other. These combinations are, however,
fully motivated.
Background: Collocations in CL
• “Parameters” of collocation within
phraseology approaches
– Motivation/arbitrariness
– Commutability
– Non-literalness
– Transparency
– Unpredictability
Background: Collocations in CL
• “Frequency-oriented” approaches
– “Automatisation” is the result of repetition
– British school of linguistics (Firth)
• The statistical tendency of words to co-occur
(Hunston 2002: 12)
• “Significant” collocation is regular collocation
between items, such that they occur more often
than their respective frequencies and the length of
the text in which they occur would predict (Jones
and Sinclair 1974:19)
Searching for collocations in text
• “Keyword” method
– Starting from a (set of) keyword(s) and looking
left and right
• E.g. Sinclair 1998, Stubbs 2001, Danielsson 2001
• “Sequence” method
– Selecting all sequences of N words (or lemmas,
or POS tags) that recur a certain number of times
• E.g. Kjellmer 1994, Biber et al. 1999, Johansson 1993
Statistics
• MI, t-score, z-score, log-likelihood…
– P. Baker (2006), McEnery et al (2006)
• Bare frequency
– Krenn and Evert (2001)
• A mixture of both
– MI * log fq
• Kilgarriff and Tugwell (2001)
– frequency-based cut-offs
• Krenn (2000)
NN in ukWaC (bare fq, top 10)
175642
81127
70514
66693
65198
60626
56721
48833
47655
46643
web
case
search
application
credit
web
car
health
climate
email
site
study
engine
form
card
page
park
care
change
address
Collocations in CBTS
applied perspectives
• Bahumaid (2006)
– Arab university lecturers translating sentences
containing collocations (make a noise, domino
effect) into English and into Arabic with any
reference tools available
• Less than 50% “correct” answers even when
translating into their L1
• Paraphrase most common strategy (40-48%)
Collocations in CBTS
applied perspectives
• Hatim and Mason (1997:205)
– Collocations should in general be neither less
unexpected (i.e. more banal) nor more unexpected
(i.e. demanding greater processing effort) than in the
ST
• Baker (1992: 56ff)
– Engrossing effect of source text patterning
– Tension between accuracy and naturalness
– The use of established patterns of collocation […]
helps to distinguish between a smooth translation,
one that reads like an original, and a clumsy
translation which sounds ‘foreign’.
Issues in descriptive CBTS
• Translation “norms” or “universals”
• Corpus research in TS should focus on the
identification of “features which typically occur in
translated text rather than original utterances and
which are not the result of interference from specific
linguistic systems”. (Baker 1993:243)
• E.g.: explicitation/explicitness, simplification, disambiguation, levelling
out (homogeneity), preference for conventional grammar, avoidance of
repetition, exaggeration of features of the target language,
normalisation/sanitisation…
Collocations in CBTS
descriptive perspectives
• Anecdotal evidence by Øverås (1998):
ST: Arket i skrivemaskinen var like skinnende nyfødt blankt som
da hun satte det inn i valsen for en time siden.
(newborn blank)
TT: The sheet of paper in her typewriter was as pristinely white
as when she had inserted it over an hour ago.
• Confirms Toury’s (1995) hypothesis that translators often
produce repertoremes in place of textemes, i.e. they
“produce ready-made, cliché structures”.
Collocations in CBTS
descriptive perspectives
• Kenny (2001)
– Normalisation/sanitisation in the translation of creative
lexical combinations
• Danielsson (2001)
– Automatic identification of collocations (keywordbased) in ST corpus and analysis of renderings in TT
corpus
• Dayrell (2007)
– Range of collocations employed in original vs.
translated language (monolingual comparable
comparison)
– 10 nouns with frequency >200 and their collocates in
a span ±4, fq4, MI4
Limits
• Kenny (2001)
– Habitual collocations not covered; method not scalable
• Danielsson (2001)
– Plagued by data-sparseness
• Only 2 units of meaning (of the ~12K identified in a large
monolingual corpus) occur 5 times in a 800K word parallel corpus
• Dayrell (2007)
– Main issue investigated is lexical repetitiveness at the
collocational level
– Selective focus: collocations of frequent words only
– No cross-check with source texts
– Uncontrolled variable makes results difficult to interpret
An alternative approach
Research questions
– Are translated texts more/less collocational
than original texts in the same language
• i.e., are their collocation types overall more/less
frequently attested and/or significant?
– If so, is this a consequence of the translation
process?
• i.e., can we identify shifts that could account for the
observed overall differences?
An alternative approach
Corpus resources
• Literary and specialised texts English/Italian
– Monolingual comparable corpora (MCC)
• Originals in Language A and comparable translations into
Language A
– Parallel corpora
• Originals in Language A and their translations into Language
B, usually combined with reference corpora
+ Reference corpora of English (BNC) and Italian
(Repubblica)
An alternative approach
Corpus resources
• Literary texts
– 8 English STs→ Italian TTs (samples)
– 7 Italian STs→ English TTs (samples)
– ~150K words per component
• Specialised texts
– Open-source software documentation
• 10 English STs→ Italian TTs (full texts)
• 6 Italian originals (full texts) (→ 1 English translation)
– ~250K words per component
Fiction texts sampled
M. Atwood
The handmaid’s tale
M. Atwood
Cat’s eye
M. Cruz
Smith
C. Fowler
F. Camon
La malattia chiamata
uomo
Gorky Park
G. Celati
Narratori delle
pianure
Red bride
C. Comencini
Le pagine strappate
N. Gordimer My son’s story
Luther Blissett
Q
G. Greene
The tenth man
D. Maraini
Donna in guerra
D. Leavitt
A place I’ve never
been
G. Pontiggia
Il giocatore invisibile
R. Rendell
Kissing the gunner’s
daughter
G. Tomasi di
Lampedusa
Il gattopardo
OSS texts sampled
S.Frampton Linux administration
made easy)
L.Wirzenius The Linux System
Administrator’s Guide
M.Cooper the Advanced BashScripting Guide
G.Beekmans Linux from scratch
G. Short
3-button mouse HOWTO
D.Jarvis
3D Graphics Modelling
and Rendering mini
HOWTO
J.Tranter
Linux Amateur Radio
AX.25 HOWTO
E.Raymond The DocBook
Demystification HOWTO
P.Gortmaker Linux Ethernet HOWTO
R.Russell
Linux IPCHAINS HOWTO
A. Madesani IDE e SoundBlaster 32
creative – HOWTO
L. Pulici
Adaptec AVA 1505 miniHOWTO
G. Paolone LDR Linux Domande e
Risposte
D. Medri
Linux facile
G. Giusti
Programmare in PHP
D. Giacomini Appunti di informatica
libera
Extracting collocations
• Target sequences
– Lexical collocations
– Made of two words
– Contiguous
• Pos-based extraction from study corpora
• JN, NN, VN, V * N, N * * N (types)
– Collection of token frequencies from
reference corpora (BNC and Repubblica)
Extracting collocations
• Calculate Mutual Information (MI)
• Rank sequences
• Take top
– Arbitrary cut-off point: MI>2 and fq>1
• Calculate significance of difference btwn
original and translated
– Mann-Whitney significance tests
Mutual Information
MI compares the probability of observing x and y
together (the joint probability) with the probabilities of
observing x and y independently (chance). If there is a
genuine association between x and y, then the joint
probability P(x,y) will be much larger than chance […].
(Church & Hanks 1990:77)
MI(x;y)=
log2
p(xy) * N
------------p(x) * p(y)
Mann-Whitney-Wilcoxon ranks test
• Confidence with which we can reject the null
hypothesis that two ranked sets of observations
are taken from the same population
• Non-parametric, i.e. makes no assumptions
about observations being normally distributed
• Used (and tested) by Kilgarriff (2001) in
comparisons of the LOB and Brown corpora and
of male and female speech in the BNC
Rankings (top 10) for JN (eng)
Original fiction corpus
MI
collocation
fq (BNC)
7,0621 Shredded Wheat
9
6,4372 open-toed sandals
5
5,9465 beta carotene
5
5,7365 Milky Way
80
5,5479 barbed wire
193
5,4172 floppy disks
63
5,3891 eternal damnation
14
5,3798 cursive script
18
5,3046 pearl necklace
14
5,2500 herbal teas
7
Translated fiction corpus
MI
collocation
fq (BNC)
6,2687 wall-to-wall carpeting 6
6,1698 vous plait
10
5,6773 pistachio nuts
10
5,3305 boric acid
5
5,2218 submachine gun
9
5,2170 Venetian blinds
16
5,2060 Neapolitan dialect
4
5,1170 nasal twang
2
5,0816 westering sun
4
5,0775 hard-boiled eggs
30
Results - Fiction
Pattern
W
2NV lit ita
16974.5 0.008979
MI
Tr
2VN lit ita
9936.5
0.01088
MI
Tr
2NV lit eng 23042
0.05452
(LOG)FQ
Tr
2JN lit eng 856885
0.08781
MI
Or
3JJ lit ita
621
0.06798
MI
Tr
3NN lit ita
272195
0.007834
MI
Tr
3NN lit eng 42888
0.06323
(LOG)FQ
Tr
4NN lit ita
0.07768
(LOG)FQ
Tr
3009.5
P value
(confidence)
MI/(LOG)FQ Higher in
Results - OSS
Pattern
W
p value
(confidence)
MI/(LOG) FQ Higher in
2JN ita
w=122618
p=0.002261
MI
Or
3NN ita
w=21683
p=0.02607
w=22066.5 p=0.01029
MI
(LOG)FQ
Or
Or
2NJ ita
w=78109.5 p=0.001134
MI
Or
2NN eng
w =19142.5
p=0.005172
(LOG)FQ
Tr
2RJ eng
w=7609
p=0.06921
MI
Or
2RV eng
w=10458
p=0.04767
MI
Or
2VR eng
w=2907
p=0.01517
(LOG)FQ
Or
3VN eng
w=11904
p=0.05694
MI
Or
3NN eng
w=1910.5
p=0.0429
(LOG)FQ
Or
4VN eng
w=1027
p=0.06974
(LOG)FQ
Or
Summing up
• Translated fiction texts (Italian and
English) tend to be (overall) richer in
salient collocations than original texts in
the same language
• Italian (and English) open source software
manuals however show the opposite
trend…
Implications for descriptive TS
• Norm/law-governed (rather than universal)
trends (Toury 1995)
– Law of interference
• Stronger in OSS translation
– Law of growing standardization
• Stronger in fiction translation
Implications for applied TS
• Parallel comparison (not discussed here)
highlights strategies displayed by
professional translators at the collocational
level
• Starting point for awareness-raising and
revision exercises focusing on:
– Normalization
– Rise in formality
– Explicitation
Methodological follow up
• Crucial role played by reference corpora
• What happens if we repeat the
calculations with MI data from different
reference corpora?
Adjective-Noun
(Italian OSS texts)
• Repubblica (fq>1 and MI>2)
Pattern
W
p value (confidence)
MI/(LOG) FQ
2JN ita
w=122618
p=0.002261
MI
• itWaC (fq>10 and MI>1)
Pattern
W
p value (confidence)
MI/(LOG) FQ
2JN ita
W = 147350
p= 0.675
MI
Noun – prep|conj - Noun
(Italian fiction texts)
• Repubblica (fq>1 and MI>2)
Pattern
W
p value (confidence)
MI/(LOG) FQ
3N prep|conj N ita
w= 272195
p= 0.007834
MI
• itWaC (fq>10 and MI>1)
Pattern
W
p value (confidence)
MI/(LOG) FQ
3N prep|conj N ita
w=320480
p=0.001959
MI
Further work
– Bottom-up search for regularities
• Other genres?
– Source-oriented approach
• Starting from ST collocations
– Collocation extraction and reference corpora
• Evaluation of method
– Search for creative exploitation of collocations
• Can it be automatised?
Thank you
Silvia Bernardini
University of Bologna, Italy
[email protected]
Aston Corpus symposium
23 May 2008