The Translational English Corpus: A practical

Download Report

Transcript The Translational English Corpus: A practical

The Translational English Corpus:
A practical approach to corpus building
Outline
• TEC and new developments
– EDT Corpus
– Humanities Corpus
• Corpus design
– Representativeness
– Balance
– Size
• Corpus building
– Identifying material
– Scanning/Converting texts
– Tagging & Annotation
A corpus of contemporary English translations:
written texts translated into English from a
variety of source languages
http://www.llc.manchester.ac.uk/ctis/research/english-corpus/
25
15
10
5
0
French
Arabic
German
Italian
Spanish
Brazilian…
Portuguese
Russian
Norwegian
Welsh
Catalan
Japanese
Latin American…
Polish
Slovene
Swedish
Tamil
Chinese
Finnish
Greek
Hebrew
Serbian
Vietnamese
Hopi
30
24 23
20
15
13
9
6 6 5
4 4 3 3 3 3 3 3 3
2 2 2 2 2 2 1
Number of books in each language
for fiction and (auto)biography
Set of software tools for the investigation of a
wide range of issues to do with the language of
translated texts.
Header File: contains meta‐data such as the title of
the text, author, publisher, etc.
Text File: contains the actual data to be analysed
– Sub-corpus Selection:
Allows you to select particular text files or groups of text
files to search.
– Sort Tool:
Allows you to sort concordances to the left or right, and
specify the number words between the search keywords.
– Corpus Tree Viewer:
Allows you to “grow” a tree for various keywords. The size
of the text reflects frequency of occurrence in the corpus.
An electronic database of all
material (to be) included in
the TEC for the subcorpora of
fiction and (auto)biography.
The entry for each book
includes not only most of the
information that is included in
the header file, but also
images of the covers of the
books.
• A corpus of discourses on translation for the investigation of
they way in which translation/translators are conceptualised
in society at different historical periods.
• No time, language or genre restriction: any material is
included as long as it is written in English.
• Two types of material
– Peritextual : material that accompanies the translation, e.g.
prefaces, introductions, afterwords, etc.
– Epitextual: published material
(broadsheet and mainstream newspapers,
literary magazines, etc.)
• Link with TEC
A corpus of translations into English of works by theorists in
the humanities, e.g. philosophers, sociologists, literary
theorists, etc.
Temporality: translations date from
1900 onwards, but the source texts
texts do not have a time restriction.
* Multiple translations of the same book.
What is a corpus?
‘A collection of texts held in machine-readable form and capable
of being analysed automatically or semi-automatically’
(Baker 1995)…
….and has certain characteristics:
– Representativeness
– Balance
– Size
“a corpus is thought to be representative of the language variety
it is supposed to represent, if the findings based on its contents
can be generalised to the said language variety” (Leech 1991).
A corpus may focus on a particular genre/language/
author/translator, etc.
Decisions about criteria for selection of
texts
TEC Design
Material: English translations (whole texts)
Genres: Fiction, (auto)biography, in-flight magazines,
news articles
Time of publication: Late 80s onwards
Place of publication: UK and USA
“a balanced corpus covers a wide range of texts which are
supposed to be representative of the language variety under
question” (McEnery et al. 2006).
Also, ‘internal’ balance, e.g.
– Gender balance
– Source language balance
– Genre balance
A corpus needs to be adequate for the purposes for which it
is intended.
A bigger corpus is not necessarily more useful than a smaller
one.
Factors that affect corpus size:
– Purpose of the corpus
–Availability of data
–Copyright
• Research questions (purpose of the corpus)
– Specialised corpora and corpora intended for
morphosyntactic studies tend to be smaller than
general corpora and corpora intended for lexical
studies. Static corpora are also smaller than dynamic
ones.
• Availability of data
– The availability of suitable data (especially in machinereadable form), as well as the ease with which they
can be identified may affect the size of a corpus.
• Copyright
– Copyright clearance can impede corpus development
as well as the accessibility and availability of a corpus
to a wide audience.
– Copyright law varies internationally.
– Fair dealing: no permission needed for short extracts
not exceeding 400 words for prose (or a total of 800
words in a series of extracts, none exceeding 300
words).
– Out of copyright material: author’s / translator’s
lifetime + 70 years (UK).
– If you’re in doubt, seek permission! (McEnery et al.
2006)
We're delighted
learn of yourposting
interestthe
project,
andof
pleased
….We
don't feeltocomfortable
entirety
both to
grant
general
permission
use all
reviews
and blogs
titles you
to your
database,
but to
would
bebook
willing
to make
half on
our
site. We'll
grateful iftoyou
canresearch
include acenter…We
link to the site in the
of both
booksbeavailable
your
typically charge a fee ofpieces
$150you
peruse.
title for use of such a
large portion.
…University Press is pleased to grant you non-exclusive,
English language, world rights to reprint limits of fair use
(under 300 words)…
We're delighted to learn of your interesting project, and
pleased to grant you general permission to use all book
reviews and blogs on our site. We'll be grateful if you can
include a link to the site in the pieces you use.
• Possible sources
• Publishers’ websites
• Search engines e.g. Farrar, Strauss and Giroux, NYTimes
• Publishing houses specialising in translation
• Databases
• National databases e.g. Three Percent, LTI Korea
• Internet, archives, etc.
• Problems
•
•
•
•
Search engine not well-designed e.g. The Telegraph
Need for specific material
In some cases, not indicated whether it is a translation or not
For reviews: not always related to translation
• Scanning
• Flat-bed scanner – Document feeder
• Paper and print quality
• Scanner settings: Resolution and Colour vs Greyscale
• OCR (Optical Character Recognition) Process
•
•
•
•
Language support
Accuracy
Font type
Document format
• Text File
• Spelling errors
• Character recognition errors (e.g. Tm instead of I’m)
• Save as .txt file
Adds value to a corpus, makes it easier to extract information
and prepares texts to be used with a corpus software
Factors that affect the extent of tagging/annotation (Olohan 2004):
• Purpose of the corpus
• Corpus software
• Accessibility of the corpus
• Technical expertise of the researcher
• POS (Part-of-Speech) Tagging
– Marks up a word in a corpus as corresponding to a particular part of
speech, based on both its definition, as well as its context.
E.g. John_NP0 loves_VVZ Mary_NP0 ._.
• Lemmatisation
– Reduces the inflectional variants of words to their respective
lemmas, i.e. as they appear in a dictionary.
E.g. is, are, am -> BE
• Parsing
– Marks the syntactic structure of each sentence.
E.g. (S (NP (NNP John))
(VP (VPZ loves)
(NP (NNP Mary)))
• Develop and use your own software
• Use existing corpus tools
– TEC Tools
For more information about how to use TEC Tools with local
corpora, you can download the tutorial from the TEC webpage.
– WordSmith Tools
A collection of corpus linguistics tools
– ParaConc
A bilingual or multilingual concordancer
– ….
“When a corpus is created, a compromise has often to be reached
between ideal design criteria and practical constraints. However,
while opportunistic choices may be justified, the limitations and
distortions they introduce in the makeup of a corpus should not
be forgotten when evaluating the results”. (Zanettin 2011)
TEC website
http://www.llc.manchester.ac.uk/ctis/research/english-corpus/
TEC Email Address
[email protected]
Baker, Mona (1995) ‘Corpora in Translation Studies: An overview and some
suggestions for future research’, Target 7(2): 223-243.
Leech, Geoffrey (1991) ‘The state of the Art in Corpus Linguistics’, in Karin Aijmer
and Bengt Altenberg (eds) English Corpus Linguistics: Linguistic studies in
honour of Jan Svartvik, London: Longman, pp. 8-29.
McEnery, Tony, Richard Xiao and Yukio Tono (2006) Corpus-based Language
Studies, London and New York: Routledge.
Olohan, Maeve (2004) Introducing Corpora in Translation Studies, London and
New York: Routledge.
Zanettin, Federico (2011) ‘Translation and Corpus Design’, SYNAPS 26:14-23.