Using Corpora and Translation

Download Report

Transcript Using Corpora and Translation

Corpus Linguistics and
Corpora
Corpus

Corpus, plural Corpora A collection
of linguistic data, either compiled as
written texts or as a transcription of
recorded speech. The main purpose
of a corpus is to verify a hypothesis
about language - for example, to
determine how the usage of a
particular sound, word, or syntactic
construction varies.
Corpus Linguistics


Corpus linguistics deals with the
principles and practice of using
corpora in language study. A
computer corpus is a large body of
machine-readable texts.
(cf. Crystal, David. 1992. An Encyclopedic Dictionary of Language
and Languages. Oxford, 85)
Corpus


CORPUS (13c: from Latin corpus
body. The plural is usually corpora)
(1) A collection of texts, especially if
complete and self-contained: the
corpus of Anglo-Saxon verse………..
(cf. McArthur, Tom 1992 "Corpus" , The Oxford Companion to the
English Language. Oxford, 265-266)
Chomsky 1957

"Any natural corpus will be skewed.
Some sentences won't occur because
they are obvious, others because
they are false, still others because
they are impolite. The corpus, if
natural, will be so wildly skewed that
the description [of language based
on the corpus] would be no more
than a mere list. " Syntactic structures. The Hague,
159
Fillmore 1992


"I have two main observations to make.
The first is that I don't think there can be
any corpora, however large, that contain
information about all of the areas of
English lexicon and grammar that I want
to explore; all that I have seen are
inadequate.
Fillmore 1992


The second observation is that every
corpus that I've had a chance to examine,
however small, has taught me facts that I
couldn't imagine finding out about in any
other way."
In "Corpus linguistics" or "Computer-aided armchair linguistics",
in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics.
Berlin/New York, 35.
Types of corpus

Monolingual corpora - in which the
texts are all in the same language

Parallel and/or aligned corpora - in
which originals and translations are
aligned so that both texts are
synchronized to appear on the screen
together and it is easy to see how the
translator has translated the original.
Types of corpus


Comparable corpora - in which a
selection of original texts has been
made in two or more languages dealing
with the same subject or genre.
Concurrent corpora - a term used to
describe texts taken from newspapers
on the same subject on approximately
the same dates.
Types of corpus

Specialized corpora - texts on
specialized subjects. The principal
use for these corpora is the
extraction of terminology and
complementary explanatory
material - definitions, explanations,
semantic relations etc
Types of corpus


'Do-it-yourself ' corpora - a term coined by
those of us using small specialized corpora
for the purpose of teaching translation or
language
Disposable corpora - the same as 'do-ityourself' corpora, but taking into account
that such corpora need to be disposed of
after use so that their users do not get into
trouble with copyright restrictions.
How do you search a corpus?
Concordancing
 Sentence level – see BNC
http://www.natcorp.ox.ac.uk
 COMPARA – parallel concordance
http://www.linguateca.pt/COMPARA

The Survey of English Usage

60s - Randolph Quirk et al >
launched the Survey of English
Usage (SEU)
• "with the aim of collecting a large and
stylistically varied corpus as the basis
for a systematic description of spoken
and written English
The Survey of English Usage
• Brown, Lancaster-Oslo/Bergen (LOB)
and London-Lund Corpus of Spoken
English
• See ICAME - International Computer
Archive of Modern and Medieval English
at the Norwegian Computing Centre for
the Humanities at
http://gandalf.aksis.uib.no/icame.html
The Survey of English Usage



Today at University of London at
http://www.ucl.ac.uk/english-usage/
ICE - the International Corpus of
English
Download the sampler of this corpus
fully tagged and analysed from
http://www.ucl.ac.uk/englishusage/ice-gb/sampler/form.htm
Quality versus quantity



A small but fully analyzed and tagged e.g. early corpora and ICE (1 million
words)
British National Corpus – 100 million
words
Other corpora
• Bank of English - 450 million

The Internet
Corpora, lexicography &
terminology


Lexicography BEFORE corpora
• Emphasis on etymology
• Complex definitions
• Usage based on intuitions of
lexicographers
Terminology BEFORE corpora
• Standardization > one word= one
concept, rigid definitions
• Paper dictionaries/glossaries
Corpora, lexicography &
terminology

Lexicography & terminology
AFTER corpora
• Emphasis on modern usage in context
• Simple definitions
• Usage based on evidence in texts
• emphasis on establishing REAL rather
than IDEAL usage
COBUILD project





Begun in 1969
Collins, the well-known dictionary
publisher, and the University of
Birmingham – led by John Sinclair
A pioneering project
Objective > to collect texts for a corpus of
contemporary texts from which to extract
information on modern English usage
Work proceeded during the 70s and 80s see Sinclair (Ed.) 1987
COBUILD > Bank of English

Present site for COBUILD > Bank of
English http://www.titania.bham.ac.
uk/docs/about.htm
British National Corpus (BNC) original


Oxford University Computing Service
at http://www.natcorp.ox.ac.uk/
This completely free – but you only
get up to 50 results
Brigham Young University (BYU)
http://corpus.byu.edu/
 Note:
 Corpus of American English
 BNC
 TIME corpus
 Corpus de Português
 Corpus de Español
Brigham Young University (BYU)

PLEASE NOTE: You will need to
create a username and password to
use this – but it costs nothing
BNC – CQP version



Lancaster university
http://bncweb.lancs.ac.uk/bncwebSi
gnup/
PLEASE NOTE: You will need to
create a username and password to
use this – but it costs nothing
Other large monolingual corpora



Portuguese > CETEMPUBLICO
http://www.linguateca.pt/cetempublico/
Spanish > Real Academia
German > Mannheimer corpus
Using corpora to study syntax

For example:
• whether certain nouns occur more often
in the singular than plural
• how pronouns are used in different
languages
• which verbs favour certain forms of
tense, aspect or mood
• how adjectives combine with nouns
• where adjuncts occur in sentences
• ETC
Monolingual corpora

General language corpora useful for
studying:
• Words in context
• Problems of COLLOCATION
• Relative usage of synonyms
• Syntactic structures
• Sentence structure
Parallel Corpora - multilingual



European commission - Multilingual
http://ec.europa.eu/
EUROPARL - Multilingual
http://www.statmt.org/europarl/
ELDA
http://www.elda.org/sommaire.php
Parallel Corpora

COMPARA EN/PT
http://www.linguateca.pt/compara
Corpógrafo - LINGUATECA

An on-line suite of tools we have
developed for:
• Construction of corpora
• Semi-automatic extraction of
terminology
• Construction of terminology databases
• Terminology & corpora research
• Research into information retrieval and
knowledge engineering
CORPÓGRAFO




http://www.linguateca.pt/corpografo
FREE!
On-line!
For individual research
Bibliography



ICAME site at http://helmer.aksis.uib.no/icame.html
BIBER, D., CONRAD, S. & REPPEN, R. 1998
Corpus Linguistics: Investigating Language structure
and Use. Cambridge: Cambridge University Press.
BIBER, Douglas,Stig Johansson, Geoffrey Leech,
Susan Conrad & Edward Finegan. 1999. Longman
Grammar of Spoken and Written English. Harlow:
Pearson Education Ltd.
Bibliography





HOEY, Michael. 1991. Patterns of Lexis in Text. Oxford:
Oxford University Press. ISBN 0 19 437142 5.
MCENERY, Tony & WILSON, Andrew. 2001. Corpus
Linguistics. 2nd Edition. Edinburgh: Edinburgh University
Press.
OAKES, Michael P. 1998. Statistics for Corpus Linguistics.
Edinburgh: Edinburgh University Press. ISBN 0 7486 0817 6
SINCLAIR, John (ed) 1987. Looking Up - An account of the
COBUILD project in lexical computing. Collins COBUILD.
Collins ELT: London and Glasgow.
STUBBS, Michael. 1996. Text and Corpus Analysis:
Computer-assisted Studies of Language and Culture. Oxford:
Blackwell Publications Ltd. ISBN 0-631-19512-2 (pbk).