Corpora and the Language Learner

Download Report

Transcript Corpora and the Language Learner

Harnessing Corpora for real
and virtual ELT purposes
IFELT
Belinda Maia
FLUP
10/11.2003
What is a corpus?
• CORPUS - 13c: from Latin corpus body - plural
corpora)
• A body of texts, utterances or other specimens
considered more or less representative of a
language, stored as an electronic database.
• A corpus corpora may store many millions of
running words
• A corpus can be tagged to identify and classify
words and other formations
• A corpus can be searched using concordancing
programmes
An example of concordancing
(from the BNC)
A0R 2231 Maybe with twists of bacon.
A35 256 This substantial, 15-minute orchestral movement was inspired by
three paintings of Innocent X by Francis Bacon, themselves based on
Velasquez.
A6N 1311 They could cook vegetables and meat simply, deal with eggs
and bacon and porridge, and they were able to bake and housekeep,
learning as they went along.
AAX 286 Sir Richard Body, MP Hirohito, shy god who liked bacon &
eggs.
ABB 67 Remembering bacon and ham, the versatility of the pig can be
stretched to pies, sandwiches and ham, egg and chips.
ABB 236 The Smoked Trout & Parma Ham Mousse (see p18) is merely
decorated with slices of the ham and the Carbonnade of Beef is
enriched by using diced ham instead of bacon.
An example of concordancing
(with Wordsmith)
Tagging
• Example – courtesy Catherine Ball at:
http://www.georgetown.edu/faculty/ballc/corpora/t
utorial2.html#RTFToC16
• A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS
**'_**' ._.
A01 3 ^ by_IN Trevor_NP Williams_NP ._.
A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT
Gaitskell_NP from_IN
A01 4 nominating_VBG any_DTI more_AP labour_NN
A01 5 life_NN peers_NNS is_BEZ to_TO be_BE
made_VBN at_IN a_AT meeting_NN
A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.
Types of Corpora
• Monolingual corpora - in which the texts are all
in the same language
• Parallel and/or aligned corpora - in which
originals and translations are aligned so that
both texts appear on the screen together and
you can see how the translator has translated
the original.
• Comparable corpora - in which a selection of
original texts has been made in two or more
languages dealing with the same subject or
genre.
Types of Corpora
• Specialized corpora - texts on specialized
subjects for the extraction of terminology and
complementary explanatory material definitions, explanations etc.
• Concurrent corpora - used to describe texts
taken from newspapers on the same subject on
approximately the same dates.
• 'Do-it-yourself ' or ‘disposable’ corpora - small
specialized corpora for the purpose of teaching
translation or language
Corpora and Lexicography
• COBUILD = Collins Publishers +
University of Birmingham – 1980s
– Corpora work that revolutionised lexicography
• TODAY - All serious lexicography uses
corpora - e.g.
– Oxford English Dictionary
http://www.oed.com/
– Academia das Ciências de Lisboa
Corpora & Grammar
• The Longman Grammars of English (Quirk,
Greenbaum, Svartvik, Leech and others)
– Based on corpora – the classical corpora now
availableon CD-ROM through ICAME
– http://www.hd.uib.no/icame.html
• BIBER, D., S. JOHANSSON, G. LEECH, S.
CONRAD & E. FINEGAN. 1999. Longman
Grammar of Spoken and Written English.
Harlow: Pearson Education Ltd.
The corpora debate
• The bigger the corpus, the better
• The carefully chosen ‘representative’
corpora
• Chomsky > the average educated speaker
was a better source
• Big corpora are not necessarily
representative – e.g. The Hansard corpus
• Any selection of texts – is a selection
Yet
• Very Large corpora exist and are very useful
• Much research work nowadays is done with
small selected corpora for studying:
– different registers
– special subjects
Using official corpora - EN
• British National Corpus at:
http://sara.natcorp.ox.ac.uk/lookup.html 50 examples of any word or expression for
free on-line
• CD-ROM of 100 million words available
• The COBUILD project
http://titania.cobuild.collins.co.uk/form.html
• 40 Examples on-line
Using official corpora - PT
• AC/DC, CetemPúblico – Portuguese
monolingual corpora
• COMPARA – aligned English/Portuguese
corpus
• All at http://www.linguateca.pt
Language Learning/Teaching
and corpora
• How can a language teacher use corpora?
• Why should a language learner need to
know about corpora?
• What can be learnt?
How can a language teacher use
corpora?
• The teacher can:
– find an enormous amount of material for use in
class, for exercises
– check on real usage and compare it to textbooks
used
• BUT:
• Must be aware that corpora sometimes
prove the textbook wrong!
What can be learnt?
• Corpora as reference material for:
–
–
–
–
–
Lexical work
Syntactic study
Textual analysis
Observing language ‘in action’
Learning about a wide variety of areas
The student
• Can be trained to search autonomously for
information of all kinds
– Finding texts that supply real knowledge
– Finding texts that serve as models for style and
register
– Finding correct collocations of individual
words
Do-it-yourself corpora
• Suggestion:
• Train students to make and use their own
corpora by:
– Collecting texts off the Internet
– Using the ‘Find’ function in Word
– Broadening their vocabulary
Useful sites
Catherine N. Ball:
Tutorial: Concordances and Corpora
• http://www.georgetown.edu/faculty/ballc/co
rpora/tutorial.html
• Tim John’s Data-driven learning at:
http://web.bham.ac.uk/johnstf/
Useful sites
• Concordance the whole Web at:
http://www.webcorp.org.uk/
• And, of course, – Google at:
• http://www.google.com