What is a Corpus?

Download Report

Transcript What is a Corpus?

What is a Corpus?
• What is not a corpus?
 the Web
 collection of citations
 a text
• Definition of a corpus
“A corpus is a collection of pieces of language text in
electronic form, selected according to external
criteria to represent, as far as possible, a language or
language variety as a source of data for linguistic
research.” (Sinclair, 2005)
Corpus design (some notions)
• Criteria (external): language or language variety,
mode, text type, domain, text location, etc.
–Criteria will form cells
• Sampling
• Balance
–Range of text categories in the corpus
• Representativeness
“Representativeness refers to the extent to which a
sample includes the full range of variability in
population.” (Biber, 1993)
Corpus size
• There is no maximum size!
• The minimum size depends on:
 The kind of query (e.g. frequent words, technical terms)
 The methodology used for studying the data
• Zipf’s law: half of the words in a corpus occur once only, a
quarter twice only, etc.
• Words occurring once only (hapax legomena) are unlikely
to be of interest for more general study of a language
• Words → sequence of words (frequency dropping)
• Studies of collocations require bigger corpora
• Corpora for specialised studies can be much smaller
Why use a corpus?
• “As language teachers and professionals, we often have strong
intuitions about language use… Corpus-based research, however,
shows us that our intuitions are often completely wrong.” (Biber
2005)
• Even if our intuition is correct, the language we produce may not
represent typical language use (McEnery et al, 2006)
• Corpus-based research: authentic data
• Using a computer to study language:




quick processing of data
accurate and consistent
non-biased
allows enriching data with additional information
What can corpus be used for?
• Quantitative analysis – what can be counted?








characters
word-forms
parts of speech
sentences
paragraphs
sections, chapters…
utterances
turns
• Qualitative analysis
 meaning
 patterns
 semantic prosody
Types of corpora (1)
• Reference corpus – large, include spoken and written texts
representing various social and situational strata
• Monitor corpus – growing regularly, reflects language changes
• Balanced corpus – balanced according to text type, genre, or domain
• Sampled corpus – finite collection of carefully selected texts
• Annotated corpus – enhanced with various types of linguistics
information
• Unannotated (raw) corpus – contains only plain texts with no
additional linguistic information
Types of corpora (2)
• General (represents a language or language variety) and
specialized corpora (domain or genre specific)
• Monolingual and multilingual corpora; parallel corpora
• Comparable corpora
• Spoken and written corpora
• Synchronic and diachronic corpora
• Native speaker and learner corpora
• The British National Corpus (BNC)
• The Michigan Corpus of Academic Spoken English (MICASE)
Basic notions in corpus linguistics
• type / token
Example:
A corpus is a collection of pieces of language text in electronic
form,
13 tokens
11 types (a, corpus, is, collection, of, pieces, language, text, in,
electronic, form)
• word-form / lemma
play, plays, playing, played (word-forms of lemma play)
Types of output/analyses
• Word/phrase frequency: wordlists, N-grams
(clusters, lexical bundles)
• Concordance (node, KWIC, sorting, expanded
context)
• Collocation (span, T-score, Mutual information)
• Keywords
Building your own corpus (1)
• Design:
 Research question
 Criteria → cells
• Identifying sources of data
 Existing corpora
 Data archives: the Oxford Text Archive
 Other: Nexis UK, Lexis Nexis, WebBootCaT
• Copyright
Building your own corpus (2)
• Data collection




Downloading
Recording
Scanning
Keyboarding
• Documentation
• Preparing data:
–
–
–
–
–
data conversion to txt format (including transcription)
character encoding
clean-up
mark-up
alignment (parallel data)
Building your own corpus (3)
• Selecting corpus tool for analysis
• Adding value to data – annotation:




part-of-speech (POS) tagging, tokenization
lemmatization
parsing
other: semantic annotation, pragmatic
annotation, error tagging, etc.
Corpus tools
• Freely available vs commercial tools
• Standalone vs online tools
• AntConc (free, standalone)
• WordSmith Tools (not free, standalone)
• The SketchEngine (not free, online)
What corpora cannot tell us?
• No negative evidence
• Corpora rarely provide explanations
• Their usefulness depends on the research
question
• Findings cannot be generalised (unless the corpus
is representative)
Suggested reading
• Hunston, S. 2002. Corpora in Applied Linguistics.
Cambridge: Cambridge University Press.
• Kennedy, G. D. 1998. An Introduction to Corpus
Linguistics. Harlow: Longman.
• McEnery, T. and Wilson, A. 2001. Corpus Linguistics.
(2nd Ed.) Edinburgh: Edinburgh University Press.
• McEnery, T., Xiao, R. and Tono, Y. 2006. Corpus-based
language studies : an advanced resource book. London:
Routledge.
• Sinclair, J. 1991. Corpus, Concordance, Collocation.
Oxford: Oxford University Press.
Corpora
•
•
•
•
•
•
•
ACORN (Aston students and staff only)
http://acorn.aston.ac.uk/private/language.php
Collins COBUILD Corpus http://www.collins.co.uk/Corpus/CorpusSearch.aspx [56 Million Words]
British National Corpus (BNC): http://sara.natcorp.ox.ac.uk/lookup.html
CORPUS.BYU.EDU (Brigham Young University): http://view.byu.edu/
MICASE (Michigan Corpus of Academic Spoken English):
http://quod.lib.umich.edu/m/micase/
International Corpus of English (ICE) – need to send the agreement (no fee)
to get the password: http://www.ucl.ac.uk/english-usage/ice/index.htm
VLC, Hong Kong: http://www.edict.com.hk/concordance/ [English, French,
Bilingual, Chinese, Japanese]
Corpus Tools, and more
• Sketch Engine: http://www.sketchengine.co.uk/
• AntConc:
http://www.antlab.sci.waseda.ac.jp/software.html
• WordSmith Tools: http://www.lexically.net/wordsmith/
Websites with general information about corpora and corpus
linguistics:
• http://tiny.cc/corpora
• http://www.corpus-linguistics.de/