Transcript Introduction to corpus linguistics
BTANT 129 w5
Introduction to corpus linguistics
Corpus
• The old school concept – A collection of texts especially if complete and self-contained: the corpus of Anglo-Saxon verse The Oxford Companion to the English Language • The modern view – A collection of naturally occurring language text chosen to characterize a state or variety of a language • John Sinclair Corpus Concordance Collocation OUP BTANT 129 w5
Corpus vs. archive
• Text archive • Collection of texts in their original format (Oxford Text Archive: http://ota.ox.ac.uk/ ) • Corpus • texts collected and processed in a unified, systematic manner British National Corpus: http://www.natcorp.ox.ac.uk/ BTANT 129 w5
BTANT 129 w5
BTANT 129 w5
Short history
Brief mention of just a select few! • Brown Corpus (Brown university) – 1 m words – 15 genres – 500 samples 2000 words each – Area: US – Time: 1961 • LOB Corpus (Lancaster-Bergen-Oslo) – GB replica of Brown BTANT 129 w5
Cobuild
• Major corpus initiative by Collins and Birmingham Univ. John Sinclair • 1991 20 m • -> Bank of English currently 450 m words • http://www.cobuild.collins.co.uk
BTANT 129 w5
British National Corpus
• 100 m words careful selection • 10 % spoken material • time span 1960 (fiction) – 1975 non-ficion) • 40-50 000 word texts • TEI compliant SGML coding • http://www.comp.lancs.ac.uk/ucrel/bncind ex/ BTANT 129 w5
BTANT 129 w5
International Corpus of English
• 20 corpora of 1 m words devoted to varieties of English around the world • 500 texts (300 written 200 spoken) of 2000 words each • time span: 1990-0996 • ICE-GB available in demo version • syntactic annotation, graphical tool ICECUP BTANT 129 w5
BTANT 129 w5
Corpus processing: tokenization
• Preprocessing – tokenization segmenting the text into sentences • sometimes tricky: sentence delimiters in mid sentence positions words • multi-word units – problem – Normalization • restoring clitics, abbreviations ("can't", "I've") BTANT 129 w5
Corpus processing: tagging
• Tagging – labelling every word with its Part of Speech category – Problem: ambiguity • out of context, words can belong to different part of speech or have different analysis within the same POS – set N vs. set V – bánt 'bánik' VBD vagy 'bánt' VBZ BTANT 129 w5
Corpus processing: disambiguation
• Disambiguation – defining the correct analysis in context • Two approaches: • both needs manually corrected training corpus – statistical • Hidden Markov model • calculating probability within a span of usually one or two words • rate of success can be around 98% – rule-based BTANT 129 w5
Syntactic annotation
• Difficult to do on such a scale • shallow parsing • Treebank: • collection of syntactically analyzed sentences • Penn treebank http://www.cis.upenn.edu/~treebank/ BTANT 129 w5
Recent trends
• Word sense ambiguation (SENSEVAL) • • http://www.itri.brighton.ac.uk/events/senseval/ • Message understanding http://www.itl.nist.gov/iaui/894.02/related_pro jects/muc/index.html
• SEMANTIC WEB • making information on the web understandable for machines • a vision requiring a huge effort, not clear whether feasible at all BTANT 129 w5
Representative sample?
• A corpus any size is inevitably a sample • Of what?
• Two approaches – sampling speakers – demographic sampling – sampling their output – text type sample BTANT 129 w5
The notion of representativeness
• Sample vs. population • sample should be proportional to the population for a given feature – example for demographic sampling if we know from census figures that 48% of people in living in Budapest are male we should compile our sample so that 48% of the informants are male -> our sample is representative of Budapest residents for gender BTANT 129 w5
Trouble with representativeness
• What should be the units of sampling?
• Registers, text types, genres etc.
• But no independent evidence about their ratio in the totality of language output -> representativeness is an ideal but impossible to implement BTANT 129 w5
Approaches to Representativeness
• Douglas Biber: • Rejects notion of proportional sampling • Sample should be as varied as possible • Representativeness measured in terms of wide variety of text types included in the sample BTANT 129 w5
The Web as a corpus?
• Pro: • immense database • dynamically growing • ideal 'quick and dirty' method • Cons: • lots of rubbish, irrelevant data • difficult to extract hits • no language analysis • only string query, which is crude BTANT 129 w5
One quick example
• Representativity or representativeness • Throw the two words at Google and have a look at the figures • Think about the conclusions • There are special front-end sites BTANT 129 w5
BTANT 129 w5
BTANT 129 w5
BTANT 129 w5
BTANT 129 w5