Introduction to corpus linguistics

Download Report

Transcript Introduction to corpus linguistics

BTANT 129 w5

Introduction to corpus linguistics

Corpus

• The old school concept – A collection of texts especially if complete and self-contained: the corpus of Anglo-Saxon verse The Oxford Companion to the English Language • The modern view – A collection of naturally occurring language text chosen to characterize a state or variety of a language • John Sinclair Corpus Concordance Collocation OUP BTANT 129 w5

Corpus vs. archive

• Text archive • Collection of texts in their original format (Oxford Text Archive: http://ota.ox.ac.uk/ ) • Corpus • texts collected and processed in a unified, systematic manner British National Corpus: http://www.natcorp.ox.ac.uk/ BTANT 129 w5

BTANT 129 w5

BTANT 129 w5

Short history

Brief mention of just a select few! • Brown Corpus (Brown university) – 1 m words – 15 genres – 500 samples 2000 words each – Area: US – Time: 1961 • LOB Corpus (Lancaster-Bergen-Oslo) – GB replica of Brown BTANT 129 w5

Cobuild

• Major corpus initiative by Collins and Birmingham Univ. John Sinclair • 1991 20 m • -> Bank of English currently 450 m words • http://www.cobuild.collins.co.uk

BTANT 129 w5

British National Corpus

• 100 m words careful selection • 10 % spoken material • time span 1960 (fiction) – 1975 non-ficion) • 40-50 000 word texts • TEI compliant SGML coding • http://www.comp.lancs.ac.uk/ucrel/bncind ex/ BTANT 129 w5

BTANT 129 w5

International Corpus of English

• 20 corpora of 1 m words devoted to varieties of English around the world • 500 texts (300 written 200 spoken) of 2000 words each • time span: 1990-0996 • ICE-GB available in demo version • syntactic annotation, graphical tool ICECUP BTANT 129 w5

BTANT 129 w5

Corpus processing: tokenization

• Preprocessing – tokenization segmenting the text into sentences • sometimes tricky: sentence delimiters in mid sentence positions words • multi-word units – problem – Normalization • restoring clitics, abbreviations ("can't", "I've") BTANT 129 w5

Corpus processing: tagging

• Tagging – labelling every word with its Part of Speech category – Problem: ambiguity • out of context, words can belong to different part of speech or have different analysis within the same POS – set N vs. set V – bánt 'bánik' VBD vagy 'bánt' VBZ BTANT 129 w5

Corpus processing: disambiguation

• Disambiguation – defining the correct analysis in context • Two approaches: • both needs manually corrected training corpus – statistical • Hidden Markov model • calculating probability within a span of usually one or two words • rate of success can be around 98% – rule-based BTANT 129 w5

Syntactic annotation

• Difficult to do on such a scale • shallow parsing • Treebank: • collection of syntactically analyzed sentences • Penn treebank http://www.cis.upenn.edu/~treebank/ BTANT 129 w5

Recent trends

• Word sense ambiguation (SENSEVAL) • • http://www.itri.brighton.ac.uk/events/senseval/ • Message understanding http://www.itl.nist.gov/iaui/894.02/related_pro jects/muc/index.html

• SEMANTIC WEB • making information on the web understandable for machines • a vision requiring a huge effort, not clear whether feasible at all BTANT 129 w5

Representative sample?

• A corpus any size is inevitably a sample • Of what?

• Two approaches – sampling speakers – demographic sampling – sampling their output – text type sample BTANT 129 w5

The notion of representativeness

• Sample vs. population • sample should be proportional to the population for a given featureexample for demographic sampling if we know from census figures that 48% of people in living in Budapest are male we should compile our sample so that 48% of the informants are male -> our sample is representative of Budapest residents for gender BTANT 129 w5

Trouble with representativeness

• What should be the units of sampling?

• Registers, text types, genres etc.

• But no independent evidence about their ratio in the totality of language output -> representativeness is an ideal but impossible to implement BTANT 129 w5

Approaches to Representativeness

• Douglas Biber: • Rejects notion of proportional sampling • Sample should be as varied as possible • Representativeness measured in terms of wide variety of text types included in the sample BTANT 129 w5

The Web as a corpus?

• Pro: • immense database • dynamically growing • ideal 'quick and dirty' method • Cons: • lots of rubbish, irrelevant data • difficult to extract hits • no language analysis • only string query, which is crude BTANT 129 w5

One quick example

• Representativity or representativeness • Throw the two words at Google and have a look at the figures • Think about the conclusions • There are special front-end sites BTANT 129 w5

BTANT 129 w5

BTANT 129 w5

BTANT 129 w5

BTANT 129 w5