Corpus 01 - Xiamen University

Download Report

Transcript Corpus 01 - Xiamen University

Corpus 01

Introduction Historical Review

Corpus Linguistics

• Linguists need evidence for theories. Evidences can be from intuition or introspection, experimentation or elicitation, observations in spoken or written texts • Focus on performance rather on competence, on observation to theory rather than theory to observation • Scope: text as domain of study and as the source of evidence for linguistic description and argumentation • Methodologies: quantification of linguistic description

Difference between corpus linguistics and other linguistics

• Richness of the evidence • Confidence in generalizability • Validity and reliability

Corpus Linguistic Activities

1. design and compilation of corpora collection of texts preparation and storage for later analysis 2. develop tools for the analysis of corpora: computational linguistics 3. use of computerized corpora to describe the lexicon and grammar of languages probalistic aspect of corpus-based description and study how often a particular form is used 4. language learning and teaching, natural language processing

Function of corpus linguistics

• Not that it is a faster way of description of language, but that it may reveal facts we might never have thought of seeking. e.g. Altenberg’s study of amplifier collocation in English (1991a): frequent maximizers such as

quite tend to

collocate with on-scalar words (

quite obviously

) while

absolutely

has a greater tendency than other maximizers to collocate with negatives (

absolutely not

) • Statistical distribution of linguistic items

Topics of Corpus linguistics

• Annotating corpora • Tagging of parts of speech and the senses of polysemous word forming • Improved automatic parsing • Identification of collocations • Phraseological units and discourse structure • Text categorization • Research methodology • Application in lexicography, syntactic description, translation, speech and handwriting recognition, language teaching

Pre-electronic corpora

• Biblical and literary studies • Lexicography • Dialect • Language education • Grammatical

Biblical and literary studies

• Alexander Cruden (1736): Concordance of the Authorized Fig. 2.1

• Similar works on Shakespeare

Lexicography

• Samuel Johnson (early 17th century): Dictionary of the English Language. Corpus of sentences from writers of the first reputation.

• James Murray: OED (1928), corpus of the canon of literary written English.

• Noah Webster (1828): An American Dictionary of the English Language

Dialect

• Wright (1898-1905): The English Dialect Dictionary • Ellis (1889): The Existing Phonology of English Dialects

Language Education

• Thorndike (1921): word frequency list based on a corpus of 45 million words from 41 different sources

Grammatical

• Jespersen (1909-49) • Cruisinga (1931-32) • Putsma (1926-29) • Fries (1940): American English Grammar.

Corpus of letters to the US Government by persons of different educational and social background.

Describe social class differences in usage.

Grammatical

• The Structure of English (1952): 250,000 word corpus of recorded telephone conversations.

• Randolph Quirk: Survey of English Usage (1968).

• 5,000 words X 200 samples >>>> 1,000,000 word corpus • representative of spoken and written English to describe the grammar and usage of educated adult native speakers of British English.

Types of electronic corpora

General corpora

(core corpora) • A text base for linguistic analysis to seek answers to particular questions about vocabulary, grammar, discourse structure.

• Balanced, containing texts from different genres, and domains in speaking and writing, private and public

Types of electronic corpora

Specialized corpora

• designed with particular projects in mind, e.g.

corpus for compilation of modern dictionary • Cartereet & Jones (1974): child language development • Zhu (1989): English used in petroleum geology exploration, drilling and refining.

• People disagreeing with each other in radio interview • Teachers’ directives in high school classrooms

Types of electronic corpora

• Leech (1992): training corpora and test corpora for language models and language processing • Dialect corpora • Regional corpora • Non-standard corpora • Learners’ corpora

Types of electronic corpora

• •

full text corpus

: complete texts • stylistic or discourse studies: 200 word samples may not be able to capture the internal structural characteristics of full texts.

raw corpus:

tagging, parsing, concordance

Major electronic corpora: first generation

• Brown Corpus (1961): Brown University Standard Corpus of Present-Day American English • Significance: 1. first computer corpus 2. in the face of massive indifference • linguistic research is not to record but to describe while corpora are statistically based, with probabilistic model of competence derived from linguistic performance.

• Structure (Table 2.2 p24)

Major electronic corpora: first generation

• Features: widely selected categories in written English: both formal and informal written English is taken into account • Selected by a method that makes it reasonably representative of current printed American English • Establish coding conventions: formula, quotations, punctuations.

• Number of characters per line: 70 abbreviations, • Grammatically tagged: each word assigned to one of over 80 tags.

Major electronic corpora: first generation

• Lancaster-Oslo/Bergen Corpus (1970): LOB Corpus • A British counterpart to the Brown Corpus • 2000 words of 500 texts published in 1961 • same categories as the Brown Corpus • differences from the Brown Corpus • coding: sentence initial markers

Major electronic corpora: first generation

• abbreviations • partly analyzed version • versions for more different platforms.

• More grammatical tags • Key word in context (KWIC) concordance

Major electronic corpora: first generation

• • Other first generation corpora

Indian English

English (1988) : the Kolpapur Corpus of Indian • • Collected materials of 1978 • New Zealand:

the Willington Corpus of Written New Zealand English and Australian Corpus of English

(1986) • Collected texts in 1961

Problems

• 1. one million size is prohibitory and too small.

• 2. difficult to find interesting differences between regional varieties: differences are sometime not in structure but in the frequency of the structure used • 3. additional words: sample ends at the first sentence ending after 2000 words. Thus in the Brown Corpus, the size is actually 1,014,312 and in LOB, 1,006,825. In word counting, LOB concordance size is 1,123,380

London-Lund Corpus (LLC)

• Spoken part of SEU which was half size for written and the other half for spoken • SEU original: 87 texts which make up 435,000 words plus 13 more texts. The total size is 5000 words of 100 texts which makes up 500,000 words.

• Features: less detailed prosodic analysis

London-Lund Corpus (LLC)

• Tone units • Onset • Location of nuclei • Direction of nuclear tones

London-Lund Corpus (LLC)

• Boosters • Degree of pause • Degree of stress • Speaker identity • Simultaneous talk • Contextual comment • Incomprehensible words

Corpora for special purposes

• Algeo (1988): a corpus of 5 million words from the 18 th century to present for studying Briticisms in the English language.

• American Heritage Intermediate Corpus (1969) • 5.09 million words from 10,043 samples of 500 words long from the publications widely read among American schoolchildren aged 7 to 15.

Corpora for special purposes

• Categories: reading, English and grammar, composition, literature, mathematics, social studies, spelling, science, music, art, religion, home economics, library fiction, non-fiction, reference and magazines.

• Words are not lemmatized • One of the first computer-based databases for lexicographical purposes.

Other special corpora

The Nijmegen Corpus

• Early 70s • Goal: grammatical description of British English • Size: 132,000 words

Other special corpora

• Composition: 20,000 words extract from 6 authors = 120,000 words • 12,000 words commentary of transcribed sports • categories: written, mainly literary English sports commentary • Sample span: 1962—1968 • Analysis: a large set of labeled trees or phrase markers.

Other special corpora

TOSCA

(Tools Analysis Corpus for Syntactic • Later then the Nijmegen Corpus • Size: 20,000 X 75=1,500,000 Corpus

Other special corpora

• Categories: various fiction and nonfiction, genres in written British English • Span: 1976—1986 • Composition: 45 samples chemistry, economics, physics.

from 21 nonfiction genres ((auto)biography, history, literary criticism, politics, women’s studies, • 30 samples from 9 fiction genres: horror, humor, love and romance, general fiction

Other special corpora

Hong Kong Technology University of Science

• Size: 1,000,000 of computer science English

and

• 2,000 word sample from 166 English language textbooks used in computer science course in the early 90s • goal: assist the teaching of English for computer science students.

Other special corpora

• Jiao Tong University Corpus for English in Science and Technology (

JTEST

) • 1980s • 1,000,000 words from written English texts in the physical science, engineering and technology • goal: facilitate lexical analysis of particular registers, e.e. count of high frequency words

Other special corpora

• Guangzhou Petroleum English Corpus (

GPEC

) • 411,000 words from 700 texts from the petroleum industry from written American and British English of the mid 1980s.

• goal: the same as JTEST

Second generation mega-corpora

• COBUILD • Collins Birmingham International Language Database • 25% from spoken texts University • reflect broadly general rather than technical language

Second generation mega-corpora

• current usage from 1960 on • naturally occurring texts • Prose included but not poetry • Contributions: commercial research and development project for dictionaries, grammars and language teaching courses.

Longman Corpus Network

• Three major corpora • LLELC: the Longman/Lancaster English Language Corpus • LSC: Longman Spoken Corpus • LCLE: Longman Corpus of Learners’ English

British National Corpus (BNC)

• 100 million words of contemporary spoken and written British English.

• Structure: Table 2.3 p.51

• Automatic word-class tagging with CLAWS

Issues in corpus design and compilation

• Static or dynamic: • Representativeness and balance • Size • Written and spoken

Issues in corpus design and compilation

• Extralinguistic variables: text origin, participants, medium genre, style, factuality, topic, date of publication, authorship (age, gender, nationality), audience • Storage • Text capture: keyboarding, CD-ROM or electronic version, scanning (software, quality of printing) • Spoken text: transcribing (conventions transcribing prosodic phenomena: ICE project) for • Markup: marks for tagging (Standard Generalized markup language—SGML) p.84

Organizations and professional associations • Descriptive linguistics: the International Computer Archive of Modern English (ICANE) • ICASME CD-ROM: The Brown, LOB, Kolhapur, London_Lund and Helsinki corpora • And softwares: WordCruncher, TACT and Free Text Browser

Organizations and professional associations • Bibliographic overview: Computing yearbook Humanities • Computational linguistics: the Association for Computational Linguistics (ACL) • Literary studies: The Association for Computers and Humanities (ACH) • Association for Literary and Linguistic Computing (ALLC)