Using Corpora For Language Research

Download Report

Transcript Using Corpora For Language Research

Introduction to Corpora and Corpus Linguistics

COGS 523-Lecture 1 General Introduction 23.04.2020

COGS 523 - Bilge Say 1

Related Readings

Course Pack:  Meyer (2002). Corpus Analysis and Linguistic Theory. Ch 1  Abney (1996) Statistical Methods and Linguistics Extra Material: (Entirely optional, part of the presentation draws on these material)   McEnery and Wilson (2001) Ch1 McEnery et al. (2006) A1 and B2     Tognini-Bonelli (2001). Corpus Linguistics at Work. Ch 3 Corpora Discussion List Archives: Corpora: Chomsky/Harris Discussion, April 2001 Borsley&Ingham vs Stubbs Discussion. Lingua 112 (2002) Schönefeld (1999) Corpus Linguistics and Cognitivism, International Journal of Corpus Linguistics 4(1) 23.04.2020

COGS 523 - Bilge Say 2

What is a Corpus?

Derlem (alt. Bütünce) Text/Speech/ Video + Annotation Digital media Written/Spoken Language 23.04.2020

Design Criteria COGS 523 - Bilge Say 3

Questions of the Week

   Is working with corpora a methodology within linguistics or a distinctive subfield (corpus linguistics)?

What potential is there for empirical analysis of corpora to contribute to linguistic theory?

What are the dangers involved in corpus based linguistics? How can these dangers be reduced?

23.04.2020

COGS 523 - Bilge Say 4

What is a Corpus,again?

   A body of written text or transcribed speech which can serve as a basis for linguistic analysis or description, designed or required for a particular “representative” function.

An electronic collection of texts in a uniform representation Corpus vs text archive vs database 23.04.2020

COGS 523 - Bilge Say 5

Sinclair’s definition

 A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as sample of language 23.04.2020

COGS 523 - Bilge Say 6

Should a Corpus be Necessarily

      Large?

Be authentic?

Compiled for linguistic analysis?

Be saturated in terms of lexical growth?

Be representative?

Be machine readable?

23.04.2020

COGS 523 - Bilge Say 7

A History of Corpora

    Pre-computers era (pre 60s) Transition era (60s to beginning of 90s) Maturation era (90s onwards) What did technology bring?

 Increased accuracy, speed, accountability, replicability, large volumes of better annotated data.

23.04.2020

COGS 523 - Bilge Say 8

Linguistics

Introspection Experimental Methods Formal Linguistic Analysis Computational Modeling Corpus Based Methods ?

4/23/2020 COGS 523 - Bilge Say

Phonology Morphology Lexicon Syntax Semantics Discourse Pragmatics Computational Linguistics Psycholinguistics Sociolinguistics Historical Linguistics Applied Linguistics Corpus Linguistics ?

9

Corpus Linguistics

  The term emerged in 1980s, although the use of corpora has a long history.

Modern perspectives contain a number of opposing positions.

23.04.2020

COGS 523 - Bilge Say 10

Linguistic Subdisciplines with a tradition for corpora

    Historical Linguistics Phonetics Language Acquisition Statistical Natural Language Processing/Language Engineering/Computational Linguistics 23.04.2020

COGS 523 - Bilge Say 11

Corpus Linguistics: a Methodology, Theory, or Subfield of Linguistics?

      Rationalism vs Empiricism Formalists vs Functionalists Competence vs Performance Core vs Periphery Applied Linguistics vs Theoretical Linguistics Corpus-Based vs Corpus-Driven Approaches (Tognini-Bonelli) 23.04.2020

COGS 523 - Bilge Say 12

False Assumptions

 All corpus linguists are descriptivists, interested only in counting and categorizing occurrences in a corpus, and that all generative grammarians are theoreticians unconcerned with the data on which their theories are based. Complexity of the structure is not in the interest of corpus linguist. (Meyer, 2002) 23.04.2020

COGS 523 - Bilge Say 13

Evaluating Linguistic Theories

  Observational vs explanatory vs descriptive adequacy Falsifiability, Completeness, Simplicity, Objectivity etc...

23.04.2020

COGS 523 - Bilge Say 14

Chomskyan quotes:

   “The corpus could never be a useful tool for the linguist, as the linguist must seek to model language” “Corpus Linguistics does not exist.” “Any natural corpus will be skewed and incomplete. Some sentences won’t occur, because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list.” Indeed Chomsky contributed to modern view of corpus linguistics by improving language technology and to overcoming the structuralist-behaviourist views of language as something that could be enumerated, by way of formal language theory. 23.04.2020

COGS 523 - Bilge Say 15

Why Statistics help? (Abney)

     Language Acquisition Language Changes Language Variation Grammaticality- Ambiguity – Computation Modularity is not in isolation 23.04.2020

COGS 523 - Bilge Say 16

Grammaticality Judgements

*He shines Tony books.

He gives Tony books. If intutions do,why bother with corpus analysis?

 Artificial data is artificial and creates another kind of skewedness.

  “Yes I could say that-but I never would” – gradedness in grammaticality judgements Intuitions are perceptions....

23.04.2020

COGS 523 - Bilge Say 17

Alternative Views

 Leech (92)  “Computer Corpus Linguistics” is a new research enterprise, a new philosophical approach that • Concentrates on linguistic performance • • Leads to a more empirical view of scientific inquiry Exploits qualitative as well as quantitative methodology to produce a quantitatively oriented language model such as Bayesian language models.

 Not everyone agrees!

23.04.2020

COGS 523 - Bilge Say 18

Further Remarks

  Corpus Linguistics contributed to blurring the distinction between grammar and lexicon.

 Sinclair’s open choice vs idiom principle Cognitive linguists can accommodate data and facts revealed by corpus linguistic analysis 23.04.2020

COGS 523 - Bilge Say 19

Corpus Linguistics vs Corpus Based Linguistics

There is no inherent incompatibility between theoretical generative linguistics and corpus linguistics (Seegmiller) Generative and corpus linguistics are two approaches to the same problem, and must meet somewhere. Generative theories should match or be backed up by real data. (Schiffrin) What is possible and what is probable? Corpus linguistics offers a way of describing things that we *do* regularly and frequently with greater confidence and reliability than by using introspection alone. (Krishnamurthy) 23.04.2020

COGS 523 - Bilge Say 20

Corpus-Based Linguistics vs Corpus-Driven Linguistics

 Take existing theory as a starting point and correct and revise the theory in light of corpus evidence.   Favour very large, full text corpora, with the idea of cumulative representativeness and no annotation-to be able to free oneself of preconceived theories.  e.g collocations rather than colligations Without a corpus, there is no meaningful work to be done (attributed to Sinclair, Stubbs – but see their own writings) 23.04.2020

COGS 523 - Bilge Say 21

Reconciling Views

  Corpora are excellent resources for verifying the falsifiability, completeness, simplicity, strength, and objectivity of linguistic hypotheses (Meyer, 2002).

They can provide additional linguistic perspectives which improve our knowledge of language and our ability to use it (a weaker position) 23.04.2020

COGS 523 - Bilge Say 22

The Rise of Corpora

Years To 1965 1966-1970 1971-1975 1976-1980 1981-1985 1986-1991 23.04.2020

No of Corpus based studies 10 20 30 80 160 320 COGS 523 - Bilge Say (McEnery and Wilso, 2001) 23

Range of Activities in Corpus-based Linguistics

1.

2.

3.

Corpus Design, Compilation and Annotation Developing Tools for (1) or Analysis of Corpora Linguistic Studies or Applications using corpora developed in (1) using tools developed in (2) 23.04.2020

COGS 523 - Bilge Say 24

Types of Corpora

    General (typically balanced and made available for general linguistic use) vs Specialized (Dialect corpora,language acquisition corpora,learner corpora) Core Corpora Written vs Spoken Corpora Full-text vs Sample-text Corpora 23.04.2020

COGS 523 - Bilge Say 25

More Typology

   Finite-size (Static) vs Dynamic/Monitor Corpora Monolingual vs Multilingual Corpora (Parallel corpora, Comparable Corpora) Rather Graded Distinctions:    Raw vs Annotated, Balanced vs Pyramidal vs Opportunistic Corpora Synchronic vs Diachronic 23.04.2020

COGS 523 - Bilge Say 26

Some Examples of Corpora

 Pre-electronic corpora      Biblical and Literary Studies Lexicographical Dialect Studies Language Education Grammatical • Quirk’s Survey of English Usage Corpus (later computerized) had 200 samples of 5000 words each, half spoken, half written, tagged manually with 65 grammatical features. 23.04.2020

COGS 523 - Bilge Say 27

More Examples

 Major Electronic Corpora    Brown Corpus (Francis and Kucera, 1965) Brown University Standart Corpus of Present Day American English- 1 million words, 1961 64, 500 samples of 2000 words each Lancaster-Oslo-Bergen Corpus (LOB corpus) a comparable corpus of British English – fewer westerns exist,though!

FBrown and FLOB – comparable corpora of 1990s 23.04.2020

COGS 523 - Bilge Say 28

Major Electronic Corpora

   Also modeled after Brown:   Kolhapur Corpus of Indian English Wellington Corpus of New Zealand English...

London-Lund Corpus (1975)- 100 5000 word samples of spoken data, major spoken corpus till mid 1990s, predominantly highly educated adult speakers Lancaster/IBM Spoken Corpus (SEC) better balance-11 categories,detailed prosodic annotation 23.04.2020

COGS 523 - Bilge Say 29

Major Electronic Corpora

       Longman Dictionary of Contemporary English (LDOCE); COBUILD Project-Bank of English-524 million words as of 2004.

International Corpus of English International Corpus of Learner’s English- 2M words- 500 word essays, different English backgrounds Longman Learner’s Corpus, HKUST Learner’s Corpus CHILDES Child Language Data Exchange System European Corpus Initiative – ECI – 93 million words Many corpora are available from LDC and ELDA/ELRA. 23.04.2020

COGS 523 - Bilge Say 30

Major Natural Language Processing Corpora

   PennTreebank (1993) – 4.9 million words, tagged and parsed, not balanced (optional paper in course pack) TIPSTER corpus- AP Newswire and Wall Street Journal – mainly used for Information Retrieval More variety by National Corpora and dependency treebanks 23.04.2020

COGS 523 - Bilge Say 31

National Corpora

    British National Corpus (BNC Corpus)  100 million words, 90% written, 10% spoken, BNC Baby – 2 million word sampler, SARA and Xaira – its own corpus query tools, wholly tagged by CLAWS tagger American National Corpus (ANC)  In progress, preliminary releases available Czech National Corpus (optional paper in course pack)    12 full time persons working for 5 years in a speacialized institute 100 million words Partially tagged and parsed in Prague Dependency School tradition See METU Online links 23.04.2020

COGS 523 - Bilge Say 32

Lecture 2

  Corpus Design Issues Readings: • Tognini-Bonelli (2001) Corpus Issues. Ch3 • McEnery et al(2006) Unit A7-A9, B1 –all appear to be one article in the course pack • Meyer (2002) Planning the Construction of a corpus. Ch 2. 23.04.2020

COGS 523 - Bilge Say 33