Corpus Linguistics

Download Report

Transcript Corpus Linguistics

More about Corpus Linguistics

PALA Summer School, Maribor, 2014

Introduction

We will look at ...

• • •

Basic concepts and terminology Sampling and representativeness Annotation and mark-up

Characteristics of corpus linguistics according to Biber, Conrad & Reppen (1998: 4)

• • • • Uses a corpus Uses computers for analysis Empirical – analysing actual patterns of language use Depends on quantitative and qualitative analytical techniques

Methodology vs. theory

Two main views:

Methodologist

CL is a methodology for studying large amounts of language data using computer software

Neo-Firthian

CL is a sub-discipline of linguistics, concerned with explaining relationships between meaning and structure in language

Characteristics of a corpus according to McEnery and Wilson (2001)

• • • • • Machine-readable form Very large Representative sample (Standard reference) Often annotated

Machine readable form

• • • Nowadays, corpus = machine readable Corpora tend to sit on a computer Not always the case

Very large

• • • • Corpora are usually very large: 10 x 1000s, 100 x 1000s, millions of words.

Usually a finite size Size decided at design stage – when size reached, data collection stops.

Exception – monitor corpus – E.g. COBUILD Corpus (Birmingham, UK) – Dictionary compiling

A representative sample

• Corpora are so big that they can be a ‘representative sample’ of a language or a language variety • • Also depends a lot on design of corpus Consider sampling – which texts will be sampled – size of samples – number of samples

A representative sample

Written language

extracts from books, magazines, newspapers, websites…

Spoken language

transcripts of meetings, lectures, radio programs, everyday conversations…

Something more specific ...

A representative sample

Yorkshire English

• • • What time frame you were going to sample?

Speech or writing or both?

Source of language data?

(Standard reference)

• A corpus might be a standard reference or a ‘benchmark’ for a particular variety of language against which other texts or corpora can be compared

Annotation

• Just the words on their own = ‘raw text’ • • Annotation = extra information about what is in the corpus Helps with the analysis of the data • Annotation also known as

tagging

or mark-up (generally)

Annotation

• • • •

Information about the text:

Where it came from Who produced it Genre Etc.

Example

The spoyle of Antwerpe <Author> George Gascoigne 1576 EEBO 2112
to the end ........

........

Annotation

• •

Adding information to the body of the text:

e.g. Gender of speaker e.g. Discourse presentation

Example

and Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning,

Example

and Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning,

Example

and Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning,

Example

and Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning,

Annotation

• Annotation can be a manual process (takes ages) • But some linguistic annotation can be done automatically – e.g. word meaning (semantic) – e.g. grammatical class of each word in the corpus (noun, verb, etc.)

Linguistic Annotation: examples

• • • •

CLAWS

C

onstituent

L

ikelihood

A

utomatic

W

ord-tagging

S

ystem Developed at Lancaster University 96-97% accurate Works out what Part Of Speech the word is and assigns a tag from a list of tags (a tagset)

Linguistic Annotation: examples CLAWS

I liked him, and he was different from other boys, not at all pushy, except pushy to please I suppose , but even that was sweet in a way

Linguistic Annotation: examples CLAWS

I_PPIS1 liked_VVD him_PPHO1 ,_, and_CC he_PPHS1 was_VBDZ different_JJ from_II other_JJ boys_NN2 ,_, not_XX at_RR21 all_RR22 pushy_JJ ,_, except_CS pushy_JJ to_TO please_VVI I_PPIS1 suppose_VV0 ,_, but_CCB even_RR that_DD1 was_VBDZ sweet_JJ in_II a_AT1 way_NN1

Characteristics of a corpus according to McEnery and Wilson (2001)

• • • • • Machine-readable Very large Representative sample (Standard reference) Annotation

A corpus

• …a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration. (McEnery & Wilson 2001: 32)

Why use a corpus?

• • • Allows linguists to access quantitative information about language, which can often be used to support qualitative analysis.

Insights into language gained from corpus analysis are often generalisable in a way that insights gained from the qualitative analysis of small samples of data are not.

Using corpus data forces us to acknowledge how language is really used (which is often different from how we think it is used)

Exploiting a corpus

Collocations

Collocation = relationship between words that tend to occur together – Words that tend to occur near word X are the collocates of word X – Based on frequencies – Statistical measures

Exploiting a corpus

• •

Collocations

Important in corpus linguistics The company a word keeps can give that word implicit associations or assumptions

Exploiting a corpus

• • •

Collocations

Juvenile = young, youthful, a young person – Collocates: delinquency , delinquent , delinquents , offenders , diabetes , crime , court Juvenile has negative associations Semantic prosody

Exploiting a corpus

Collocations

– Near-synonyms often differ in terms of their collocations

Exploiting a corpus

• •

Collocations Young

– Collocates: mums-to-be, bloods, nubile, hopefuls, impressionable, up-and-coming Negative associations?

Exploiting a corpus

Keywords

A keyword is a word which occurs in a text or corpus more frequently than you would expect by chance alone – … based on comparison with another (benchmark) corpus (e.g. the BNC) – … and the difference has to be statistically significant

Text #1 wordlist

Keyness

Text #2 wordlist Difference must be statistically significant Comparison process Key words list Apply statistical test (e.g. Log Likelihood). Calculated by the tool The over-represented (and under-represented) words in text #1 when compared with text #2

Exploiting a corpus

Keywords

A text’s keywords often point towards its content or its biases and/or can act as style markers (Enkvist 1973) • Keywords are often a good guide to what would be interesting to look at in more detail

Exploiting a corpus

Keyness is “[...] a quality words may have in a given text or set of texts, suggesting that they are important [...]” (Scott and Tribble 2006: 55-6)

Summary

• • • • •

The basic idea:

By analysing VERY large amounts of textual data, we can ...

establish norms about the variety of language being studied test theories about language spot common and rare language phenomena reduce bias

Summary

The computer can’t do it all for us – we still have to analyse the results and ask ...

‘What does it all mean?’