Transcript Corpus Linguistics
More about Corpus Linguistics
PALA Summer School, Maribor, 2014
Introduction
We will look at ...
• • •
Basic concepts and terminology Sampling and representativeness Annotation and mark-up
Characteristics of corpus linguistics according to Biber, Conrad & Reppen (1998: 4)
• • • • Uses a corpus Uses computers for analysis Empirical – analysing actual patterns of language use Depends on quantitative and qualitative analytical techniques
Methodology vs. theory
Two main views:
Methodologist
CL is a methodology for studying large amounts of language data using computer software
Neo-Firthian
CL is a sub-discipline of linguistics, concerned with explaining relationships between meaning and structure in language
Characteristics of a corpus according to McEnery and Wilson (2001)
• • • • • Machine-readable form Very large Representative sample (Standard reference) Often annotated
Machine readable form
• • • Nowadays, corpus = machine readable Corpora tend to sit on a computer Not always the case
Very large
• • • • Corpora are usually very large: 10 x 1000s, 100 x 1000s, millions of words.
Usually a finite size Size decided at design stage – when size reached, data collection stops.
Exception – monitor corpus – E.g. COBUILD Corpus (Birmingham, UK) – Dictionary compiling
A representative sample
• Corpora are so big that they can be a ‘representative sample’ of a language or a language variety • • Also depends a lot on design of corpus Consider sampling – which texts will be sampled – size of samples – number of samples
A representative sample
Written language
extracts from books, magazines, newspapers, websites…
Spoken language
transcripts of meetings, lectures, radio programs, everyday conversations…
Something more specific ...
A representative sample
Yorkshire English
• • • What time frame you were going to sample?
Speech or writing or both?
Source of language data?
(Standard reference)
• A corpus might be a standard reference or a ‘benchmark’ for a particular variety of language against which other texts or corpora can be compared
Annotation
• Just the words on their own = ‘raw text’ • • Annotation = extra information about what is in the corpus Helps with the analysis of the data • Annotation also known as
tagging
or mark-up (generally)
Annotation
• • • •
Information about the text:
Where it came from Who produced it Genre Etc.
Example
........
Annotation
• •
Adding information to the body of the text:
e.g. Gender of speaker e.g. Discourse presentation
Example
and Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning,
Example
and Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning,
Example
and Bromssell having demanded that it
Example
and Bromssell having demanded that it
Annotation
• Annotation can be a manual process (takes ages) • But some linguistic annotation can be done automatically – e.g. word meaning (semantic) – e.g. grammatical class of each word in the corpus (noun, verb, etc.)
Linguistic Annotation: examples
• • • •
CLAWS
C
onstituent
L
ikelihood
A
utomatic
W
ord-tagging
S
ystem Developed at Lancaster University 96-97% accurate Works out what Part Of Speech the word is and assigns a tag from a list of tags (a tagset)
Linguistic Annotation: examples CLAWS
I liked him, and he was different from other boys, not at all pushy, except pushy to please I suppose , but even that was sweet in a way
Linguistic Annotation: examples CLAWS
I_PPIS1 liked_VVD him_PPHO1 ,_, and_CC he_PPHS1 was_VBDZ different_JJ from_II other_JJ boys_NN2 ,_, not_XX at_RR21 all_RR22 pushy_JJ ,_, except_CS pushy_JJ to_TO please_VVI I_PPIS1 suppose_VV0 ,_, but_CCB even_RR that_DD1 was_VBDZ sweet_JJ in_II a_AT1 way_NN1
Characteristics of a corpus according to McEnery and Wilson (2001)
• • • • • Machine-readable Very large Representative sample (Standard reference) Annotation
A corpus
• …a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration. (McEnery & Wilson 2001: 32)
Why use a corpus?
• • • Allows linguists to access quantitative information about language, which can often be used to support qualitative analysis.
Insights into language gained from corpus analysis are often generalisable in a way that insights gained from the qualitative analysis of small samples of data are not.
Using corpus data forces us to acknowledge how language is really used (which is often different from how we think it is used)
Exploiting a corpus
•
Collocations
Collocation = relationship between words that tend to occur together – Words that tend to occur near word X are the collocates of word X – Based on frequencies – Statistical measures
Exploiting a corpus
• •
Collocations
Important in corpus linguistics The company a word keeps can give that word implicit associations or assumptions
Exploiting a corpus
• • •
Collocations
Juvenile = young, youthful, a young person – Collocates: delinquency , delinquent , delinquents , offenders , diabetes , crime , court Juvenile has negative associations Semantic prosody
Exploiting a corpus
Collocations
– Near-synonyms often differ in terms of their collocations
Exploiting a corpus
• •
Collocations Young
– Collocates: mums-to-be, bloods, nubile, hopefuls, impressionable, up-and-coming Negative associations?
Exploiting a corpus
•
Keywords
A keyword is a word which occurs in a text or corpus more frequently than you would expect by chance alone – … based on comparison with another (benchmark) corpus (e.g. the BNC) – … and the difference has to be statistically significant
Text #1 wordlist
Keyness
Text #2 wordlist Difference must be statistically significant Comparison process Key words list Apply statistical test (e.g. Log Likelihood). Calculated by the tool The over-represented (and under-represented) words in text #1 when compared with text #2
Exploiting a corpus
•
Keywords
A text’s keywords often point towards its content or its biases and/or can act as style markers (Enkvist 1973) • Keywords are often a good guide to what would be interesting to look at in more detail
Exploiting a corpus
Keyness is “[...] a quality words may have in a given text or set of texts, suggesting that they are important [...]” (Scott and Tribble 2006: 55-6)
Summary
• • • • •
The basic idea:
By analysing VERY large amounts of textual data, we can ...
establish norms about the variety of language being studied test theories about language spot common and rare language phenomena reduce bias
Summary
The computer can’t do it all for us – we still have to analyse the results and ask ...