Transcript Slide 1
PALA Summer School, Maribor, 2014
Corpus Stylistics
Brian Walker and Dan McIntyre University of Huddersfield
Summer School Schedule
Day 1 09:00 – Lectures:
Introduction to corpus linguistic terminology and methodology; corpus linguistics + stylistics
11:00 – Practical session:
Introduction to WMatrix 12:30 – LUNCH
14:00 – Practical sessions:
WMatrix
17:30 – FINISH
Summer School Schedule
Day 2 09:00 – Practical:
Introduction to AntConc
10:00 – Practical:
AntConc– advanced features
11:30 – Lecture:
Round up: Corpus stylistics – more than the sum of its parts?
12:30 – LUNCH
14:00 – Over to Willie
Introduction to Corpus Linguistics
PALA Summer School, Maribor, 2014
What is a corpus?
•
Latin corpus: ‘body’ (plural corpora)
•
Put simply: a corpus is a ‘body’ of text
What is Corpus Linguistics?
•
Corpus linguistics is the study of language using a corpus or corpora
Early Corpus Linguistics
Franz Boas Leonard Bloomfield
Early Corpus Linguistics
Franz Boas Leonard Bloomfield
Early Corpus Linguistics
Franz Boas Leonard Bloomfield Cha rl e s Ho ckett
Franz Boas
Early Corpus Linguistics
Leonard Bloomfield • • • • • • ‘Corpus Linguistics’ as an anachronism Field Linguistics Boas’s studies of native American languages Bloomfield’s description of Tagalog Hockett’s work on Potawatomi Harris’s emphasis on the importance of results being derived from data While until about 1880 investigators confined themselves to the collection of vocabularies and brief grammatical notes, it has become more and more evident that large masses of texts are needed in order to elucidate the structure of languages. (Boas 1917: 1) Cha rl e s Ho ckett Zelli g H arri s
Principles of Chomskyan linguistics
• • • Homogeneous underlying system of language Describe the language of the ideal speaker/hearer Focus on linguistic competence as opposed to linguistic performance Corpus linguistics doesn’t mean anything. It’s like saying suppose a physicist decides, suppose physics and chemistry decide that instead of relying on experiments, what they’re going to do is take videotapes of things happening in the world and they’ll collect huge videotapes of everything that’s happening and from that maybe they’ll come up with some generalizations or insights. (Chomsky, quoted in Andor 2004: 97)
Problems with intuition
• •
Issue of acceptability
I was 19 when I started university I were 19 when I started university • • •
Impossibility of studying certain aspects of language without recourse to corpus data
Historical linguistics Language change/variation Language acquisition …this [intuition] is a very strange notion of data. Normally one expects a scientist to develop theories to describe and explain some phenomena which already exist, independently of the scientist. One does not expect a scientist to make up the data at the same time as the theory, or even to make up the data afterwards, in order to illustrate the theory. (Stubbs 1996: 29)
The Survey of English
• • • Instigated 1959 by Randolph Quirk at University College London One million words of written and spoken British English, made up of 200 text samples of 5000 words each Electronic version of the spoken data produced in collaboration with Lund University: the London-Lund Corpus • • • Manually annotated for prosodic and paralinguistic features Grammatical structures for each text sample recorded on file cards Searching the corpus meant a trip to the Survey offices to search through filing cabinets of data!
The Survey of English
• • • Instigated 1959 by Randolph Quirk at University College London One million words of written and spoken British English, made up of 200 text samples of 5000 words each Electronic version of the spoken data produced in collaboration with Lund University: the London-Lund Corpus • • • Manually annotated for prosodic and paralinguistic features Grammatical structures for each text sample recorded on file cards Searching the corpus meant a trip to the Survey offices to search through filing cabinets of data!
Building the Brown corpus
• • • The Brown Corpus Built by Nelson Francis and Henry Kučera at Brown University, USA One million words of written American English (1961), made up of 500 text samples of 2000 words each • • • • Enabled frequency measures of words Confirmed Zipf’s law The most frequent word in a corpus is approximately twice as frequent as the second most frequent, and three times as frequent as the third most frequent, etc.
Frequency is inversely proportional to rank
Extending the Brown family
• • • 1970-78: LOB Built by Geoffrey Leech and colleagues at Lancaster University One million words of written British English (1961), made up of 500 text samples of 2000 words each • • • • FROWN: Written American English from 1991 FLOB: Written British English from 1991 BE06: Written British English from early years of 21 st century LOBalike: Written British English from 2011
Extending the Brown family
• • • 1970-78: LOB Built by Geoffrey Leech and colleagues at Lancaster University One million words of written British English (1961), made up of 500 text samples of 2000 words each • • • • FROWN: Written American English from 1991 FLOB: Written British English from 1991 BE06: Written British English from early years of 21 st century LOBalike: Written British English from 2011
Making sense of meaning
• COBUILD project initiated at Birmingham in 1980 - resulted in the Bank of English • English Lexical Studies 1963: Sinclair, Susan Jones and Robert Daley analysed a small corpus of spoken and written English to investigate the relationship between words and meaning • Meaning is best seen as a property of words in combination • Builds on J. R. Firth’s concept of collocation
Bart Homer
You're up to something, aren't ya?
No! I'm just going out to commit certain deeds.
s.
A39 57 A39 58
A39 59
D11 131 would not commit the D17 47 Methodist Church to the view that th 7 article. Take care though: don't let your words commit an editor to E10 98 using a specific picture, w theory and deconstruction is such as to G67 189 commit the reasoner to defending certain values.
G67 ithin the Service about offenders who continue to commit to be silly and trivial, because I don't want to commit rease the penalties on those who use such guns to commit ereas H09 191 crime while on bail
H09 192
H27 148
A26 13 and
C13 200
E28 2
E28 3
H08 58
H09 70
H09 121
P17 103
Advances in annotation:
• • • • • • • Currently, one of the best contemporary UK English corpora 100 million words from the early 1990s Represents a wide range of both spoken and written modern British English Written data – 90 million words – Includes extracts from newspapers, academic books, popular fiction, letters and university essays Spoken data – 10 million words – Includes demographic data and context governed data The demographic part – Transcripts of about 900 everyday unscripted spoken conversations The context-governed part – Spoken language collected in public contexts – e.g. radio phone-ins, government meetings, classroom interactions
Advances in annotation: Wmatrix
Looking ahead
• • • • • Development of tools and technologies Corpus techniques increasingly used in other disciplines Interdisciplinarity Multimodal corpora (e.g. Headtalk, Knight et al. 2008) Corpus Linguistics and Geographical Information Systems. This involves extracting place-names from a corpus, searching for their semantic collocates and creating maps to allows users to visualise how concepts such as war and money are distributed geographically (Gregory and Hardie 2011)
References
• • • • • Andor, J. (2004) ‘The master and his performance: an interview with Noam Chomsky’, Intercultural Pragmatics 1(1): 93-111.
Boas, F. (1917) ‘Introduction’, International Journal of American Linguistics 1(1): 1-8. [Reprinted in Boas, D. (1940) Race, Language and Culture, pp. 199-210. The Free Press; New York.] Gregory, I. and Hardie, A. (2011) ‘Visual GISting: bringing together corpus linguistics and Geographical Information Systems’, Literary and Linguistic Computing 26(3): 297-314.
Knight, D., Adolphs, S., Tennent, P. and Carter, R. (2008) ‘The Nottingham Multi-Modal Corpus: a demonstration’, Proceedings of the 6th Language Resources and Evaluation Conference, Palais des Congrés Mansour Eddahbi, Marrakech, Morocco, 28-30 th May.
Stubbs, M (1996) Text and Corpus Analysis. Oxford: Blackwell.