Transcript Slide 1

PALA Summer School, Maribor, 2014

Corpus Stylistics

Brian Walker and Dan McIntyre University of Huddersfield

Summer School Schedule

Day 1 09:00 – Lectures:

Introduction to corpus linguistic terminology and methodology; corpus linguistics + stylistics

11:00 – Practical session:

Introduction to WMatrix 12:30 – LUNCH

14:00 – Practical sessions:

WMatrix

17:30 – FINISH

Summer School Schedule

Day 2 09:00 – Practical:

Introduction to AntConc

10:00 – Practical:

AntConc– advanced features

11:30 – Lecture:

Round up: Corpus stylistics – more than the sum of its parts?

12:30 – LUNCH

14:00 – Over to Willie

Introduction to Corpus Linguistics

PALA Summer School, Maribor, 2014

What is a corpus?

Latin corpus: ‘body’ (plural corpora)

Put simply: a corpus is a ‘body’ of text

What is Corpus Linguistics?

Corpus linguistics is the study of language using a corpus or corpora

Early Corpus Linguistics

Franz Boas Leonard Bloomfield

Early Corpus Linguistics

Franz Boas Leonard Bloomfield

Early Corpus Linguistics

Franz Boas Leonard Bloomfield Cha rl e s Ho ckett

Franz Boas

Early Corpus Linguistics

Leonard Bloomfield • • • • • • ‘Corpus Linguistics’ as an anachronism Field Linguistics Boas’s studies of native American languages Bloomfield’s description of Tagalog Hockett’s work on Potawatomi Harris’s emphasis on the importance of results being derived from data While until about 1880 investigators confined themselves to the collection of vocabularies and brief grammatical notes, it has become more and more evident that large masses of texts are needed in order to elucidate the structure of languages. (Boas 1917: 1) Cha rl e s Ho ckett Zelli g H arri s

Principles of Chomskyan linguistics

• • • Homogeneous underlying system of language Describe the language of the ideal speaker/hearer Focus on linguistic competence as opposed to linguistic performance Corpus linguistics doesn’t mean anything. It’s like saying suppose a physicist decides, suppose physics and chemistry decide that instead of relying on experiments, what they’re going to do is take videotapes of things happening in the world and they’ll collect huge videotapes of everything that’s happening and from that maybe they’ll come up with some generalizations or insights. (Chomsky, quoted in Andor 2004: 97)

Problems with intuition

• •

Issue of acceptability

I was 19 when I started university I were 19 when I started university • • •

Impossibility of studying certain aspects of language without recourse to corpus data

Historical linguistics Language change/variation Language acquisition …this [intuition] is a very strange notion of data. Normally one expects a scientist to develop theories to describe and explain some phenomena which already exist, independently of the scientist. One does not expect a scientist to make up the data at the same time as the theory, or even to make up the data afterwards, in order to illustrate the theory. (Stubbs 1996: 29)

The Survey of English

• • • Instigated 1959 by Randolph Quirk at University College London One million words of written and spoken British English, made up of 200 text samples of 5000 words each Electronic version of the spoken data produced in collaboration with Lund University: the London-Lund Corpus • • • Manually annotated for prosodic and paralinguistic features Grammatical structures for each text sample recorded on file cards Searching the corpus meant a trip to the Survey offices to search through filing cabinets of data!

The Survey of English

• • • Instigated 1959 by Randolph Quirk at University College London One million words of written and spoken British English, made up of 200 text samples of 5000 words each Electronic version of the spoken data produced in collaboration with Lund University: the London-Lund Corpus • • • Manually annotated for prosodic and paralinguistic features Grammatical structures for each text sample recorded on file cards Searching the corpus meant a trip to the Survey offices to search through filing cabinets of data!

Building the Brown corpus

• • • The Brown Corpus Built by Nelson Francis and Henry Kučera at Brown University, USA One million words of written American English (1961), made up of 500 text samples of 2000 words each • • • • Enabled frequency measures of words Confirmed Zipf’s law The most frequent word in a corpus is approximately twice as frequent as the second most frequent, and three times as frequent as the third most frequent, etc.

Frequency is inversely proportional to rank

Extending the Brown family

• • • 1970-78: LOB Built by Geoffrey Leech and colleagues at Lancaster University One million words of written British English (1961), made up of 500 text samples of 2000 words each • • • • FROWN: Written American English from 1991 FLOB: Written British English from 1991 BE06: Written British English from early years of 21 st century LOBalike: Written British English from 2011

Extending the Brown family

• • • 1970-78: LOB Built by Geoffrey Leech and colleagues at Lancaster University One million words of written British English (1961), made up of 500 text samples of 2000 words each • • • • FROWN: Written American English from 1991 FLOB: Written British English from 1991 BE06: Written British English from early years of 21 st century LOBalike: Written British English from 2011

Making sense of meaning

• COBUILD project initiated at Birmingham in 1980 - resulted in the Bank of English • English Lexical Studies 1963: Sinclair, Susan Jones and Robert Daley analysed a small corpus of spoken and written English to investigate the relationship between words and meaning • Meaning is best seen as a property of words in combination • Builds on J. R. Firth’s concept of collocation

Bart Homer

You're up to something, aren't ya?

No! I'm just going out to commit certain deeds.

s.

A39 57 A39 58 The write way to commit murder

A39 59 "Advice and inform of God is manifested. Kill, D03 78 rob and commit adultery are all deeds forbidden in the D03 0 of a religious sect who orders his followers to commit suicide.

D11 131 "God, permitting the mir bility of episcopal ordination" would not commit the D17 47 Methodist Church to the view that th 7 article. Take care though: don't let your words commit an editor to E10 98 using a specific picture, w theory and deconstruction is such as to G67 189 commit the reasoner to defending certain values.

G67 ithin the Service about offenders who continue to commit to be silly and trivial, because I don't want to commit rease the penalties on those who use such guns to commit ereas H09 191 crime while on bail

H09 192 Whil 4m ($45m).

H27 148 However, it would only commit itself to a forecast of H27 149 maintained sales democracy from collapse, but this was to J41 142 commit "a common fallacy in social thought which 45 163 the effort levels that they are willing to commit. Let contracts J45 164 with regard to effort be 2 Her cheeks flushed crimson and he strove to commit to memory P08 53 the lovely colour as the blood 222 never took the slot, although he did briefly commit to an ROTC A10 223 program before putting his na 1894.

A26 13 Commissioners hesitated to commit themselves after one of the A26 14 monument's c ote/> "Cold Feet A32 243 - Why Men Won't Commit" and "Letting Go and Moving A32 B13 92 addressed men who use drugs or those who commit adultery, and who B13 93 get AIDS and other ven ceeds rational basis. Since urban blacks B17 61 commit more crime proportionately (although not numerica us consequences. Mr. C12 185 Deng was hounded to commit suicide in 1966 and his criticism is now C12 186 nd Jodie squabble C13 199 because he's afraid to commit to marriage.

C13 200 Social issues, too, ue "is to do something about it, i.e., to commit oneself D03 187 to a way of life ..."

SANITATION

E28 2 HOW TO COMMIT BIOCIDE

E28 3 In the strictest sense, s an offensive F04 52 position. That is, to commit to an aggressive daily-action plan F04 53 desig form drives 15 F11 31 percent of its victims to commit suicide. (For a list of symptoms, F11 32 see 'A , artificial persons make decisions that F37 23 commit other people. At the same time, the power to spea act G22 13 open to us now would be unjust is to commit ourselves to avoiding G22 14 it. But what of pa H08 57 exploiting the Gulf war as a pretext to commit terrorism.

H08 58 While we can be proud p/> H09 52 First, we must get the people who commit crimes out of the H09 53 community, and we must And, it increases penalties for criminals who commit gun H09 69 offenses.

H09 70 We have no H09 120 crimes.

H09 121 Mr. President, I y requiring grantees H26 155 in most programs to commit their own funds for a portion of the H26 156 cos to do with its value; to think so is to J30 27 commit a genetic fallacy. After I wrote this, I came acr J43 34 disengaged delinquents are free to commit a variety of illegal J43 35 activities, such fr on a particular illegal J43 38 possibility. Why commit anti-gay violence versus rape or armed J43 39 r _>Hitler understandably regarded people who could commit such J56 150 acts against Britain as his natural rt with this J58 131 their so natural Right, but commit onely<&|>sic! the Administration J58 132 of such lives K23 172 were before us. Rarely did anyone commit suicide. Here, hundreds of K23 173 people sit, w asked Michael. "Did you want P17 102 to commit suicide?"

P17 103 "Oh, no

Advances in annotation:

• • • • • • • Currently, one of the best contemporary UK English corpora 100 million words from the early 1990s Represents a wide range of both spoken and written modern British English Written data – 90 million words – Includes extracts from newspapers, academic books, popular fiction, letters and university essays Spoken data – 10 million words – Includes demographic data and context governed data The demographic part – Transcripts of about 900 everyday unscripted spoken conversations The context-governed part – Spoken language collected in public contexts – e.g. radio phone-ins, government meetings, classroom interactions

Advances in annotation: Wmatrix

Looking ahead

• • • • • Development of tools and technologies Corpus techniques increasingly used in other disciplines Interdisciplinarity Multimodal corpora (e.g. Headtalk, Knight et al. 2008) Corpus Linguistics and Geographical Information Systems. This involves extracting place-names from a corpus, searching for their semantic collocates and creating maps to allows users to visualise how concepts such as war and money are distributed geographically (Gregory and Hardie 2011)

References

• • • • • Andor, J. (2004) ‘The master and his performance: an interview with Noam Chomsky’, Intercultural Pragmatics 1(1): 93-111.

Boas, F. (1917) ‘Introduction’, International Journal of American Linguistics 1(1): 1-8. [Reprinted in Boas, D. (1940) Race, Language and Culture, pp. 199-210. The Free Press; New York.] Gregory, I. and Hardie, A. (2011) ‘Visual GISting: bringing together corpus linguistics and Geographical Information Systems’, Literary and Linguistic Computing 26(3): 297-314.

Knight, D., Adolphs, S., Tennent, P. and Carter, R. (2008) ‘The Nottingham Multi-Modal Corpus: a demonstration’, Proceedings of the 6th Language Resources and Evaluation Conference, Palais des Congrés Mansour Eddahbi, Marrakech, Morocco, 28-30 th May.

Stubbs, M (1996) Text and Corpus Analysis. Oxford: Blackwell.