Transcript Slide 1

Corpus Linguistics and
Stylistics
PALA Summer School, Maribor, 2014
In this lecture...
• Stylistics and style
• Combining stylistics + corpus linguistics
• Examples of studies combining corpus linguistics
and stylistics
–
–
–
–
Analysis of genres
Analysis of the works by particular authors
Analysis of individual texts
Analysis of variation inside texts
• Corpus Tools
– WMatrix
Stylistics
Stylistics is the study of literature
using methods, theories and
concepts from linguistics
(Leech and Short 2007: 1)
it is "[...] the study of the
relationship between linguistic
form and literary function [...]”
(Leech and Short 2007: 3).
Linguistic style
‘Style is a way in which language
is used’
(Leech and Short 2007: 31)
‘[S]tyle consists in choices made
from the repertoire of the
language.’
(Leech and Short 2007: 31)
Linguistic style
‘Stylistic choice is limited
to those aspects of
linguistic choice which
concern alternative ways
of rendering the same
subject matter’
(Leech and Short 2007: 31)
e.g. horse vs. steed but not
horse vs. dog
Linguistic style
• Style and genre, e.g. science fiction, romance
novels, etc.
• Style and author
• Style and text
• Style and parts of texts (e.g. the narration or
speech of different characters)
Ways of analysing style
• Analyst’s intuitions
• ‘Manual’ comparative analysis
Ways of analysing style
Style and comparison
‘Even if style is defined as that variety of language
which correlates with context, the recognition and
analysis of styles are squarely based on comparison.
The essence of variation, and thus of style, is difference,
and differences cannot be analysed and described
without comparison.’
(Enkvist 1973: 21)
Ways of analysing style
• Comparative analysis – manually
– OK for shorter texts/extract
• Comparative analysis – using computers:
– Corpus linguistic methods/tools
– Especially useful for longer texts – prose fiction
Combining corpus linguistics and
stylistics
• The ‘corpus turn’ (Leech and Short 2007:284).
• On-going trend in stylistics to use methods
and tools from corpus-linguistics for the
analysis of literary and other texts.
• Usually referred to as corpus stylistics
• Other terms:
digital stylistics (Louw 2008)
electronic text analysis (Adolphs 2006)
Examples of studies
• Combining corpus linguistics and stylistics
– Analysis of genres
– Analysis of the works by particular authors
– Analysis of individual texts
– Analysis of variation inside texts
Genre style
• Biber (1988) – multivariate statistical techniques
– factor analysis
– many different variables
– variables = linguistic features (e.g. passive constructions)
• e.g. narrative versus non-narrative texts
– important variables = past tense verbs, 3rd person
pronouns, perfect aspect, present participle
clauses
– High scores = narrative
– Low scores = non-narrative
A range not a dichotomy
the top text-types
there exists a whole range of text-types in the
the bottom text types
middle – it’s not just a two-way distinction
Note also –spoken and written genres are
mixed together along the dimension
narrative / non-narrative
Genre style – direct speech
Corpus-based study of
speech, writing and thought
presentation
(Semino and Short 2004)
Genre style – direct speech
Corpus of 260,000 (approx) words of (late) 20th
century written British English
• 120 text samples
• 2,000 (approx) words each, amounting to a
total of 258,348 words. It is divided into three
sections:
Genre style – direct speech
Corpus divided into three sections:
– prose fiction (87,709 words),
– newspaper news reports (83,603 words), and
– biography and autobiography (87,036 words)
Each genre section further divided into a
‘serious’ and a ‘popular’ sub-sections.
Genre style – direct speech
• Corpus tagged – manually
<sptag cat=NRS next=DS s=0.37 w=7>
The theme park’s manager, Mike Slattery said:
<sptag cat=DS next=NRS s=1.63 w=18>
‘By closing Crinkley Bottom, the council has shot
Morecambe in the foot. And I’m out of a job.’
Genre style – direct speech
Section of the corpus
Number of instances of DS
Whole corpus
2,974
Fiction
1,569
Press
770
(Auto)biography
635
Fiction sub-section
Number of instances of DS
Serious
629
Popular
940
Authorial style
• Studies attempting to ‘fingerprint’ authors: i.e. to
identify linguistic items that distinguish the works by
one author from those of others.
• Burrows (1987): study of Jane Austen’s novels
focusing on closed-class words, such as the, and, of,
a and to.
• Burrows found that these words can distinguish the
works of different authors , different novels, and
even the words spoken by different characters.
Authorial style
• Hoover (2002) studied a series of corpora containing
chunks from novels by different authors.
• For example, he looked at a corpus containing the
first 30,000 words of 29 novels by 17 different
authors.
• The distribution of the 300 most frequent words in
the corpus as a whole correctly clusters 15 out of 17
novels.
Authorial style
• An analysis of the most frequent word sequences (ngrams) can also be useful, e.g.
– of the
– in the
– to the
– it was
– he was
– and the
Authorial style
• Mahlberg (2007, 2009, 2012)
• Corpus stylistics and Dickens’s fiction
• Also shows that analysis of frequent
word sequences (clusters) can be
useful.
• Clusters containing body parts
– “his hands in his pockets”
– “his head on one side”
– “his hands upon his”
Text style
• Stubbs’s (2005) study of
Joseph Conrad’s Heart of
Darkness, first published in
1899.
• Marlow, the protagonist and
first-person narrator, tells of
how he was contracted to
travel up a river in the Belgian
Congo, in order to find an
ivory trader called Kurtz, who
was the subject of stories of
madness and suspect
practices. However, Kurtz dies
while travelling back down the
river.
Text style
• Main themes
– ‘hypocrisy of the colonizers’
– ‘unreliability of progress and civilization’
– ‘breakdowns in communication’
– Light vs. dark
– Restraint vs. frenzy
– Appearance vs. reality
– Marlow’s ‘unreliable and distorted knowledge
(Stubbs 2005: 8-9)
Text style
• Used WordSmith Tools (Scott 2007)
• Compared one novel with a corpus of fictional texts
of around 700,000 words
• Overused words in novel include: seemed, mystery,
darkness, absurd, horror, terror, desolation
• Several words concern uncertainty, perception and
knowledge.
• Coincide with some of the novel’s themes
Text style
• Stubbs shows how the application of corpus
methods can provide:
– further justification for well-established
interpretations,
– new insights into the language and meaning
potential of the text.
Text style: variation inside texts
• Culpeper (2002) used WordSmith Tools to do a keyword analysis of the speech of the main characters in
Romeo and Juliet
• A file with the words spoken by each character was
compared to a ‘reference corpus’ containing the
words of all the other characters.
• Findings are relevant to an understanding of how the
characters are linguistically constructed
(characterisation).
Text style: variation inside texts
Juliet’s key-words (raw frequencies in brackets):
If (31), Or (25), Sweet (16), Be (59), News (9), My
(92), Night (27), I (138), Would (20), Yet (18), Thou
(71), Words (5), Name (11), Nurse (20), Tybalt’s (6),
Send (7), Husband (7), That (82), Swear (5)
Text style: variation inside texts
Key-words such as if, or, would, yet can be related
to Juliet’s tendency to express uncertainty and
anxiety throughout the play:
‘I fear it is: and yet, methinks, it
should not, For he hath still
been tried a holy man’ (IV.iii.)
[Context: Wondering whether
the Friar has supplied sleeping
potion or poison]
Corpus tools
Corpus tools make comparison relatively easy
• WordSmith Tools (Scott 2007)
• WMatrix (Rayson 2009)
• AntConc (Anthony 2011)
• MLCT (Piao)
Summary
• Style is the way in which language is used.
• The notion of ‘style’ is fundamentally based on
comparison
• Corpus linguistic methods are relevant to the
analysis of style in fiction/literature.
• They have been applied to the analysis of
genres, authors and texts.
• Manual analysis and interpretation of the
output from corpus tools is needed.
Summary
[...] ‘corpus stylistics’ is not
purely a quantitative study of
literature. Rather, it is still a
qualitative stylistic approach
to the study of the language
of literature, combined with
or supported by corpus-based
quantitative methods and
technology.
(Ho 2011:10)
References
Culpeper, J. (2009) “Keyness: words, parts-of-speech and semantic categories in the character-talk of
Shakespeare’s Romeo and Juliet” International Journal of Corpus Linguistics, 14(1): 29-59.
Ho, Y. (2011) Corpus Stylistics in Principles and Practice: A Stylistic Exploration of John Fowles’ The
Magus. London: Continuum
Leech, G. (2008) Language in Literature: style and foregrounding Harlow, UK: Pearson
Louw, B. (2008) "Consolidating Empirical method in data-assisted stylistics: Towards a corpus-attested
glossary of literary terms" in Zyngier, S., Bortlussi, M., Chesnokova, A. and Auracher, J. Directions in
Empirical Literary Studies, pp. 243-264. Amsterdam: Benjamins.
Mahlberg M. (2007) “Clusters, Key Clusters and local textual functions in Dickens” Corpora 2(1): 1-31
Mahlberg, M. (2009) “Corpus Stylistics and the Pickwickian watering-pot”, in Contemporary Corpus
Linguistics Baker, P. (ed.) Contemporary Corpus Linguistics, pp47-63. London: Continuum.
Mahlberg, M. (2012) Corpus Stylistics and Dickens’s Fiction. London: Routledge
McIntyre, D. (2010) “Dialogue and Characterization in Quentin Tarantino’s Reservoir Dogs: A Corpus
Stylistic Analysis”, in McIntyre, M. and Busse, B. (eds.) Language and Style pp 162-182. Basingstoke:
Palgrave.
McIntyre, D. and Walker, B. (2010) 'How can corpora be used to explore the language of poetry and
drama?' in McCarthy, M. and O’Keefe, A. (eds) The Routledge Handbook of Corpus Linguistics.
London: Routledge
Widdowson, H. G. (2008) “The Novel Features of Text. Corpus Analysis and Stylistics” in Gerbig, A. and
Mason, O. (eds.)Language, People, Numbers: Corpus Stylistics and Society, pp. 293-304.
Amsterdam: Rodopi.
WMatrix
WMatrix
• Web-based corpus tool
• Developed by Paul Rayson at Lancaster
University
• Automated grammatical and semantic analysis
of texts/corpora
• A web-based front end for CLAWS and USAS
WMatrix
Using a web interface:
• Texts are uploaded onto the Wmatrix server
(at Lancaster)
• The upload procedure automatically adds
(i) Grammatical or Part of Speech (POS) tags;
(ii) Semantic tags
WMatrix
• CLAWS grammatical (POS) tagger.
CLAWS = Constituent Likelihood Automatic Wordtagging System
• USAS semantic tagger
USAS = UCREL Semantic Analysis System
• (UCREL = University Centre for Corpus Research on
Language)
WMatrix
USAS
• Assigns tags to each word using a hierarchical
framework of categorization
• Based originally on McArthur’s (1981)
Longman Lexicon of Contemporary English
The 21 Top Level Semantic Categories of the
USAS Tag-set
A
GENERAL &
ABSTRACT TERMS
B
THE BODY & THE
INDIVIDUAL
C
ARTS & CRAFTS
E
EMOTION
F
FOOD & FARMING
G
GOVERNMENT &
PUBLIC DOMAIN
H
ARCHITECTURE,
HOUSING & THE
HOME
N
NUMBERS &
MEASUREMENT
I
MONEY &
COMMERCE
(IN INDUSTRY)
O
SUBSTANCES,
MATERIALS,
OBJECTS,
EQUIPMENT
K
ENTERTAINMENT
L
LIFE & LIVING
THINGS
P
EDUCATION
Q
LANGUAGE &
COMMUNICATION
T
TIME
W
WORLD &
ENVIRONMENT
X
PSYCHOLOGICAL
ACTIONS, STATES
& PROCESSES
Y
SCIENCE &
TECHNOLOGY
M
MOVEMENT,
LOCATION,
TRAVEL,
TRANSPORT
S
SOCIAL ACTIONS,
STATES &
PROCESSES
Z
NAMES &
GRAMMAR
WMatrix
G - Government and the public domain
G1
G2
G3
Government,
politics and
elections
Crime, law and
order
War, defence
and the army:
weapons
Government, etc.
G1.1
Politics
G1.2
WMatrix
Allows analysis of texts at :
– the word level
– the grammatical level (POS)
– and the semantic level
WMatrix
Allows text comparison at:
– the word level
– the grammatical level (POS)
– and the semantic level
WMatrix
Keyness
• Word level – Key-words
• Grammatical level – Key-POS
• Semantic level – Key-concepts