投影片 1 - School of Liberal Arts

Download Report

Transcript 投影片 1 - School of Liberal Arts

Researching ESP Corpora:
Issues in compilation and analysis
Lynne Flowerdew
Outline
Compilation
• Size
• Representativeness
• balance
Analysis and interpretation
• Units for linguistic analysis
• Top-down vs. bottom-up analysis
• Role of context in interpretation of corpus data
2
Compilation
Size
Commonly held view − the larger the better
“…a corpus should be as large as possible and
keep on growing” (Sinclair 1991: 18)
“…it is important to have a substantial corpus if
you want to make claims based on statistical
frequency” (Bowker & Pearson 2002: 48)
3
Compilation
But size of corpus
• highly dependent on phenomenon one is investigating
(de Haan 1992)
• lower the frequency of the feature under investigation,
larger the corpus (McEnery & Wilson 2001: 154)
• Smaller corpora can be used for investigating more
common features of language (Biber 1990)
• Different picture for ESP corpora (see Flowerdew 2004;
Hunston 2009; Koester 2010 for pointers on building small,
specialised corpora)
4
Compilation
General vs. ESP corpora (Sinclair 2005 : 16)
LOB
HK of
CS
%
Number of different word forms 69990
(types)
27210 39%
Number that occur once only
36796
11430 31%
Number that occur twice only
9890
3837
39%
Twenty times or more
4750
3811
80%
200 times or more
471
687
(69%)
5
Compilation
Representativeness
• Specialised corpora do not exhibit as much internal
variation as general corpora
• Greater variation in the corpus text, the more samples
and larger corpus required to ensure representativeness
(Meyer 2002)
• “We should always bear in mind that the assumption of
representativeness must be regarded as an act of faith,
as at present we have no means of ensuring it, or even
evaluating it objectively” (Tognini-Bonelli 2001: 57)
6
Compilation
Corpus of EIA (Environmental Impact Assessment)
reports
• 60 reports, approx. 225,000 words
• Selected on basis represent 23 different environmental
consulting companies
• Impossible to select equal number of reports from each
of companies; “convenience sampling” (Meyer 2002)
• Larger the company, more reports catalogued in library;
distribution seen as reflecting size and importance of
company
7
Compilation
Corpus of EIA reports
Balancedness
• Balanced corpus would consist of the same
amount of text from each of the 23 companies
• If EIA reports from different companies were of
different lengths then balancing the corpus in
terms of number of texts would lead to an
imbalance in terms of number of tokens
8
Compilation
Pragmatic considerations
• Size balanced against level of delicacy of
investigation (Kennedy 1998)
• my investigation primarily qualitative
(phraseologies of keywords for P-S pattern)
• Investigation is of key vocabulary items –
225,000 words deemed sufficient
9
Analysis
Units for linguistic analysis
• Frequency (Kennedy 1998)
• Keywords (Bondi 2001; Flowerdew 2008; Mudraya
2006; Nelson 2006)
• Lexical bundles (Biber et al. 2004; Hyland 2007, 2008)
• Corpus set up lexically rather than grammatically
(Halliday 2004)
10
Analysis
Comparison of ESP corpora with Coxhead’s
AWL
• Disciplinary differences
Hyland & Tse 2007; Chen & Ge 2007;
Martinez et al. 2009
• Common core of academic vocabulary (AWL)
Paquot 2010; Simpson-Vlach & Ellis 2010
11
Analysis
Corpus analysis driven by type of software
• WMatrix (Rayson 2008)
– Classifies vocabulary into semantic fields (Ali Mohamed
2007)
• ConcGram (Greaves 2009)
– Finds sets of words that co-occur (e.g. AB; A*B), allowing
up to 12 slots for constituency variation
– Searches for positional variation (e.g. AB; BA)
– Only a few studies (Cheng 2009; Durrant 2009; Milizia &
Spinzi 2009; Warren 2011)
12
Analysis
My corpus of EIA reports
•
WordSmith Tools for keyword extraction (Scott 1999)
•
Then manually classified lexico-grammar of keywords
into causal / non-causal categories
1. The export scheme will create a noise problem
2. In order to alleviate the problem of noise…
3. Severe traffic noise problems already exist in..
•
WMatrix automatic identification of causal categories &
CongGram for positional variation (e.g. problem of
noise)
13
Analysis
• Top-down vs. bottom-up
In the ‘top-down’ approach, the functional components
of a genre are determined first and then all texts in a
corpus are analysed in terms of these components. In
contrast, textual components emerge from the corpus
analysis in the ‘bottom-up’ approach, and the discourse
organization of individual texts is then analysed in terms
of linguistically-defined textual categories.
(Biber, Connor & Upton 2007a: 11)
14
Analysis
Bottom-up starting point
• Phraseology of preposition “in” in cancer
research articles (Gledhill 2000)
• Politeness strategies in two moves in job
application letters (Upton & Connor 2001)
• Verb-noun collocations in 4 moves in law cases
(Bhatia et al. 2004)
• Phraseology of “research” in moves in PhD
literature reviews (J. Flowerdew & Forest 2010)
15
Analysis
Top-down starting point
• Kanoksilapatham’s (2007) corpus study of biochemistry
research articles; first developed analytical framework
through identifying moves
• In reality, many studies toggle between the two (Charles
2006)
• Different starting points yield different results (Biber,
Connor & Upton 2007b)
16
Analysis
Corpus of EIA reports
• Devised a coding system to account for 3 different
levels of text
– Macrostructure (Intro., Body, Concl.)
– Problem – Solution elements
– Discourse-based moves (e.g. <obj>; <need>)
• Different phraseologies for different sections
– …to assess in detail the environmental impacts of
…<obj>
– ..in order to reduce potential noise impacts. <prso>
17
Role of Context in Interpretation
Genre perspective
• Goal-driven communicative event associated with
particular discourse communities and disciplines
• Handford (2010a) asks “how can we relate the specific
instance (such as text, discourse move or lexicogrammatical item) to the wider social context within
which it occurs …
• Is it possible to interpret the corpus data as a reflection
of the context, or conversely, is it possible to rely on
contextual features for interpretation of the corpus data?
(Flowerdew 2011)
18
Interpretation
• Stubbs (2001a, 2001b) argues that conventional view
that context-sensitive pragmatic markers meanings are
usually inferred by speaker / hearer may be overstated;
large-scale corpus studies show pragmatic meanings
can be conventionally encoded in linguistic form
• Tognini Bonelli (2004) considers it possible to “read off”
discursive practices of a discourse community from
recurring multiple concordance lines.
19
Interpretation
Corpus of EIA reports
1. The problems associated with continued
pollution…
2. Health hazards associated with proximity to
high tension power lines…
3. It is expected that there will be no significant
residual impacts…
4. Works at the tunnel portal will create a noise
problem…
20
Interpretation
Discursive practices vs. strategies (Handford
2010b)
• Discursive practices: signify recurrent patterns of
linguistic behaviour and “tie the communication to the
wider social context”
• Strategies: “merely describe what the individual is trying
to achieve within the particular speech event”
• Widdowson (2004: 60) points out difficult to assign
pragmatic significance to phraseologies in one particular
text.
21
Interpretation
• Interpret data related to strategies with reference to not
only other co-textual features but also to external
contextual information.
• Ethnographic perspective sometimes needed for
interpretation of context-dependent pragmatically
oriented features
• Widdowson (2000: 60) remarks that corpus-based
methods focus on the text as product and ‘cannot
account for complex interplay of linguistic and
contextual factors whereby discourse is enacted’.
22
Conclusion
• No “tailor-made” corpora for teaching (Leech
2008); no “perfect” corpora for research
• Corpus linguistic techniques one of approaches
(ethnographic dimension)
• Corpora are now being used in other applied
linguistics areas: textlinguistics, genre analysis,
CDA, sociolinguistics, SLA (Flowerdew in press,
2011a, b)
23
Thank You!
24