Corpus Linguistics and Language Development: Research
Download
Report
Transcript Corpus Linguistics and Language Development: Research
Tracking Learning:
Using Corpus Linguistics to Assess
Language Development
James Lantolf
Steve Thorne
CALPER
Center for Advanced Language Proficiency Education and
Research
The Pennsylvania State University
Tracking Learning: Approaches to Assessment
Traditional Classroom Assessment
Standardized Tests
AP, TOEFL, OPI, STAMP
Alternative Assessment
Achievement, Placement, Formative
Portfolio & LinguaFolio
Performance Assessment, Task-Based
CALPER Assessment
Dynamic Assessment
Corpus-Informed Assessment
Today’s Talk
What is a corpus?
Types of corpora
Corpus-informed assessment
o Developmental learner corpora
o Contrastive learner corpus analysis against baseline
Two examples of corpus-informed assessment
Advanced ESL academic discourse competence
German modal particles
What is a Corpus?
A corpus (plural corpora)
Large collection of texts
Gathered according to specific criteria
Stored in an electronic database with relevant meta-data
associated with each text entry
Student ID
Time/date
Activity type
Corpora can be constructed from written language use
(especially digital texts) or transcribed from spoken
interaction
Basic Tenets of Corpus Analysis
Data driven, highly empirical
Objective approach
A grammar of use based on attested utterance types
A grammar of probability based on frequency and
distribution
Language use and structure:
Collocational patterns
Lexicon heart of systematicity in language, i.e., grammar
Formulaic sequences comprise ~60% of language use (Wray, 2002;
Schmitt & Carter, 2004)
Corpora & Language Assessment
For advanced proficiency -- develop and/or utilize genre,
modality, and context-specific corpora
Focus can be on grammatical, lexical, metaphoric, discourse,
pragmatic features
Typical problems and errors of usage can be found in learner
data
Teachers and learners themselves can observe and assess their
own and one another’s performance
Expert-speaker corpora can reveal what learners are not
using/doing, as well as how appropriately, successfully, and
differentially they are using the target language
Comparing Assessment Approaches
Testing
1.
2.
3.
4.
5.
Corpus-based
Elicited performance indicative of
competence
“Authenticity” and / or ecological
validity of test instrument
Sampling issues
Reliability
1.
3.
Naturally-occurring language
performance indicates competence
Volume of language learners
produce across tasks/genres and
time
Sampling issues become irrelevant
Critical question: Is the elicited
performance representative of the
individual’s state of language
development?
4.
Reconceptualize reliability
5.
Critical question: Have enough data
been collected to conclude that an
individual’s performance is
representative of her state of
language development?
2.
ITA Project
Describing, assessing, and
developing academic discourse with
international teaching assistants
Steve Thorne
Jonathan Reinhardt
Paula Golombek
ITAcorp Project
ITAs highly competent researchers
Expand repertoire of options for performing often complex
social roles (instructor, adjudicator, tutor, advisor, fellow
student, mediator)
Assessment --> Contrastive corpus analysis of ITACorp with
baseline corpus -- MICASE
Grammar as choice as it relates to meaning and social actions
Formulaic sequences, small words, modulation
Corpus-informed pedagogical intervention to prepare
students to participate successfully in spoken and written
genres of academic discourse
Methodology
Contrastive corpus analysis of MICASE and ITAcorp -->
what are the differences in language use between
expert/native and ITA/advanced ESL speakers?
Identified directive and obligative constructions
Quantified usage of directive language in both corpora
The case of wanna / want to
Corpus Assessment: Time 1
The case of “you want to” | “you could …”
Please + [imperative]
word/phrase
you want to
ple ase (+
impe rative )
total words
ITACorp/
ESL
rate per 10K
MICASE
rate per 10K
ratio of
over/underuse
0
0.00
228
12.71
0.0879
82
9.16
0
0
---
89489
179446
Corpus Assessment: Time 2
Post intervention usage of “you want to”
10 instances of usage across 25 advanced ESL students
Concordance lines of proceduralized usage in context
you
you
you
you
you
you
you
you
you
you
also
also
just
just
just
may
maybe
might
do
might
wanna
want to
wanna
wanna
wanna
wanna
wanna
wanna
wanna
wanna
say,
i
get
draw
think
think
highlight,
summarize
make,
think
what,
mean
a
out
about
about
why
here
but
about,
Corpus Assessment
Corpus-informed Assessment and Materials
Development: German Modal Particles
Nina Vyatkina
Teaching the MPs: Challenges
Modal Particles: ja, doch, denn, mal
Rampant polysemy in MPs
Strongly context-bound meaning
Absence of a direct counterpart in English (translated by
tag questions, intonation, omitted)
Absence of an informal “particle-friendly climate” in
traditional language classrooms
Overly formal treatment in textbooks
Sentence-based rather than utterance-based [interactive]
Participants
7 American students and 16 German students discussing
intercultural topics in German and in English using email and
chat during 8 semester weeks (Fall 2005)
German Modal Particles
German modal particles: indeclinable “smallwords”
typical of conversations
‘The German listener expects a particle. If it is
absent, the sentence acquires a specific stylistic
value: without a particle it sounds choppy, harsh,
unfriendly, its utterance is apodictic, abrupt,
blatantly noncommittal.’ (Weydt, 1969)
Pedagogical intervention
QUAN & QUAL
analysis
Data-driven
instruction
CMC practice
Data-driven
instruction
CMC practice
QUAN & QUAL
analysis
Classroom intracultural sessions:
explicit instruction based on the
data produced by the
participants in
Internet-mediated intercultural
sessions: practice in language
use in CMC with native speakers
(Belz, 2006)
Relative frequency:
modality/intervention effect
15
12
9
Amer
Ger
6
3
0
chat/pre*
email/pre*
chat/post
email/post
* Statistically significant difference in mean relative frequencies (no.
MPs/1000 German words), p<.05
MP Dispersion in the corpus
Learners:
1.
4.
ja
denn
doch
mal
NSs:
1.
ja
denn
doch
mal
2.
3.
2.
3.
4.
MP use by NSs and learners (absolute
numbers)
Stages
NSs
Learners:
Accurate use
Learners:
Inaccurate use
Pre-Interv.
(4 weeks)
89
3
0
Interv. W1
7
2
Interv. W2
6
0
Interv. W3
27
3
Post-Int. W4
22
1
65
6
Total
Post-Interv.
80
Corpus-informed Assessment:
Conclusions, Questions, & Resources
Representativeness and ecological validity?
Assemble corpus data to adequately and significantly represent
production
Use benchmark corpora for assessing learner language successes and
problems
Developmental corpus assessment of individuals and class-cohorts
CALPER materials:
Corpus tutorial -- see calper.la.psu.edu
INVESTIGATING REAL LANGUAGE -- June 25-27, 2007
DYNAMIC ASSESSMENT workshop June 25-27, 2007
CALPER Corpus Tool available Summer, 2007
Thanks -- please visit our website for more
information on CALPER materials, events, and
services:
http://calper.la.psu.edu
Challenges to Corpus Approaches
One data source among many: ethnographic details, visual field,
introspection, clinical and experimental elicitation
Descriptive not explanatory
Focus on externalized language use / performance –
psycholinguistics and language processing inferred
Corpora are “real” (representation of actual use), but are they
“authentic” (meaningful and applicable to learners, e.g., Widdowson,
2002)
Only as good as its representativeness
Harkening back to contrastive error analysis? No, contrastive
analysis of actual use that does not need to include incapacity
evaluations of learners
Types of Corpora & Analyses
Synchronic
Descriptive Benchmark (BNC, ANC)
Frequencies, ratios
Diachronic
Historical Corpora (Davies)
Frequencies, ratios
Genre/Register/Variation
(Biber, Swales, Sinclair)
Factor & cluster analyses
Youman’s Vocabulary
Management
Profiling
Mutual information
Learner Corpora (Granger)
Contrastive IL Analyses
Frequencies, ratios
Developmental Learner Corpora
(Myles, Payne, Belz, Thorne et. al.)
Frequencies, ratios, ???
Corpus Design and Construction
Synchronic
Descriptive Benchmark (BNC, ANC)
Frequencies, ratios
Genre/Register/Variation
(Biber, Swales, Sinclair)
Factor & cluster analyses
Youman’s Vocabulary
Management
Profiling
Mutual information
Learner Corpora (Granger)
Contrastive IL Analyses
Frequencies, ratios
Aggregative
Genre, register
Meta-data:
Situational context
Activity
Level of proficiency
Corpus Design and Construction
Diachronic
Role of meta-data:
Individual
Task
Time
Corpus construction as
a form of experimental
research
Historical Corpora (Davies)
Frequencies, ratios
Developmental Learner Corpora
(Myles, Payne, Belz, Thorne et. al.)
Frequencies, ratios, ???
Corpus Annotation
Frequency and location of tags
Laughter for hyperbole
Language use as social action
Part-of-speech
Lemmatization
Syntactic tagging
Error tagging
Semantic tagging
Corpus Informing Language Theory
Not only what is possible (e.g., nativist and UG approaches), but
what is likely or frequent in usage
Illustrates the limits of introspection about language (enormous
differences between intuition and actual use)
Language structure, i.e., formulaic sequences comprise ~60% of
language use (Wray, 2002; Schmitt & Carter, 2004)
Emergent grammar
(Hopper, 2002; Bybee, 2001)
Grammar a consequence, not a precondition -- epiphenomenal
Grammar = observable repetition in discourse
Grammar contingent upon lexical environment
“Grammar contracts as texts expands” --> fragments and repertoires
Revisioning Ellipsis
Speakers add features as necessary rather than as taking away
from what would be required in written discourse (see also
Wittgenstein, 1953; Rommetveit, 1974)
Omission of auxiliaries is common (be, have, do) but not often from
speaker’s or 1st person perspective
Empty “its” and existential “there is” often dropped in spoken
discourse
Pronouns before modal verbs e.g., can happen, should be
Overall, beginning bits are left out
Grammatical description SHOULD represent spoken language
use, should relate items and structures to interactional and
situational functions
Importance of Measuring &
Understanding Process
Alfred Binet (1909) advocated process assessment,
though never designed an instrument to measure
it.
Buckingham (1921) accounting for learning
processes as important as products.
Challenges of Assessing Process
Feasibility
“the most direct procedure for determining an
individual’s proficiency…would simply be to follow
that individual surreptitiously over an extended period
of time…It is clearly impossible, or at least highly
impractical, to administer a ‘test’ of this type in the
language learning situation” (Clark 1978, as quoted in
Bachman, 1990).
Scalability - the bane of “alternative” assessment
Depicting Process in SLA
Accuracy of production of L2 forms and IL development
suggests a curvilinear rather than a linear relationship
(Norris & Ortega, 2003).
Threshold and stage effects (Meisel, Clahsen & Pienemann,
1991).
U-shaped behavior (Kellerman, 1985)
Omega-shaped behavior - temporary increase in frequency
followed by a normalization (Wolfe-Quintero, Inagaki,&
Kim, 1998).
Using Corpus to Assess IL
Development
Addressing feasibility and scalability
Proliferation of technology-mediated language learning
More powerful computers and more refined software.
Automated speech recognition - “dirty” ASR
“Complementary” Assessment
Use testing techniques (traditional or performance)
in conjunction with corpus-based assessment to
generate a more detailed and broad-based account
of IL development.
Academic Discourse Performance
An ITA’s success as instructors and future faculty
depends on successful participation in written and
spoken academic discourse
e.g. spoken genres:
•small lecture presentation
•large lecture presentation
•discussion leading
•lab section leading
•seminar leading
•advising
•colloquia participation
•interviewing
•meeting participation
•office hours conducting
•service encounters
•tutorial leading
•socializing
•conference presentation
The ITA “problem”
Jan 2005: North Dakota proposed legislation: bill would
have forced universities to reimburse class fees to student
complaints about an instructor’s inability in English. If ten
percent or more students had complained, the instructor
would have been relieved from teaching pending further
review. A watered-down version of the bill passed.
High number of international graduate students in the U.S.
-- 50 % of US graduate students in engineering and
sciences are international
Directive Language
DL is language with directive illocutionary force (Searle, 1979)
used functionally for making suggestions or giving advice
In traditional frameworks, DL has primarily deontic qualities of
obligative modality
In textbooks, is taught as series of modals & semi-modals
(must, mustn’t, have to, should, ought to, need to, needn’t)
In SYS-FUNC, DL would be considered part of the
MODULATION system, a continuum between obligation
(what I want you to do) and inclination (what you want to do)
Why Study Directive Language?
DL is an important part of several academic discourse
genres and professional competence
Inappropriate or unintended use of DL may result in
miscommunication or misunderstanding of speaker
intention
DL is highly interpersonal, involving speaker authority and
power hierarchies
Research
Contrastive genre-comparable spoken corpora
ITAcorp (ITA language use): office hours role plays (CMC,
presentation, post-evaluation)-- approx. 120,000 tokens
MICASE (base-line ‘expert’ corpus): Advising and Office
Hours sub-corpora--180,000 tokens
MICASE data as model
Analytical framework:
Corpus: usage-based, frequency & distribution
Qualitative: (professional) discourse analysis, SYS-FUNC &
APPRAISAL
Preliminary Contrastive Analysis of wanna / want to
construction
1. please (+
imperative)
2. how about
3. allowed
4. you had better
5. don't worry
6. I/my
suggest/ion
7. you must
8. let me
9. I recommend
10.
required
11.
you should
12.
don't (+
imperative)
13.
you can*
14.
I want you
15.
why don't you
16.
you have to
17.
let's
18.
you need to
19.
you could*
20.
you want to
21.
you('ve) got
to
22.
I would
total words
ITAcorp rate per MICASE rate per rate of
10000
10000 over/underuse
82
24
6
6
24
9.16
2.68
0.67
0.67
2.68
0
0
0
0
4
0.00
0.00
0.00
0.00
0.22
16442.8 835
4812.5513
1203.1378**
1203.1378**
12.0314
30
8
63
13
7
94
3.35
0.89
7.04
1.45
0.78
10.50
5
2
32
7
5
83
0.28
0.11
1.78
0.39
0.28
4.63
12.0314
8.0209**
3.9478
3.7240**
2.8073**
2.2710
42
459
8
5
51
41
50
14
10
4.69
51.29
0.89
0.56
5.70
4.58
5.59
1.56
1.12
38
502
12
12
123
118
147
134
228
2.12
27.97
0.67
0.67
6.85
6.58
8.19
7.47
12.71
2.2163
1.8335
1.3368**
0.8355**
0.8314
0.6967
0.6821
0.2095
0.0879
0
0
89489
0.00
0.00
19
64
179446
1.06
3.57
0.0011
0.0003
* epis temic a nd dynamic us es not
disambiguated
* *too few tokens
for likely
s ignific ance
You [+ hedge] want to / wanna [+ hedge]
ITACorp
MICASE
you
you
you
you
you
you
you
also
do
just
just
just
may
maybe
want to
wanna
wanna
wanna
wanna
wanna
wanna
you
you
you
you
you
you
you
you,
you
you
you
you
might
might
might
might
might
might
probably
probably
really
um
would
would
wanna
wanna
wanna
wanna
wanna
wanna
wanna
wanna
wanna
want to
want to
want to
so
you
i
make,
get
draw
think
think
highlight,
mean
but
a
out
about
about
why
take,
this
do
differently
put
it
sort
of
switch
this
um,
explain
characterize, Howell
um,
they
look
back
see
all
start
off
look
at
wanna
sort
now
you
wanna
maybe
and
you
wanna
just
of
look
isolate
you
you
you
you
you
you
you
you
you
you
also
also
just
just
just
may
maybe
might
do
might
wanna
want to
wanna
wanna
wanna
wanna
wanna
wanna
wanna
wanna
say,
i
get
draw
think
think
highlight,
summarize
make,
think
• MICASE shows 12x the
hedged use of want to /
wanna
• ITACOrp uses followed
pedagogical intervention on
hedged wanna DL
what,
mean
a
out
about
about
why
here
but
about,
Additional Preliminary Descriptive
Findings
In comparison to MICASE data, ITAs as represented in
ITAcorp:
Generally use very few hedges or intensifiers
Generally under use periphrastic forms
Overuse obligative modals (must, should) and please + imperative
Use ‘can’ for obligative ‘should’
Use only basic conditional, underuse of ‘you could’ and no use of ‘I
would’
Navigate between ‘I’ and exclusive ‘we’ strategically, invoking
departmental or professorial authority when the going gets tough
Next Steps
Complementary ethnographic data (survey, interviews) for
ITAcorp participants
Use audio to produce narrow transcriptions of select data
Focus on differences across modality (CMC vs. F2F)
Focus on classroom presentation of a concept (contrasting with
MICASE)
Gather data from non-role play ITA professional activity (section
leader, lecturer, office hours)
Develop set of corpus-informed pedagogical interventions
focusing on professional discourse competencies
What is Data-Driven Learning?
Application of tools (concordancers) and
techniques from corpus linguistics in the service
of language learning.
Inquiry-based pedagogy
Learner as researcher
"Research is too important to leave to researchers" (Johns,
1991, p2.)
Paradigms of L2 Instruction
Traditional approaches:
Present -> Practice -> Produce
Data-driven learning:
Observe -> Hypothesize -> Experiment
Impact of Corpus Techniques on L2
Pedagogy
Materials development
Instructional activities
How do native/expert speakers actually use the target
language?
What drives sequencing?
Example - link
Data-driven learning tools
KWICionary - link
Research on Data-Driven Learning
Vocabulary Acquisition: improved through the use
of concordances (Steven, 1991; Cobb, 1997)
Writing Instruction: students can correct their own
errors with concordances (Gaskell & Cobb, 2004; Ross &
Payne, 2005)
Pedagogical Issues for DDL
Learning a new way to learn language
Relationship between proficiency level and datadriven learning approach
Should frequency of use drive materials
development?
Next Generation Corpus Tools
Text files => relational databases
Application-based => web-based
Storing data as smallest atomic unit
Associate extensive meta-data with each data entry
Promote aggregation and sharing of data
Location-independent collaborative research
Integration with online learning environments
Online Corpus Analytic Tool (OCAT)
OCAT Design
Relational database backend
Extensive meta-data can be assigned to each data entry.
Multiple corpora can be linked and meta-data fields aligned
to create meta-corpora.
Dynamic sub-corpora
Users can create corpora and upload data via a web
interface.
Location-independent collaborative research
Concordance query, frequency lists, Mean Tokens per
Learner
Data visualization techniques
Assessing Language Development
Assessing Language Development
Assessing Language Development