Moja prezentacja
Download
Report
Transcript Moja prezentacja
The stumbling blocks in corpus-based
research of interlanguage phraseology
PLM 2001 Bukowy Dworek
27 April 2001
Przemysław Kaszubski
School of English
Adam Mickiewicz University
Poznań, Poland
Corpus linguistics: central
problems
Representativeness (corpus design, compilation
criteria, etc.)
Annotation (& disambiguation) of data
Some basic questions:
How much to annotate? (whole corpus? 1 part of
speech, 1 lemma etc.)
How deep an analysis?
How large is the corpus?
What and whom are the results for?
Corpus-based or corpus-driven procedures?
PLM 2001, Bukowy Dworek
Methodological premises of my
research (1)
EFL learners’ overuse of high-frequency words:
what does it mean?
Intensive collocability of core lexical items
Multi-word extensions (compounds, coinages,
idioms, expressions, phrasals)
Confrontation
Available corpus-driven extraction methods
vs.
pedagogical usefulness: L1-perspective (the role of
transfer)
PLM 2001, Bukowy Dworek
Methodological premises (2)
multi-corpus scheme with Polish advanced EFL
learner data as hub data
variables: a) genre / text-type; b) L1; c)
proficiency level d) age / maturity level
Lemma-based approach (as opposed to
wordform- or family-oriented approaches)
PLM 2001, Bukowy Dworek
The corpus base: full specification
ENGLISH CORPORA
non-native English
native English
‘apprentice’ corpora
‘expert’ corpora
1. Intermediate
2. Upperintermediate
Polish
intermediate EFL
Spanish
(upper-)
intermediate EFL
BelgianFrench
advanced
EFL
Polish
advanced
EFL
British and American
college learner
English
British
academic
writing
British and
American quality
press
PLLC
SPAN
FREN
IFA-P(ICLE)
LOCN(ARG)
MCONC
LOB&BROWN
92,712
tokens
94,965
tokens
101,442
tokens
107,990
tokens
106,255
tokens
97,914
tokens
94,421
tokens
3. Advanced
4. College
5. Professional
POLISH CORPORA
POL-STUD
‘apprentice’
corpus
4. College level
Polish college compositions
103,382
tokens
POL-EXP
‘expert’ corpus
5. Professional level
Polish academic papers + qualitypress articles
101,348
tokens
PLM 2001, Bukowy Dworek
The ‘extended’ tripartite
idiomaticity model: the criteria
lexical fixedness
syntactic fixedness and / or anomaly
semantic opacity
lexicalisation / institutionalisation / specialisation
/ conventionality = frequency + distribution
implementation of fourth criterion via external
sources BBI2 & LDOCE3
PLM 2001, Bukowy Dworek
The ‘extended’ tripartite
idiomaticity model: the levels (1)
frozen expressions:
phrasals: ‘TAKE after sb’; ‘TAKE to (doing) sth; ‘be taken
aback’; ‘GIVE (sth) up’; ‘GIVE sb/o.s. away’
MWUs: ‘GIVE rise to’; ‘GIVE way to sb/sth’; ‘GIVE sb a
hand’; ‘TAKE care’; ‘TAKE place’; ‘TAKE for granted’; ‘TAKE
advantage’; ‘TAKE root’; ‘TAKE effect’;
lexicalised compounds: ‘God-given’; ‘risk-taking’; ‘leavetaking’
restricted uses (1):
restricted collocations & delexical uses: ‘TAKE drugs/ steps/
the form of/ advice/ decision/ initiative/ a bath/ a breath/
sleep’; ‘GIVE an account/ a lesson/ explanation/ sb/sth a
name/ a concert/ permission/ a speech/ sb a warm welcome’
PLM 2001, Bukowy Dworek
The ‘extended’ tripartite
idiomaticity model: the levels (2)
restricted uses (2):
special senses or uses: ‘GIVE results/ details/ data’; ‘TAKE
<X minutes, year(s), months, hours, generations, life, etc.>’,
TAKE <sth> to mean <sth>
discourse formulae ‘let's take X/an example of X/ X as an
example etc.’
free combinations:
regular (incl. transparent phrasals): ‘TAKE <sb> away/ to <a
place> etc.’; ‘GIVE <sth> back’; ‘GIVE money’
curious interlanguage usage: ‘?GIVE generalisation/
stabilisation to <sth>’; ‘?TAKE help/ behaviour’
PLM 2001, Bukowy Dworek
The research hypotheses
negative correlation between proficiency level
and frequencies of non-idiomatic uses
positive correlation between proficiency level
and frequencies of idiomatic expressions except
EFL learners’ ‘favourite expressions’
traceability of (at least) some ‘favourite
expressions’ to L1
PLM 2001, Bukowy Dworek
Automatic extraction precision &
recall problems
POS (part-of-speech) taggers’ error margin
Word-sense disambiguation and / or syntactic
parsing
Collocation statistics
Nature of learner language
Inter-corpus comparability
PLM 2001, Bukowy Dworek
Problem 1: error margin of POS
taggers
Standard error margin: 5%
Affected: extraction of lemmas meeting POS criteria
Precision (noise in data): non-verbs tagged as verbs
• Not-telling
...(7)
• agressive
...(3)
• well-behaved
...(2)
VB(lex,montr,ingp)
?not-tel?
VB(lex,intr,infin)
?agressive?
VB(lex,montr,edp)
?well-behave?
Recall (data ignored): verbs tagged as non-verbs / lexical
verbs as auxiliaries:
• ... who in sharing their lives with a retarded sibiling [sic!] and
taking <ADJ(ge,pos,ingp)> {taking} part in every-day care
problems, may decide never to have ...
PLM 2001, Bukowy Dworek
Tracking & rectifying the POS
errors
tagger built-in tag editor (TOSCA-ICLE): on-line
targeting of precision & recall errors
(UNTAGged and doubtful cases)
Problem: insufficient query language: word OR
lemma OR tag pattern
no tagger built-in editor: concordancer or editor
needed to test for precision and recall
Problem: either comprehensive or intuitive check
remaining difficulty: tagsets vs. research
assumptions (gerunds & participles tagged as
PLM 2001, Bukowy Dworek
non-verbs)
Problem 2: semantic
disambiguation and associations
sometimes only grouping data uncovers a
meaningful type of association (Stubbs 1998:4)
automatic word-sense disambiguation (WSD)
and machine-readable lexicons (e.g. WordNet
1.7, EuroWordNet): the Senseval Project
University of Lancaster disambiguation tool
Tools unavailable or not at implementable stage
PLM 2001, Bukowy Dworek
Problem 3: corpus-driven
collocation extraction (1)
lemmas or wordforms?
collocation vs co-occurrence (vs adjacency)
word clusters
precision: many identified clusters have little linguistic
significance (‘is the’; ‘of the’; ‘it BE a’)
recall: Many genuine collocations and MWUs are not
contiguous (Kennedy 1998: 114) and may spill outside the
typical 4:4 window (e.g. ‘TAKE care of...’ vs ‘TAKE good care
of’; ‘the chance which were not eager to take’)
stop-listing not quite possible with high-frequency items
(BUT: Ted Pedersen’s ‘Bigram Statistics Package’:
http://www.d.umn.edu/~tpederse/code.html)
PLM 2001, Bukowy Dworek
Problem 3: corpus-driven
collocation extraction (2)
co-occurrence statistics (WordSmith)
precision: not all co-occurrence patterns testify to
meaningful collocations
recall: collocations may extend beyond typical 4:4 word
spans
MI: mostly identifies ‘idiosyncratic collocations’ (Oakes 1998;
90):
GIVE
172
2458
birth
antibiotic
4.01
3.91
isolate 3.43
away
2.91
4.65
vote
vaccination
habit
3.43
4.24
opening 4.24
3.91
ingenuity
happiness
3.24
WordSmith: only 10 collocate output
Oliver Mason’s QWICK: MI with weighting factors for
frequent words and unlimited display of collocates
PLM 2001, Bukowy Dworek
Problem 3: corpus-driven
collocation extraction (3)
co-occurrence statistics (z-score: TACT)
z-score & t-score - better suited for frequent collocates but
also mutual and imprecise on their own: z-score ordered
collocate list for BE:
• there; it; that; able; not; which; should; considered; by;
likely; to; said; very; enough; why; important; concerned;
what; always; worth; if; proved; afraid; used;
Mason’s QWICK: multi-test package: incl. also loglikelihood; modified log likelihood; expected/observed ratio
Remaining problems
stop-listing not quite possible with high-frequency item tests
collocations outside a heuristic window
lexical associations between collocates (synsets)
PLM 2001, Bukowy Dworek
semi-manual grouping of data essential (limitations)
Problem 4: the nature of learner
data
Difference in proficiency levels essential in
cross-corpus comparisons
Recall: misspelled words may get mistagged by
taggers and overlooked by concordancers, unless
edited beforehand
Wrong or inconsistent hyphenation may mislead
taggers, e.g. ‘money making’ vs. ‘moneymaking’ vs.
‘money-making’
Unrecognised words vs. tagger default option tag
PLM 2001, Bukowy Dworek
Problem 5: cross-corpus
comparability
genre homogeneity
topic-skewed distribution: heuristic method of
isolation: sort by standard deviation
TAKE
<sb/sth>
LOB
&BR
MCONC LOCN
IFA-P
FREN
SPAN
PLLC
SD
drugs*
0
0
2
43
6
0
0
15,90
steps
4
4
1
13
2
7
0
4,43
overdose
0
6
0
0
0
0
0
2,27
exercise
0
6
0
1
0
0
1
2,19
1
0
0
2,19
life**
0
0
6
1
* incl. marijuana, opium, chemical substances
** = kill
PLM 2001, Bukowy Dworek
Summary
Difficult to find/compile truly homogenous AND comparable
sets of corpora = small corpus analysis often a necessity
With small corpora, mere automated methods of processing
and analysis display insufficient precision and recall
Loss of data may be prove too costly when pedagogical
conclusions are sought
Instead of automatisation: increase the pace of assisted preprocessing and semi-manual analysis (disambiguation)
Dedicated new type of hybrid concordancer-editor needed
PLM 2001, Bukowy Dworek
SOLUTION: dedicated
concordancer-annotator
Feature 1: allow editing of concordance lines - text
and/or tags and/or lemmas - like built-in tagger
editors
Feature 2: allow adding custom information to
concordance lines (specialised annotation / grouping
of data)
Feature 3: allow saving concordances as text BACK
into the corpus (pasting)
Feature 4: collocation annotation / statistics
enhanced by links with phraseological dictionary
Feature 5: ???
PLM 2001, Bukowy Dworek
This show shortly available from:
http://main.amu.edu.pl/~przemka/rsearch.html
PLM 2001, Bukowy Dworek