Moja prezentacja

Download Report

Transcript Moja prezentacja

The stumbling blocks in corpus-based
research of interlanguage phraseology
PLM 2001 Bukowy Dworek
27 April 2001
Przemysław Kaszubski
School of English
Adam Mickiewicz University
Poznań, Poland
Corpus linguistics: central
problems
Representativeness (corpus design, compilation
criteria, etc.)
Annotation (& disambiguation) of data
Some basic questions:
How much to annotate? (whole corpus? 1 part of
speech, 1 lemma etc.)
How deep an analysis?
How large is the corpus?
What and whom are the results for?
Corpus-based or corpus-driven procedures?
PLM 2001, Bukowy Dworek
Methodological premises of my
research (1)
EFL learners’ overuse of high-frequency words:
what does it mean?
Intensive collocability of core lexical items
Multi-word extensions (compounds, coinages,
idioms, expressions, phrasals)
Confrontation
Available corpus-driven extraction methods
vs.
pedagogical usefulness: L1-perspective (the role of
transfer)
PLM 2001, Bukowy Dworek
Methodological premises (2)
multi-corpus scheme with Polish advanced EFL
learner data as hub data
variables: a) genre / text-type; b) L1; c)
proficiency level d) age / maturity level
Lemma-based approach (as opposed to
wordform- or family-oriented approaches)
PLM 2001, Bukowy Dworek
The corpus base: full specification
ENGLISH CORPORA
non-native English
native English
‘apprentice’ corpora
‘expert’ corpora
1. Intermediate
2. Upperintermediate
Polish
intermediate EFL
Spanish
(upper-)
intermediate EFL
BelgianFrench
advanced
EFL
Polish
advanced
EFL
British and American
college learner
English
British
academic
writing
British and
American quality
press
PLLC
SPAN
FREN
IFA-P(ICLE)
LOCN(ARG)
MCONC
LOB&BROWN
92,712
tokens
94,965
tokens
101,442
tokens
107,990
tokens
106,255
tokens
97,914
tokens
94,421
tokens
3. Advanced
4. College
5. Professional
POLISH CORPORA
POL-STUD
‘apprentice’
corpus
4. College level
Polish college compositions
103,382
tokens
POL-EXP
‘expert’ corpus
5. Professional level
Polish academic papers + qualitypress articles
101,348
tokens
PLM 2001, Bukowy Dworek
The ‘extended’ tripartite
idiomaticity model: the criteria
lexical fixedness
syntactic fixedness and / or anomaly
semantic opacity
lexicalisation / institutionalisation / specialisation
/ conventionality = frequency + distribution
implementation of fourth criterion via external
sources BBI2 & LDOCE3
PLM 2001, Bukowy Dworek
The ‘extended’ tripartite
idiomaticity model: the levels (1)
frozen expressions:
phrasals: ‘TAKE after sb’; ‘TAKE to (doing) sth; ‘be taken
aback’; ‘GIVE (sth) up’; ‘GIVE sb/o.s. away’
MWUs: ‘GIVE rise to’; ‘GIVE way to sb/sth’; ‘GIVE sb a
hand’; ‘TAKE care’; ‘TAKE place’; ‘TAKE for granted’; ‘TAKE
advantage’; ‘TAKE root’; ‘TAKE effect’;
lexicalised compounds: ‘God-given’; ‘risk-taking’; ‘leavetaking’
restricted uses (1):
restricted collocations & delexical uses: ‘TAKE drugs/ steps/
the form of/ advice/ decision/ initiative/ a bath/ a breath/
sleep’; ‘GIVE an account/ a lesson/ explanation/ sb/sth a
name/ a concert/ permission/ a speech/ sb a warm welcome’
PLM 2001, Bukowy Dworek
The ‘extended’ tripartite
idiomaticity model: the levels (2)
restricted uses (2):
special senses or uses: ‘GIVE results/ details/ data’; ‘TAKE
<X minutes, year(s), months, hours, generations, life, etc.>’,
TAKE <sth> to mean <sth>
discourse formulae ‘let's take X/an example of X/ X as an
example etc.’
free combinations:
regular (incl. transparent phrasals): ‘TAKE <sb> away/ to <a
place> etc.’; ‘GIVE <sth> back’; ‘GIVE money’
curious interlanguage usage: ‘?GIVE generalisation/
stabilisation to <sth>’; ‘?TAKE help/ behaviour’
PLM 2001, Bukowy Dworek
The research hypotheses
negative correlation between proficiency level
and frequencies of non-idiomatic uses
positive correlation between proficiency level
and frequencies of idiomatic expressions except
EFL learners’ ‘favourite expressions’
traceability of (at least) some ‘favourite
expressions’ to L1
PLM 2001, Bukowy Dworek
Automatic extraction precision &
recall problems
POS (part-of-speech) taggers’ error margin
Word-sense disambiguation and / or syntactic
parsing
Collocation statistics
Nature of learner language
Inter-corpus comparability
PLM 2001, Bukowy Dworek
Problem 1: error margin of POS
taggers
Standard error margin: 5%
Affected: extraction of lemmas meeting POS criteria
Precision (noise in data): non-verbs tagged as verbs
• Not-telling
...(7)
• agressive
...(3)
• well-behaved
...(2)
VB(lex,montr,ingp)
?not-tel?
VB(lex,intr,infin)
?agressive?
VB(lex,montr,edp)
?well-behave?
Recall (data ignored): verbs tagged as non-verbs / lexical
verbs as auxiliaries:
• ... who in sharing their lives with a retarded sibiling [sic!] and
taking <ADJ(ge,pos,ingp)> {taking} part in every-day care
problems, may decide never to have ...
PLM 2001, Bukowy Dworek
Tracking & rectifying the POS
errors
tagger built-in tag editor (TOSCA-ICLE): on-line
targeting of precision & recall errors
(UNTAGged and doubtful cases)
Problem: insufficient query language: word OR
lemma OR tag pattern
no tagger built-in editor: concordancer or editor
needed to test for precision and recall
Problem: either comprehensive or intuitive check
remaining difficulty: tagsets vs. research
assumptions (gerunds & participles tagged as
PLM 2001, Bukowy Dworek
non-verbs)
Problem 2: semantic
disambiguation and associations
sometimes only grouping data uncovers a
meaningful type of association (Stubbs 1998:4)
automatic word-sense disambiguation (WSD)
and machine-readable lexicons (e.g. WordNet
1.7, EuroWordNet): the Senseval Project
University of Lancaster disambiguation tool
Tools unavailable or not at implementable stage
PLM 2001, Bukowy Dworek
Problem 3: corpus-driven
collocation extraction (1)
lemmas or wordforms?
collocation vs co-occurrence (vs adjacency)
word clusters
precision: many identified clusters have little linguistic
significance (‘is the’; ‘of the’; ‘it BE a’)
recall: Many genuine collocations and MWUs are not
contiguous (Kennedy 1998: 114) and may spill outside the
typical 4:4 window (e.g. ‘TAKE care of...’ vs ‘TAKE good care
of’; ‘the chance which were not eager to take’)
stop-listing not quite possible with high-frequency items
(BUT: Ted Pedersen’s ‘Bigram Statistics Package’:
http://www.d.umn.edu/~tpederse/code.html)
PLM 2001, Bukowy Dworek
Problem 3: corpus-driven
collocation extraction (2)
co-occurrence statistics (WordSmith)
precision: not all co-occurrence patterns testify to
meaningful collocations
recall: collocations may extend beyond typical 4:4 word
spans
MI: mostly identifies ‘idiosyncratic collocations’ (Oakes 1998;
90):
GIVE
172
2458
birth
antibiotic
4.01
3.91
isolate 3.43
away
2.91
4.65
vote
vaccination
habit
3.43
4.24
opening 4.24
3.91
ingenuity
happiness
3.24
WordSmith: only 10 collocate output
Oliver Mason’s QWICK: MI with weighting factors for
frequent words and unlimited display of collocates
PLM 2001, Bukowy Dworek
Problem 3: corpus-driven
collocation extraction (3)
co-occurrence statistics (z-score: TACT)
z-score & t-score - better suited for frequent collocates but
also mutual and imprecise on their own: z-score ordered
collocate list for BE:
• there; it; that; able; not; which; should; considered; by;
likely; to; said; very; enough; why; important; concerned;
what; always; worth; if; proved; afraid; used;
Mason’s QWICK: multi-test package: incl. also loglikelihood; modified log likelihood; expected/observed ratio
Remaining problems
stop-listing not quite possible with high-frequency item tests
collocations outside a heuristic window
lexical associations between collocates (synsets)
PLM 2001, Bukowy Dworek
semi-manual grouping of data essential (limitations)
Problem 4: the nature of learner
data
Difference in proficiency levels essential in
cross-corpus comparisons
Recall: misspelled words may get mistagged by
taggers and overlooked by concordancers, unless
edited beforehand
Wrong or inconsistent hyphenation may mislead
taggers, e.g. ‘money making’ vs. ‘moneymaking’ vs.
‘money-making’
Unrecognised words vs. tagger default option tag
PLM 2001, Bukowy Dworek
Problem 5: cross-corpus
comparability
genre homogeneity
topic-skewed distribution: heuristic method of
isolation: sort by standard deviation
TAKE
<sb/sth>
LOB
&BR
MCONC LOCN
IFA-P
FREN
SPAN
PLLC
SD
drugs*
0
0
2
43
6
0
0
15,90
steps
4
4
1
13
2
7
0
4,43
overdose
0
6
0
0
0
0
0
2,27
exercise
0
6
0
1
0
0
1
2,19
1
0
0
2,19
life**
0
0
6
1
* incl. marijuana, opium, chemical substances
** = kill
PLM 2001, Bukowy Dworek
Summary
Difficult to find/compile truly homogenous AND comparable
sets of corpora = small corpus analysis often a necessity
With small corpora, mere automated methods of processing
and analysis display insufficient precision and recall
Loss of data may be prove too costly when pedagogical
conclusions are sought
Instead of automatisation: increase the pace of assisted preprocessing and semi-manual analysis (disambiguation)
Dedicated new type of hybrid concordancer-editor needed
PLM 2001, Bukowy Dworek
SOLUTION: dedicated
concordancer-annotator
Feature 1: allow editing of concordance lines - text
and/or tags and/or lemmas - like built-in tagger
editors
Feature 2: allow adding custom information to
concordance lines (specialised annotation / grouping
of data)
Feature 3: allow saving concordances as text BACK
into the corpus (pasting)
Feature 4: collocation annotation / statistics
enhanced by links with phraseological dictionary
Feature 5: ???
PLM 2001, Bukowy Dworek
This show shortly available from:
http://main.amu.edu.pl/~przemka/rsearch.html
PLM 2001, Bukowy Dworek