The Cambridge Learner Corpus, English Profile, the Sketch Engine
Download
Report
Transcript The Cambridge Learner Corpus, English Profile, the Sketch Engine
The Cambridge Learner Corpus,
English Profile, the Sketch Engine
and the Kelly Project
Adam Kilgarriff
Lexical Computing Ltd
http://www.sketchengine.co.uk
The Cambridge Learner Corpus,
English Profile, the Sketch Engine,
“freely available”, HOO, DANTE and
the Kelly Project
Adam Kilgarriff
Lexical Computing Ltd
http://www.sketchengine.co.uk
Cambridge Learner Corpus (CLC)
• Since 1993
– Nearly as old as CECL
• Leading resource (like ICLE)
• CUP and Cambridge ESOL
– For better dictionaries, ELT courses, tests
– Material: all from exams (levels A1-C2)
• 45m words; 22m error-tagged
• 200,000 scripts, 138 L1s, 203 nationalities
English Profile
• From 2006
• Cambridge Univ, Univ Press, ESOL (+ others)
• Goal
– for each CEFR level, find characteristic lexis and
grammar
– Main resource: CLC
– Talk on Thursday
• Theodora Alexopolou, Helen Yannakoudakis
Flyers
Sketch Engine
• Leading corpus tool
• Word sketches
– One-page summaries of a word’s grammatical and
collocational behaviour
• In use at OUP, CUP, Collins, Macmillan, INL …
• 42 languages
– Over 150 corpora
– Since May including CHILDES: demo
– Since last year including CLC
Error-coded corpus
• Challenge
– Intuitive to search for x
• anywhere
• only where it is part of an error
• only where it is part of a correction
where x can be a word, phrase, grammar pattern …
Requirement for CLC in Sketch Engine
Sample text
• We will only use those informations to take
part of our guest survey
Error-coded corpora in SkE
• demo
freely available
freely available
Free (MED online)
Sense 1: not costing anything
Sense 4: not limited by rules
… anyone can get hold of it??
freely available
Free (MED online)
Sense 1: not costing anything
Sense 4: not limited by rules
… anyone can get hold of it??
Available
To download onto your com
To use
Case studies
ICLE
CLC
Money
225 EUR
No
To everyone
Yes
Cambridge author/collab
To download
?
No
To use
Yes
Yes
Non-geeks
• Access is important, not download
• Web is beautiful
HOO / HOO+
• Helping Our Own
• HOO: English-NNS NLP researchers
– Developer = user: motivation
– Shared task/competitive evaluation
• Organisers define task and prepare ‘gold standard’
• Teams participate by running their software over test
data
• Six teams (incl Tübingen), workshop end Sept
HOO+ (2012)
• Probably
– English: learner data from CLC
– Other languages?
– Tasks
•
•
•
•
Essay scoring
Determiner, preposition errors
?
http://www.clt.mq.edu.au/research/projects/hoo/
DANTE
Highlights of English lexicography
DANTE
DANTE
DANTE
DANTE
http://webdante.com
Flyers
The KELLY Project
• EU Lifelong Learning Project
• Word cards
– 9 languages
• Arabic Chinese English Greek Italian Norwegian Polish
Russian Swedish
– All 36 pairs
– Words the learner should know (at A1 … C2)
• Partners
• Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ,
ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S,
Lexical Computing Ltd
Interesting question
• How close to purely corpus-based can a
pedagogic list be?
Method
•
•
•
•
•
Take a general corpus
Count
Review, add, delete using other lists and corpora
Translate (72 directed-lg-pairs)
Words not in source list which occur in
translations:
– Review source list
• http://kelly.sketchengine.co.uk
• Symmatrical pairs: <x,y> and <y,x>
• Cliques:
– For x, y, z, … all pairs are symmetrical
– 9-language cliques (English members)
• hospital library music sun theory
Homage