Transcript The Montclair Electronic Language Learner Database
The Montclair Electronic Language Learner Database (MELD)
www.chss.montclair.edu/linguistics/MELD/
Eileen Fitzpatrick & Steve Seegmiller Montclair State University
Non-native speaker (NNS) corpora
• Begun in early 1990’s • Data – written performance only – essays of students of English as a foreign language • Corpus development (academic) – in Europe: Louvain, Lodz, Uppsala – in Asia: Tokyo Gakugei University, Hong Kong Univ of Science and Technology • Annotation – Lodz: part of speech – HKUST, Lodz: error tags 2
Gaps in NNS Corpus Creation
• No NNS Corpus in America, so no corpus of English as a Second Language (ESL) • No NNS corpus is publicly available • No NNS corpus annotates errors without a predetermined list of error types 3
MELD Goals
• Initial Goals – Collect ESL student writing – Tag writing for error – Provide publicly available NNS data • Initial Goals support – 2nd language pedagogy – Language acquisition research – tool building (grammar checkers, student editing aids, parallel texts from NS and NNS) 4
MELD Overview
• Data – 44477 words of text annotated – 53826 more words of raw data – language, education data for each student author – upper level ESL students • Tools written to – link essays to student background data – produce an error-free version from tagged text – allow fast entry of background data 5
Annotation
• Annotators “reconstruct” a grammatical form {error/reconstruction}
school systems {is/are} since children {0/are} usually inspired becoming {a/0} good citizens
• Agreement between annotators is an issue 6
Error Classification from a Predetermined List
• Benefit – annotators agree on what an error is: only those items in the classification scheme • Problems – annotators have to learn a classification scheme – the existence of a classification scheme means that the annotators can misclassify – errors not in the scheme will be missed 7
Error Identification & Reconstruction
• Benefits – speed in annotating since there is no classification scheme to learn – no chance of misclassifying – less common errors will be captured – a reconstructed text can be more easily parsed and tagged for part of speech • Question – How well can we agree on what is an error?
8
Agreement Measures
• • •
Reliability
: What percentage of the errors do both taggers tag?
T1 T2 1 (T1 +T2)/2
Precision
: What percentage of the non-expert’s (T2) tags are accurate?
T1 T2 T2
Recall
: What percent of true errors did the non expert (T2) find?
T1 T2 T1 9
Agreement Measures
Non-expert Expert High precision Low Recall Low Reliability 10
Agreement Measures
Essay 1-10 11-22 J&L Recall Precision Reliability .54
.58
.39
.57
.78
.49
Essay 1-10 11-22 J&N Recall Precision Reliability .58
.37
.48
.54
.23
.27
Essay 1-10 11-22 L&N Recall Precision Reliability .65
.60
.70
.78
.37
.36 11
Conclusions on Tagging Agreement
• Unsatisfactory level of agreement as to what is an error • Disagreements resolved through regular meetings • There are now 2 types of tags: one for lexico-syntactic errors and one for stylistic • The tags are transparent to the user and can be deleted or ignored 12
The Future
• Immediate – Internet access to data and tools – an error concordancer – automatic part of speech and syntactic markup – data from different ESL skill levels • Long Range – statistical tool to correlate error frequency with student background – student editing aid – grammar checker – NNS speech data 13
Some Possible Applications
• Preparation of instructional materials • Studies of progress over a semester • Research on error types by L1 • Research on writing characteristics by L1 14
Writing Characteristics by L1
L1 Spanish tense
1 {would/will} 1 {went/go} 1 {stay/stayed} 1 {gave/give} 1 {cannot/could} 1 {can/could} TOTAL: 6 Word Ct: 2305
L1 Gujarati tense
5 {was/is} 1 {passes/passed} 3 {were/are} 1 {love/loved} 2 {would/will} 2 {is/was} 2 {have/had} 1 {left/leave} 1 {kept/keeps} 1 {involved/involves} 2 {had/have} 1 {would start/started} 1 {will/0} 1 {will/were to} 1 {was/were} 1 {wanted/want} 1 {spend/spent} 1 {get/got} 1 {do/did} 1 {can/could} 1 {are/were} TOTAL: 31 Word Ct: 2500 15
Acknowledgments
Jacqueline Cassidy Jennifer Higgins Norma Pravec Lenore Rosenbluth Donna Samko Jory Samkoff Kae Shigeta 16