The Montclair Electronic Language Learner Database

Download Report

Transcript The Montclair Electronic Language Learner Database

The Montclair Electronic Language Learner Database (MELD)

www.chss.montclair.edu/linguistics/MELD/

Eileen Fitzpatrick & Steve Seegmiller Montclair State University

Non-native speaker (NNS) corpora

• Begun in early 1990’s • Data – written performance only – essays of students of English as a foreign language • Corpus development (academic) – in Europe: Louvain, Lodz, Uppsala – in Asia: Tokyo Gakugei University, Hong Kong Univ of Science and Technology • Annotation – Lodz: part of speech – HKUST, Lodz: error tags 2

Gaps in NNS Corpus Creation

• No NNS Corpus in America, so no corpus of English as a Second Language (ESL) • No NNS corpus is publicly available • No NNS corpus annotates errors without a predetermined list of error types 3

MELD Goals

• Initial Goals – Collect ESL student writing – Tag writing for error – Provide publicly available NNS data • Initial Goals support – 2nd language pedagogy – Language acquisition research – tool building (grammar checkers, student editing aids, parallel texts from NS and NNS) 4

MELD Overview

• Data – 44477 words of text annotated – 53826 more words of raw data – language, education data for each student author – upper level ESL students • Tools written to – link essays to student background data – produce an error-free version from tagged text – allow fast entry of background data 5

Annotation

• Annotators “reconstruct” a grammatical form {error/reconstruction}

school systems {is/are} since children {0/are} usually inspired becoming {a/0} good citizens

• Agreement between annotators is an issue 6

Error Classification from a Predetermined List

• Benefit – annotators agree on what an error is: only those items in the classification scheme • Problems – annotators have to learn a classification scheme – the existence of a classification scheme means that the annotators can misclassify – errors not in the scheme will be missed 7

Error Identification & Reconstruction

• Benefits – speed in annotating since there is no classification scheme to learn – no chance of misclassifying – less common errors will be captured – a reconstructed text can be more easily parsed and tagged for part of speech • Question – How well can we agree on what is an error?

8

Agreement Measures

• • •

Reliability

: What percentage of the errors do both taggers tag?

T1  T2 1 (T1 +T2)/2

Precision

: What percentage of the non-expert’s (T2) tags are accurate?

T1  T2 T2

Recall

: What percent of true errors did the non expert (T2) find?

T1  T2 T1 9

Agreement Measures

Non-expert Expert High precision Low Recall Low Reliability 10

Agreement Measures

Essay 1-10 11-22 J&L Recall Precision Reliability .54

.58

.39

.57

.78

.49

Essay 1-10 11-22 J&N Recall Precision Reliability .58

.37

.48

.54

.23

.27

Essay 1-10 11-22 L&N Recall Precision Reliability .65

.60

.70

.78

.37

.36 11

Conclusions on Tagging Agreement

• Unsatisfactory level of agreement as to what is an error • Disagreements resolved through regular meetings • There are now 2 types of tags: one for lexico-syntactic errors and one for stylistic • The tags are transparent to the user and can be deleted or ignored 12

The Future

• Immediate – Internet access to data and tools – an error concordancer – automatic part of speech and syntactic markup – data from different ESL skill levels • Long Range – statistical tool to correlate error frequency with student background – student editing aid – grammar checker – NNS speech data 13

Some Possible Applications

• Preparation of instructional materials • Studies of progress over a semester • Research on error types by L1 • Research on writing characteristics by L1 14

Writing Characteristics by L1

L1 Spanish tense

1 {would/will} 1 {went/go} 1 {stay/stayed} 1 {gave/give} 1 {cannot/could} 1 {can/could} TOTAL: 6 Word Ct: 2305

L1 Gujarati tense

5 {was/is} 1 {passes/passed} 3 {were/are} 1 {love/loved} 2 {would/will} 2 {is/was} 2 {have/had} 1 {left/leave} 1 {kept/keeps} 1 {involved/involves} 2 {had/have} 1 {would start/started} 1 {will/0} 1 {will/were to} 1 {was/were} 1 {wanted/want} 1 {spend/spent} 1 {get/got} 1 {do/did} 1 {can/could} 1 {are/were} TOTAL: 31 Word Ct: 2500 15

Acknowledgments

Jacqueline Cassidy Jennifer Higgins Norma Pravec Lenore Rosenbluth Donna Samko Jory Samkoff Kae Shigeta 16