The Montclair Electronic Language Learner Database

Transcript The Montclair Electronic Language Learner Database

The Montclair Electronic Language Learner Database (MELD)

www.chss.montclair.edu/linguistics/MELD/

Eileen Fitzpatrick & Steve Seegmiller Montclair State University

Non-native speaker (NNS) corpora

• Begun in early 1990’s • Data – written performance only – essays of students of English as a foreign language • Corpus development (academic) – in Europe: Louvain, Lodz, Uppsala – in Asia: Tokyo Gakugei University, Hong Kong Univ of Science and Technology • Annotation – Lodz: part of speech – HKUST, Lodz: error tags 2

Gaps in NNS Corpus Creation

• No NNS Corpus in America, so no corpus of English as a Second Language (ESL) • No NNS corpus is publicly available • No NNS corpus annotates errors without a predetermined list of error types 3

MELD Goals

• Initial Goals – Collect ESL student writing – Tag writing for error – Provide publicly available NNS data • Initial Goals support – 2nd language pedagogy – Language acquisition research – tool building (grammar checkers, student editing aids, parallel texts from NS and NNS) 4

MELD Overview

• Data – 44477 words of text annotated – 53826 more words of raw data – language, education data for each student author – upper level ESL students • Tools written to – link essays to student background data – produce an error-free version from tagged text – allow fast entry of background data 5

Annotation

• Annotators “reconstruct” a grammatical form {error/reconstruction}

school systems {is/are} since children {0/are} usually inspired becoming {a/0} good citizens

• Agreement between annotators is an issue 6

Error Classification from a Predetermined List

• Benefit – annotators agree on what an error is: only those items in the classification scheme • Problems – annotators have to learn a classification scheme – the existence of a classification scheme means that the annotators can misclassify – errors not in the scheme will be missed 7

Error Identification & Reconstruction

• Benefits – speed in annotating since there is no classification scheme to learn – no chance of misclassifying – less common errors will be captured – a reconstructed text can be more easily parsed and tagged for part of speech • Question – How well can we agree on what is an error?

Agreement Measures

• • •

Reliability

: What percentage of the errors do both taggers tag?

T1  T2 1 (T1 +T2)/2

Precision

: What percentage of the non-expert’s (T2) tags are accurate?

T1  T2 T2

Recall

: What percent of true errors did the non expert (T2) find?

T1  T2 T1 9

Agreement Measures

Non-expert Expert High precision Low Recall Low Reliability 10

Agreement Measures

Essay 1-10 11-22 J&L Recall Precision Reliability .54

.58

.39

.57

.78

.49

Essay 1-10 11-22 J&N Recall Precision Reliability .58

.37

.48

.54

.23

.27

Essay 1-10 11-22 L&N Recall Precision Reliability .65

.60

.70

.78

.37

.36 11

Conclusions on Tagging Agreement

• Unsatisfactory level of agreement as to what is an error • Disagreements resolved through regular meetings • There are now 2 types of tags: one for lexico-syntactic errors and one for stylistic • The tags are transparent to the user and can be deleted or ignored 12

The Future

• Immediate – Internet access to data and tools – an error concordancer – automatic part of speech and syntactic markup – data from different ESL skill levels • Long Range – statistical tool to correlate error frequency with student background – student editing aid – grammar checker – NNS speech data 13

Some Possible Applications

• Preparation of instructional materials • Studies of progress over a semester • Research on error types by L1 • Research on writing characteristics by L1 14

Writing Characteristics by L1

L1 Spanish tense

1 {would/will} 1 {went/go} 1 {stay/stayed} 1 {gave/give} 1 {cannot/could} 1 {can/could} TOTAL: 6 Word Ct: 2305

L1 Gujarati tense

5 {was/is} 1 {passes/passed} 3 {were/are} 1 {love/loved} 2 {would/will} 2 {is/was} 2 {have/had} 1 {left/leave} 1 {kept/keeps} 1 {involved/involves} 2 {had/have} 1 {would start/started} 1 {will/0} 1 {will/were to} 1 {was/were} 1 {wanted/want} 1 {spend/spent} 1 {get/got} 1 {do/did} 1 {can/could} 1 {are/were} TOTAL: 31 Word Ct: 2500 15

Acknowledgments

Jacqueline Cassidy Jennifer Higgins Norma Pravec Lenore Rosenbluth Donna Samko Jory Samkoff Kae Shigeta 16

The Montclair Electronic Language Learner Database

Transcript The Montclair Electronic Language Learner Database