Nos - QOD Internet Services

Download Report

Transcript Nos - QOD Internet Services

The "Eifeler Regel": Formalizing a spelling rule of Lëtzebuergesch for the CORTINA Spell Checker

An "unsupervised learning" approach © 2000 CRP-GL 1

Overview

• What is the "Eifeler Regel"? • What are the aims of the spell checker?

How does the "Eifeler Regel" fit in?

• An empirical, rule-learning approach • Data extraction and analysis • Rule generation and testing • Preliminary results • Outlook 2 29/04/2020

The 'n'-deletion rule (Eifeler Regel) at work

D' seriös Serjanten op schmuddelegen Serpentinnen .

'n'-deletion rule affects at least 20% of the words in an average text! 3 29/04/2020

A definition

• – "Um Enn vum Wuert fällt am Saz all

-n

ewech, wann dat Wuert drop mat engem Konsonant ufänkt, oons virun

h- d- t- z[ts]-

an

n-

. Viru Vokaler bleift den

-n

sin." —

Arrêté ministériel du 10 octobre 1975 (Mém. B-68/1976) Problems:

– Exceptions (words) are not defined – Triggering contexts are not defined in detail – Definition recurs to intuitions of native speakers "Mir schreiwen hei ëmmer, wéi mer schwätzen"

(ibd.)

4 29/04/2020

Clearcut cases …

n Ären éischten ënnen öffentlechen An Aarbechten Aen Aktivitéiten ... Ëmstänn Aktioun Associatioun Autobunn Bühn Camion Commission Décisioun ...

Steen/Stee Téin/Téi Eltren Examen Führerschäin Schäin Italien Spuenien Reen Wäin ausgesinn ?

...

n 5 29/04/2020

Aims of the CORTINA project

• Development of a spellchecker prototype for Luxemburgish • Integration in various existing text processing applications • Currently under development: Winword, StarWriter Sentences = Words in context Integration of the n-deletion rule is a decisive factor for user acceptance 6 29/04/2020

Aims of my study

• Contribute to a more precise description of the 'n'-deletion in Luxemburgish • Allow for later extensions of the Cortina dictionary – by developers • the dictionary is planned to be expanded from now 50.000 to 150.000 word forms (CORTINA-2) – by users • Automatic discovery of rules, wherever possible 7 29/04/2020

Approach: Empirical, rule-based

• A text collection (Corpus) shows the 'n'-deletion rule as used by "real people" • Classes of 'n'-word instances were collected • These were used to generate rule hypotheses • The "space of possible classes" was searched for the top-performing rule set 8 29/04/2020

The corpus: A collection of texts

• Source: Institut Grand Ducal (IGD-LEO) • 40 + 8 texts were available • Words-in-context extracted using Perl scripts • File name, date, genre and Word position was retained • Corpus toolkit – character conversions – selection of sub-corpora (samples) based on genre, author, year – written in Perl 9

40

126000 588000

(size in words)

762000

(Genre)

Spoken Juridical Literature

8

184000 102000

29/04/2020

Classification of 'n'-words

• Definition of an 'n'-word: {

vowel

|

consonant

}

vowel

[ 'n' | 'nn' ] • Treated as "same word" (individual): ( kru | krun | krunn ) • Variables of the statistic: Probability of "keeping the -n"… – in "keeping" context, – in "dropping" context, – in unknown or zero contexts • Confidence intervals with different probability distributions • "Hand-crafted" thresholds determine class membership 10 29/04/2020

Clearcut cases, fuzzy cases

> 0.96

n Ären éischten ënnen öffentlechen An Aarbechten Aen Aktivitéiten ... Ëmstänn Aktioun Associatioun Autobunn Bühn Camion Commission Décisioun ...

Steen/Stee Téin/Téi Eltren Examen Führerschäin Schäin Italien Spuenien Reen Wäin ausgesinn ?

...

= 0.5

n 11 29/04/2020

Rule abstraction

• Using the data directly, only 22.5% of the CORTINA word stock could have been annotated • Needed: More abstract "rules" that say whether or not a word is an 'n'-dropper • The "safe keeper", "safe dropper" classes were used to generate rule hypotheses: – 4 rule types – Example: "ends-with iou ( n | nn)" – Criteria: Maximum clarification, maximum generality • Modelled after: (Mikheev 1994), (Brill 1995, 1997) 12 29/04/2020

Searching the rule space

• The classification thresholds, statistical assumptions and rule types were varied systematically (by a program) • Each resulting rule-set was tried against the test corpus (8 texts, not used for training) • "Blocking collocations" were not counted • Type maxima – token maxima 13 29/04/2020

Example ruleset

(0.5/0.5, binomial confidence)

ends-with Ma + 'nn' ends-with E + 'nn' ends-with ou + N ends-with Hä + 'nn' ends-with to + 'nn' ends-with ssio + N ends-with So + 'nn' ends-with ai + N ends-with tro + N ends-with Zä + 'nn' ends-with ro + 'nn' ends-with rsta + N ends-with Wo + N ends-with bu + 'nn' ends-with mio + N ends-with la + N ends-with fo + N ends-with h + N ends-with chi + N ends-with zi + N ends-with stio + N ends-with Li + 'nn' ends-with Ëmstä + 'nn' ends-with täi + N ends-with ço + N ends-with r + N ends-with nta + N ends-with ko + 'nn' ends-with llo + N ends-with ma + 'nn' ends-with Ho + 'nn' ends-with pho + N ends-with ndi + N ends-with mä + N ends-with co + N ...

14 29/04/2020

Results

• Ruleset scores varied between 88% .. 97% • Evaluation: Results of tagging by rule and interactive (manual) tagging were compared • Result currently: 77.6% agreement • Currently investigating: What caused misses?

15 29/04/2020

More results

• Exception classes can be analyzed • Sheds light on explanations supplied by earlier research • New data on "blocking contexts" – e.g., (e)lo, (e)ran, (e)raus, (e)rëm, (e)rop, (e)sou… 16 29/04/2020

Outlook

• Can a fully automatic procedure be achieved?

• Divergence between "officially correct" and "good practice" • Official reading of the "Eifeler Regel" should be restated to capture language practice more accurately 17 29/04/2020