Transcript Slide 1

Morphology 3 Unsupervised Morphology Induction

Sudeshna Sarkar IIT Kharagpur

Linguistica

: Unsupervised Learning of Natural Language Morphology Using MDL

John Goldsmith Department of Linguistics The University of Chicago

Unsupervised learning

   Input: untagged text in orthographic or phonetic form with spaces (or punctuation) separating words.

But no tagging or text preparation.

    Output List of stems, suffixes, and prefixes List of signatures.

  A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem.

Hence, a stem in a corpus has a unique signature.

 … A

signature

has a unique set of stems associated with it

(example of signature in English)

 NULL.ed.ing.s ask call = ask call point point asked asking called calling calls pointed asks pointing points

…output

 Roots (“stems of stems”) and the inner structure of stems  Regular allomorphy of stems: e.g., learn “delete stem-final –

e

in English before

–ing

and

–ed”

Essence of Minimum Description Length (MDL)

  Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989) Work by Michael Brent and Carl de Marcken on word discovery using MDL We are

given

1.

a corpus, and 2.

a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes.

  The

higher

the probability is that the morphology assigns to the (observed) corpus, the as a model

of

that data.

better

that morphology is Better said: -1 * log probability (corpus) is a measure of

how well

the morphology models the data: the

smaller

that number is, the better the morphology models the data.

This is known as the

optimal compressed length

of the data, given the model.

Using base 2 logs, this number is a measure in information theoretic bits.

Essence of MDL…

 The goodness of the morphology is also measured by how

compact

the morphology is.

 We can measure the compactness of a morphology in information theoretic bits.

How can we measure the compactness of a morphology?

  Let’s consider a naïve version of description length: count the number of letters. This naïve version is nonetheless helpful in seeing the intuition involved.

Naive Minimum Description Length

Corpus:

jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total:

62

letters

Analysis: Stems

: jump laugh sing sang dog (20 letters)

Suffixes

: s ing ed (6 letters)

Unanalyzed

: the (3 letters) total:

29

letters.

Notice that the description length goes UP if we analyze sing into s+ing

Essence of MDL…

The best overall theory of a corpus is the one for which the

sum

of  log prob (corpus) +  length of the morphology (that’s the

description length)

is the

smallest

.

Essence of MDL…

700000 600000 500000 400000 300000 200000 100000 0 Best analysis Elegant theory that works badly Baroque theory modeled on data Length of morphology Log prob of corpus

Overall logic

 Search through morphology space for the morphology which provides the smallest description length.

1.

Application of MDL to iterative search of morphology-space, with successively finer-grained descriptions

Corpus

Pick a large corpus from a language - 5,000 to 1,000,000 words.

Corpus Bootstrap heuristic

Feed it into the “bootstrapping” heuristic...

Corpus Bootstrap heuristic

Morphology Out of which comes a preliminary morphology, which need not be superb.

Corpus Bootstrap heuristic

Morphology Feed it to the incremental heuristics...

incremental heuristics

Corpus Bootstrap heuristic

Morphology Out comes a modified morphology.

modified morphology

incremental heuristics

Corpus Bootstrap heuristic

Is the modification an improvement?

Ask MDL!

Morphology modified morphology

incremental heuristics

Corpus Bootstrap heuristic

modified morphology Morphology Garbage If it is an improvement, replace the morphology...

Corpus Bootstrap heuristic

modified morphology Send it back to the incremental heuristics again...

incremental heuristics

Continue until there are no improvements to try.

Morphology

incremental heuristics

modified morphology

1. Bootstrap heuristic

  A function that takes words as inputs and gives an initial hypothesis regarding what are stems and what are affixes.

In theory, the search space is enormous: each word w of length |w| has at least |w| analyses, so search space has at least members.

i V

  1 |

w i

|

Better bootstrap heuristics

Heuristic, not perfection! Several good heuristics. Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain

peaks

of

successor frequency

.

Problems: can

over-cut;

can

under-cut

; and can put cuts too far to the right (“aborti-” problem). [Not a problem!]

Successor frequency

g o v e r n Empirically, only one letter follows “gover”: “n”

Successor frequency

g o v e r n e i m o s # Empirically, 6 letters follows “govern”: “n”

Successor frequency

g o v e r n m e Empirically, 1 letter follows “governm”: “e” g o v e r 1 n 6 m 1 e peak of successor frequency

Lots of errors…

9 18 11 6 4 1 2 1 1 2 1 1

c o n s e r v a t i v e s

wrong right wrong

Even so…

We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.

2. Incremental heuristics

Course-grained to fine-grained  1.

Stems and suffixes to split

:  Accept any analysis of a word if it consists of a known stem and a known suffix.

 2.

Loose fit

:

suffixes and signatures

to split: Collect any string that precedes a known suffix.  Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment.

Incremental heuristic

 3

.Slide stem-suffix boundary to the left

: Again, use MDL to decide.

How do we use MDL to decide?

Using MDL to judge a potential stem

act, acted, action, acts.

We have the suffixes NULL, ed, ion, and s, but no signature NULL.ed.ion.s

Let’s compute

cost

versus

savings

NULL.ed.ion.s

of signature

Savings: Stem savings:

3 copies of the stem

act

: that’s 3 x 4 = 12 letters = almost 60 bits.

Cost of NULL.ed.ing.s

 A pointer to each suffix: log

W

[

NULL

]  log

W

[

ed

]  log

W

[

ing

]  log

W

[

s

] To give a feel for this:

W

log [

ed

]  5 Total cost of suffix list: about 30 bits.

Cost of pointer to signature: total cost is

W

log [#

stems that use this sig

]  13

bits

-- all the stems using it chip in to pay for its cost, though.

  Cost of signature: about 45 bits Savings: about 60 bits so MDL says:

Do it

! Analyze the words as stem + suffix.

Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.

Today’s presentation

5.

6.

7.

1.

2.

3.

4.

The task: unsupervised learning Overview of program and output Overview of Minimum Description Length framework Application of MDL to iterative search of morphology space, with successively finer-grained descriptions Mathematical model Current capabilities Current challenges

Model

 A model to give us a probability of each word in the corpus (hence, its optimal compressed length); and  A morphology whose length we can measure.

Frequency of analyzed word

W is analyzed as belonging to Signature , stem T and suffix F .

Freq

(

T

F

)  

Freq

(  ) *

Freq

(

T

|  ) *

Freq

(

F

[  ] [

W

] [

T

* [  ] ] * [

F in

[  ]  ] [x] means the

count

of x’s in the corpus (token count) |  ) Where [

W

] is the total number of words.

Actually what we care about is the

log

of this:

Compressed length

(

word T

  log

freq

(

T

)  log  log [

W

]  log [

T

]

freq

(

F

[

F

[ 

in

]  ] 

F

) |  )

Next, let’s see how to measure the length of a morphology

A morphology is a set of 3 things:  A list of stems;  A list of suffixes;  A list of signatures with the associated stems.

We’ll make an effort to make our grammars consist primarily of lists, whose length is conceptually simple.

Length of a list

 

A header

telling us how long the list is, of length (roughly) log 2 N, where N is the length.

N entries. What’s in an entry?

 Raw lists: a list of

strings of letters

, where the length of each letter is log 2 (26) – the information content of a letter (we can use a more accurate conditional probability).

 Pointer lists:

Lists

 Raw suffix list:  ed  s     ing ion able …  Signature 1:  Suffixes:   pointer to “ing” pointer to “ed”  Signature 2:  Suffixes   pointer to “ing” pointer to “ion” The length of each pointer is log 2   # #

suffixed occurrence of words this suffix

  -- usually cheaper than the letters themselves

 The fact that a pointer to a symbol has a length that is inversely proportional to its frequency is the key:  We want the shortest overall grammar; so  That means maximizing the

re-use

of units (stems, affixes, signatures, etc.)

(

ii

) (

iii

)

Suffix list Stem list

:

f

 

Suffixes

   * |

f

|  log [

W A

] [

f

]   

t

Stems

   * |

t

|  log( [

W

] [

t

] )  

Number of letters structure + Signatures, which we’ll get to shortly

Information contained in the Signature component   

Signatures

log [

W

[  ] ] list of pointers to signatures   

log

 Signatures 

stems(

     

log

 

suffixes

         

Sigs t

Stems

(  ) log [

W

]  [

t

]

f

 

Suffixes

(  ) log [ [ 

f in

]  ] ) indicates the number of distinct elements in X

Repair heuristics: using MDL

We

could

compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths.

Original morphology + Compressed data < > Revised morphology+ compressed data

But it’s better to have a more thoughtful approach.

Let’s define 

x

 log

x state

1

x state

2 Then the size of the punctuation for the 3 lists is: (

i

) log 

suffixes

  log 

stems

  log 

signatures

 Then the change of the size of the punctuation in the lists:  +  + 

Size of the suffix component, remember: (

ii

)

Suffix list

f

 

Suffixes

   * |

f

|  log [

W

A

[

f

] ]   Change in its size when we consider a modification to the morphology: 1. Global effects of change of number of suffixes; 2. Effects on change of size of suffixes in both states; 3. Suffixes present only in state 1; 4. Suffixes present only in state 2;

Suffix component change:

Suffixes whose Global effect of change counts change on all suffixes 

W A

* 

Suffixes

 ( 1 , 2 ) 

f

 

f

Suffixes

( 1 , 2 ) 

f

 

Suffixes

( 1 , ~ 2 )   log [

W A

] 1 [

f

]   * |

f

|   

f

 

Suffixes

(~ 1 , 2 )   log [

W A

] 2 [

f

]   * |

f

|   Contribution of suffixes that appear only in State1 Contribution of suffixes that appear only in State 2

Current research projects

1.

2.

3.

4.

Allomorphy: Automatic discovery of relationship between stems (lov~love, win~winn) Use of syntax (automatic learning of syntactic categories) Rich morphology: other languages (e.g., Swahili), other sub-languages (e.g., biochemistry sub-language) where the mean # morphemes/word is much higher Ordering of morphemes

Allomorphy: Automatic discovery of relationship between stems

 Currently learns (unfortunately, over-learns) how to delete stem-final letters in order to simplify signatures.

 E.g., delete stem-final –e in English before suffixes – ing, -ed, -ion (etc.).

Automatic learning of syntactic categories

 Work in progress with Mikhail Belkin (U of Chicago)  Pursuing Shi and Malik’s 1997 application of spectral graph theory (vision)  Finding eigenvector decomposition of a graph that represents bigrams and trigrams

Rich morphologies

 A practical challenge for use in data-mining and information retrieval in patent applications (de-oxy-ribo nucle-ic, etc.)  Swahili, Hungarian, Turkish, etc.

Unsupervised Knowledge-Free Morpheme Boundary Detection

Stefan Bordag University of Leipzig      Example Related work Part One: Generating training data Part Two: Training and Applying a Classificator Preliminary results Further research

Example: clearly early

The examples used throughout this presentation are clearly and early  In one case, the stem is clear and in the other early  Other word forms of same lemmas:  clear ly : clear est , clear, clear er , clear ing  early: earl ier , erl iest  Semantically related words:  clearly: logically, really, totally, weakly, …  early: morning, noon, day, month, time, …  Correct morpheme boundaries analysis:  clearly → clear ly but not *clearl y or *clea rly  early → early or earl y but not *ear ly

Three approaches to morpheme boundary detection

Three kinds of approaches: 1.

Genetic Algorithms and the Minimum Description Length model   (Kazakov 97 & 01), (Goldsmith 01), (Creutz 03 & 05) This approach utilizes only word list, not the

context information

for each word from corpus.

  This possibly results in an upper limit on achievable performance (especially with regards to irregularities).

One advantage is that smaller corpora sufficient 2.

3.

Semantics based   (Schone & Jurafsky 01), (Baroni 03) General problem of this approach with examples like

deeply

and

deepness

where semantic similarity is unlikely Letter Successor Variety (LSV) based  (Harris 55), (Hafer & Weiss 74) first application, but low performance   Also applied only to a word list Further hampered by noise in the data

2. New solution in two parts

The talk was very informative The talk 1 Talk was 1 … sentences cooccurrences clear late … early ly ly Talk speech 20 Was is 15 … compute LSV similar words clear ly lately early … s = LSV * freq multiletter bigram * * apply classifier late ¤ cl ¤ ly ear ¤ train classifier root clear ¤ late ¤

2.1. First part: Generating training data with LSV and distributed Semantics

Overview:  Use context information to gather common

direct neighbors

of the input word → they are most probably marked by the same grammatical information     Frequency of word A and B is

n A

and

n B

Frequency of cooccurrence of A with B is

n AB

Corpus size is

n

Significance computation is Poisson approximation of log likelihood (Dunning 93) (Quasthoff & Wolff 02)

sig poiss

1    ln

n AB

n AB

ln

n B

n B

Neighbors of “clearly“

Most significant left neighbors very quite so It‘s most it‘s shows clearly Most significant right neighbors defined written

It’s clearly labeled

labeled marked visible demonstrated superior results that‘s stated Quite stated shows very clearly shows demonstrates understood

2.2

.

New solution as combination of two existing approaches

Overview:  Use context information to gather common

direct neighbors

of the input word → they are most probably marked by the same grammatical information  Use these neighbor cooccurrences to find words that have

similar cooccurrence profiles

→ those that are surrounded by the same cooccurrences bear mostly the same grammatical marker

Similar words to “clearly“

Most significant left neighbors very quite so It‘s most it‘s shows results that‘s stated Quite … weakly legally closely

clearly

greatly linearly really … Most significant right neighbors defined written labeled marked visible demonstrated superior stated shows demonstrates understood

2.3. New solution as combination of two existing approaches

Overview:  Use context information to gather common

direct neighbors

of the input word → they are most probably marked by the same grammatical information  Use these neighbor cooccurrences to find words that have

similar cooccurrence profiles

→ those that are surrounded by the same cooccurrences bear mostly the same grammatical marker  Sort those words by edit distance and keep 150 most similar → since further words only add random noise

Similar words to “clearly“ sorted by edit distance

Sorted List Most significant left neighbors Most significant right neighbors very quite so It‘s most it‘s shows results that‘s stated Quite clearly closely greatly legally linearly really weakly … defined written labeled marked visible demonstrated superior stated shows demonstrates understood

2.4. New solution as combination of two existing approaches

Overview:     Use context information to gather common

direct neighbors

of the input word → they are most probably marked by the same grammatical information Use these neighbor cooccurrences to find words that have

similar cooccurrence profiles

→ those that are surrounded by the same cooccurrences bear mostly the same grammatical marker Sort those words by edit distance and keep 150 most similar → since further words only add random noise Compute letter successor variety for each transition between two characters of the input word Report boundaries where the LSV is above threshold

2.5. Letter successor variety

  

Letter successor variety:

Harris (55) where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses the threshold.

Input are the 150 most similar words Observing how many different letters occur after a part of the string:      #c- In the given list after #c- 5 letters #cl- only 3 letters #cle- only 1 letter … -ly# but reversed before –ly# 16 stems preceding the suffix –ly#) different letters (16 different # c l e a r l y # 28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters) 1 1 2 1 3 16 letters) 10 14 f. right (thus before -y# 10 var.

2.5.1. Balancing factors

 LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise:  freq : Frequency differences between beginning and middle of word  multiletter : Representation of single phonemes with several letters  bigram : Certain fixed combinations of letters  Final score s for each possible boundary is then: s = LSV * freq * multiletter * bigram

2.5.2. Balancing factors: Frequency

LSV is not normalized against frequency  28 different first letters within 150 words   5 different second letters within 11 words, beginning with c 3 different third letters within 4 words, beginning with cl Computing frequency weight freq :  4 out of 11 begin with # cl - then weight is 4/11 # c l 150 11 e a r l y # 4 1 1 1 1 1 of 11 4 begin with cl 0.1 0.4

0.3 1 1 1 1 1 from left

2.5.3. Balancing factors: Multiletter Phonemes

 Problem: Two or more letters which together represent one phoneme “carry away” the nominator for the overlap factor quotient: Letter split variety: # s c h l i m m e 150 7 1 7 2 1 1 2 2 1 1 1 2 4 15 Computing overlap factor: 27 18 18 6 5 5 5 2 2 2 2 3 7 105 150 ^ thus at this point the LSV 7 is weighted 1 ( 18/18 ), but since sch is one phoneme, it should have been 18 / 150 !

 Solution: Ranking of

bi-

1.0

and

trigrams

, highest receives weight of  Overlap factor is recomputed as weighted average:  In this case that means 1.0 * 27/150, since ‘sch’ is the highest trigram and has a weight of 1.0.

2.5.4. Balancing factors: Bigrams

 It is obvious that – th – in English is almost never to be divided  Computation of bigram ranking over all words in word list and give 0.1 weight to highest ranked and 1.0 to lowest ranked.

  LSV score then multiplied with resulting weight.

Thus, the German – ch - which is the highest ranked bigram receives a penalty of 0.1 and thus it is nearly impossible that it becomes a morpheme boundary

2.5.5. Sample computation

Compute letter successor variety:

# c l e a r 1 1 2 1 3 1 1 2 2 5 l y # # e a r l y # 28 5 3 1 1 1 1 1 40 5 1 1 2 1 16 10 10 1 2 1 4 6 19

Balancing: Frequencies:

150 11 4 1 1 1 1 1 150 9 2 2 2 1 76 90 150 1 2 2 6 19 150

Balancing: Multiletter weights: Bi l

0.4 0.1 0.5 0.2 0.5 0.0 0.2 0.2 0.5 0.0

Tri r Bi l Tri r

0.1 0.1 0.1 0.1 0.0 0.0 0.1 0.0

0.5 0.2 0.5 0.0 0.1 0.3 0.5 0.0 0.1 0.3

0.1 0.1 0.0 0.0 0.2 0.0 0.0 0.2

Balancing: Bigram weight:

0.1 0.5 0.2 0.5 0.0

0.1 0.2 0.5 0.0 0.1

Left and Right LSV scores:

16 *( 76 / 90 + 0.1 0.3 0.0 0.4 1.0 0.9 0.0 0.0 0.5 1.7

0.3 0.9 0.1 0.1

* 76 / 150 0.0

12.4

)/(1.0+ 0.1

3.7 1.0 0.0 0.7 0.2

Computing right score for clear-ly:

)*(1 0.0

)= 12.4

Sum scores for left and right:

0.4 1.2 0.1 0.4 13.4

4.6 1.0 0.1 1.2 2.0

threshold: 5 clear ly early

Second Part: Training and Applying classifier  Any word list can be stored in a

trie

(Fredkin:60) or in a more efficient version of a trie, a PATRICIA compact tree (PCT) (Morrison:68) e  Example: t clearly early a l lately clear ¤ late l c ¤ e ¤ y l r a root r a e l e a a l c ¤ ¤ ¤ = End or beginning of word

late ¤

3.1. PCT as a Classificator

root cl ¤ clear ly ear ¤ Clear, late clear ¤ retrieve known information ly, late late ¤ ly, early, late ly=1 root ly ly=2 ear ly=1 clear ¤=1 late ¤=1 ¤ ¤=1 ¤ ¤=1 ¤ ly=1 cl ly=1 ¤ ly=1 add known information ¤ ¤=1 Apply deepest found node Amazing ?

ly dear ?

ly amazing ly dearly

4. Evaluation

Boundary measuring:

each boundary detected can be correct or wrong (precision) or boundaries can be not detected (recall)  First evaluation is global LSV with the proposed improvements 1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 3 6 9 12 15 18 21 24 27 30 33 F(lsv) F(lsv*fw) F(lsv*fw*ib)

Evaluating LSV Precision vs. Recall

0.4

0.3

0.2

0.1

0 1 0.9

0.8

0.7

0.6

0.5

0 2 4 6 8 10 12 14 16 18 20 22 24 P(lsv) R(lsv) P(lsv*fw) R(lsv*fw) P(lsv*fw*ib) R(lsv*fw*ib)

Evaluating LSV F-measure

1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 F(lsv) F(lsv*fw) F(lsv*fw*ib)

Evaluating combination Precision vs. Recall

1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 P(lsv+trie) R(lsv+trie) P(lsv*fw+trie) R(lsv*fw+trie) P(lsv*fw*ib+trie) R(lsv*fw*ib+trie)

Evaluating combination F-measure

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 F(lsv+trie) F(lsv*fw*ib+trie) F(lsv*fw+trie)

Comparing combination with global LSV

1 0.9

0.4

0.3

0.2

0.1

0 0.8

0.7

0.6

0.5

0 2 4 6 8 10 12 14 16 18 20 22 24 1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34

4.1. Results

German newspaper corpus with 35 million sentences English newspaper corpus with 13 million sentences t=5 lsv Precision lsv Recall lsv F-measure combined Precision combined Recall combined F-measure German 80,20 34,52

48,27

68,77 72,11

70,40

English 70,35 10,86

18,82

52,87 52,56

55,09

4.2. Statistics

Corpus size nunmber of word( form)s analysed words boundaries morph. length length of analysed words length of unanalysed words morphemes per word en lsv en comb 13 million 167.377

49.159

94.237

tr lsv tr comb 1 million 582.923

26.307

460.791

fi lsv fi comb 4 million 1.636.336

68.840 1.380.841

70.106

2,60 8,97 7,56 2,43 131.465

2,56 8,91 6,77 2,40 31.569

2,29 9,75 10,12 2,20 812.454

3,03 10,62 8,15 2,76 84.193 3.138.039

2,32 11,94 12,91 2,22 3,73 13,34 10,47 3,27

Assessing true error rate

  Typical sample list of words considered as wrong due to CELEX:            Tau sendeTausend-e senegales-isch-e senegalesisch-e sensibelst-en separat-ist-isch-e tris t triump hal trock-en sens-ibel-sten separ-at-istisch-e trist triumph-al trocken unueber-troff-en trop f-en trotz-t-en ver-traeum-t-e un-uebertroffen tropf-en trotz-ten vertraeumt-e Reasons:  Gender –e (in (Creutz & Lagus 05) for example counted as correct)     compounds (sometimes separated, sometimes not) -t-en Error With proper names –isch often not analyzed Connecting elements

4.4. Real example

Orien-tal Orien-tal-ische Orien-tal-ist Orien-tal-ist-en Orien-tal-ist-ik Orien-tal-ist-in Orient-ier-ung Orient-ier-ungen Orient-ier-ungs-hilf-e Orient-ier-ungs-hilf-en Orient-ier-ungs-los-igkeit Orient-ier-ungs-punkt Orient-ier-ungs-punkt-e Orient-ier-ungs-stuf-e Ver-trau-enskrise Ver-trau-ensleute Ver-trau-ens-mann Ver-trau-ens-sache Ver-trau ensvorschuß Ver-trau-ensvo-tum Ver-trau-ens würd-igkeit Ver-traut-es Ver-trieb-en Ver-trieb-spartn-er Ver-triebene Ver triebenenverbände Ver-triebs-beleg-e

5. Further research

 Examine quality on various language types  Improve trie-based classificator  Possibly combine with other existing algorithms  Find out how to acquire morphology of non concatenative languages  Deeper analysis:  find deletions  alternations  insertions  morpheme classes etc.

References

     (Argamon et al. 04) Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Effcient unsupervized recursive word segmentation using minimun desctiption length. In Proceedings of Coling 2004, Geneva, Switzerland, 2004. GLDV-Tagung, pages 93-99, Leipzig, March 1998. Deutscher Universitätsverlag.

(Baroni 03) Marco Baroni. Distribution-driven morpheme discovery: A computational/experimental study. Yearbook of Morphology, pages 213-248, 2003. France, http://www.sle.sharp.co.uk/senseval2/, 5-6 July 2001.

(Creutz & Lagus 05) Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Publications in Computer and Information Science, Report A81. Helsinki University of Technology, March 2005. (Déjean 98) Hervé Déjean. Morphemes as necessary concept for structures discovery from untagged corpora. In D.M.W. Powers, editor, NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Natural Language Learning, ACL, pages 295-299, Adelaide, January 1998.

(Dunning 93) T. E. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74, 1993.

6. References II

      (Goldsmith 01) John Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153-198, 2001.

(Hafer & Weiss 74) Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371-385, 1974.

(Harris 55) Zellig S. Harris. From phonemes to morphemes. Language, 31(2):190-222, 1955.

(Kazakov 97) Dimitar Kazakov. Unsupervised learning of na Prague, Czech Republic, April 1997.

¨ive morphology with genetic algorithms. In A. van den Bosch, W. Daelemans, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pages 105-112, (Quasthoff & Wolff 02) Uwe Quasthoff and Christian Wolff. The poisson collocation measure and its applications. In Second International Workshop on Computational Approaches to Collocations. 2002.

(Schone & Jurafsky 01) Patrick Schone and Daniel Jurafsky. Language independent induction of part of speech class labels using only language universals. In Workshop at IJCAI-2001, Seattle, WA., August 2001. Machine Learning: Beyond Supervision.

E. Gender-e vs. Frequency-e

Frequency-e vs. Gender-e vs. other-e

Affe Junge 2.7

5.3

Knabe 4.6

Bursche 2.4

Backstage 3.0

Schule Devise 8.4

7.8

Sonne Abendsonne 5.3

Abende 4.5

5.5

Liste 6.5

andere keine 8.4

6.8

rote 11.6

stolze rufe 8.0

drehte 10.8

winzige 9.7

lustige 13.2

4.4

Dumme 12.6