Transcript Slide 1
Morphology 3 Unsupervised Morphology Induction
Sudeshna Sarkar IIT Kharagpur
Linguistica
: Unsupervised Learning of Natural Language Morphology Using MDL
John Goldsmith Department of Linguistics The University of Chicago
Unsupervised learning
Input: untagged text in orthographic or phonetic form with spaces (or punctuation) separating words.
But no tagging or text preparation.
Output List of stems, suffixes, and prefixes List of signatures.
A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem.
Hence, a stem in a corpus has a unique signature.
… A
signature
has a unique set of stems associated with it
(example of signature in English)
NULL.ed.ing.s ask call = ask call point point asked asking called calling calls pointed asks pointing points
…output
Roots (“stems of stems”) and the inner structure of stems Regular allomorphy of stems: e.g., learn “delete stem-final –
e
in English before
–ing
and
–ed”
Essence of Minimum Description Length (MDL)
Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989) Work by Michael Brent and Carl de Marcken on word discovery using MDL We are
given
1.
a corpus, and 2.
a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes.
The
higher
the probability is that the morphology assigns to the (observed) corpus, the as a model
of
that data.
better
that morphology is Better said: -1 * log probability (corpus) is a measure of
how well
the morphology models the data: the
smaller
that number is, the better the morphology models the data.
This is known as the
optimal compressed length
of the data, given the model.
Using base 2 logs, this number is a measure in information theoretic bits.
Essence of MDL…
The goodness of the morphology is also measured by how
compact
the morphology is.
We can measure the compactness of a morphology in information theoretic bits.
How can we measure the compactness of a morphology?
Let’s consider a naïve version of description length: count the number of letters. This naïve version is nonetheless helpful in seeing the intuition involved.
Naive Minimum Description Length
Corpus:
jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total:
62
letters
Analysis: Stems
: jump laugh sing sang dog (20 letters)
Suffixes
: s ing ed (6 letters)
Unanalyzed
: the (3 letters) total:
29
letters.
Notice that the description length goes UP if we analyze sing into s+ing
Essence of MDL…
The best overall theory of a corpus is the one for which the
sum
of log prob (corpus) + length of the morphology (that’s the
description length)
is the
smallest
.
Essence of MDL…
700000 600000 500000 400000 300000 200000 100000 0 Best analysis Elegant theory that works badly Baroque theory modeled on data Length of morphology Log prob of corpus
Overall logic
Search through morphology space for the morphology which provides the smallest description length.
1.
Application of MDL to iterative search of morphology-space, with successively finer-grained descriptions
Corpus
Pick a large corpus from a language - 5,000 to 1,000,000 words.
Corpus Bootstrap heuristic
Feed it into the “bootstrapping” heuristic...
Corpus Bootstrap heuristic
Morphology Out of which comes a preliminary morphology, which need not be superb.
Corpus Bootstrap heuristic
Morphology Feed it to the incremental heuristics...
incremental heuristics
Corpus Bootstrap heuristic
Morphology Out comes a modified morphology.
modified morphology
incremental heuristics
Corpus Bootstrap heuristic
Is the modification an improvement?
Ask MDL!
Morphology modified morphology
incremental heuristics
Corpus Bootstrap heuristic
modified morphology Morphology Garbage If it is an improvement, replace the morphology...
Corpus Bootstrap heuristic
modified morphology Send it back to the incremental heuristics again...
incremental heuristics
Continue until there are no improvements to try.
Morphology
incremental heuristics
modified morphology
1. Bootstrap heuristic
A function that takes words as inputs and gives an initial hypothesis regarding what are stems and what are affixes.
In theory, the search space is enormous: each word w of length |w| has at least |w| analyses, so search space has at least members.
i V
1 |
w i
|
Better bootstrap heuristics
Heuristic, not perfection! Several good heuristics. Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain
peaks
of
successor frequency
.
Problems: can
over-cut;
can
under-cut
; and can put cuts too far to the right (“aborti-” problem). [Not a problem!]
Successor frequency
g o v e r n Empirically, only one letter follows “gover”: “n”
Successor frequency
g o v e r n e i m o s # Empirically, 6 letters follows “govern”: “n”
Successor frequency
g o v e r n m e Empirically, 1 letter follows “governm”: “e” g o v e r 1 n 6 m 1 e peak of successor frequency
Lots of errors…
9 18 11 6 4 1 2 1 1 2 1 1
c o n s e r v a t i v e s
wrong right wrong
Even so…
We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.
2. Incremental heuristics
Course-grained to fine-grained 1.
Stems and suffixes to split
: Accept any analysis of a word if it consists of a known stem and a known suffix.
2.
Loose fit
:
suffixes and signatures
to split: Collect any string that precedes a known suffix. Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment.
Incremental heuristic
3
.Slide stem-suffix boundary to the left
: Again, use MDL to decide.
How do we use MDL to decide?
Using MDL to judge a potential stem
act, acted, action, acts.
We have the suffixes NULL, ed, ion, and s, but no signature NULL.ed.ion.s
Let’s compute
cost
versus
savings
NULL.ed.ion.s
of signature
Savings: Stem savings:
3 copies of the stem
act
: that’s 3 x 4 = 12 letters = almost 60 bits.
Cost of NULL.ed.ing.s
A pointer to each suffix: log
W
[
NULL
] log
W
[
ed
] log
W
[
ing
] log
W
[
s
] To give a feel for this:
W
log [
ed
] 5 Total cost of suffix list: about 30 bits.
Cost of pointer to signature: total cost is
W
log [#
stems that use this sig
] 13
bits
-- all the stems using it chip in to pay for its cost, though.
Cost of signature: about 45 bits Savings: about 60 bits so MDL says:
Do it
! Analyze the words as stem + suffix.
Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.
Today’s presentation
5.
6.
7.
1.
2.
3.
4.
The task: unsupervised learning Overview of program and output Overview of Minimum Description Length framework Application of MDL to iterative search of morphology space, with successively finer-grained descriptions Mathematical model Current capabilities Current challenges
Model
A model to give us a probability of each word in the corpus (hence, its optimal compressed length); and A morphology whose length we can measure.
Frequency of analyzed word
W is analyzed as belonging to Signature , stem T and suffix F .
Freq
(
T
F
)
Freq
( ) *
Freq
(
T
| ) *
Freq
(
F
[ ] [
W
] [
T
* [ ] ] * [
F in
[ ] ] [x] means the
count
of x’s in the corpus (token count) | ) Where [
W
] is the total number of words.
Actually what we care about is the
log
of this:
Compressed length
(
word T
log
freq
(
T
) log log [
W
] log [
T
]
freq
(
F
[
F
[
in
] ]
F
) | )
Next, let’s see how to measure the length of a morphology
A morphology is a set of 3 things: A list of stems; A list of suffixes; A list of signatures with the associated stems.
We’ll make an effort to make our grammars consist primarily of lists, whose length is conceptually simple.
Length of a list
A header
telling us how long the list is, of length (roughly) log 2 N, where N is the length.
N entries. What’s in an entry?
Raw lists: a list of
strings of letters
, where the length of each letter is log 2 (26) – the information content of a letter (we can use a more accurate conditional probability).
Pointer lists:
Lists
Raw suffix list: ed s ing ion able … Signature 1: Suffixes: pointer to “ing” pointer to “ed” Signature 2: Suffixes pointer to “ing” pointer to “ion” The length of each pointer is log 2 # #
suffixed occurrence of words this suffix
-- usually cheaper than the letters themselves
The fact that a pointer to a symbol has a length that is inversely proportional to its frequency is the key: We want the shortest overall grammar; so That means maximizing the
re-use
of units (stems, affixes, signatures, etc.)
(
ii
) (
iii
)
Suffix list Stem list
:
f
Suffixes
* |
f
| log [
W A
] [
f
]
t
Stems
* |
t
| log( [
W
] [
t
] )
Number of letters structure + Signatures, which we’ll get to shortly
Information contained in the Signature component
Signatures
log [
W
[ ] ] list of pointers to signatures
log
Signatures
stems(
log
suffixes
Sigs t
Stems
( ) log [
W
] [
t
]
f
Suffixes
( ) log [ [
f in
] ] )
Repair heuristics: using MDL
We
could
compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths.
Original morphology + Compressed data < > Revised morphology+ compressed data
But it’s better to have a more thoughtful approach.
Let’s define
x
log
x state
1
x state
2 Then the size of the punctuation for the 3 lists is: (
i
) log
suffixes
log
stems
log
signatures
Then the change of the size of the punctuation in the lists:
Size of the suffix component, remember: (
ii
)
Suffix list
f
Suffixes
* |
f
| log [
W
A
[
f
] ] Change in its size when we consider a modification to the morphology: 1. Global effects of change of number of suffixes; 2. Effects on change of size of suffixes in both states; 3. Suffixes present only in state 1; 4. Suffixes present only in state 2;
Suffix component change:
Suffixes whose Global effect of change counts change on all suffixes
W A
*
Suffixes
( 1 , 2 )
f
f
Suffixes
( 1 , 2 )
f
Suffixes
( 1 , ~ 2 ) log [
W A
] 1 [
f
] * |
f
|
f
Suffixes
(~ 1 , 2 ) log [
W A
] 2 [
f
] * |
f
| Contribution of suffixes that appear only in State1 Contribution of suffixes that appear only in State 2
Current research projects
1.
2.
3.
4.
Allomorphy: Automatic discovery of relationship between stems (lov~love, win~winn) Use of syntax (automatic learning of syntactic categories) Rich morphology: other languages (e.g., Swahili), other sub-languages (e.g., biochemistry sub-language) where the mean # morphemes/word is much higher Ordering of morphemes
Allomorphy: Automatic discovery of relationship between stems
Currently learns (unfortunately, over-learns) how to delete stem-final letters in order to simplify signatures.
E.g., delete stem-final –e in English before suffixes – ing, -ed, -ion (etc.).
Automatic learning of syntactic categories
Work in progress with Mikhail Belkin (U of Chicago) Pursuing Shi and Malik’s 1997 application of spectral graph theory (vision) Finding eigenvector decomposition of a graph that represents bigrams and trigrams
Rich morphologies
A practical challenge for use in data-mining and information retrieval in patent applications (de-oxy-ribo nucle-ic, etc.) Swahili, Hungarian, Turkish, etc.
Unsupervised Knowledge-Free Morpheme Boundary Detection
Stefan Bordag University of Leipzig Example Related work Part One: Generating training data Part Two: Training and Applying a Classificator Preliminary results Further research
Example: clearly early
The examples used throughout this presentation are clearly and early In one case, the stem is clear and in the other early Other word forms of same lemmas: clear ly : clear est , clear, clear er , clear ing early: earl ier , erl iest Semantically related words: clearly: logically, really, totally, weakly, … early: morning, noon, day, month, time, … Correct morpheme boundaries analysis: clearly → clear ly but not *clearl y or *clea rly early → early or earl y but not *ear ly
Three approaches to morpheme boundary detection
Three kinds of approaches: 1.
Genetic Algorithms and the Minimum Description Length model (Kazakov 97 & 01), (Goldsmith 01), (Creutz 03 & 05) This approach utilizes only word list, not the
context information
for each word from corpus.
This possibly results in an upper limit on achievable performance (especially with regards to irregularities).
One advantage is that smaller corpora sufficient 2.
3.
Semantics based (Schone & Jurafsky 01), (Baroni 03) General problem of this approach with examples like
deeply
and
deepness
where semantic similarity is unlikely Letter Successor Variety (LSV) based (Harris 55), (Hafer & Weiss 74) first application, but low performance Also applied only to a word list Further hampered by noise in the data
2. New solution in two parts
The talk was very informative The talk 1 Talk was 1 … sentences cooccurrences clear late … early ly ly Talk speech 20 Was is 15 … compute LSV similar words clear ly lately early … s = LSV * freq multiletter bigram * * apply classifier late ¤ cl ¤ ly ear ¤ train classifier root clear ¤ late ¤
2.1. First part: Generating training data with LSV and distributed Semantics
Overview: Use context information to gather common
direct neighbors
of the input word → they are most probably marked by the same grammatical information Frequency of word A and B is
n A
and
n B
Frequency of cooccurrence of A with B is
n AB
Corpus size is
n
Significance computation is Poisson approximation of log likelihood (Dunning 93) (Quasthoff & Wolff 02)
sig poiss
1 ln
n AB
n AB
ln
n B
n B
Neighbors of “clearly“
Most significant left neighbors very quite so It‘s most it‘s shows clearly Most significant right neighbors defined written
It’s clearly labeled
labeled marked visible demonstrated superior results that‘s stated Quite stated shows very clearly shows demonstrates understood
2.2
.
New solution as combination of two existing approaches
Overview: Use context information to gather common
direct neighbors
of the input word → they are most probably marked by the same grammatical information Use these neighbor cooccurrences to find words that have
similar cooccurrence profiles
→ those that are surrounded by the same cooccurrences bear mostly the same grammatical marker
Similar words to “clearly“
Most significant left neighbors very quite so It‘s most it‘s shows results that‘s stated Quite … weakly legally closely
clearly
greatly linearly really … Most significant right neighbors defined written labeled marked visible demonstrated superior stated shows demonstrates understood
2.3. New solution as combination of two existing approaches
Overview: Use context information to gather common
direct neighbors
of the input word → they are most probably marked by the same grammatical information Use these neighbor cooccurrences to find words that have
similar cooccurrence profiles
→ those that are surrounded by the same cooccurrences bear mostly the same grammatical marker Sort those words by edit distance and keep 150 most similar → since further words only add random noise
Similar words to “clearly“ sorted by edit distance
Sorted List Most significant left neighbors Most significant right neighbors very quite so It‘s most it‘s shows results that‘s stated Quite clearly closely greatly legally linearly really weakly … defined written labeled marked visible demonstrated superior stated shows demonstrates understood
2.4. New solution as combination of two existing approaches
Overview: Use context information to gather common
direct neighbors
of the input word → they are most probably marked by the same grammatical information Use these neighbor cooccurrences to find words that have
similar cooccurrence profiles
→ those that are surrounded by the same cooccurrences bear mostly the same grammatical marker Sort those words by edit distance and keep 150 most similar → since further words only add random noise Compute letter successor variety for each transition between two characters of the input word Report boundaries where the LSV is above threshold
2.5. Letter successor variety
Letter successor variety:
Harris (55) where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses the threshold.
Input are the 150 most similar words Observing how many different letters occur after a part of the string: #c- In the given list after #c- 5 letters #cl- only 3 letters #cle- only 1 letter … -ly# but reversed before –ly# 16 stems preceding the suffix –ly#) different letters (16 different # c l e a r l y # 28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters) 1 1 2 1 3 16 letters) 10 14 f. right (thus before -y# 10 var.
2.5.1. Balancing factors
LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise: freq : Frequency differences between beginning and middle of word multiletter : Representation of single phonemes with several letters bigram : Certain fixed combinations of letters Final score s for each possible boundary is then: s = LSV * freq * multiletter * bigram
2.5.2. Balancing factors: Frequency
LSV is not normalized against frequency 28 different first letters within 150 words 5 different second letters within 11 words, beginning with c 3 different third letters within 4 words, beginning with cl Computing frequency weight freq : 4 out of 11 begin with # cl - then weight is 4/11 # c l 150 11 e a r l y # 4 1 1 1 1 1 of 11 4 begin with cl 0.1 0.4
0.3 1 1 1 1 1 from left
2.5.3. Balancing factors: Multiletter Phonemes
Problem: Two or more letters which together represent one phoneme “carry away” the nominator for the overlap factor quotient: Letter split variety: # s c h l i m m e 150 7 1 7 2 1 1 2 2 1 1 1 2 4 15 Computing overlap factor: 27 18 18 6 5 5 5 2 2 2 2 3 7 105 150 ^ thus at this point the LSV 7 is weighted 1 ( 18/18 ), but since sch is one phoneme, it should have been 18 / 150 !
Solution: Ranking of
bi-
1.0
and
trigrams
, highest receives weight of Overlap factor is recomputed as weighted average: In this case that means 1.0 * 27/150, since ‘sch’ is the highest trigram and has a weight of 1.0.
2.5.4. Balancing factors: Bigrams
It is obvious that – th – in English is almost never to be divided Computation of bigram ranking over all words in word list and give 0.1 weight to highest ranked and 1.0 to lowest ranked.
LSV score then multiplied with resulting weight.
Thus, the German – ch - which is the highest ranked bigram receives a penalty of 0.1 and thus it is nearly impossible that it becomes a morpheme boundary
2.5.5. Sample computation
Compute letter successor variety:
# c l e a r 1 1 2 1 3 1 1 2 2 5 l y # # e a r l y # 28 5 3 1 1 1 1 1 40 5 1 1 2 1 16 10 10 1 2 1 4 6 19
Balancing: Frequencies:
150 11 4 1 1 1 1 1 150 9 2 2 2 1 76 90 150 1 2 2 6 19 150
Balancing: Multiletter weights: Bi l
0.4 0.1 0.5 0.2 0.5 0.0 0.2 0.2 0.5 0.0
Tri r Bi l Tri r
0.1 0.1 0.1 0.1 0.0 0.0 0.1 0.0
0.5 0.2 0.5 0.0 0.1 0.3 0.5 0.0 0.1 0.3
0.1 0.1 0.0 0.0 0.2 0.0 0.0 0.2
Balancing: Bigram weight:
0.1 0.5 0.2 0.5 0.0
0.1 0.2 0.5 0.0 0.1
Left and Right LSV scores:
16 *( 76 / 90 + 0.1 0.3 0.0 0.4 1.0 0.9 0.0 0.0 0.5 1.7
0.3 0.9 0.1 0.1
* 76 / 150 0.0
12.4
)/(1.0+ 0.1
3.7 1.0 0.0 0.7 0.2
Computing right score for clear-ly:
)*(1 0.0
)= 12.4
Sum scores for left and right:
0.4 1.2 0.1 0.4 13.4
4.6 1.0 0.1 1.2 2.0
threshold: 5 clear ly early
Second Part: Training and Applying classifier Any word list can be stored in a
trie
(Fredkin:60) or in a more efficient version of a trie, a PATRICIA compact tree (PCT) (Morrison:68) e Example: t clearly early a l lately clear ¤ late l c ¤ e ¤ y l r a root r a e l e a a l c ¤ ¤ ¤ = End or beginning of word
late ¤
3.1. PCT as a Classificator
root cl ¤ clear ly ear ¤ Clear, late clear ¤ retrieve known information ly, late late ¤ ly, early, late ly=1 root ly ly=2 ear ly=1 clear ¤=1 late ¤=1 ¤ ¤=1 ¤ ¤=1 ¤ ly=1 cl ly=1 ¤ ly=1 add known information ¤ ¤=1 Apply deepest found node Amazing ?
ly dear ?
ly amazing ly dearly
4. Evaluation
Boundary measuring:
each boundary detected can be correct or wrong (precision) or boundaries can be not detected (recall) First evaluation is global LSV with the proposed improvements 1 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 0 3 6 9 12 15 18 21 24 27 30 33 F(lsv) F(lsv*fw) F(lsv*fw*ib)
Evaluating LSV Precision vs. Recall
0.4
0.3
0.2
0.1
0 1 0.9
0.8
0.7
0.6
0.5
0 2 4 6 8 10 12 14 16 18 20 22 24 P(lsv) R(lsv) P(lsv*fw) R(lsv*fw) P(lsv*fw*ib) R(lsv*fw*ib)
Evaluating LSV F-measure
1 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 F(lsv) F(lsv*fw) F(lsv*fw*ib)
Evaluating combination Precision vs. Recall
1 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 P(lsv+trie) R(lsv+trie) P(lsv*fw+trie) R(lsv*fw+trie) P(lsv*fw*ib+trie) R(lsv*fw*ib+trie)
Evaluating combination F-measure
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 F(lsv+trie) F(lsv*fw*ib+trie) F(lsv*fw+trie)
Comparing combination with global LSV
1 0.9
0.4
0.3
0.2
0.1
0 0.8
0.7
0.6
0.5
0 2 4 6 8 10 12 14 16 18 20 22 24 1 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
4.1. Results
German newspaper corpus with 35 million sentences English newspaper corpus with 13 million sentences t=5 lsv Precision lsv Recall lsv F-measure combined Precision combined Recall combined F-measure German 80,20 34,52
48,27
68,77 72,11
70,40
English 70,35 10,86
18,82
52,87 52,56
55,09
4.2. Statistics
Corpus size nunmber of word( form)s analysed words boundaries morph. length length of analysed words length of unanalysed words morphemes per word en lsv en comb 13 million 167.377
49.159
94.237
tr lsv tr comb 1 million 582.923
26.307
460.791
fi lsv fi comb 4 million 1.636.336
68.840 1.380.841
70.106
2,60 8,97 7,56 2,43 131.465
2,56 8,91 6,77 2,40 31.569
2,29 9,75 10,12 2,20 812.454
3,03 10,62 8,15 2,76 84.193 3.138.039
2,32 11,94 12,91 2,22 3,73 13,34 10,47 3,27
Assessing true error rate
Typical sample list of words considered as wrong due to CELEX: Tau sendeTausend-e senegales-isch-e senegalesisch-e sensibelst-en separat-ist-isch-e tris t triump hal trock-en sens-ibel-sten separ-at-istisch-e trist triumph-al trocken unueber-troff-en trop f-en trotz-t-en ver-traeum-t-e un-uebertroffen tropf-en trotz-ten vertraeumt-e Reasons: Gender –e (in (Creutz & Lagus 05) for example counted as correct) compounds (sometimes separated, sometimes not) -t-en Error With proper names –isch often not analyzed Connecting elements
4.4. Real example
Orien-tal Orien-tal-ische Orien-tal-ist Orien-tal-ist-en Orien-tal-ist-ik Orien-tal-ist-in Orient-ier-ung Orient-ier-ungen Orient-ier-ungs-hilf-e Orient-ier-ungs-hilf-en Orient-ier-ungs-los-igkeit Orient-ier-ungs-punkt Orient-ier-ungs-punkt-e Orient-ier-ungs-stuf-e Ver-trau-enskrise Ver-trau-ensleute Ver-trau-ens-mann Ver-trau-ens-sache Ver-trau ensvorschuß Ver-trau-ensvo-tum Ver-trau-ens würd-igkeit Ver-traut-es Ver-trieb-en Ver-trieb-spartn-er Ver-triebene Ver triebenenverbände Ver-triebs-beleg-e
5. Further research
Examine quality on various language types Improve trie-based classificator Possibly combine with other existing algorithms Find out how to acquire morphology of non concatenative languages Deeper analysis: find deletions alternations insertions morpheme classes etc.
References
(Argamon et al. 04) Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Effcient unsupervized recursive word segmentation using minimun desctiption length. In Proceedings of Coling 2004, Geneva, Switzerland, 2004. GLDV-Tagung, pages 93-99, Leipzig, March 1998. Deutscher Universitätsverlag.
(Baroni 03) Marco Baroni. Distribution-driven morpheme discovery: A computational/experimental study. Yearbook of Morphology, pages 213-248, 2003. France, http://www.sle.sharp.co.uk/senseval2/, 5-6 July 2001.
(Creutz & Lagus 05) Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Publications in Computer and Information Science, Report A81. Helsinki University of Technology, March 2005. (Déjean 98) Hervé Déjean. Morphemes as necessary concept for structures discovery from untagged corpora. In D.M.W. Powers, editor, NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Natural Language Learning, ACL, pages 295-299, Adelaide, January 1998.
(Dunning 93) T. E. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74, 1993.
6. References II
(Goldsmith 01) John Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153-198, 2001.
(Hafer & Weiss 74) Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371-385, 1974.
(Harris 55) Zellig S. Harris. From phonemes to morphemes. Language, 31(2):190-222, 1955.
(Kazakov 97) Dimitar Kazakov. Unsupervised learning of na Prague, Czech Republic, April 1997.
¨ive morphology with genetic algorithms. In A. van den Bosch, W. Daelemans, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pages 105-112, (Quasthoff & Wolff 02) Uwe Quasthoff and Christian Wolff. The poisson collocation measure and its applications. In Second International Workshop on Computational Approaches to Collocations. 2002.
(Schone & Jurafsky 01) Patrick Schone and Daniel Jurafsky. Language independent induction of part of speech class labels using only language universals. In Workshop at IJCAI-2001, Seattle, WA., August 2001. Machine Learning: Beyond Supervision.
E. Gender-e vs. Frequency-e
Frequency-e vs. Gender-e vs. other-e
Affe Junge 2.7
5.3
Knabe 4.6
Bursche 2.4
Backstage 3.0
Schule Devise 8.4
7.8
Sonne Abendsonne 5.3
Abende 4.5
5.5
Liste 6.5
andere keine 8.4
6.8
rote 11.6
stolze rufe 8.0
drehte 10.8
winzige 9.7
lustige 13.2
4.4
Dumme 12.6