A Simpler, Intuitive Approach to Morpheme Induction
Download
Report
Transcript A Simpler, Intuitive Approach to Morpheme Induction
RePortS: A Simpler, Intuitive
Approach to Morpheme Induction
Emily Pitler
Samarth Keshava
Yale University
Goals
Segment English words into morphemes
Simple algorithm
Minimize assumptions and “magic numbers”
Approach
Identify common morphemes in the language
–
“prefix” and “suffix” lists
Use these to segment the test words
Intuition and Motivation
The resulting word fragment, after removing a
potential morpheme, is often still a word
Examples:
–
–
–
training = train+ing
chairman = chair+man
insufferable = insuffer+able
Don’t use to segment words
Intuition and Motivation
Use fluctuations in transitional probabilities
(Harris 1955, Hafer and Weiss 1974)
Examples:
–
–
Expect Pr(t | repor) ≈ 1
Expect Pr(s | report) < 1
Because there are other words such as reported,
reporting, report, etc.
Four Steps
1.
2.
3.
4.
Preprocessing: build the lexicographic trees
Score word fragments to determine
morphemes
Prune the morpheme lists
Segment words using the trees and
morpheme lists
Step 1: Build the trees
We build a “forward tree” and a “backward
tree”
We use these trees to calculate transitional
probabilities in O(1) time
Hypothetical section of the forward tree
Step 2: Scoring morphemes
Example: scoring “s” in “reports”
–
–
–
Check if “report” is a word in the corpus
Check if Pr(t | repor) ≈ 1
Check if Pr(s | report) < 1
If “s” passes all three tests, we add 19 to its
suffix score; otherwise we subtract 1
Step 2: Scoring morphemes
We declare fragments to be morphemes if they have
positive scores
+19/-1 scheme
–
–
–
Chosen so that positive score iff pass 5% of tests
More frequent morphemes have higher scores
Any multiple of these numbers would produce same results
Step 3: Pruning
Don’t want “er”, “s” and “ers” all in the
morpheme list
Remove any morpheme composed of two
other morphemes with higher scores
Top English Morphemes
Top 10 of the 808 morphemes in the
“prefix” list:
1. un
6. mis
2.
3.
4.
5.
re
dis
non
over
in
8. sub
9. pre
10. inter
7.
Top English Morphemes
Top 10 of the 987 morphemes in the
“suffix” list:
6. al
1. s
7. ism
2. ly
8. less
3. ness
9. ist
4. ing
5.
ed
10.able
Top English Morphemes
Prefixes and suffixes later in the list
101.well
101.ier
102.water
102.box
103.servo
103.town
104.make
104.line
105.quick
105.more
Step 4: Segmenting Words
politeness = polite+ness or politenes+s ?
Use transitional probabilities again
–
Expect Pr(n | polite) < Pr(s | politenes)
Peel off morpheme with smallest probability
(unless all probabilities are 1)
Results
English results
–
–
On the provided 532-word Gold Standard
F-score
Precision
Recall
80.92%
82.84%
79.10%
On the organizers’ test data
F-score
Precision
Recall
76.8%
76.2%
77.4%
Results
Breakdown
–
Contribution of the different intuitions
F-score
Precision
Recall
Criteria 1 only
57.33%
45.22%
78.29%
Criteria 2 & 3
only
60.58%
50.21%
76.36%
All
80.92%
82.84%
79.10%
Results
Finnish
F-score
Precision
Recall
46.62%
83.76%
32.30%
F-score
Precision
Recall
54.04%
72.68%
43.01%
Turkish
Simple and Effective
Based on intuition, not a complex model
–
How we personally would segment words
Program was relatively short--252 lines of Perl
Other variations had slightly better F-scores
Best mixture of performance and elegance
Thank you for listening.
Emily Pitler
Samarth Keshava