A Simpler, Intuitive Approach to Morpheme Induction

Download Report

Transcript A Simpler, Intuitive Approach to Morpheme Induction

RePortS: A Simpler, Intuitive
Approach to Morpheme Induction
Emily Pitler
Samarth Keshava
Yale University
Goals

Segment English words into morphemes

Simple algorithm

Minimize assumptions and “magic numbers”
Approach

Identify common morphemes in the language
–

“prefix” and “suffix” lists
Use these to segment the test words
Intuition and Motivation

The resulting word fragment, after removing a
potential morpheme, is often still a word

Examples:
–
–
–

training = train+ing
chairman = chair+man
insufferable = insuffer+able
Don’t use to segment words
Intuition and Motivation

Use fluctuations in transitional probabilities
(Harris 1955, Hafer and Weiss 1974)

Examples:
–
–
Expect Pr(t | repor) ≈ 1
Expect Pr(s | report) < 1

Because there are other words such as reported,
reporting, report, etc.
Four Steps
1.
2.
3.
4.
Preprocessing: build the lexicographic trees
Score word fragments to determine
morphemes
Prune the morpheme lists
Segment words using the trees and
morpheme lists
Step 1: Build the trees

We build a “forward tree” and a “backward
tree”

We use these trees to calculate transitional
probabilities in O(1) time
Hypothetical section of the forward tree
Step 2: Scoring morphemes

Example: scoring “s” in “reports”
–
–
–

Check if “report” is a word in the corpus
Check if Pr(t | repor) ≈ 1
Check if Pr(s | report) < 1
If “s” passes all three tests, we add 19 to its
suffix score; otherwise we subtract 1
Step 2: Scoring morphemes

We declare fragments to be morphemes if they have
positive scores

+19/-1 scheme
–
–
–
Chosen so that positive score iff pass 5% of tests
More frequent morphemes have higher scores
Any multiple of these numbers would produce same results
Step 3: Pruning

Don’t want “er”, “s” and “ers” all in the
morpheme list

Remove any morpheme composed of two
other morphemes with higher scores
Top English Morphemes

Top 10 of the 808 morphemes in the
“prefix” list:
1. un
6. mis
2.
3.
4.
5.
re
dis
non
over
in
8. sub
9. pre
10. inter
7.
Top English Morphemes

Top 10 of the 987 morphemes in the
“suffix” list:
6. al
1. s
7. ism
2. ly
8. less
3. ness
9. ist
4. ing
5.
ed
10.able
Top English Morphemes

Prefixes and suffixes later in the list
101.well
101.ier
102.water
102.box
103.servo
103.town
104.make
104.line
105.quick
105.more
Step 4: Segmenting Words

politeness = polite+ness or politenes+s ?

Use transitional probabilities again
–

Expect Pr(n | polite) < Pr(s | politenes)
Peel off morpheme with smallest probability
(unless all probabilities are 1)
Results

English results
–
–
On the provided 532-word Gold Standard
F-score
Precision
Recall
80.92%
82.84%
79.10%
On the organizers’ test data
F-score
Precision
Recall
76.8%
76.2%
77.4%
Results

Breakdown
–
Contribution of the different intuitions
F-score
Precision
Recall
Criteria 1 only
57.33%
45.22%
78.29%
Criteria 2 & 3
only
60.58%
50.21%
76.36%
All
80.92%
82.84%
79.10%
Results


Finnish
F-score
Precision
Recall
46.62%
83.76%
32.30%
F-score
Precision
Recall
54.04%
72.68%
43.01%
Turkish
Simple and Effective

Based on intuition, not a complex model
–
How we personally would segment words

Program was relatively short--252 lines of Perl

Other variations had slightly better F-scores

Best mixture of performance and elegance
Thank you for listening.
Emily Pitler
Samarth Keshava