Knowledge-Free Induction of Morphology Using Latent

Download Report

Transcript Knowledge-Free Induction of Morphology Using Latent

Knowledge-Free Induction of
Morphology Using Latent Semantic
Analysis
(Patric Schone and Daniel Jurafsky)
Danny Shacham
Yehoyariv Louck
Presentation Outlines



The problem
Previous solutions
The proposed approach
–
–
–
–
Advantages
The Technique
Evaluation Criteria
The Results
The Problem

The main problem this research is trying to
solve is:
How to automatically induce
morphological relationships
between words

The importance of the problem arises from the
field of morphological analyzers and the
growing need to build them without human
knowledge.
Previous Solutions



Existing induction approaches relies on
statistics of hypothesized stems and affixes to
choose which affixes are legitimate.
relying on statistics rather than on semantic
knowledge may lead to induction errors.
the three main algorithms today are:
–
–
–
D’eJean (1998)
Goldsmith (1997)
Gaussier (1999)
The proposed approach advantages




This paper introduce a semantic-based
algorithm which only proposes affixes when
they are sufficiently similar semantically.
Using semantic similarity may resolve some of
the problems introduced earlier.
The proposed solution is knowledge free.
The proposed solution could be applied to any
inflectional language.
The proposed approach – The
Technique

The algorithm consists of 4 stages:
–
–
–
–
Identifying potential affixes
Finding pairs of words that are possibly
morphological variants
Developing semantic vectors for each word
Selecting variants that has similar semantic vectors
( similar semantic meaning)
The Technique – Stage 1



The selection of candidate affixes is done using
the p-similarity technique ( like Gaussier ).
The method inserts words into a trie and
extracting affixes by looking at the nodes in the
trie where there are branches.
Only the k most frequent affixes are selected. (k
usually 200)
The Technique – Stage 2




Identifying rules – a pair of candidate affixes
that descend from a common ancestor node.
Defining PPMV ( pair of potential
morphological variants) - two words sharing
the same root and the same affix rule.
Defining ruleset - a ruleset of a given rule is
the set of all PPMV that have the rule in
common.
Building a rulesets for every rule extracted
from the data.
The Technique – Stage 3



Building a term-term matrix ( of size Nx2N)
which identify local semantic information.
Applying SVD (singular value
decomposition) on the term-term matrix.
Using the SVD results ( U , D , V) building a
semantic vector for each word.
The Technique – Stage 4


For each pair of word we wish to check. We
take the two word’s semantic vectors and
perform NCS (normalized cosine score).
By considering NCS for all word pairs under a
particular rule we determine which PPMV are
legitimate.
The proposed approach Evaluation Criteria


The algorithm is compared to Goldsmith’s
Linguistica (2000) by using CELEX and a
scoring mechanism.
The scoring mechanism uses conflation sets
and the summation of correct, inserted and
deleted words in the conflation sets in
comparison to CELEX conflation sets.
The proposed approach –
The Results


The results suggest that semantics and LSA
can play a key part in knowledge free
morphology induction.
The results show that the semantic only
approach shown in this article rival any current
state of the art system.