Japanese Dependency Analysis using Cascaded Chunking
Download
Report
Transcript Japanese Dependency Analysis using Cascaded Chunking
Japanese Dependency Analysis
using Cascaded Chunking
Taku Kudo 工藤 拓
Yuji Matsumoto 松本 裕治
Nara Institute Science and Technology,
JAPAN
Motivation
Kudo, Matsumoto 2000 (VLC)
Presented a state-of-the-art Japanese dependency
parser using SVMs (89.09% for standard dataset)
Could show the high generalization performance and
feature selection abilities of SVMs
Problems
Not scalable
• 2 weeks training using 7,958 sentences
• Hard to train with larger data
Slow in Parsing
• 2 ~ 3 sec./sentence
• Too slow to use it for actual NL applications
Goal
Improve the scalability and the
parsing efficiency without loosing
accuracy !
How?
Apply Cascaded Chunking model to
dependency parsing and the selection of
training examples
Reduce the number of times SVMs are
consulted in parsing
Reduce the number of negative examples
learned
Outline
Japanese dependency analysis
Two models
Probabilistic model (previous)
Cascaded Chunking model (new!)
Features used for training and
classification
Experiments and results
Conclusion and future work
Japanese Dependency
Analysis (1/2)
Analysis of relationship between
phrasal units called bunsetsu
(segments), base phrases in English
Two Constraints
Each segment modifies one of the
right-side segments (Japanese is head
final language)
Dependencies do not cross each other
Japanese Dependency
Analysis (2/2)
Raw text
私は彼女と京都に行きます
I go to Kyoto with her.
Morphological analysis and
Bunsetsu identification
私は / 彼女と / 京都に / 行きます
I
with her
to Kyoto-loc
go
Dependency Analysis
私は / 彼女と / 京都に / 行きます
Probabilistic Model
Input
私は 1 / 彼女と 2 / 京都に 3 / 行きます
I-top / with her / to Kyoto-loc / go
1. Build a Dependency
Matrix ME, DT or SVMs
(How probable one
segment modifies another)
4
Modifiee
2. Search the optimal dependencies
which maximize the sentence
probabilities using CYK or Chart
2
Modifier
Output
3
4
1
0.1 0.2 0.7
2
0.2 0.8
3
1.0
Dependency Matrix
私は
1
/ 彼女と
2
/ 京都に
3
/ 行きます
4
Problems of Probabilistic
model(1/2)
Selection of training examples:
All candidates of two segments which have
Dependency relation→positive
No dependency relation→negative
This straightforward way of selection
requires a total n(n 1) / 2 (where n is # of segments
in a sentence) training examples per sentence
Difficult to combine probabilistic model
with SVMs which require polynomial
computational cost
Problems of Probabilistic
model(2/2)
3
O(n ) parsing time is necessary with
CYK or Chart
Even if beam-search is applied,
2
parsing
O(n ) time is always necessary
The classification cost of SVMs is
much more expensive than other ML
algorithms such as ME and DT
Cascaded Chunking Model
English parsing [Abney 1991]
Parses a sentence deterministically
only deciding whether the current
segment modifies the segment on
its immediate right hand side
Training examples are extracted
using this algorithm itself
Example: Training Phase
Annotated sentence
彼は1
He
彼女の2
her
温かい3
warm
真心に4 感動した。5
heart be moved
(He was moved by her warm heart.)
??
finish
彼は
1
1
彼は
彼は
11
D
O
O
彼女の
彼女の22
O
D
Pairs of tag (D or O)
and context(features)
are stored as training
data for SVMs
?
??
?
?
???
温かい3 真心に44 感動した。
感動した。55
O
D
D
O
O
Training
Data
SVMs
Tag is decided
by annotated
corpus
Example: Test Phase
Test sentence
彼は1
He
彼女の2
her
温かい3
warm
真心に4 感動した。5
heart be moved
(He was moved by her warm heart.)
??
finish
彼は
1
11
彼は
彼は
1
1
D
O
O
彼女の22
O
D
Tag is decided by
SVMs built in
training phase
?
??
?
?
???
温かい3 真心に44 感動した。
感動した。5555
O
D
O
D
SVMs
Advantages of Cascaded
Chunking model
Simple and Efficient
3
Prob.: O(n )
v.s. cascaded chunking: O(n2 )
2
O
(
n
) since most of segments
Lower than
modify segment on its immediate righthand-side
Training examples is much smaller
Independent from ML algorithm
Can be combined with any ML algorithms
which work as a binary classifier
Probabilities of dependency are not
necessary
Features
Modify or not?
B
彼の1 友人は2
His
friend-top
A
この本を3
C
持っている4 女性を5 探している6
this book-acc
have
modifier
lady-acc
be looking for
modifiee
His friend is looking for a lady who has this book.
Static Features
modifier/modifiee
Head/Functional Word:
(surface,POS,POS-subcategory,inflectiontype,inflection-form), brackets, quotations, punctuations, position
Between segments:
quotations, punctuations
distance, case-particles, brackets,
Dynamic Features [Kudo, Matsumoto 2000]
A,B : Static features of Functional word
C:
Static features of Head word
Experimental Setting
Kyoto University Corpus 2.0/3.0
Standard Data Set
• Training: 7,958 sentences / Test: 1,246 sentences
• Same data as [Uchimoto et al. 98, Kudo, Matsumoto 00]
Large Data Set
• 2-fold Cross-Validation using all 38,383 sentences
Kernel Function: 3rd polynomial
Evaluation method
Dependency accuracy
Sentence accuracy
Results
Data Set
Standard
Model
Cascaded
Chunking
Large
Probabilistic
Cascaded
Chunking
Probabilistic
Dependency
Acc. (%)
89.29
89.09
90.04
N/A
Sentence
Acc. (%)
47.53
46.17
53.16
N/A
# of training
sentences
7,956
7,956
19,191
19,191
# of training
examples
110,355
459,105
251,254
1,074,316
Training
time (hours)
8
336
48
N/A
Parsing time
(sec./sent.)
0.5
2.1
0.7
N/A
Effect of Dynamic
Features(1/2)
Effect of Dynamic Features
(2/2)
Modify or not?
B
彼の1 友人は2
His
Friend-top
modifier
Deleted type of
dynamic features
A
この本を3
C
持っている4 女性を5 探している6
this book-acc
have
lady-acc
be looking for
modifiee
Difference from the model with all dynamic features
Dependency Acc.
Sentence Acc.
A
-0.28 %
-0.89 %
B
-0.10%
-0.89 %
C
-0.28 %
-0.56 %
AB
-0.33 %
-1.21 %
AC
-0.55 %
-0.97 %
BC
-0.54 %
-1.61 %
ABC
-0.58 %
-2.34 %
Probabilistic v.s. Cascaded
Chunking (1/2)
Probabilistic Model uses all candidates of dependency
relation as training data
彼は1
この本を2
He-top
this book-acc
modifier
持っている3 女性を4
have
lady-acc
探している5
be looking for
modifiee
(He is looking for a lady who has this book.)
Positive:
Negative:
この本を2 → 持っている3
この本を2 → 探している5
unnecessary
Probabilistic models commit a number of
unnecessary examples
Probabilistic v.s. Cascaded
Chunking (2/2)
Probabilistic
Cascaded Chunking
Strategy
Maximize
sentence
probability
Shift-Reduce
Deterministic
Merit
Can see all
candidates of
dependency
Simple, efficient and
scalable
Accurate as Prob. model
Demerit
Not efficient,
Commit to
unnecessary
training examples
Cannot see the all
(posterior) candidates
of dependency
Conclusion
A new Japanese dependency parser
using a cascaded chunking model
It outperforms the previous
probabilistic model with respect to
accuracy, efficiency and scalability
Dynamic features significantly
contribute to improve the performance
Future Work
Coordinate structure analysis
Coordinate structures frequently
appear in Japanese long sentences
and make analysis hard
Use posterior context
Hard to parse the following sentence
only using cascaded chunking model
僕の 母の
ダイヤの
My mother’s diamond
指輪
ring
Comparison with Related Work
Model
Training Corpus
sentences)
Acc.
(%)
Cascaded
Chunking +
SVMs
Kyoto Univ. (19,191)
90.46
Kyoto Univ. (7,956)
89.29
Kudo et al. 00
Prob. + SVMs
Kyoto Univ. (7,956)
89.09
Uchimoto et al.
00
Prob. + ME
Kyoto Univ. (7,956)
87.93
Kanayama et al.
00
Prob. + ME +
HPSG
EDR (192,778)
88.55
Haruno et al. 98 Prob. + DT +
Boosting
EDR (50,000)
85.03
Fujio et al. 98
EDR (190,000)
86.67
Our Model
Prob. + ML
(# of
Support Vector Machines
[Vapnik]
w x b 1
w x b 1
yi 1
d
d
yi 1
d
d
wx b 0
Maximize the margin d
d
| w xi b 1 | | w xi b 1 |
2
|| w ||
|| w ||
|| w ||
Min.:
s.t.:
L(w) || w || / 2
yi [(w xi ) b] 1
2
Soft Margin
Kernel Function