Parts-of-Speech (POS) Tagging

Download Report

Transcript Parts-of-Speech (POS) Tagging

Hindi Parts-of-Speech
Tagging & Chunking
Baskaran S
MSRI
What's in?
 Why POS tagging & chunking?
 Approach
 Challenges


Unseen tag sequences
Unknown words
 Results
 Future work
 Conclusion
4 July 2006
NWAI
2
Intro & Motivation
4 July 2006
NWAI
3
POS
 Parts-of-Speech
 Dionysius Thrax (ca 100 BC)
 8 types – noun, verb, pronoun, preposition, adverb,
conjunction, participle and article
I get my thing in action.
(Verb, that's what's happenin')
To work, (Verb!)
To play, (Verb!)
To live, (Verb!)
To love... (Verb!...)
- Schoolhouse Rock
4 July 2006
NWAI
4
Tagging
Assigning the appropriate POS
or lexical class marker
to words in a given text
 Symbols, punctuation markers etc. are also
assigned specific tag(s)
4 July 2006
NWAI
5
Why POS tagging?
 Gives significant information about a word
and its neighbours


Adjective near noun
Adverb near verb
 Gives clue on how a word is pronounced


OBject as noun
obJECT as verb
 Speech synthesis, full parsing of sentences,
IR, word sense disambiguation etc.
4 July 2006
NWAI
6
Chunking
 Identifying simple phrases

Noun phrase, verb phrase, adjectival phrase…
 Useful as a first step to Parsing
 Named entity recognition
4 July 2006
NWAI
7
POS tagging & Chunking
4 July 2006
NWAI
8
Stochastic approaches
 Availability of tagged corpora in large quantity
 Most are based on HMM
 Weischedel ’93
 DeRose ’88
 Skut and Brants ’98 – extending HMM to chunking
 Zhou and Su ‘00
 and lots more…
4 July 2006
NWAI
9
HMM

T  argmax P (T / W )
T 
P(T ) P(W Annotated
/T)
corpus
P(T / W ) 
P(W )
 Assumptions
 Probability of a word is dependent only on its tag
Tag-sequence
Word-emit
probability
 Approximate
theprobability
tag history to the most
recent
two
tags
n

 n

P(T / W )   P(t1) P(t 2 / t1) P(ti / ti  2ti  1)   P( wi / ti ) 
i 3

 i 1

4 July 2006
NWAI
10
Structural tags
 A triple – POS tag, structural relation & chunk
tag
 Originally proposed by Skut & Brants ’98

Seven relations
 Enables embedded and overlapping chunks
4 July 2006
NWAI
11
Structural relations
परीक्षा में भी प्रथम श्रेणी प्राप्त की और विद्यालय में कुलपतत द्िारा विशेष परु स्कार भी उन्ीीं को प्राप्त ्ुआ ।
SSF
NP
VG
परीक्षा
में
SSF
।
00
SSF
09
NP
Beg
VG
NP
परीक्षा
श्रेणी
90
4 July 2006
End
प्राप्त
99
NWAI
12
Decoding
 Viterbi mostly used (also A* or stack)
 Aims at finding the best path (tag sequence)
given observation sequence
 Possible tags are identified for each
transition, with associated probabilities
 The best path is the one that maximizes the
product of these transition probabilities
4 July 2006
NWAI
13
अब
जीिन
का
एक
अनय
रूप
उनके
सामने
आया
।
JJ
NLOC
NN
PREP
PRP
QFN
RB
VFM
SYM
4 July 2006
NWAI
14
अब
जीिन
का
एक
अनय
रूप
उनके
सामने
आया
।
JJ
NLOC
NN
PREP
PRP
QFN
RB
VFM
SYM
4 July 2006
NWAI
15
अब
जीिन
का
एक
अनय
रूप
उनके
सामने
आया
।
JJ
NLOC
NN
PREP
PRP
QFN
RB
VFM
SYM
4 July 2006
NWAI
16
Issues
4 July 2006
NWAI
17
1. Unseen tag sequences
 Smoothing (Add-One, Good-Turing) and/ or Backoff
(Deleted interpolation)
 Idea is to distribute some fractional probability (of
seen occurrences) to unseen
 Good-Turing


Re-estimates the probability mass of lower count Ngrams by that of higher counts
N C - Number of N-grams occurring c times
NC  1
c*  (c  1)
NC
4 July 2006
NWAI
18
2. Unseen words
 Insufficient corpus (even after 10 mn words)
 Not all of them are proper names
 Treat them as rare words that occur once in
the corpus - Baayen and Sproat ’96,
Dermatas and Kokkinakis ’95
 Known Hindi corpus of 25 K words and
unseen corpus of 6 K words
 All words vs. Hapax vs. Unknown
4 July 2006
NWAI
19
Tag distribution analysis
0.4
0.35
0.3
Probability
0.25
0.2
0.15
0.1
0.05
H
U
B
V
B
R
V
JJ
B
TF
IN
R
V
W
Q
B
C
JV
G
LO
N
B
E
N
R
P
C
F
N
N
Q
N
P
N
N
V
N
B
C
V
N
N
N
P
FN
Q
R
JJ
C
C
X
P
R
A
U
V
P
FM
V
S
Y
M
E
P
N
P
R
N
0
Tags
All words
4 July 2006
Hapex (= 1)
NWAI
Hapex (< 3)
Unknown words
20
3. Features
 Can we use other features?


Capitalization
Word endings and Hyphenations
 Weishedel ’93 reports about 66% reduction in
error rate with word endings and
hyphenations
 Capitalizations, though useful for proper
nouns are not very effective
4 July 2006
NWAI
21
Contd…
 String length
 Prefix & suffix – fixed characters width
 Character encoding range
 Complete analysis remains to be done
 Expected to be very effective for
morphologically rich languages

4 July 2006
To be experimented with Tamil
NWAI
22
4. Multi-part words
 Examples
In/ terms/ of/
United/ States/ of/ America/
 More problematic in Hindi
United/NNPC States/NNPC of/NNPC America/NNP
Central/NNC government/NN
NNPC – Compound proper noun, NN - noun
NNP – Proper noun, NNC – Compound noun


4 July 2006
How does the system identify the last word in multi-part
word?
10% of errors is due to this in Hindi (6 K words tested)
NWAI
23
Results
4 July 2006
NWAI
24
Evaluation metrics
 Tag precision
 Unseen word accuracy


% of unseen words that are correctly tagged
Estimates the goodness of unseen words
 % reduction in error

4 July 2006
Reduction in error after the application of a
particular feature
NWAI
25
Results - Tagger
 No structural tags  better smoothing
 Unseen data – significantly more unknowns
Dev
S-1
S-2
S-3
S-4
Test
# words
8511
6388
6397
6548
5847
5000
Correctly
tagged
6749
5538
5504
5558
5060
3961
Precision
79.29
86.69
86.04
86.06
86.54
79.22
# Unseen
1543
660
648
589
603
1012
Correctly
tagged
672
354
323
265
312
421
Unseen
Precision
43.55
53.63
49.84
44.99
51.74
41.6
4 July 2006
NWAI
26
Results – Chunk tagger
 Training 22 K, development data  8 K
 4-cross validation
 Test data 5 K
POS
tagging
Precision
Chunk
Identification
Labelling
Pre
Rec
Pre
Rec
Dev data
76.16
69.54
69.05
66.73
66.27
Average
85.02
72.26
73.52
70.01
71.35
Test data
76.49
58.72
61.28
54.36
56.73
4 July 2006
NWAI
27
Results – Tagging error analysis
 Significant issues with nouns/multi-part words


NNP  NN
NNC  NN
 Also,


4 July 2006
VAUX  VFM; VFM  VAUX and
NVB  NN; NN  NVB
NWAI
28
HMM performance (English)
 > 96% reported accuracies
 About 85% for unknown words
 Advantage

4 July 2006
Simple and most suitable with the availability
of annotated data
NWAI
29
Conclusion
4 July 2006
NWAI
30
Future work
 Handling unseen words
 Smoothing
 Can we exploit other features?

Especially morphological ones
 Multi-part words
4 July 2006
NWAI
31
Summary
 Statistical approaches now include linguistic
features for higher accuracies
 Improvement required

Tagging



Chunking


4 July 2006
Precision – 79.22%
Unknown words – 41.6%
Precision – 60%
Recall – 62%
NWAI
32