Parts-of-Speech (POS) Tagging
Download
Report
Transcript Parts-of-Speech (POS) Tagging
Hindi Parts-of-Speech
Tagging & Chunking
Baskaran S
MSRI
What's in?
Why POS tagging & chunking?
Approach
Challenges
Unseen tag sequences
Unknown words
Results
Future work
Conclusion
4 July 2006
NWAI
2
Intro & Motivation
4 July 2006
NWAI
3
POS
Parts-of-Speech
Dionysius Thrax (ca 100 BC)
8 types – noun, verb, pronoun, preposition, adverb,
conjunction, participle and article
I get my thing in action.
(Verb, that's what's happenin')
To work, (Verb!)
To play, (Verb!)
To live, (Verb!)
To love... (Verb!...)
- Schoolhouse Rock
4 July 2006
NWAI
4
Tagging
Assigning the appropriate POS
or lexical class marker
to words in a given text
Symbols, punctuation markers etc. are also
assigned specific tag(s)
4 July 2006
NWAI
5
Why POS tagging?
Gives significant information about a word
and its neighbours
Adjective near noun
Adverb near verb
Gives clue on how a word is pronounced
OBject as noun
obJECT as verb
Speech synthesis, full parsing of sentences,
IR, word sense disambiguation etc.
4 July 2006
NWAI
6
Chunking
Identifying simple phrases
Noun phrase, verb phrase, adjectival phrase…
Useful as a first step to Parsing
Named entity recognition
4 July 2006
NWAI
7
POS tagging & Chunking
4 July 2006
NWAI
8
Stochastic approaches
Availability of tagged corpora in large quantity
Most are based on HMM
Weischedel ’93
DeRose ’88
Skut and Brants ’98 – extending HMM to chunking
Zhou and Su ‘00
and lots more…
4 July 2006
NWAI
9
HMM
T argmax P (T / W )
T
P(T ) P(W Annotated
/T)
corpus
P(T / W )
P(W )
Assumptions
Probability of a word is dependent only on its tag
Tag-sequence
Word-emit
probability
Approximate
theprobability
tag history to the most
recent
two
tags
n
n
P(T / W ) P(t1) P(t 2 / t1) P(ti / ti 2ti 1) P( wi / ti )
i 3
i 1
4 July 2006
NWAI
10
Structural tags
A triple – POS tag, structural relation & chunk
tag
Originally proposed by Skut & Brants ’98
Seven relations
Enables embedded and overlapping chunks
4 July 2006
NWAI
11
Structural relations
परीक्षा में भी प्रथम श्रेणी प्राप्त की और विद्यालय में कुलपतत द्िारा विशेष परु स्कार भी उन्ीीं को प्राप्त ्ुआ ।
SSF
NP
VG
परीक्षा
में
SSF
।
00
SSF
09
NP
Beg
VG
NP
परीक्षा
श्रेणी
90
4 July 2006
End
प्राप्त
99
NWAI
12
Decoding
Viterbi mostly used (also A* or stack)
Aims at finding the best path (tag sequence)
given observation sequence
Possible tags are identified for each
transition, with associated probabilities
The best path is the one that maximizes the
product of these transition probabilities
4 July 2006
NWAI
13
अब
जीिन
का
एक
अनय
रूप
उनके
सामने
आया
।
JJ
NLOC
NN
PREP
PRP
QFN
RB
VFM
SYM
4 July 2006
NWAI
14
अब
जीिन
का
एक
अनय
रूप
उनके
सामने
आया
।
JJ
NLOC
NN
PREP
PRP
QFN
RB
VFM
SYM
4 July 2006
NWAI
15
अब
जीिन
का
एक
अनय
रूप
उनके
सामने
आया
।
JJ
NLOC
NN
PREP
PRP
QFN
RB
VFM
SYM
4 July 2006
NWAI
16
Issues
4 July 2006
NWAI
17
1. Unseen tag sequences
Smoothing (Add-One, Good-Turing) and/ or Backoff
(Deleted interpolation)
Idea is to distribute some fractional probability (of
seen occurrences) to unseen
Good-Turing
Re-estimates the probability mass of lower count Ngrams by that of higher counts
N C - Number of N-grams occurring c times
NC 1
c* (c 1)
NC
4 July 2006
NWAI
18
2. Unseen words
Insufficient corpus (even after 10 mn words)
Not all of them are proper names
Treat them as rare words that occur once in
the corpus - Baayen and Sproat ’96,
Dermatas and Kokkinakis ’95
Known Hindi corpus of 25 K words and
unseen corpus of 6 K words
All words vs. Hapax vs. Unknown
4 July 2006
NWAI
19
Tag distribution analysis
0.4
0.35
0.3
Probability
0.25
0.2
0.15
0.1
0.05
H
U
B
V
B
R
V
JJ
B
TF
IN
R
V
W
Q
B
C
JV
G
LO
N
B
E
N
R
P
C
F
N
N
Q
N
P
N
N
V
N
B
C
V
N
N
N
P
FN
Q
R
JJ
C
C
X
P
R
A
U
V
P
FM
V
S
Y
M
E
P
N
P
R
N
0
Tags
All words
4 July 2006
Hapex (= 1)
NWAI
Hapex (< 3)
Unknown words
20
3. Features
Can we use other features?
Capitalization
Word endings and Hyphenations
Weishedel ’93 reports about 66% reduction in
error rate with word endings and
hyphenations
Capitalizations, though useful for proper
nouns are not very effective
4 July 2006
NWAI
21
Contd…
String length
Prefix & suffix – fixed characters width
Character encoding range
Complete analysis remains to be done
Expected to be very effective for
morphologically rich languages
4 July 2006
To be experimented with Tamil
NWAI
22
4. Multi-part words
Examples
In/ terms/ of/
United/ States/ of/ America/
More problematic in Hindi
United/NNPC States/NNPC of/NNPC America/NNP
Central/NNC government/NN
NNPC – Compound proper noun, NN - noun
NNP – Proper noun, NNC – Compound noun
4 July 2006
How does the system identify the last word in multi-part
word?
10% of errors is due to this in Hindi (6 K words tested)
NWAI
23
Results
4 July 2006
NWAI
24
Evaluation metrics
Tag precision
Unseen word accuracy
% of unseen words that are correctly tagged
Estimates the goodness of unseen words
% reduction in error
4 July 2006
Reduction in error after the application of a
particular feature
NWAI
25
Results - Tagger
No structural tags better smoothing
Unseen data – significantly more unknowns
Dev
S-1
S-2
S-3
S-4
Test
# words
8511
6388
6397
6548
5847
5000
Correctly
tagged
6749
5538
5504
5558
5060
3961
Precision
79.29
86.69
86.04
86.06
86.54
79.22
# Unseen
1543
660
648
589
603
1012
Correctly
tagged
672
354
323
265
312
421
Unseen
Precision
43.55
53.63
49.84
44.99
51.74
41.6
4 July 2006
NWAI
26
Results – Chunk tagger
Training 22 K, development data 8 K
4-cross validation
Test data 5 K
POS
tagging
Precision
Chunk
Identification
Labelling
Pre
Rec
Pre
Rec
Dev data
76.16
69.54
69.05
66.73
66.27
Average
85.02
72.26
73.52
70.01
71.35
Test data
76.49
58.72
61.28
54.36
56.73
4 July 2006
NWAI
27
Results – Tagging error analysis
Significant issues with nouns/multi-part words
NNP NN
NNC NN
Also,
4 July 2006
VAUX VFM; VFM VAUX and
NVB NN; NN NVB
NWAI
28
HMM performance (English)
> 96% reported accuracies
About 85% for unknown words
Advantage
4 July 2006
Simple and most suitable with the availability
of annotated data
NWAI
29
Conclusion
4 July 2006
NWAI
30
Future work
Handling unseen words
Smoothing
Can we exploit other features?
Especially morphological ones
Multi-part words
4 July 2006
NWAI
31
Summary
Statistical approaches now include linguistic
features for higher accuracies
Improvement required
Tagging
Chunking
4 July 2006
Precision – 79.22%
Unknown words – 41.6%
Precision – 60%
Recall – 62%
NWAI
32