Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran 7 November 2006 University of Tehran.

Download Report

Transcript Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran 7 November 2006 University of Tehran.

Persian POS Tagging
Hadi Amiri
Database Research Group (DBRG)
ECE Department, University of Tehran
7 November 2006
University of Tehran
Outline
•
•
•
•
•
•
•
•
•
•
What is POS tagging
How is data tagged for POS?
Tagged Corpora
POS Tagging Approaches
Corpus Training
How to Evaluate a tagger?
Bijankhan Corpus
Memory Based POS
MLE Based POS
Neural Network POS Tagger
7 November 2006
DBRG- University of Tehran
What is POS tagging
Annotating each word for its part of speech (grammatical
type) in a given sentence.
e.g. I/PRP would/MD prefer/VB to/TO study/VB at/IN a/DT
traditional/JJ school/NN
Properties:
• It helps parsing
• It resolves pronunciation ambiguities
As the water grew colder, their hands grew number. (number=ADJ, not N)
• It resolves semantic ambiguities
Patients can bear pain.
7 November 2006
DBRG- University of Tehran
POS Application
Part-of-speech (POS) tagging is important for many
applications
• Word sense disambiguation
• Parsing
• Language modeling
• Q&A and Information extraction
• Text-to-speech
• Tagging techniques can be used for a variety of tasks
• Semantic tagging
• Dialogue tagging
• Information Retrieval….
7 November 2006
DBRG- University of Tehran
POS Tags
Open Class
N noun
baby, toy
V
verb
see, kiss
ADJ
adjective
tall, grateful, alleged
ADV
adverb
quickly, frankly, ...
P
preposition
in, on, near
DET
determiner
the, a, that
WhPron wh-pronoun
who, what, which, …
COORD coordinator
and, or
7 November 2006
DBRG- University of Tehran
POS Tags
• There is no standard set of POS tags
 Some use coarse classes: e.g., N, V, A, Aux, ….
 Others prefer finer distinctions (e.g., Penn Treebank):
•
•
•
•
•
•
7 November 2006
PRP: personal pronouns (you, me, she, he, them, him, …)
PRP$: possessive pronouns (my, our, her, his, …)
NN: singular common nouns (sky, door, theorem, …)
NNS: plural common nouns (doors, theorems, women, …)
NNP: singular proper names (Fifi, IBM, Canada, …)
NNPS: plural proper names (Americas, Carolinas, …)
DBRG- University of Tehran
How is data tagged for POS?
• We are trying to model human performance.
• So we have humans tag a corpus and try to
match their performance.
To creating a model
 A corpora are hand-tagged for POS by more than 1
annotator
 Then checked for reliability
7 November 2006
DBRG- University of Tehran
History
Trigram Tagger
(Kempe)
96%+
DeRose/Church
Efficient HMM
Sparse Data
95%+
Greene and Rubin
Rule Based - 70%
1960
Brown Corpus
Created (EN-US)
1 Million Words
HMM Tagging
(CLAWS)
93%-95%
1970
Brown Corpus
Tagged
LOB Corpus
Created (EN-UK)
1 Million Words
7 November 2006
Tree-Based Statistics
(Helmut Shmid)
Rule Based – 96%+
Transformation
Based Tagging
(Eric Brill)
Rule Based – 95%+
1980
Combined Methods
98%+
Neural Network
96%+
1990
2000
LOB Corpus
Tagged
POS Tagging
separated from
other NLP
Penn Treebank
Corpus
DBRGUniversity
(WSJ,
4.5M) of Tehran
British National
Corpus
(tagged by CLAWS)
Tagged Corpora
Corpus
# Tags
#Tokens
Brown
87
1 million
British Natl
61
100 million
Penn Treebank
45
4.8 million
Original Bijankhan
550
?
Bijankhan
40
2.6 million
7 November 2006
DBRG- University of Tehran
POS Tagging Approaches
POS Tagging
Supervised
Rule-Based
7 November 2006
Stochastic
Unsupervised
Neural
Rule-Based
DBRG- University of Tehran
Stochastic
Neural
Rule-Based POS Tagger
He was that drunk.
Constraints to eliminate
tags:
If
Lexicon with tags
identified for each word
that ADV
PRON DEM SG
DET CENTRAL
DEM SG
CS
 next word is adj, adv,
quant
 And following is S bdry
 And previous word is not
consider-type V
Then
 Eliminate non-ADV tags
7 November 2006
DBRG- University of Tehran
Probabilistic POS Tagging
• Provides the possibility of automatic training
rather than painstaking rule revision.
• Automatic training means that a tagger can be
easily adapted to new text domains.
E.g.
A moving/VBG house
A moving/JJ ceremony
7 November 2006
DBRG- University of Tehran
Probabilistic POS Tagging
• Needs large tagged corpus for training
• Unigram statistics (most common part-ofspeech for each word) get us to about 90%
accuracy
• For greater accuracy, we need some
information on adjacent words
7 November 2006
DBRG- University of Tehran
Corpus Training
• The probabilities in a statistical model come
from the corpus it is trained on.
• If the corpus is too domain-specific, the model
may not be portable to other domains.
• If the corpus is too general, it will not
capitalize on the advantages of domainspecific probabilities
7 November 2006
DBRG- University of Tehran
Tagger Evaluation
• Once a tagging model has been built, how is it tested?
 Typically, a corpus is split into a training set (usually ~90%
of the data) and a test set (10%).
 The test set is held out from the training.
 The tagger learns the tag sequences that maximize the
probabilities for that model.
 The tagger is tested on the test set.
• Tagger is not trained on test data.
• But test data is highly similar to training data.
7 November 2006
DBRG- University of Tehran
Current Performance
• How many tags are correct?
 About 98% currently
 But baseline is already 90%
 Baseline algorithm:
• Tag every word with its most frequent tag
• Tag unknown words as nouns
• How well do people do?
7 November 2006
DBRG- University of Tehran
Memory Based Part Of Speech
Tagging Experiments With
Persian Text
7 November 2006
University of Tehran
Corpus Study
• At first the corpus had 550 tags.
• The content is gathered form daily news and common
texts.
• Each document is assigned a subject such as political,
cultural and so on.
 Totally, there are 4300 different subjects.
 This subject categorization provides an ideal experimental
environment for clustering, filtering, categorization
research.
• In this research, we simply ignored the subject
categories of the documents and concentrated on POS
tags.
7 November 2006
DBRG- University of Tehran
Selecting Suitable Tags
• At first frequencies of each tags was gathered.
• Then many of the tags were grouped together and a
smaller tag set was produced
• Each tag in the tag set is placed in a hierarchical
structure.
 As an example, consider the tag “N_PL_LOC”.
N
stands for a noun
PL
describes the plurality of the tag
LOC
defines the tag as about locations
7 November 2006
DBRG- University of Tehran
The Tags Distribution
7 November 2006
DBRG- University of Tehran
Max, Min, AVG, Total # of Tags in
The Training Set
7 November 2006
DBRG- University of Tehran
Number of Different Tags
For instance, the word “‫ ”آسمان‬which means “the sky” in English is always tagged with "N_SING" in the whole
corpus; but a word like “‫ ”باال‬which means “high or above” has been tagged by several tags ("ADJ_SIM",
"ADV", "ADV_NI", "N_SING", "P", and "PRO").
7 November 2006
DBRG- University of Tehran
Classifying the Rare Words
ETC
12%
PRO
2%
V_PA
3%
N_SING
38%
N_PL
6%
CON
8%
P
12%
ADJ_SIM
9%
DELM
10%
The Tags whose number of occurrences is below 5000 times in the corpus are gathered to “ETC” group.
7 November 2006
DBRG- University of Tehran
Bijankhan Corpus
7 November 2006
DBRG- University of Tehran
Implemented Mehtods
• Memory Based POS Tagger
• MLE Based POS Tagger
• Neural Network POS Tagger
7 November 2006
DBRG- University of Tehran
Implemented Mehtods
• Memory Based POS Tagger
• MLE Based POS Tagger
• Neural Network POS Tagger
7 November 2006
DBRG- University of Tehran
Memory-Based POS Tagging
• Memory-based POS tagging is also called
Lazy Leaning, Example Based learning or
Case Based Learning
• MBT uses some specifications of each word
such as its possible tags, and a fixed width
context as features.
• We used MBT, a tool for memory based tagger
generation and tagging. (available at:
http://ilk.uvt.nl/mbt/)
7 November 2006
DBRG- University of Tehran
Memory-Based POS Tagging
The MBT tool generates a tagger by working
through the annotated corpus and creating
three data structures:
 a lexicon, associating words to tags as evident in
the training corpus
 a case base for known words (words occurring in
the lexicon)
 a case base for unknown words.
Selecting appropriate feature sets for known and unknown words has important impact on the accuracy of
the results
7 November 2006
DBRG- University of Tehran
Memory-Based POS Tagging
After different experiments, we chose “ddfa” as the
feature set for known words.
So “ddfa” is choosing the appropriate tag for each known word, based on the
tag of two words before and possible tags of the word after it.
f means focus (current) word
d stand for
disambiguated tags
d
d
f
a
d stand for disambiguated tags
7 November 2006
DBRG- University of Tehran
a is
ambiguous
word after the
current word.
Memory-Based POS Tagging
The feature set chosen for unknown word is
“dFass”
a stands for ambiguous tags
of the word after current word
ss are two
suffix letters
of the current
word.
d is the disambiguated
tag of the word before
current word
d
F
a
ss
current word
The F in unknown words features indicates position of the focus word and it is not included in actual
feature set for tagging.
7 November 2006
DBRG- University of Tehran
MBT Results- Known Words
“ddfa”
7 November 2006
DBRG- University of Tehran
MBT Results- Unknown Words
“dFass”
7 November 2006
DBRG- University of Tehran
MBT Results- Overall
7 November 2006
DBRG- University of Tehran
Implemented Mehtods
• Memory Based POS Tagger
• MLE Based POS Tagger
• Neural Network POS Tagger
7 November 2006
DBRG- University of Tehran
Maximum Likelihood Estimation
As a bench mark of POS tagging accuracy, we
chose Maximum Likelihood Estimation (MLE)
approach.
 Calculating the maximum likelihood probabilities for each
tag assigned to any word in the training set.
 Choosing the tag with greater maximum likelihood
probability (designated tag) for each word and make it the
only tag assignable to that word.
• In order to evaluate this method we analyze the words
in the test set and assign the designated tags to the
words in the test set.
7 November 2006
DBRG- University of Tehran
Maximum Likelihood Estimation
Occurrence
Word
Tag
MLE
1
‫پدرانه‬
ADV_NI
0.1667
5
‫پدرانه‬
ADJ_SIM
0.8333
4
‫پديدار‬
ADJ_SIM
0.1538
22
‫پديدار‬
N_SING
0.8462
1
‫پذيرفته‬
N_SING
0.0096
3
‫پذيرفته‬
ADJ_SIM
0.0288
6
‫پذيرفته‬
V_PA
0.0577
94
‫پذيرفته‬
ADJ_INO
0.9038
2
‫پراكنده اند‬
V_PRE
0.5000
2
‫پراكنده اند‬
V_PA
0.5000
7 November 2006
DBRG- University of Tehran
MLE Results-Known Words
7 November 2006
DBRG- University of Tehran
MLE Results- Unknown Words,
“DEFAULT”
For each unknown word we assign the “DEFAULT” tag.
7 November 2006
DBRG- University of Tehran
MLE Results- Overall, “DEFAULT”
For each unknown word we assign the “DEFAULT” tag.
7 November 2006
DBRG- University of Tehran
MLE Results- Unknown Words,
“N_SING”
For each unknown word we assign the “N_SING” tag.
7 November 2006
DBRG- University of Tehran
MLE Results- Overall, “N_SING”
For each unknown word we assign the “N_SING” tag, most assigned tag.
7 November 2006
DBRG- University of Tehran
Comparison With Other Languages
7 November 2006
DBRG- University of Tehran
Implemented Mehtods
• Memory Based POS Tagger
• MLE Based POS Tagger
• Neural Network POS Tagger
7 November 2006
DBRG- University of Tehran
Neural Network
Preceding Words
Following Words
Each unit corresponds to one of the tags in the tag set.
7 November 2006
DBRG- University of Tehran
Neural Network
• For each POS tag, posi and each of the p+1+f
in the context, there is an input unit whose
activation ini,j represent the probability that
wordi has pos posj.
Input representation for the currently tagged word and the following words:
The activation value for the preceding words:
7 November 2006
DBRG- University of Tehran
Neural Network Results on
Bijankhan Corpus
Training
Algorithm
No. of
No. of Input
Hidden Layer for Train
Training No. of Input
Duration for Test
(Hour)
Accuracy
MLP
2
1mil
120:00:8 1000
7
Too Low
MLP
3
1mil
?
1000
Too Low
Generalized
Feed Forward
1
1mil
95:30:57 1000
Too Low
Generalized
Feed Forward
2
1mil
?
1000
Too Low
Generalized
Feed Forward
2
20000
1:53:35
1000
%58
7 November 2006
DBRG- University of Tehran
Neural Network on Other Languages
English
7 November 2006
DBRG- University of Tehran
Neural Network on Other Languages
Chinese
7 November 2006
DBRG- University of Tehran
Future Work
• Using more than 1 level POS tags.
• Unsupervised POS tagging using Hamshahri
Collection
• Investigation of other methods for Persian POS
tagging such as Support Vector Machine
(SVM) based tagging
• KASRE YE EZAFE in Persian!
7 November 2006
DBRG- University of Tehran
Thank You
Space for Question?
7 November 2006
DBRG- University of Tehran