Comparing TR-Classifier and kNN by using Reduced Sizes of Vocabularies Mourad Abbas

Download Report

Transcript Comparing TR-Classifier and kNN by using Reduced Sizes of Vocabularies Mourad Abbas

Comparing TR-Classifier and
kNN by using Reduced Sizes of
Vocabularies
Mourad Abbas
Citala 2009
Topic Identification: Definition

Topic identification, what does
it mean?
It aims to assign a topic label to a
flow of textual data.
Citala 2009
T.I applications

Documents categorization,

Machine Translation,

Selecting documents for web engines,

Speech recognition system...etc.
Citala 2009
Speech Recognition

According to Bayes probability formula , P(W|X) is defined as below:
P( X / W ).P(W )
P(W
P(W/ X/ X) ) P( X / W ).P(W )
P( X )

The probability to observe the sequence of vectors X when a sequence
of words W is emitted P(X|W). It is given by an acoustic model.

The probability of the sequence of the words W in the used language
P(W). This probability is given by a language model.
n
P(W )   P( wi / wi  n 1 ..., wi 1 )
i 1
Citala 2009
Description of the recognition process
Speech
Parametrization
P(X|W)
Acoustic
Model
X={x1,…,xT}
Searching
arg maxW={w1,…,wT} P(X|W).P(W)
Sequence of recognized
words
Citala 2009
Language
Model
P(W)
Speech Recognition



Statistical Language models are essential for Speech Recognition of large
vocabularies. They allow to estimate the a priori probability P(W) to emit a
sequence of words W from a training corpus.
Nevertheless, in many times, the language model is not able to find the
correct choice.
That is why Language model adaptation is needed.
Citala 2009
Language model adaptation


One of Language model adaptation methods consists to divide the training
documents to classes.
Each class represents a subset of the language which regroups the
documents that share the same characteristics.
In our case these subsets are known as topics.
Corpus
Culture
Religion
Citala 2009
Politics

This allows to construct from these topics a language model which is able
to describe the characteristics of each topic.
The aim is then to:
- find out the topic of the recognized uttered sentences.
- Use the model derived from the detected topic.
Citala 2009
Building the vocabulary




The vocabulary should
be representative of
Starting from the training corpus the vocabulary is built.
the corpus.
Using the vocabulary, a document is represented. If a word of the
vocabulary doesn’t exist in the document the attributed value is zero.
To construct the vocabulary, some methods could be used:
They do not bring any information
- Term Frequency.
with regard to the sens of the text.
- document Frequency.
‫ وأن هذه االجتماعات لن تمنع من عقد المجلس ال‬9 words
- mutual ‫وطني‬
Information.
- Transition Point Technique.
‫ اجتماعات تمنع عقد مجلس وطني‬5 words
We have used the Term Frequency, because it is simple and leads to good
results.

Words which frequency don’t exceed value 3 are discarded.

The non content words are too discarded.
Citala 2009

One Arabic word equivalent to 4 words in the
following example.
Arabic
English
‫و‬
and
‫ب‬
by
‫عالقات‬
relations
‫ها‬
her
Citala 2009
Fig 3. Illustratif exemple : Method Bag of words
Citala 2009
Role of the vocabulary in representation

Each document d={w1,w2,…,wn} is represented by a vector
V={f1,f2,…,fn} with fn = TF(wn,d) .IDF(wn).
Word 1
Word 2
Word 3
…
…
…
Word n
 f1

0
f
 3
 f4

0


 f V
Citala 2009











Real values
We put 0 in the case where the word
couldn’t be found in the document.
|V| Size of the vocabulary
kNN

To identify a topic-unknown document d, kNN ranks the
neighbors of d among the training document vectors, and uses
the topics of the k Nearest Neighbors to predict the topic of the
test document d.
Citala 2009
TR-Classifier

Triggers of a word wk are the ensemble of words that have a high
degree of correlation with it.

The main idea of the TR-classifier is based on computing the average
mutual information of each couple of words which belong to the
vocabulary Vi.

Couples of words or "triggers" that are considered important for a
topic identification task, are those which have the highest average
mutual information (AMI) values.

Each topic is then endowed with a number of selected triggers M,
calculated using training corpora of topic Ti.
Citala 2009
TR-Classifier

The AMI of two words a and b is given by:
IMM (a, b)  p(a, b).log

p ( a , b)
p ( a, b)
p ( a , b)
p ( a, b)
 p(a, b).log
 p(a, b).log
 p(a, b).log
p(a). p(b)
p(a). p(b)
p(a). p(b)
p(a). p(b)
AMI measures the association between words, using the following values:
Number of documents inNumber of documents
Number of documents
of documents
which a et b could be found
in which b could be found
NumberNumber
of documents
in
that
contain the word b .
together.
thatadoesn’t
contain the
without a .
which both
and b couldn’t
be found word b
Citala 2009
TR-Classifier
Identifying topics by using TR-method consists in:
• Giving corresponding triggers for each word wk Є Vi, where Vi is the
vocabulary of a topic Ti.
• Selecting the best M triggers which characterize the topic Ti.
• In test step, we extract for each word wk from the test document, its
corresponding triggers.
• Computing Qi values by using the TR-distance given by the equation:
 IMM ( wk , wik )
Qi 
i ,k
n 1
 (n  l )
l 0
Citala 2009


Where i stands for the ith topic. The denominator presents a normalization of
AMI computation.
wik are triggers included in the test document d , and characterizing the topic
Ti.

A Decision for labeling the test document with topic Ti is obtained by
choosing arg max Qi.

TR-classifier uses topic vocabularies which are composed of words ranked
according to their frequencies from the maximum to the minimum.
Citala 2009
The ten best triggers which charactrizes the topic Culture
culture
Arabic
English
‫ ثقافة‬- ‫ملتقى‬
‫ شاعر‬- ‫قصيدة‬
‫ قصة‬- ‫رواية‬
‫ شخصية‬- ‫مسلسل‬
‫ جمهور‬- ‫أفالم‬
‫ معرض‬- ‫تشكيلي‬
‫ فنان‬- ‫مسلسل‬
‫ تشكيلي‬- ‫لوحة‬
‫ مسرح‬- ‫فرقة‬
‫ سينما‬- ‫أفالم‬
Culture - Meeting
Poet - Poem
Novel - Story
Personage - Serial
Public - Movies
Exposition - Plastic
Artist - Serial
Plastic - Painting
Theater - Group
Cinema - Movies
Citala 2009
Evaluation of the methods




For a topic Tn , the method is evaluated using the following
measures:
Recall R= number of documents correctly labelled (Tn ) / total
number of documents (belonging to the topic Tn ) .
Precision P= number of documents correctly labelled (Tn ) / number
of documents labelled ( Tn ) by the method.
The combination of R and P gives F1 which allows to measure the
number of documents correctly labelled efficiently.
F1=2RP/ R+P
Citala 2009
Experiments and results
Citala 2009
corpus gathering
Citala 2009
The software WinHTTrack
allowed to collect many
web pages. We have just to
fill the address of the
source.
Corpus source

The source of the used corpus is the arabic newspaper:
Alwatan
Sultanate of Oman.
Citala 2009
Size of the corpus
Topic
N. de mots
Culture
1.359.210
Religion
3.122.565
International
855.945
Economy
1.460.462
Local
1.555.635
Sports
1.423.549
Total
9.813.366
ELWATAN newspaper
Citala 2009
TR-Classifier Performances
Topic
Recall (%)
Precision (%)
F1 (%)
Culture
82.66
80.55
81.59
Religion
96.33
83.56
89.49
Economy
83.50
84.05
83.77
Local
86.25
82.53
84.35
International
93.33
90.66
91.97
Sports
96
97.33
96.66
Total
89.67
86.44
88.02
Citala 2009
Recall values versus triggers number using a size of vocabulary 300
Maximal value of R=89.67 %
with N of triggers= 250.
Citala 2009
kNN Performances
Topics
Recall (%)
Precision (%)
Culture
76
49.78
Religion
75.33
94.95
Economy
68.66
81.74
Local
69.33
70.27
International
80
85.11
Sports
84.66
92.70
Average
75.66
70.09
Citala 2009
TR versus kNN
Citala 2009
Conclusion

The experiments are realized using an Arabic corpus.

The strong point of the TR-Classifier is its ability to realize better
performances by using reduced sizes of topic vocabularies, compared to
kNN.

The reason behind that, is the significance of the information present in the
longer-distance history that TR-Classifier uses.

Though the used small corpus (800 words), Performances of kNN are
relatively acceptable (~ 76 % in terms or Recall).

In perspectives, we aim to enhance TR-Classifier performances by using
superior sizes of vocabularies, though it outperforms kNN by 14 %.
Citala 2009