An Automatic Construction of Arabic Similarity Thesaurus

Download Report

Transcript An Automatic Construction of Arabic Similarity Thesaurus

Abdulaziz Al-Qabbany AbdulMalik Al-Salman Abdulrahman Almuhareb
CITALA 2009
1

Introduction

Thesauruses

Similarity Thesaurus

Proposed Improvement

The Experiment

Evaluation

Discussion

Conclusions and Future Work
2

Thesaurus importance

Effective Information Retrieval systems

Vocabulary mismatch problem

Query Expansion
3

Arabic thesauruses

Manual construction drawbacks:
 cost
 time
 subjectivity

Automatic construction approaches
4



Qiu and Frei (1993) presented their query
expansion model using similarity thesaurus.
Zazo et al. (2005) used the same approach for
constructing a Spanish similarity thesaurus.
Expanding queries based on similarity to their
concepts rather than similarity to the individual
terms.
5


Using similarity thesaurus is analogous to the
translation from a language to another.
Example
‫قرص الشمس‬
‫قـرص‬
‫قرص الدواء‬
‫قرص ضوئي‬
6



The similarity thesaurus is a matrix that
represents terms similarities.
Each term is represented by a vector that
determines its relation with each document.
The matrix is generated through calculating
similarities between terms vectors.
7

Similarity between the query q and any term t
is computed as the sum of the similarities
values between each query term and t.
SIM_QT(q, t) =


ti q
sim (ti , t )
As a response to any query, the terms can be
ranked in descending order according to their
SIM_QT values.
8



“SUM” method is appropriate when the similarity
values between the query terms and the indexed
term are consistent within the same range.
When similarity values are inconsistent, the
differences between the values will not be
reflected on the total sum.
Similarity values are considered to be inconsistent
when they contain outliers.
9

Outlier is a value that is considerably dissimilar
or inconsistent with the majority of the data.
Y
outlier
X
10



A given term should have a high similarity
value with each individual term in the query
in order to be considered related.
The dispersion between the similarity values
is one of the factors that needed to be
considered in query expansion.
The total similarity value should remain as
the main factor in query expansion.
11

Instead of using the sum of the similarity values,
we use the mean of the values subtracted by the
standard error of the mean (SE).
SIM_QT(q, t) = MEANti q (sim(ti , t ))  SE

The standard error of the mean is a measure of
data dispersion.
SE =  n
where, α is the standard deviation and n is the number of values.
12



we used the France Press Agency Arabic news
of years 2004, 2005 and 2006 as the
document collection.
This document collection can be found in LDC
Arabic Gigaword corpus (Third Edition).
After examining the high frequency terms in
the collection, we had chosen 150 stop
words.
13
Characteristic
Number
Number of Documents
208,596
Number of Terms
435,846
Total Number of Terms Occurrences
Average Number of Words per
Document
Number of Processed Terms
30,415,222
69.78
248,311
14



The objective of the evaluation was to assess
the relevance strength of the produced terms.
The evaluation process was applied for both
the “SUM” and “MEAN” methods.
We have selected twenty common topics that
belong to five different domains.
15



For each topic, the top ten related terms were
presented to five expert evaluators.
Each evaluator was asked to study these
twenty topics carefully and then specify if the
produced terms are relevant or not.
Levels of relevance:
 Relevant
 Somewhat Relevant
 Irrelevant
16

The relevance strength of the standard “SUM”
method was 95.0%, while the Relevance strength
of the “MEAN” method was 98.1%.
100%
90%
80%
70%
60%
50%
SUM method
40%
MEAN method
30%
20%
10%
0%
Evaluator 1
Evaluator 2
Evaluator 3
Evaluator 4
Evaluator 5
17



We believe that the main reason that makes
the “MEAN” method a better method is its
ability to detect and exclude outliers.
Adding a single term to the query may
completely change the concept of the query.
The candidate related term should have
consistent similarities with all of the query
terms.
18

The response to a query about the former
French president “‫”جاك شيراك‬:
SUM Most
Related Terms
Value
MEAN Most
Related Terms
Value
‫الفرنسي‬
0.814
‫الفرنسي‬
0.383
‫االليزيه‬
0.630
‫االليزيه‬
0.283
‫فرنسا‬
0.556
‫فرنسا‬
0.266
‫الرئيس‬
0.503
‫الرئيس‬
0.241
‫سترو‬
0.482
‫باريس‬
0.214
19
‫‪0.5‬‬
‫‪0.4‬‬
‫‪0.3‬‬
‫جاك ‪Similarity with‬‬
‫شيراك ‪Similarity with‬‬
‫‪0.2‬‬
‫‪0.1‬‬
‫‪0.0‬‬
‫سترو‬
‫‪20‬‬
‫باريس‬
‫الرئيس‬
‫فرنسا‬
‫االليزيه‬
‫الفرنسي‬



The relevance strength of the standard “SUM”
method was 95.0%, while the Relevance
strength of the “MEAN” method was 98.1%
“MEAN” method shows an improvement of
about 3.3% over “SUM” method.
We conclude that the “MEAN” method is more
accurate mainly because it can detect and
exclude the outliers.
21

Applying word stemming.

Producing collocations.

Constructing a single word-category thesaurus.

Using similarity thesaurus in question answering.
22
23