下載/瀏覽Download

Download Report

Transcript 下載/瀏覽Download

A Web Search Engine-Based
Approach to Measure Semantic
Similarity between Words
Danushka Bollegala, Yutaka Matsuo, & Mitsuru Ishizuka
IEEE Trans. on Knowledge & Data Engineering, 23(7), 2011.
Presenter: Guan-Yu Chen
1
Outline
1.
2.
3.
4.
5.
Introduction
Related Work
Method
Experiments
Conclusion
2
1. Introduction (1/5)
• Semantic Similarity
– Web mining: community extraction, relation
detection, & entity disambiguation.
– Information retrieval: to retrieve a set of
documents that is semantically related to a given
user query.
– Natural language processing: word sense
disambiguation, textual entailment, & automatic
text summarization.
3
1. Introduction (2/5)
• Web search engines
– Page count: the number of pages that contain
the query words.
– Snippets: a brief window of text extracted by a
search engine around the query term in a
document.
4
1. Introduction (3/5)
• Page count
– In Google,
“apple” AND “computer” is 288,000,000;
“banana” AND “computer” is 3,590,000.
– “apple” AND “computer” is much similar than
“banana” AND “computer”.
5
1. Introduction (4/5)
• Snippets
– “Jaguar” AND “cat”
– Jaguar is the largest cat  X is the largest Y
6
1. Introduction (5/5)
• Web search engine (Google) +
Page count + Snippets
 Semantic Similarity
7
2. Related Work (1/2)
• Normalized Google Distance (NGD)
– Cilibrasi & Vitanyi, 2007.
max{log H ( p), log H (Q)}  log H ( P, Q)
NGD( P, Q) 
log N  min{log H ( P), log H (Q)}
P and Q: the two words;
NGD(P,Q): the distance between P and Q;
H(P),H(Q): the page count for the word P and Q;
H(P,Q): the page count for the query “P AND Q”.
8
2. Related Work (2/2)
• Co-occurrence Double-Checking (CODC)
– Chen et al., 2006.
0,
if f ( P @ Q)  0,




CODC ( P, Q)  
 f ( P @ Q) f (Q @ P )  
otherwise.
exp  log  H ( P)  H (Q)   ,

 


F(P@Q): the number of occurrences of P in the
top-ranking snippets for the query Q in Google;
H(P): the page count for query P;
α: a constant in this model, which is
experimentally set to the value 0.15.
9
3. Method
1.
2.
3.
4.
5.
6.
Outline
Page Count-Based Co-Occurrence Measures
Lexical Pattern Extraction
Lexical Pattern Clustering
Measuring Semantic Similarity
Training
10
3.1 Outline
11
3.2 Page Count-Based
Co-Occurrence Measures (1/2)
0,
if H ( P  Q)  c,


WebJaccard ( P, Q)  
H ( P  Q)
otherwise.
 H ( P)  H (Q)  H ( P  Q) ,

• P∩Q denotes the conjunction query “P AND Q”.
0,
if H ( P  Q)  c,


WebOverlap  
H ( P  Q)
otherwise.
 min{H ( P), H (Q)} ,

12
3.2 Page Count-Based
Co-Occurrence Measures (2/2)
0,
if H ( P  Q)  c,


WebDice( P, Q)   2 H ( P  Q)
,
otherwise
.
 H ( P) _ H (Q)

0,
if H ( P  Q)  c,


 H ( P  Q) 



WebPMI ( P, Q)  
N
otherwise.
log 2  H ( P ) H (Q)  ,



N 
 N

N: the number of documents indexed by the search engine.
13
3.3 Lexical Pattern Extraction (1/2)
Conditions:
1. A subsequence must contain exactly one occurrence of each
X and Y .
2. The maximum length of a subsequence is L words.
3. A subsequence is allowed to skip one or more words.
However, we do not skip more than g number of words
consecutively. Moreover, the total number of words skipped
in a subsequence should not exceed G.
4. We expand all negation contractions in a context. For
example, didn’t is expanded to did not. We do not skip the
word not when generating subsequences. For example, this
condition ensures that from the snippet X is not a Y, we do
not produce the subsequence X is a Y.
14
3.3 Lexical Pattern Extraction (2/2)
A snippet retrieved for the query “ostrich*******bird.”
• X, a large Y
• X a flightless Y
• X, large Y lives
15
3.4 Lexical Pattern Clustering (1/2)
word-pair frequency: f ( Pi , Qi , a j )
total occurrence:
 (a j )   f ( Pi , Q j , a j )
i
aj: a pattern in pattern vector a.
(Pi,Qj): a word pair.
16
3.4 Lexical Pattern Clustering (2/2)
17
3.5 Measuring Semantic Similarity
(1/5)
Weight to a pattern ai in a cluster cj:
 (ai )
wij 
tc  (t )
j
The jth feature for a word pair (P, Q):
fj 
w
ai c j
ij
f ( P, Q, ai )
18
3.5 Measuring Semantic Similarity
(2/5)
Feature vector for a word pair (P, Q) :
f PQ
f1








fN


 WebJaccard PQ 
WebOverlap 
PQ


 WebDicePQ 
 WebPMI

PQ


19
3.5 Measuring Semantic Similarity
(3/5)
Train a two-class SVM:
( synonymous / nonsynonymous )
S  {( Pk , Qk , yk )} yk {1, 1}
Semantic similarity:
sim( P , Q )  p( y  1 f )
*
*
*
*
20
3.5 Measuring Semantic Similarity
(4/5)
Distance: d ( f )   yk k K ( f k , f )  b
*
*
i
b: the bias term and the hyperplane.
ak: the Lagrange multiplier.
fk: support vector.
K(fk, f): the value of the kernel function.
f : the instance to classify.
21
3.5 Measuring Semantic Similarity
(5/5)
1
The probability: p( y  1 d ( f )) 
1  exp( d ( f )  
Log likelihood:
N
L( ,  )   log p( y k f k ,  ,  )
k 1
N
= t k {tk log( pk )}  (1  tk ) log(1  pk )
k 1
tk  yk ( yk  1) / 2
22
3.6 Training (1/5)
Number of Patterns Extracted for Training Data
Synonymous
(A, B)
(C, D)
Nonsynonymous
(A, D)
(C, B)
23
3.6 Training (2/5)
• L = 5, g = 2, G = 4, & T = 5, for lexical pattern
extraction conditions.
Distribution of patterns extracted from synonymous word pairs.
24
3.6 Training (3/5)
Average similarity versus clustering threshold θ.
25
3.6 Training (4/5)
The centroid vector of all feature vectors:
1
fw 
W

( P ,Q )W
f PQ
The average Mahalanobis distance:
1
D( ) 
W

( P ,Q )W
Mahala( fW , f PS )
Mahala( fW , f PQ )  ( fW  f PQ )T C 1 ( fW  f PQ )
|W|: the number of word pairs in W.
C-1: the inverse of the intercluster correlation Matrix.
26
3.6 Training (5/5)
Distribution of patterns extracted from nonsynonymous word pairs.
ˆ  arg min D( )
 [0,1]
27
4. Experiments
1. Benchmark Data Sets
2. Semantic Similarity
3. Community Mining
28
5. Conclusion (1/3)
1. A semantic similarity measure using both page
counts and snippets retrieved from a web
search engine for two words.
2. Four word co-occurrence measures were
computed using page counts.
3. A lexical pattern extraction algorithm to
extract numerous semantic relations that exist
between two words.
29
5. Conclusion (2/3)
4. A sequential pattern clustering algorithm was
proposed to identify different lexical patterns
that describe the same semantic relation.
5. Both page counts-based co-occurrence
measures and lexical pattern clusters were
used to define features for a word pair.
6. A two-class SVM was trained using those
features extracted for synonymous and
nonsynonymous word pairs selected from
WordNet synsets.
30
5. Conclusion (2/3)
• Experimental results on three benchmark data sets
showed that the proposed method outperforms
various baselines as well as previously proposed webbased semantic similarity measures, achieving a high
correlation with human ratings.
• The proposed method improved the F-score in a
community mining task, thereby underlining its
usefulness in real-world tasks, that include named
entities not adequately covered by manually created
resources.
31
The End~
Thanks for your attention!!
32