Learning to Cluster Web Search Results.

Download Report

Transcript Learning to Cluster Web Search Results.

Learning to Cluster Web Search Results.
Hua-jun zeng
Qi-cai He
Zheng Chen
Wei-Ying Ma
Jinwen Ma
1
Contents
Motivation
Algorithm
-Salient phrases extraction
-Learning to rank salient phrases
•Experiments
•Conclusion
2
Motivation
Organizing Web search results into
clusters facilitates user’s quick browsing
through search results.
Traditional clustering techniques don’t
generate clusters with highly readable
namessalient phrase
3
Motivation(cont’d)
4
Problem formalization and algorithm
Algorithm is composed of the four steps:
1.Search result fetching
2.Document parsing and phrase property
calculation
3.Salient phrase ranking
4.Post-processing
5
Salient Phrases extraction
Phrase Frequency/Inverted Document
Frequency( TFIDF)
Phrase Length
Intra-Cluster Similarity
Cluster Entropy
Phrase Independence
6
TFIDF
w=current phrase
D(w)=the set of documents that contains w
N=the number of total documents
f(w)=frequency caclulation
7
Phrase Length
Generally, a longer name is preferred for users’ browsing
8
Intra-Cluster Similarity(ICS)
•First,convert
documents into vectors
•For each candidate cluster, we then calculate its centroid as:
•ICS is calculated as the average cosine similarity between the
documents and the centroid
9
Cluster Entropy
10
Phrase Independence
a phrase is independent when the entropy of its context is
high (I,e, the left and right contexts are random enough).we
use a IND to measure the independence of phrases.
11
The IND,value for right context could be
calculated similarly
The final IND value is the average of
those two
12
Learning to rank salient phrases.
Regression tries to determine the relationship between two
random variables
and y
X=(x1,x2,..,xp)
x= (TFIDF,LEN,ICS,CE,IND)
Linear Regression
13
Experiments
Each query,200 returned documents from search
engines.
Extract all n-grams from the documents where
n<=3.
Use SVM_Light(2) to do support vector
regression.
14
Experiments(cont’d)
Evaluation Measure
-precision(P) at top N results to measure
the performance:
15
Experiments(cont’d)
Training Data Collection
16
Experimental Results
Property Comparison.
17
Experimental Results (cont’d)
Learning Methods Comparison
18
Experimental Results(cont’d)
Learning Methods Comparison.
-the coefficients of one of the linear
regression models as follows:
Y=-0.427 +0.146 X TFIDF
+0.241 X LEN
-0.022 X ICS
+0.065 X CE
+0.266 X IND
19
Experimental Results
In put Document Number
20
Experimental Results(cont’d)
Coverage and Overlap
21
Conclusion
Reformalizes the search result clustering
problem as a supervised salient phrase
ranking problem.
Several properties and regression models
are proposed to calculate salience score
for salient phrase
22