Document 7436192

Download Report

Transcript Document 7436192

Searching Web Better
Dr Wilfred Ng
Department of Computer Science
The Hong Kong University of Science and
Technology
1
Outline


Introduction
Main Techniques (RSCF)




The RSCF-based Metasearch Engine




Clickthrough Data
Ranking Support Vector Machine Algorithm
Ranking SVM in Co-training Framework
Search Engine Components
Feature Extraction
Experiments
Current Development
2
Search Engine Adaptation
Computer Science
CS terms
Finance
Social Science
Product
News
Google, MSNsearch,
Wisenut, Overture, …
Adapt the search engine by learning from
implicit feedback ---- Clickthrough data
3
Clickthrough Data


Clickthrough data: data that indicates which
links in the returned ranking results have been
clicked by the users
Formally, a triplet (q, r, c)
 q – the input query
 r – the ranking result presented to the
 c – the set of links the user clicked on

user
Benefits:
 Can be obtained timely
 No intervention to the search
activity
4
An Example of Clickthrough Data
User’s input
query
l
l
Clicked by
the user
l
l
l
l
l
l
5
Target Ranking (Preference Pairs Set )
Arising from l1
Arising from l7
Arising from l10
Empty Set
l7 <r l2
l7 <r l3
l7 <r l4
l7 <r l5
l7 <r l6
l10 <r l2
l10 <r l3
l10 <r l4
l10 <r l5
l10 <r l6
l10 <r l8
l10 <r l9
6
An Example of Clickthrough Data
User’s input
query
l
Labelled
data set
l
Clicked by
the user
l
l
l
l
Unlabelled
data set
l
l
7
Target Ranking (Preference Pairs Set )
Arising from l1
Arising from l7
Arising from l10
Empty Set
l7 <r l2
l7 <r l3
l7 <r l4
l7 <r l5
l7 <r l6
l10 <r l2
l10 <r l3
l10 <r l4
l10 <r l5
l10 <r l6
l10 <r l8
l10 <r l9
 Labelled data set: l1, l2,…, l10
 Unlabelled data set: l11, l12,…
8
The Ranking SVM Algorithm
1
l1 , l2 , l3
Three links, each
described by a feature
vector
Target ranking: l1 <r’ l2 <r’ l3


2
l1’

l2
l2’
l2’
l1’
l1
l3’
Weight vector -- Ranker
Distance between two
closest projected links
l3’
l3
Cons: It needs a large set of labelled data
9
The Ranking SVM in Co-training Framework
 Divide the feature
vector into two
subvectors
 Rebuild each ranker
from the augmented
labelled data set
Training
Ranker a_A
Ranker a_B
Selecting confident pairs
Augmented pairs
 Each ranker chooses
several unlabelled
preference pairs and
add them to the labelled
data set
Augmented pairs
 Two rankers are built
over these two feature
subvectors
Labelled Preference
Feedback Pairs P_l
Unlabelled Preference
Pairs P_u
10
Some Issues

Guideline for partitioning the feature vector
 After
the partition each subvector must be sufficient
for the later ranking

Number of rankers
 Depend

on the number of features
When to terminate the procedure?
 Prediction
difference: indicates the ranking
difference between the two rankers
 After termination, get a final ranker on the augmented
labelled data set
11
Metasearch Engine
User




Receives query from
user
Sends query to
multiple search
engines
Combines the
retrieved results from
the underlying search
engines
Presents a unified
ranking result to user
query
Metasearch Engine
Search
Engine 1
Search
Engine 2
Search
Engine n
Retrieved Retrieved
Results 1 Results 2
Retrieved
Results n
Unified
Ranking
Result
12
Search Engine Components


Powered by Inktomi, relatively mature

One of the most powerful search engines nowadays

A new but growing search engine

Ranks links based on the prices paid by the sponsors on
the links



13
Feature Extraction

Ranking Features (12 binary features)
where E {M,W,O} T {1,3,5,10}
(M: MSNsearch, W: Wisenut, O: Overture)
 Indicate the ranking of the links in each underlying
search engine
 Rank(E,T)

Similarity Features(4 features)
 Sim_U(q,l),
Sim_T(q,t), Sim_C(q,a), Sim_G(q,a)
 URL,Title, Abstract Cover, Abstract Group
 Indicate the similarity between the query and the link
14
Experiments
Experiment data: within the same domain
– Computer science
 Objectives:

experiments – compared with RSVM
 Online experiments – compared with Google
 Offline
15
Prediction Error

Prediction Error: difference between the
ranker’s ranking and the target ranking
1
 Target
ranking:
l1 <r’ l2, l1 <r’ l3, l2 <r’ l3
 Projected ranking:
l2 <r’ l1, l1 <r’ l3, l2 <r’ l3
 Prediction error = 33%
l2
l2’
l1’
l1
l3’
l3
16
Offline Experiment (Compared with
RSVM)
10 queries
R
A
B
30 queries
60 queries
The ranker trained by the RSVM algorithm on the whole feature vector
The ranker trained by the RSCF algorithm on one feature subvector
The ranker trained by the RSCF algorithm on another feature subvector
Prediction error rise up again!
The number of iterations in RSCF algorithm is about four to five!
17
Offline Experiment (Compare with
RSVM)
Overall comparison
R
C
The ranker trained by the RSVM algorithm
The final ranker trained by the RSCF algorithm
18
Online Experiment (Compare with Google)

Experiment data: CS terms
 e.g.

radix sort, TREC collection, …
Experiment Setup
 Combine
the results returned by RSCF and
those by Google into one shuffled list
 Present to the users in a unified way
 Record the users’ clicks
Cases
More clicks More clicks
on RSCF
on Google
Queries 26
17
Tie
No clicks Total
13
2
58
19
Experimental Analysis
Features
Weight
Features
Weight
Rank(M,1)
0.1914
Rank(W,1)
0.0184
Rank(M,3)
0.2498
Rank(W,3)
0.1014
Rank(M,5)
0.1152
Rank(W,5)
-0.3021
Rank(M,10) 0.2498
Rank(W,10) -0.4367
Rank(O,1)
-0.1673
Sim_U(q,l)
0.5382
Rank(O,3)
-0.1229
Sim_T(q,t)
0.4928
Rank(O,5)
-0.4976
Sim_C(q,a) 0.4136
Rank(O,10) 0.4441
Sim_G(q,a) 0.5010
20
Experimental Analysis
Features
Weight
Features
Weight
Rank(M,1)
0.1914
Rank(W,1)
0.0184
Rank(M,3)
0.2498
Rank(W,3)
0.1014
Rank(M,5)
0.1152
Rank(W,5)
-0.3021
Rank(M,10) 0.2498
Rank(W,10) -0.4367
Rank(O,1)
-0.1673
Sim_U(q,l)
0.5382
Rank(O,3)
-0.1229
Sim_T(q,t)
0.4928
Rank(O,5)
-0.4976
Sim_C(q,a) 0.4136
Rank(O,10) 0.4441
Sim_G(q,a) 0.5010
21
Experimental Analysis
Features
Weight
Features
Weight
Rank(M,1)
0.1914
Rank(W,1)
0.0184
Rank(M,3)
0.2498
Rank(W,3)
0.1014
Rank(M,5)
0.1152
Rank(W,5)
-0.3021
Rank(M,10) 0.2498
Rank(W,10) -0.4367
Rank(O,1)
-0.1673
Sim_U(q,l)
0.5382
Rank(O,3)
-0.1229
Sim_T(q,t)
0.4928
Rank(O,5)
-0.4976
Sim_C(q,a) 0.4136
Rank(O,10) 0.4441
Sim_G(q,a) 0.5010
22
Conclusion on RSCF

Search engine adaptation
 The
RSCF algorithm
Train on clickthrough data
 Apply RSVM in the co-training framework

 The
RSCF-based metasearch engine
Offline experiments – better than RSVM
 Online experiments – better than Google

23
Current Development
 Features
extraction and division
 Apply in different domains
 Search engine personalization
 SpyNoby Project: Personalized search engine
with clickthrough analysis
24
Modified Target Ranking for Metasearch
Engines

If l1 and l7 are from the same underlying
search engine, the preference pairs set
arising from l1 should be
l1 <r l2 , l1 <r l3 , l1 <r l4 , l1 <r l5 , l1 <r l6

Advantages:
 Alleviate
the penalty on high-ranked links
 Give more credit to the ranking ability of the
underlying search engines
25
Modified Target Ranking
Arising from l1
Arising from l7
Arising from l10
l1 <r l2
l1 <r l3
l1 <r l4
l1 <r l5
l1 <r l6
l7 <r l2
l7 <r l3
l7 <r l4
l7 <r l5
l7 <r l6
l10 <r l2
l10 <r l3
l10 <r l4
l10 <r l5
l10 <r l6
l10 <r l8
l10 <r l9
 Labeled data set: l1, l2,…, l10
 Unlabelled data set: l11, l12,…
26
RSCF-based Metasearch Engine - MEA
User
query
q
q
MEA
q
q
1. ……
2. ……
…………
…………
30. ......
1. ……
2. ……
…………
…………
30. ……
Unified
Ranking
Result
1. ……
2. ……
…………
…………
30. ……
27
RSCF-based Metasearch Engine - MEB
User
query
q
MEB
q
1. ……
2. ……
…………
…………
30. ……
q
q
q
1. ……
2. ……
…………
…………
30. ……
1. ……
2. ……
…………
…………
30. ……
1. ……
2. ……
…………
…………
30. ……
Unified
Ranking
Result
28
Generating Clickthrough Data

n
Probability of being clicked on: Pr( k ) 
k  H  (V )
k: the ranking of the link in the metasearch engine
n: the number of all the links in the metasearch engine
: the skewness parameter inn Zipf’s law
1
Harmonic number: H (V ) 



i 1 i

Judge the link’s relevance manually
the link is irrelevant  not be clicked on
 If the link is relevant  has the probability of Pr(k) to be
clicked on
 If
29
Feature Extraction

Ranking Features (binary features)
 Rank(E,T):
whether the link is ranked within ST in E
where E {G,M,W,O} T {1,3,5,10,15,20,25,30}
S1={1}, S3={2,3}, S5={4,5}, S10={6,7,8,9,10} ……
(G: Google, M: MSNsearch, W: Wisenut, O: Overture)
 Indicate the ranking of the links in each underlying
search engine

Similarity Features(4 features)
 Sim_U(q,l),
Sim_T(q,t), Sim_C(q,a), Sim_G(q,a)
 Measure the similarity between the query and the link
30
Experiments

Experiment data: three different domains

CS terms
 News
 E-shopping

Objectives:
Error – better than RSVM
 Top-k Precision – adaptation ability
 Prediction
31
Top-k Precision

Advantages:
 Precision
is more easier to obtained than recall
 Users care only top-k links (k=10)

Evaluation data: 30 queries in each domain
32
Comparison of Top-k precision
News
CS terms
E-shopping
33
Statistical Analysis
Hypothesis Testing:
(two-sample hypothesis
testing about means)
used to analyze
whether there is a
statistically significant
difference between
two means of two
samples
Comparison between
Engines
News
E-Shopping
MEA VS Google
≈
≈
>
≈
≈
≈
>
≈
≈
>
>
≈
>
>
≈
≈
>
≈
MEA VS MSNsearch
MEA VS Overture
MEA VS Wisenut
MEB VS Google
MEB VS MSNsearch
MEB VS Overture
MEB VS Wisenut
MEA VS MEB
CS
terms
≈
≈
>
>
≈
>
>
>
≈
Combined
>
>
>
>
>
>
>
>
≈
34
Comparison Results





MEA can produce better search quality than
Google
Google does not excel in every query category
MEA and MEB is able to adapt to bring out the
strengths of each underlying search engine
MEA and MEB are better than, or comparable to
all their underlying search engine components in
every query category
The RSCF-based metasearch engine


Comparison of prediction error – better than RSVM
Comparison of top-k precision – adaptation ability
35
Spy Naïve Bayes – Motivation

The problem of Joachims
method

Strong assumptions
 Excessively penalize highranked links

l1, l2, l3 are apt to appear on the
right, while l7, l10 on the left
New interpretation of
clickthrough data
Clicked – positive (P)
 Unclicked – unlabeled (U),
containing both positive and
negative samples.


Goal: identify Reliable
Negatives (RN) from U
Arising
from l1
Arising
from l7
Arising
from l10
Empty
Set
l7 <r l2
l7 <r l3
l7 <r l4
l7 <r l5
l7 <r l6
l10 <r l2
l10 <r l3
l10 <r l4
l10 <r l5
l10 <r l6
l10 <r l8
l10 <r l9
lp <r ln
36
Spy Naïve Bayes: Ideas


Standard naïve Bayes – classify
positive and negative samples
One-step spy naïve Bayes: Spying out
RN from U




Put a small number of positive
samples into U to act as “spies”, (to
scout the behavior of real positive
samples in U)
Take U as negative samples to train a
naïve Bayes classifier
Samples with lower probabilities to be
positive will be assigned into RN
Voting procedure: make Spying more
robust

Run one-step SpyNB for n times and
get n sets of RNi
 A sample appear in at least m (m<≈n)
sets of RNi will appear in the final RN
37
http://dleecpu1.cs.ust.hk:8080/SpyNoby/
38
My publications
1.
2.
3.
4.
5.
6.
7.
8.
Wilfred NG. Book Review: An Introduction to Search Engines and Web Navigation. An International
Journal of Information Processing & Management, pp. 290-292, 43(1) (2007).
Wilfred NG, Lin DENG and Dik-Lun LEE. Spying Out Real User Preferences in Web Searching.
Accepted and to appear: ACM Transactions on Internet Technology, (2006).
Yiping KE, Lin DENG, Wilfred NG and Dik-Lun LEE. Web Dynamics and their Ramifications for the
Development of Web Search Engines. Accepted and to appear: Computer Networks Journal - Special
Issue on Web Dynamics, (2005).
Qingzhao TAN, Yiping KE and Wilfred NG. WUML: A Web Usage Manipulation Language For
Querying Web Log Data. International Conference on Conceptual Modeling ER 2004, Lecture Notes of
Computer Science Vol.3288, Shanghai, China, page 567-581, (2004).
Lin DENG, Xiaoyong CHAI, Qingzhao TAN, Wilfred NG, Dik-Lun LEE. Spying Out Real User
Preferences for Metasearch Engine Personalization. ACM Proceedings of WEBKDD Workshop on Web
Mining and Web Usage Analysis 2004, Seattle, USA, (2004).
Qingzhao TAN, Xiaoyong CHAI, Wilfred NG and Dik-Lun LEE. Applying Co-training to
Clickthrough Data for Search Engine Adaptation. 9th International Conference on Database Systems for
Advanced Applications DASFAA 2004, Lecture Notes of Computer Science Vol. 2973, Jeju Island,
Korea, page 519-532, (2004).
Haofeng ZHOU, Yubo LOU, Qingqing YUAN, Wilfred NG, Wei WANG and Baile SHI. Refining Web
Authoritative Resource by Frequent Structures. IEEE Proceedings of the International Database
Engineering and Applications Symposium IDEAS 2003, Hong Kong, pages 236-241, (2003).
Wilfred NG. Capturing the Semantics of Web Log Data by Navigation Matrices. A Book Chapter in
"Semantic Issues in E-Commerce Systems", Edited by R. Meersman, K. Aberer and T. Dillon, Kluwer
Academic Publishers, pages 155-170, (2003).
39