Propagating Trust and Distrust to Demote Web Spam

Download Report

Transcript Propagating Trust and Distrust to Demote Web Spam

LEHIGH
UNIVERSITY
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Introduction: Web Search
Google
 Web search – the access to the Web
Yahoo!
for hundreds of millions of people
 Hundreds of millions of queries per day MSN Search
Ask
 Queries + people = TRAFFIC
A9
Exalead
 A HUGE incentive for web site owners
to rank highly in search engine results
 Communicate some message
(advertising, political statement)
 Install viruses, adware, etc.
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Gigablast
+
metasearch
+
many more!
Introduction: Web Spam
 a.k.a. search engine spam, spamdexing
 Any technique to manipulate search engine
results
 Target page gets an undeservedly higher ranking
 Many methods
 Link farms, keyword stuffing, cloaking, link
bombs, and more
 The target of much of our work!
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Propagating Trust and Distrust
to Demote Web Spam
Baoning Wu,
Vinay Goel, and
Brian D. Davison
Computer Science & Engineering
Lehigh University
Bethlehem, PA USA
Outline
 Background and motivation
 Proposed methods
 Experimental results
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Background: PageRank
 (Page and Brin, 1998)
 Uses number and status of “parents” to
determine status of child
 r(i+1) = (1-α) * T * r(i) + α * s




r: PageRank score vector (with N nodes)
T: transition matrix (NxN)
(1-α): decay factor; α: jump probability
s: uniform distribution of 1/N
 PageRank score generates a ranking of
importance of node
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Background: TrustRank
 (Gyongyi and Garcia-Molina, VLDB 2004)
 Uses number and trust of “parents” to
determine trust status of child
 t(i+1) = (1-α) * T * t(i) + α * s




t: TrustRank score vector (with N nodes)
T: transition matrix (NxN)
(1-α): decay factor
s: seed set trust score distribution
 Vector of size N, but only seed nodes are
non-zero
 Demotes web spam by propagating trust
from a known good seed set.
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Specific Motivation
 In TrustRank
 Parent divides its trust among its children.
 This may not be optimal – real-world trust
relationships are independent of the number of
trusted entities.
 Distrust can also be propagated.
Trust Propagation
A
Hyperlink
Distrust Propagation
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
B
Key steps in propagation
 Decay of trust (d)
 Trust is not perfectly transitive.
 Splitting of trust
 For each parent, how to divide its score
among its children.
 Accumulation of trust
 For each child, how to accumulate the
overall score given the portions from all
of its parents.
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Outline
 Background and motivation
 Proposed methods
 Experimental results
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Choices for Trust Splitting
 Given a node i with trust score TR(i)
and O(i) outgoing links:
 Equal splitting
 Gives d*TR(i)/O(i) to each child (used by
TrustRank)
 Constant splitting
 Gives d*TR(i) to each child
 Logarithmic splitting
 Gives d*TR(i)/log(1+O(i)) to each child
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Choices for Trust Accumulation
 Simple summation
 Sum the trust values from each parent
 Maximum share
 Use the maximum of the trust values
sent by the parents
 Maximum parent
 Sum the trust values but never exceed
the trust score of most-trusted parent
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Propagating Distrust
 Distrust can be propagated from a
seed set of bad nodes.
 Similar to trust propagation, but in
reverse – follow incoming links, not
outgoing links
 Same key choices for decay, splitting
and accumulation
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Combining Trust and Distrust
 For each node i, Trust score TR(i) and
Distrust score DIS_TR(i), the
combination score Total(i) can be
Total(i) = ŋ * TR(i) – ß * DIS_TR(i)
where 0 ≤ ŋ ≤ 1, 0 ≤ ß ≤ 1
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Outline
 Background and motivation
 Proposed methods
 Experimental results
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Data set
 20M pages from the Swiss search
engine [search.ch] in 2004
 350K sites with “.ch” domain
 We used only this site graph
 Seed sets
 3,589 labeled sites as using web spam
with various techniques (provided)
 20,005 sites with pages in dir.search.ch
topics as trusted set
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Experimental Design
 Explore various combinations of trust
and distrust propagation
 Evaluation
 Performance of TrustRank is the number
of spam sites found among the highestranked ~1% of sites.
 We use the same metric in this work.
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Baseline result
Algorithm
Num. spam sites
PageRank
90
TrustRank
58
Topical TrustRank
(Wu et al., WWW2006)
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
33-42
Simple TrustRank Improvement:
Increase jump probability (α)
No. of spam sites in top 10 buckets
65
60
default
α=0.15
55
50
45
40
35
30
25
20
15
10
5
0
0.9
0.8
0.7
0.6
0.5
Jump probability
22 May 2006
Wu, Goel and Davison
0.4
(α)
Models of Trust for the Web (MTW)
WWW2006 Workshop
0.3
0.2
0.1
Other trust propagation methods
Algorithm
Decay=
Simple
Summation
Maximum
Share
Maximum
Parent
22 May 2006
Wu, Goel and Davison
Constant
Splitting
0.1
0.3
0.7
Logarithmic
Splitting
0.9
0.1
0.3
0.7
0.9
364 364 364 364 364 364
364
364
34
34
34
34
13
12
20
18
27
32
33
33 372
27
29
32
Models of Trust for the Web (MTW)
WWW2006 Workshop
Results of propagating distrust
Combined equally with TrustRank, 200 seeds
Constant
Splitting
Algorithm
Logarithmic
Splitting
dDistrust =
0.1
0.3
0.7
0.9
0.1
0.3
0.7
0.9
Simple
Summation
53
53
55
55
57
53
53
53
Maximum
Share
53
53
53
53
59
53
52
52
Maximum
Parent
53
53
53
53
57
53
53
53
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Combining trust and distrust
Using best scoring trust and distrust formulations, beta=(1-eta)
Number of spam sites in top 1.1%
>2200
12
10
8
6
4
Trial 1
Trial 2
2
Trial 3
0
0
0.1
(Distrust Only)
22 May 2006
Wu, Goel and Davison
0.2
0.3
0.4
0.5
0.6
Value of eta
Models of Trust for the Web (MTW)
WWW2006 Workshop
0.7
0.8
0.9
1
(Trust Only)
Coverage of trust propagation
Constant
Splitting
Algorithm
Decay
0.1
0.3
0.7
Logarithmic
Splitting
0.9
0.1
0.3
0.7
0.9
Maximum
Share
77.71 77.73 77.74 77.74 77.19 77.72 77.73 77.73
Maximum
Parent
77.52 77.71 77.73 77.74 76.93 77.60 77.71 77.72
Percentage of sites affected by approach. TrustRank reached 76.05%.
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Conclusions
 Propagating trust based on outdegree does
not appear to be optimal.
 Alternative splitting and accumulation
methods can help to demote top ranked
spam sites.
 Propagating distrust can also help to
demote top ranked spam sites.
 Additional tests needed!
 E.g., to examine impact on retrieval
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
Thank You!
Questions?
Contact Info:
The WUME Lab
Dr. Brian D. Davison
davison(at)cse.lehigh.edu
WUME Laboratory
Computer Science and Engineering
Lehigh University
Bethlehem, PA 18015 USA
22 May 2006
Wu, Goel and Davison
Models of Trust for the Web (MTW)
WWW2006 Workshop
http://wume.cse.lehigh.edu/