Adversarial Information Retrieval

Download Report

Transcript Adversarial Information Retrieval

Adversarial Information Retrieval
The Manipulation of Web Content
Introduction
• Examples
• TrustRank and Other Methods
What is Adversarial IR?
• Gathering, Indexing, Retrieving and Ranking
Information
• Subset of the information has been
manipulated maliciously
• Financial Gain
What is the Goal of AIR?
• Detect the bad sites or communities
• Improve precision on search engines by
eliminating the bad guys
Simplest form
• First generation engines relied heavily on tf/idf
– The top-ranked pages for the query maui resort were the
ones containing the most maui’s and resort’s
• SEOs responded with dense repetitions of chosen terms
– e.g., maui resort maui resort maui resort
– Often, the repetitions would be in the same color as the
background of the web page
• Repeated terms got indexed by crawlers
• But not visible to humans on browsers
Pure word density cannot
be trusted as an IR signal
Search Engine Spamming
•
•
•
•
•
•
Link-spam
Link-bombing
Spam Blogs
Comment Spam
Keyword Spam
Malicious Tagging
Spamming
• Online tutorials for “search engine persuasion
techniques”
– “How to boost your PageRank”
• Artificial links and Web communities
• Latest trend: “Google bombing”
– a community of people create (genuine) links with
a specific anchor text towards a specific page.
Usually to make a political point
Google Bombing
Our Focus
• Link Manipulation
Trust Rank
• Observation
– Good pages tend to link good pages.
– Human is the best spam detector
• Algorithm
– Select a small subset of pages and let a human
classify them
– Propagate goodness of pages
10
Propagation
• Trust function T
– T(p) returns the propability that p is a good page
• Initial values
– T(p) = 1, if p was found to be a good page
– T(p) = 0, if p was found to be a spam page
• Iterations:
– propagate Trust following out-links
– only a fixed number of iteration M.
11
Propagation (2)
• Problem with
propagation
– Pages reachable from
good seeds might not be
good
– the further away we are
from good seed pages, the
less certain we are that a
page is good.
–
solution: reduce
trust as we move
further away
from the good
seed pages (trust
attenuation).
12
Trust attenuation – dampening
– Propagate a dampened trust score ß < 1 at first
step
– At n-th step propagate a trust of ß^n
13
Trust attenuation – splitting
– Parent trust value is splittet among child nodes
– Observation: the more the links the less the
care in choosing them
– Mix damp and split? ß^n(splitted trust)
14
Selection – Inverse PageRank
• The seed set S should:
– be as small as possible
– cover a large part of the Web
• Covering is related to out-links in the very same way
PageRank is related to in-link
– Inverse PageRank !
• Perform PageRank on a graph with inverted links
– G' = (V, E') where (p,q)  E' (q, p)  E.
15
Algorithm
1.
2.
3.
4.
Select seeds ( s ) and order by preference
Invoke oracle (human) on the first L seeds,
Initialize and normalize oracle response d
Compute TrustRank score (as in PageRank formula):
t* = ß ·T·t*+(1−ß) ·d
T is the adjacency matrix of the Web Graph.
ß is the dampening factor. (usually .85)
16
Algorithm - example
– s = [0.08, 0.13, 0.08, 0.10, 0.09, 0.06, 0.02]
– Ordering = [2, 4, 5, 1, 3, 6, 7]
– L=3 {2, 4, 5} d=[0,
– ß=0.85
0.5, 0, 0.5, 0, 0, 0]
M=20
– t* = [0, 0.18, 0.12, 0.15, 0.13, 0.05, 0.05]
– NB. max=0.18
– Issues with page 1 and 5
17
Issues with TrustRank
• Coverage of the seed set may not be broad enough
 Many different topics exist, each with good pages
• TrustRank has a bias towards communities that are
heavily represented in the seed set
 inadvertently helps spammers that fool these communities
Bias towards larger partitions
t
m1

n
i 1
t1 
mi
m2

n
i 1
t 2  ... 
mi
mn

n
i 1
tn
mi
• Divide the seed set into n partitions, each has mi nodes
• ti : TrustRank score calculated by using partition i as the seed
set
• t : TrustRank score calculated by using all the partitions as
one combined seed set
Basic ideas
•
Use pages labeled with topics as seed pages
 Pages listed in highly regarded topic directories
•
Trust should be propagated by topics
 link between two pages is usually created in a
topic specific context
Topical TrustRank
•
Topical TrustRank
 Partition the seed set into topically coherent groups
 TrustRank is calculated for each topic
 Final ranking is generated by a combination of these topic
specific trust scores
•
Note
 TrustRank is essentially biased PageRank
 Topical TrustRank is fundamentally the same as TopicSensitive PageRank, but for demoting spam
Combination of trust scores
•
Simple summation
 default mechanism just seen
t  t1  t 2  ...  t n
•
Quality bias
 Each topic weighted by a bias factor
 Summation of these weighted topic scores
t  w1t1  w 2 t 2  ...  w n t n
 One possible bias: Average PageRank value of the seed pages of
the topic
Further Improvements
• Seed Weighting
 Instead of assigning an equal weight to each seed page,
assign a weight proportional to its quality / importance
• Seed Filtering
 Filtering out low quality pages that may exist in topic
directories
• Finer topics
 Lower layers of the topic directory