Transcript Document

Web Spam Detection with AntiTrust Rank
Vijay Krishnan
Rashmi Raj
Computer Science Department
Stanford University
The World Wide Web
•Huge
•Distributed content creation,
linking (no coordination)
•Structured databases,
unstructured text, semistructured data.
The Web
•Content includes truth, lies,
obsolete information,
contradictions, …
PageRank
• Intuition: “a page is important if important
pages link to it.”
• In high-falutin’ terms: importance = the
principal eigenvector of the stochastic
matrix of the Web.
(A few fixups needed.)
PageRank
• Web graph encoded by matrix M
–
–
–
–
NXN matrix (N = number of web pages)
Mij = 1/|O(j)| iff there is a link from j to i
Mij = 0 otherwise
O(j) = set of pages node i links to
• Define matrix A as follows
– Aij = βMij + (1-β)/N, where 0<β<1
– 1-β is the “tax” discussed in prior lecture
• Page rank r is first eigenvector of A
– Ar = r
Many Random Walkers Model
• Imagine a large number M of independent,
identical random walkers (MÀN)
• At any point in time, let M(p) be the
number of random walkers at page p
• The page rank of p is the fraction of random
walkers that are expected to be at page p
i.e., E[M(p)]/M.
Economic Considerations
• Search has become the default gateway to
the web
• Very high premium to appear on the first
page of search results
– e.g., e-commerce sites
– advertising-driven sites
What is Web Spam?
• Spamming = any deliberate action solely in order
to boost a web page’s position in search engine
results, incommensurate with page’s real value
• Spam = web pages that are the result of spamming
• This is a very broad defintion
– SEO industry might disagree!
– SEO = search engine optimization
• Approximately 10-15% of web pages are spam
Types of Spamming Techniques
• Term spamming
– Manipulating the text of web pages in order to
appear relevant to queries
• Link spamming
– Creating link structures that boost page rank or
hubs and authorities scores
Link Spam
• Three kinds of web pages from a spammer’s
point of view
– Inaccessible pages
– Accessible pages
• e.g., web log comments pages
• spammer can post links to his pages
– Own pages
• Completely controlled by spammer
• May span multiple domain names
Link Spam Detection
• Open research area
• One approach: TrustRank
Trust Rank
• Basic principle: approximate isolation
– It is rare for a “good” page to point to a “bad” (spam)
page
•
•
•
•
•
Sample a set of “seed pages” from the web.
Set trust of each trusted page to 1
Propagate trust through links
Each page gets a trust value between 0 and 1
Use a threshold value and mark all pages below
the trust threshold as spam
Anti-Trust Approach
• Broadly based on the same “approximate
isolation principle”
• This principle also implies that the pages pointing to
spam pages are very likely to be spam pages
themselves.
• Anti-Trust is propagated in the reverse direction along
incoming links, starting from a seed set of spam
pages.
• A page can be classified as a spam page if it has AntiTrust Rank value more than a chosen threshold value.
Seed Set selection
• Seed spam set chosen from pages with high
page rank.
• Nearly 100% URLS containing certain
terms like {viagra,gambling, hardporn} as
substrings are spam. Use these for
evaluation.
• Also some seed pages were chosen by an
Oracle (Human Expert).
Results
• Overall Percentage of “spam” pages =0.28%.
• Average page rank of “spam”/Average Page Rank
= 2.6.
• % of “spam” pages in:
• top 1000 Anti-Trust rank pages = 25.3%
• Bottom 1000 Trust rank pages = 0.68%
• Ratio of average page ranks of spam pages
returned by ATR vs. TR is roughly 6.
Results
Number of spam pages under different scenarios
TrustRank
AntiTrust (Seed=40)
ATR (Seed=80)
ATR (Seed=120)
100000
11885
6569
6263
4724
10000
1721
1231
937 905
684
1000
253
100 99
68
100
39
10
10
7
10
4
1
1
1
1
2
3
4
5
References
• The PageRank citation ranking: Bringing order to the
web. L. Page, S. Brin, R. Motwani and T. Winograd.
Technical Report, Stanford University, 1998.
• Combating Web Spam with Trust Rank. Zoltan
Gyongyi, Hector Garcia-Molina and Jan Pedersen. In
VLDB 2004.
• Topic-sensitive PageRank. Taher Haveliwala. In
WWW 2002.
• The WebGraph dataset. Online at:
• http://webgraph-data.dsi.unimi.it/