Link Analysis and Anti-Spam Tie-Yan Liu Microsoft Research Asia

Download Report

Transcript Link Analysis and Anti-Spam Tie-Yan Liu Microsoft Research Asia

Link Analysis and
Anti-Spam
Tie-Yan Liu
Microsoft Research Asia
Outline
• First Session
̵ Overview of Link Analysis Technologies
̵ PageRank and HITS
• Second Session
̵ More about Link Analysis Algorithms
• Third Session
̵ Spam and Anti-Spam
• Homework
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
2
First Session
Typical Search Engine Architecture
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
4
Ranking for the Search Results
• Today’s search engines may return millions of pages
for a certain query
• It is definitely not possible for the user to preview all
these results
• An appropriate ranking will be very helpful.
̵ Ranking on relevance
̵ Ranking on importance
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
5
Traditional IR Ranking
• A ranking purely on relevance
̵ Term frequency (tf)
̵ Inverse Document Frequency (idf)
̵ Okapi …
̵ Many other aspects that Dr. Shuming Shi will mention
in the next course.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
6
Limitations of Traditional IR
• Text-based ranking function
̵ www.harvard.edu can hardly be recognized as one of
the most authoritative pages for the query “harvard”,
since many other web pages contain “harvard” more
often.
̵ The number of pages with the same relevance is still
too large for the users to preview.
• Pages are not sufficiently self-descriptive
̵ Usually the term “search engine” doesn't appear on
the web pages of search engines.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
7
What’s More for Web Search
• In order to solve these problems
̵ We must leverage other information on the Web
̵ We must distinguish those pages with the same amount of
relevance
• Link Analysis
̵ The web is not just a collection of pure-text documents
• the hyperlinks are also very important!
A link from page A to page B may indicate:
̵
• A is related to B, or
• A is recommending, citing, voting for or endorsing B
̵
2005-11-4
Links effect the ranking of web pages and thus have commercial
value.
"Web Search and Mining" Course
@ USTC, 2005
8
Famous Link Analysis Methods
• HITS
• PageRank
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
9
HITS - Kleinberg’s Algorithm
• HITS – Hypertext Induced Topic Selection
• For each vertex v in a subgraph of interest:
̵ a(v) - the authority of v
̵ h(v) - the hubness of v
• A site is very authoritative if it receives many citations.
Citation from important sites weight more than
citations from less-important sites
• Hubness shows the importance of a site. A good hub
is a site that links to many authoritative sites
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
10
Authority and Hubness
5
2
3
1
1
4
7
a(1) = h(2) + h(3) + h(4)
2005-11-4
6
h(1) = a(5) + a(6) + a(7)
"Web Search and Mining" Course
@ USTC, 2005
11
Convergence of Authority and
Hubness
• Recursive dependency:
a(v)  Σ
w  pa[v]
h(w)
h(v)  Σ w  ch[v] a(w)
• Using Linear Algebra, we can prove:
a(v) and h(v) converge
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
12
HITS Example
Find a base subgraph:
• Start with a root set R {1, 2, 3, 4}
• {1, 2, 3, 4} - nodes relevant to
the topic
• Expand the root set R to include
all the children and a fixed
number of parents of nodes in R
 A new set S (base subgraph) 
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
13
HITS Example
Hubs and authorities: two n-dimensional a and h
HubsAuthorities(G)
|V|
1
2
3
4
5
6
1  [1,…,1]  R
a0  h 0  1
t 1
repeat
for each v in V
do at (v)  Σ
7
8
9
10
11
12
ht (v)  Σ w  pa[v] a (w)
t -1
a t  at / || at ||
h t  ht / || ht ||
t  t+1
until || a t – at -1 || + || h t – ht -1 || < ε
return (a t , h t )
2005-11-4
w  pa[v]
h
t -1
(w)
"Web Search and Mining" Course
@ USTC, 2005
14
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
15
Matrix Denotion of HITS
• It is clear that the authority and hubness values
calculated by the aforementioned algorithm is the left
and right singular vector of the adjacency matrix of
the base sub graph.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
16
PageRank
• Introduced by Page et al (1998)
̵ The page rank is proportional to its parents’ rank, but
inversely proportional to its parents’ outdegree
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
17
Matrix Notation
Adjacent Matrix
A=
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
18
Matrix Notation
• Matrix Notation
r=Br
B=
• Pagerank is embedded in the eigenvector of
B associated with the eigen value 1.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
19
Matrix Notation
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
20
Markov Chain Notation
• Random surfer model
̵ Description of a random walk through the Web graph
̵ Interpreted as a transition matrix with asymptotic probability that
a surfer is currently browsing that page
rt = M rt-1
M: transition matrix for a first-order Markov chain (stochastic)
Does it converge to some sensible solution (as t∞)
regardless of the initial ranks ?
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
21
Problem
• “Rank Sink” Problem
̵ In general, many Web pages have no inlinks/outlinks
̵ It results in dangling edges in the graph
E.g.
no parent  rank 0
MT converges to a matrix
whose last column is all zero
no children  no solution
MT converges to zero matrix
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
22
Modification
• Surfer will restart browsing by picking a new Web
page at random
M=(B+E)
E : escape matrix
M : stochastic matrix
• Still problem?
̵
It is not guaranteed that M is primitive
̵
If M is stochastic and primitive, PageRank converges to
corresponding stationary distribution of M
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
23
Distribution of the Mixture Model
•
The probability distribution that results from combining the Markovian
random walk distribution & the static rank source distribution
r = εe + (1- ε)x
ε: probability of selecting non-linked page
PageRank
Now, transition matrix [εH + (1- ε)M] is primitive and stochastic
rt converges to the dominant eigenvector
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
24
PageRank v.s. HITS - Algorithm
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
25
PageRank v.s. HITS - Stability
• Whether the link analysis algorithms based on eigenvectors are
stable in the sense that results don’t change significantly?
• General Strategy for evaluating stability:
̵ 1. Start with original adjacency matrix, A
̵ 2. Perturb the matrix to get A*, Select k nodes in graph to add
or delete
̵ 3. Compute distance, d(r(A),r(A*)), for some distance measure d
and objective function r that measures the quality of results of A’
somehow
̵ 4. Compute amount of perturbation p(Α,Α*) for some distance
function p that measures the amount of perturbation
̵ 5. Evaluate the conditions, if any, where small values for p
generate large values for d
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
26
Stability of HITS
• Ng 2001
̵ A bound on the number of hyperlinks k that can
added or deleted from one page without affecting
the authority or hubness weights
̵
δ: eigengap λ1 – λ2
d: maximum outdegree of G
• Observations
̵ Stability determined by eigengap
̵ Eigengap: difference between 1st and 2nd eigenvalues
• ATA for authorities, AAT for hubs
̵ If eigengap is big, HITS will be insensitive to small
perturbations, vice versa if small
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
27
Stability of PageRank
• Looser bound
̵ Ng et al (2001)
̵ Bianchini et al (2001)
• Observations
̵ The parameter ε of the mixture model has a
stabilization role
̵ If original k pages to be modified do not have high
overall PR scores then perturbed scores will not be
far from the original
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
28
Second Session
Pre-PageRank
• PageRank achieves great success in the industry, many
people regarded it as a break-through in the research
field as well.
• Actually the basic idea of PageRank has already
appeared in many previous works
̵ Mark 1988
̵ Bray 1996
̵ Marchiori 1997
̵ ……
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
30
Mark 1988
• To calculate the score S of a document at vertex v
1
Σ
S(w)
S(v) = s(v) +
| ch[v] | w  |ch(v)|
v: a vertex in the hypertext graph G = (V, E)
S(v): the global score
s(v): the score if the document is isolated
ch(v): children of the document at vertex v
• Limitation:
- Require G to be a directed acyclic graph (DAG)
- If v has a single link to w, S(v) > S(w)
- If v has a long path to w and s(v) < s(w),
then S(v) > S (w)
Mark, D. M., (1988), "Network models in geomorphology," Chapter 4 in Modeling in Geomorphologic Systems, Edited by M.
G. Anderson, John Wiley., p.73-97.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
31
Bray 1996
• The visibility of a site is measured by the number of
other sites pointing to it
̵ Authority?
• The luminosity of a site is measured by the number of
other sites to which it points
̵ Hub?
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
32
Marchiori (1997)
• Hyper information should complement textual
information to obtain the overall information
S(v) = s(v) + h(v)
- S(v): overall information
- s(v): textual information
- h(v): hyper information
r(v, w)
• h(v) =
Σ
F
S(w)
w  |ch[v]|
- F: a fading constant, F Є (0, 1)
- r(v, w): the rank of w after sorting the children of v
by S(w)
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
33
Post PageRank
• And following the success of PageRank, a lot of new
algorithms were also proposed.
̵ Fast PageRank calculation (Haveliwala)
̵ Topic-sensitive PageRank
̵ Personalized PageRank
̵ LinkFusion
̵ ……
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
34
Fast PageRank calculation
[Haveliwala – 1999]
• Partition the destination vector into d blocks that
each fit into main memory, and to compute one block
at a time.
• This algorithm is quite similar in structure to the Block
Nested-Loop Join algorithm in database systems.
which also performs very well for data sets of
moderate size but eventually loses out to more
scalable approaches.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
35
Fast PageRank calculation [Haveliwala
– 2003]
• Basic observation:
̵ the convergence rates of the PageRank values of individual
pages during application of the Power Method is nonuniform.
That is, many pages converge quickly, with a few pages taking
much longer to converge. Furthermore, the pages that
converge slowly are generally those pages with high PageRank.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
36
Topic-Specific PageRank [Haveliwala WWW02]
• Topic-specific PageRanks
̵ For each page precomputed PageRank values of the
most relevant topics used for each query.
̵ 16 topics
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
37
Link Fusion –[Zeng, WWW04]
• In a more generalized scenario, suppose there are N data types. The
importance attribute of one type of object can be reinforced by both
inter and intra-type links as:
 1 L1  12 L12
 21 L21  2 L2
Lurm 


 N1 LN1  N 2 LN 2
  1N L1N
  2 N L2 N


  N LN
• Suppose w is the attribute vector of all the objects in the URM. Link
Fusion can be represented as:
wnew=LurmTwold
• Such iterative calculation can be continued:
wn=(LurmT)nw0
• The result w is the prime eigenvector of Lurm, which can be explained
as the value of data objects regarding a specific attribute.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
38
Limits of Link Analysis
• Pay-for-place
̵ Search engine bias : organizations pay search engines and page
rank
̵ Advertisements: organizations pay high ranking pages for
advertising space
• With a primary effect of increased visibility to end users and a
secondary effect of increased respectability due to relevance to
high ranking page
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
39
Limits of Link Analysis
• Stability
̵ Adding even a small number of nodes/edges to the graph has a
significant impact
• Topic drift
̵ A top authority may be a hub of pages on a different topic
resulting in increased rank of the authority page
• Content evolution
̵ Adding/removing links/content can affect the intuitive authority
rank of a page requiring recalculation of page ranks
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
40
Third Session
What is Link Spam
• Since link analysis has played an important role in
search engines, it has large commercial values
• Improving one’s PageRank, can directly increase one’s
clicks thus earn more money.
• Link Spam is something trying to unfairly gain a high
ranking on a search engine for a web page without
improving the user experience, by mean of tricky
modification / manipulation of the link graph.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
42
Link Spamming Technologies
• Adding outlinks
̵ Replicate hub pages
• Adding inlinks
̵ Create a honey pot
̵ Infiltrate a web directory
̵ Post links on blog, wiki, etc
̵ Participate in-link exchange
̵ Buy expired domains
̵ Create own spam farm.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
43
Case Study: Spam HITS
• Hub score can be increased by adding outlinks to the
target page
• Authority score can be increased by creating
hyperlinks from high-hub-score pages to the target
page.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
44
Case Study: Spam PageRank
• Factors that influence PageRank
̵ PR(t)=PRstatic(t)+PRin(t)-PRout(t)-PRsink(t)
• Strategies
̵ Own pages are part of the spam farm, maximizing PRstatic
̵ Accessible pages point to the spam farm, maximizing PRin
̵ Links pointing outside the
spam farm are supressed,
minimizing PRout(t)
̵ All pages within the farm
have some outlinks,
minimizing PRsink(t)
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
45
Anti-Spam
• Early approaches
̵ BHITS, SALSA, DOM, revised HITS, BadRank …
• State-of-the-art
̵ TrustRank (2004)
̵ Revised PageRank (VLDB2004)
̵ BadRank + (WWW2005)
̵ SpamRank (WWW2005, workshop)
̵ ……
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
46
TrustRank
• Basic assumption
̵ Good pages seldom point to spam pages, but spam
pages may very likely point to good pages.
• Use TrustRank to denote the goodness of a webpage,
and use Trust Propagation to label all the web pages
starting from a small human-labeled seed set.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
47
TrustRank
• Step 1: Initialization
̵ How to select seeds
• Inverse PageRank (Hub pages, since they have more
influence)
• High PageRank (Important pages are more important to
search applications)
• Step 2: Propagation
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
48
TrustRank
• Step 3:
̵ Trust Dampening
̵ Trust Splitting
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
49
BadRank+
• Motivation
̵ Pages in the spam farm are densely connected, and
many common pages exist in both the inlinks and
outlinks of these pages.
• Propagate the badness of pages in the seed set to
detect other the spam pages in the Web.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
50
BadRank+
• Step 1: Initialization
̵ At least 3 common nodes (approximately the same,
i.e. with the same domain name) in the inlink and
outlink sets
• Step 2: Expansion
̵ ParentPenalty: if a page links to many bad pages
(larger than a threshold), it will also be labeled as bad.
̵ Delete all the links between detected bad pages
before PageRank calculation.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
51
Revised PageRank
• Assumption
̵ The spam farm have high correlation with each other.
• Approach
̵ Increase the probability of jumping from nodes with
large correlation coefficients.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
52
Revised PageRank
• Step 1: Collusion detection
̵ Calculate PageRank values for different ε
̵ Calculate the correlation coefficient between the
curve of node x’s PageRank and 1/ ε, denoted by coco(x).
• Step 2: ε Personalization
̵ Use F(εdefault, co-co(x)) to personalize the original
matrix U.
̵ Recalculate PageRank.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
53
SpamRank
• Key assumption
̵ Supporters of an honest page should not overly
dependent on one another, i.e. they should be
spread across different quality.
̵ Due to the self-similarity, the honest supporter set
should have a power-law distribution of PageRank.
̵ Spammers have a limited budget, so they do not
replicate the unimportant structures.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
54
Summary
• The current works on anti-spam are very limited.
• Promising research directions
̵ Use more statistics and the properties of the
transition probability matrix to detect spam
̵ Design a new spam-free ranking function
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
55
Homework
Technical Report Writing
1.
2.
3.
4.
5.
6.
7.
8.
HITS and PageRank are both based on simple linear algebra,
can you design some other link analysis algorithm based on
advanced linear algebra or matrix factorization?
The performance / sensitivity of PageRank with respect to the
smoothing factor ε.
How to speed up the calculation of PageRank using matrix
factorization, or some specific characteristics of the Markov
chain?
PageRank is the eigenvector of a 2-D matrix, then can
LinkFusion be the eigenvector of a 3-D tensor?
Stability analysis for other link analysis algorithms.
A survey on the state-of-the-art spam technologies.
How to design a search engine that is robust to spam?
Other novel research topics related to link analysis.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
57
Requirements
• Send the report to [email protected] before
Dec 4 (within 1 month).
• The length should not be less than 8 pages, with the
template at
http://www.acm.org/sigs/pubs/proceed/template.html
• There must be something new and intersting in your
report, and you’s better use some experiments to
support your idea.
• Never try to copy or steal already-published ideas as
your technical report. We are sure we have read much
more than you can find.
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
58
Other Information
• Slides can be found at
http://research.microsoft.com/users/tyliu/
2005-11-4
"Web Search and Mining" Course
@ USTC, 2005
59