Transcript Title
Link Analysis
Mining Massive Datasets
Wu-Jun Li
Department of Computer Science and Engineering
Shanghai Jiao Tong University
Lecture 7: Link Analysis
1
Link Analysis
Link Analysis Algorithms
PageRank
Hubs and Authorities
Topic-Sensitive PageRank
Spam Detection Algorithms
Other interesting topics we won’t cover
Detecting duplicates and mirrors
Mining for communities (community detection)
(Refer to Chapter 10 of the textbook)
2
Link Analysis
Outline
PageRank
Topic-Sensitive PageRank
Hubs and Authorities
Spam Detection
3
Link Analysis
PageRank
Ranking web pages
Web pages are not equally “important”
www.joe-schmoe.com v www.stanford.edu
Inlinks as votes
www.stanford.edu has 23,400 inlinks
www.joe-schmoe.com has 1 inlink
Are all inlinks equal?
Recursive question!
4
Link Analysis
PageRank
Simple recursive formulation
Each link’s vote is proportional to the importance of
its source page
If page P with importance x has n outlinks, each link
gets x/n votes
Page P’s own importance is the sum of the votes on
its inlinks
5
PageRank
Link Analysis
Simple “flow” model
The web in 1839
y
a/2
Yahoo
y/2
y/2
y = y /2 + a /2
a = y /2 + m
m = a /2
m
M’soft
Amazon
a
a/2
m
6
Link Analysis
PageRank
Solving the flow equations
3 equations, 3 unknowns, no constants
No unique solution
All solutions equivalent modulo scale factor
Additional constraint forces uniqueness
y+a+m = 1
y = 2/5, a = 2/5, m = 1/5
Gaussian elimination method works for small
examples, but we need a better method for large
graphs
7
Link Analysis
PageRank
Matrix formulation
Matrix M has one row and one column for each web page
Suppose page j has n outlinks
If j i, then Mij=1/n
Else Mij=0
M is a column stochastic matrix
Columns sum to 1
Suppose r is a vector with one entry per web page
ri is the importance score of page i
Call it the rank vector
|r| = 1
8
PageRank
Link Analysis
Example
Suppose page j links to 3 pages, including i
j
i
i
=
1/3
M
r
r
9
Link Analysis
PageRank
Eigenvector formulation
The flow equations can be written
r = Mr
So the rank vector is an eigenvector of the stochastic
web matrix
In fact, its first or principal eigenvector, with corresponding
eigenvalue 1
10
PageRank
Link Analysis
Example
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
m
0
1
0
r = Mr
Amazon
M’soft
y = y /2 + a /2
a = y /2 + m
m = a /2
y
1/2 1/2 0
a = 1/2 0 1
m
0 1/2 0
y
a
m
11
Link Analysis
PageRank
Power Iteration method
Simple iterative scheme (aka relaxation)
Suppose there are N web pages
Initialize: r0 = [1/N,….,1/N]T
Iterate: rk+1 = Mrk
Stop when |rk+1 - rk|1 <
|x|1 = 1≤i≤N|xi| is the L1 norm
Can use any other vector norm e.g., Euclidean
12
PageRank
Link Analysis
Power Iteration Example
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
Amazon
y
a =
m
m
0
1
0
M’soft
1/3
1/3
1/3
1/3
1/2
1/6
5/12
1/3
1/4
3/8
11/24 . . .
1/6
2/5
2/5
1/5
13
Link Analysis
PageRank
Random Walk Interpretation
Imagine a random web surfer
At any time t, surfer is on some page P
At time t+1, the surfer follows an outlink from P uniformly
at random
Ends up on some page Q linked from P
Process repeats indefinitely
Let p(t) be a vector whose ith component is the
probability that the surfer is at page i at time t
p(t) is a probability distribution on pages
14
Link Analysis
PageRank
The stationary distribution
Where is the surfer at time t+1?
Follows a link uniformly at random
p(t+1) = Mp(t)
Suppose the random walk reaches a state such that
p(t+1) = Mp(t) = p(t)
Then p(t) is called a stationary distribution for the random
walk
Our rank vector r satisfies r = Mr
So it is a stationary distribution for the random surfer
15
Link Analysis
PageRank
Existence and Uniqueness
A central result from the theory of random walks (aka Markov
processes):
For graphs that satisfy certain conditions, the
stationary distribution is unique and eventually will
be reached no matter what the initial probability
distribution at time t = 0.
16
Link Analysis
PageRank
Spider traps
A group of pages is a spider trap if there are no links
from within the group to outside the group
Random surfer gets trapped
Spider traps violate the conditions needed for the
random walk theorem
17
PageRank
Link Analysis
Microsoft becomes a spider trap
Yahoo
y
a
m
y 1/2 1/2 0
a 1/2 0 0
m 0 1/2 1
M’soft
Amazon
y
a =
m
1
1
1
1
1/2
3/2
3/4
1/2
7/4
5/8
3/8
2
...
0
0
3
18
Link Analysis
PageRank
Random teleports
The Google solution for spider traps
At each time step, the random surfer has two
options:
With probability , follow a link at random
With probability 1-, jump to some page uniformly at
random
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a few
time steps
19
PageRank
Link Analysis
Random teleports ( = 0.8)
0.2*1/3
Yahoo
1/2
0.8*1/2
1/2
0.8*1/2
0.2*1/3
y
y 1/2
a 1/2
m 0
y
1/2
0.8* 1/2
0
y
1/3
+ 0.2* 1/3
1/3
0.2*1/3
Amazon
M’soft
1/2 1/2 0
0.8 1/2 0 0
0 1/2 1
1/3 1/3 1/3
+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
20
PageRank
Link Analysis
Random teleports ( = 0.8)
1/2 1/2 0
0.8 1/2 0 0
0 1/2 1
Yahoo
y
a =
m
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
M’soft
Amazon
1
1
1
1.00
0.60
1.40
1/3 1/3 1/3
+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
0.84
0.60
1.56
0.776
0.536 . . .
1.688
7/11
5/11
21/11
21
Link Analysis
PageRank
Matrix formulation
Suppose there are N pages
Consider a page j, with set of outlinks O(j)
We have Mij = 1/|O(j)| when j i and Mij = 0 otherwise
The random teleport is equivalent to
adding a teleport link from j to every other page with probability
(1-)/N
reducing the probability of following each outlink from 1/|O(j)| to
/|O(j)|
Equivalent: tax each page a fraction (1-) of its score and
redistribute evenly
22
Link Analysis
PageRank
PageRank
Construct the N*N matrix A as follows
Aij = Mij + (1-)/N
Verify that A is a stochastic matrix
The PageRank vector r is the principal eigenvector of
this matrix
satisfying r = Ar
Equivalently, r is the stationary distribution of the
random walk with teleports
23
Link Analysis
PageRank
Dead ends
Pages with no outlinks are “dead ends” for the
random surfer
Nowhere to go on next step
24
PageRank
Link Analysis
Microsoft becomes a dead end
1/2 1/2 0
0.8 1/2 0 0
0 1/2 0
Yahoo
M’soft
Amazon
y
a =
m
1
1
1
1
0.6
0.6
1/3 1/3 1/3
+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 1/15
0.787 0.648
0.547 0.430 . . .
0.387 0.333
0
0
0
Nonstochastic!
25
Link Analysis
PageRank
Dealing with dead ends
Teleport
Follow random teleport links with probability 1.0 from
dead ends
Adjust matrix accordingly
Prune and propagate
Preprocess the graph to eliminate dead ends
Might require multiple passes
Compute PageRank on reduced graph
Approximate values for dead ends by propagating values
from reduced graph
26
Link Analysis
PageRank
Computing PageRank
Key step is matrix-vector multiplication
rnew = Arold
Easy if we have enough main memory to hold A,
rold, rnew
Say N = 1 billion pages
We need 4 bytes for each entry (say)
2 billion entries for vectors, approx 8GB
Matrix A has N2 entries
1018 is a large number!
27
Link Analysis
PageRank
Rearranging the equation
r = Ar, where
Aij = Mij + (1-)/N
ri = 1≤j≤N Aij rj
ri = 1≤j≤N [Mij + (1-)/N] rj
= 1≤j≤N Mij rj + (1-)/N 1≤j≤N rj
= 1≤j≤N Mij rj + (1-)/N, since |r| = 1
r = Mr + [(1-)/N]N
where [x]N is an N-vector with all entries x
28
Link Analysis
PageRank
Sparse matrix formulation
We can rearrange the PageRank equation:
r = Mr + [(1-)/N]N
[(1-)/N]N is an N-vector with all entries (1-)/N
M is a sparse matrix!
10 links per node, approx 10N entries
So in each iteration, we need to:
Compute rnew = Mrold
Add a constant value (1-)/N to each entry in rnew
29
PageRank
Link Analysis
Sparse matrix encoding
Encode sparse matrix using only nonzero entries
Space proportional roughly to number of links
say 10N, or 4*10*1 billion = 40GB
still won’t fit in memory, but will fit on disk
source
degree destination nodes
node
0
3
1, 5, 7
1
5
17, 64, 113, 117, 245
2
2
13, 23
30
Link Analysis
PageRank
Basic Algorithm
Assume we have enough RAM to fit rnew, plus some
working memory
Store rold and matrix M on disk
Basic Algorithm:
Initialize: rold = [1/N]N
Iterate:
Update: Perform a sequential scan of M and rold to update rnew
Write out rnew to disk as rold for next iteration
Every few iterations, compute |rnew-rold| and stop if it is below
threshold
Need to read in both vectors into memory
31
PageRank
Link Analysis
Update step
Initialize all entries of rnew to (1-)/N
For each page p (out-degree n):
Read into memory: p, n, dest1,…,destn, rold(p)
for j = 1..n:
rnew(destj) += *rold(p)/n
rnew
0
1
2
3
4
5
6
src
0
degree
3
destination
1, 5, 6
1
4
17, 64, 113, 117
2
2
13, 23
rold
0
1
2
3
4
5
6
32
Link Analysis
PageRank
Analysis
In each iteration, we have to:
Read rold and M
Write rnew back to disk
IO Cost = 2|r| + |M|
What if we had enough memory to fit both rnew and
rold?
What if we could not even fit rnew in memory?
10 billion pages
33
Link Analysis
PageRank
Strip-based update
Problem: thrashing
34
Link Analysis
PageRank
Block Update algorithm
35
PageRank
Link Analysis
Block Update algorithm
rnew
0
1
2
3
src
0
1
degree
3
2
destination
0, 1
rold
0
2
1
0
3
2
1
0
3
3
1
2
2
3
2
2
0
1
2
3
36
Link Analysis
PageRank
Block Update algorithm
Some additional overhead
But usually worth it
Cost per iteration
|M|(1+) + (k+1)|r|
37
Link Analysis
Outline
PageRank
Topic-Sensitive PageRank
Hubs and Authorities
Spam Detection
38
Link Analysis
Topic-Sensitive PageRank
Some problems with PageRank
Measures generic popularity of a page
Biased against topic-specific authorities
Ambiguous queries e.g., jaguar
Uses a single measure of importance
Other models e.g., hubs-and-authorities
Susceptible to link spam
Artificial link topographies created in order to boost page
rank
39
Link Analysis
Topic-Sensitive PageRank
Topic-Sensitive PageRank
Instead of generic popularity, can we measure popularity
within a topic?
E.g., computer science, health
Bias the random walk
When the random walker teleports, he picks a page from a set S of
web pages
S contains only pages that are relevant to the topic
E.g., Open Directory (DMOZ) pages for a given topic (www.dmoz.org)
For each teleport set S, we get a different rank vector rS
40
Link Analysis
Topic-Sensitive PageRank
Matrix formulation
Aij = Mij + (1-)/|S| if i is in S
Aij = Mij otherwise
Show that A is stochastic
We have weighted all pages in the teleport set S
equally
Could also assign different weights to them
41
Topic-Sensitive PageRank
Link Analysis
Example
0.2
0.5
0.4
2
1
1
0.8
Suppose S = {1}, = 0.8
0.5
0.4
3
1
0.8
1
0.8
4
Node
1
2
3
4
Iteration
0
1
1.0
0.2
0
0.4
0
0.4
0
0
2…
0.52
0.08
0.08
0.32
stable
0.294
0.118
0.327
0.261
Note how we initialize the PageRank vector differently from the
unbiased PageRank case.
42
Link Analysis
Topic-Sensitive PageRank
How well does TSPR work?
Experimental results [Haveliwala 2000]
Picked 16 topics
Teleport sets determined using DMOZ
E.g., arts, business, sports,…
“Blind study” using volunteers
35 test queries
Results ranked using PageRank and TSPR of most closely
related topic
E.g., bicycling using Sports ranking
In most cases volunteers preferred TSPR ranking
43
Link Analysis
Topic-Sensitive PageRank
Which topic ranking to use?
User can pick from a menu
Use Bayesian classification schemes to classify query
into a topic
Can use the context of the query
E.g., query is launched from a web page talking about a
known topic
History of queries e.g., “basketball” followed by “jordan”
User context e.g., user’s My Yahoo settings,
bookmarks, …
44
Link Analysis
Outline
PageRank
Topic-Sensitive PageRank
Hubs and Authorities
Spam Detection
45
Link Analysis
Hubs and Authorities
Hubs and Authorities
Suppose we are given a collection of documents on
some broad topic
e.g., stanford, evolution, iraq
perhaps obtained through a text search
Can we organize these documents in some manner?
PageRank offers one solution
HITS (Hypertext-Induced Topic Selection) is another
proposed at approx the same time (1998)
46
Link Analysis
Hubs and Authorities
HITS Model
Interesting documents fall into two classes
Authorities are pages containing useful information
course home pages
home pages of auto manufacturers
Hubs are pages that link to authorities
course bulletin
list of US auto manufacturers
47
Hubs and Authorities
Link Analysis
Idealized view
Hubs
Authorities
48
Link Analysis
Hubs and Authorities
Mutually recursive definition
A good hub links to many good authorities
A good authority is linked from many good hubs
Model using two scores for each node
Hub score and Authority score
Represented as vectors h and a
49
Link Analysis
Hubs and Authorities
Transition Matrix A
HITS uses a matrix A[i, j] = 1 if page i links to page j, 0
if not
AT, the transpose of A, is similar to the PageRank
matrix M, but AT has 1’s where M has fractions
50
Hubs and Authorities
Link Analysis
Example
Yahoo
A=
Amazon
y a m
y 1 1 1
a 1 0 1
m 0 1 0
M’soft
51
Link Analysis
Hubs and Authorities
Hub and Authority Equations
The hub score of page P is proportional to the sum of
the authority scores of the pages it links to
h = λAa
Constant λ is a scale factor
The authority score of page P is proportional to the
sum of the hub scores of the pages it is linked from
a = μAT h
Constant μ is scale factor
52
Link Analysis
Hubs and Authorities
Iterative algorithm
Initialize h, a to all 1’s
h = Aa
Scale h so that its max entry is 1.0
a = ATh
Scale a so that its max entry is 1.0
Continue until h, a converge
53
Hubs and Authorities
Link Analysis
Example
111
A= 101
010
110
AT = 1 0 1
110
a(yahoo) =
a(amazon) =
a(m’soft) =
1
1
1
1
1
1
...
1
0.75 . . .
...
1
1
0.732
1
h(yahoo)
=
h(amazon) =
h(m’soft) =
1
1
1
...
1
1
1
2/3 0.71 0.73 . . .
1/3 0.29 0.27 . . .
1.000
0.732
0.268
1
4/5
1
54
Link Analysis
Hubs and Authorities
Existence and Uniqueness
h = λAa
a = μAT h
h = λμAAT h
a = λμATA a
Under reasonable assumptions about A,
the dual iterative algorithm converges to vectors
h* and a* such that:
• h* is the principal eigenvector of the matrix AAT
• a* is the principal eigenvector of the matrix ATA
55
Hubs and Authorities
Link Analysis
Bipartite cores
Hubs
Authorities
Most densely-connected core
(primary core)
Less densely-connected core
(secondary core)
56
Link Analysis
Hubs and Authorities
Secondary cores
A single topic can have many bipartite cores
corresponding to different meanings, or points of view
abortion: pro-choice, pro-life
evolution: darwinian, intelligent design
jaguar: auto, Mac, NFL team, panthera onca
How to find such secondary cores?
57
Link Analysis
Hubs and Authorities
Non-primary eigenvectors
AAT and ATA have the same set of eigenvalues
An eigenpair is the pair of eigenvectors with the same
eigenvalue
The primary eigenpair (largest eigenvalue) is what we get
from the iterative algorithm
Non-primary eigenpairs correspond to other
bipartite cores
The eigenvalue is a measure of the density of links in the
core
58
Link Analysis
Hubs and Authorities
Finding secondary cores
Once we find the primary core, we can remove its
links from the graph
Repeat HITS algorithm on residual graph to find the
next bipartite core
Technically, not exactly equivalent to non-primary
eigenpair model
59
Link Analysis
Hubs and Authorities
Creating the graph for HITS
We need a well-connected graph of pages for HITS
to work well
60
Link Analysis
Hubs and Authorities
PageRank and HITS
PageRank and HITS are two solutions to the same
problem
What is the value of an inlink from S to D?
In the PageRank model, the value of the link depends on
the links into S
In the HITS model, it depends on the value of the other
links out of S
The destinies of PageRank and HITS post-1998 were
very different
Why?
61
Link Analysis
Outline
PageRank
Topic-Sensitive PageRank
Hubs and Authorities
Spam Detection
62
Link Analysis
Spam Detection
Web Spam
Search has become the default gateway to the web
Very high premium to appear on the first page of
search results
e.g., e-commerce sites
advertising-driven sites
63
Link Analysis
Spam Detection
What is web spam?
Spamming = any deliberate action solely in order to
boost a web page’s position in search engine results,
incommensurate with page’s real value
Spam = web pages that are the result of spamming
This is a very broad defintion
SEO industry might disagree!
SEO = search engine optimization
Approximately 10-15% of web pages are spam
64
Link Analysis
Spam Detection
Web Spam Taxonomy
We follow the treatment by Gyongyi and GarciaMolina [2004]
Boosting techniques
Techniques for achieving high relevance/importance for a
web page
Hiding techniques
Techniques to hide the use of boosting
From humans and web crawlers
65
Link Analysis
Spam Detection
Boosting techniques
Term spamming
Manipulating the text of web pages in order to appear
relevant to queries
Link spamming
Creating link structures that boost page rank or hubs and
authorities scores
66
Link Analysis
Spam Detection
Term Spamming
Repetition
of one or a few specific terms e.g., free, cheap, viagra
Goal is to subvert TF.IDF ranking schemes
Dumping
of a large number of unrelated terms
e.g., copy entire dictionaries
Weaving
Copy legitimate pages and insert spam terms at random positions
Phrase Stitching
Glue together sentences and phrases from different sources
67
Link Analysis
Spam Detection
Link spam
Three kinds of web pages from a spammer’s point of
view
Inaccessible pages
Accessible pages
e.g., web log comments pages
spammer can post links to his pages
Own pages
Completely controlled by spammer
May span multiple domain names
68
Link Analysis
Spam Detection
Link Farms
Spammer’s goal
Maximize the page rank of target page t
Technique
Get as many links from accessible pages as possible to
target page t
Construct “link farm” to get page rank multiplier effect
69
Spam Detection
Link Analysis
Link Farms
Accessible
Own
1
Inaccessible
t
2
M
One of the most common and effective organizations for a link farm
70
Spam Detection
Link Analysis
Analysis
Own
Accessible
Inaccessibl
e
t
1
2
M
Suppose rank contributed by accessible pages = x
Let page rank of target page = y
Rank of each “farm” page = y/M + (1-)/N
y = x + M[y/M + (1-)/N] + (1-)/N
Very small; ignore
= x + 2y + (1-)M/N + (1-)/N
y = x/(1-2) + cM/N where c = /(1+)
71
Spam Detection
Link Analysis
Analysis
Own
Accessible
Inaccessibl
e
t
1
2
M
y = x/(1-2) + cM/N where c = /(1+)
For = 0.85, 1/(1-2)= 3.6
Multiplier effect for “acquired” page rank
By making M large, we can make y as large as we want
72
Link Analysis
Spam Detection
Detecting Spam
Term spamming
Analyze text using statistical methods e.g., Naïve Bayes
classifiers
Similar to email spam filtering
Also useful: detecting approximate duplicate pages
Link spamming
Open research area
One approach: TrustRank
73
Link Analysis
Spam Detection
TrustRank idea
Basic principle: approximate isolation
It is rare for a “good” page to point to a “bad” (spam) page
Sample a set of “seed pages” from the web
Have an oracle (human) identify the good pages and
the spam pages in the seed set
Expensive task, so must make seed set as small as possible
74
Link Analysis
Spam Detection
Trust Propagation
Call the subset of seed pages that are identified as
“good” the “trusted pages”
Set trust of each trusted page to 1
Propagate trust through links
Each page gets a trust value between 0 and 1
Use a threshold value and mark all pages below the trust
threshold as spam
75
Link Analysis
Spam Detection
Rules for trust propagation
Trust attenuation
The degree of trust conferred by a trusted page decreases
with distance
Trust splitting
The larger the number of outlinks from a page, the less
scrutiny the page author gives each outlink
Trust is “split” across outlinks
77
Link Analysis
Spam Detection
Simple model
Suppose trust of page p is t(p)
Set of outlinks O(p)
For each q in O(p), p confers the trust
t(p)/|O(p)| for 0<<1
Trust is additive
Trust of p is the sum of the trust conferred on p by all its
inlinked pages
Note similarity to Topic-Specific PageRank
Within a scaling factor, trust rank = biased page rank with
trusted pages as teleport set
78
Link Analysis
Spam Detection
Picking the seed set
Two conflicting considerations
Human has to inspect each seed page, so seed set must be
as small as possible
Must ensure every “good page” gets adequate trust rank,
so need make all good pages reachable from seed set by
short paths
79
Link Analysis
Spam Detection
Approaches to picking seed set
Suppose we want to pick a seed set of k pages
PageRank
Pick the top k pages by page rank
Assume high page rank pages are close to other highly
ranked pages
We care more about high page rank “good” pages
80
Link Analysis
Spam Detection
Inverse page rank
Pick the pages with the maximum number of outlinks
Can make it recursive
Pick pages that link to pages with many outlinks
Formalize as “inverse page rank”
Construct graph G’ by reversing each edge in web graph G
Page rank in G’ is inverse page rank in G
Pick top k pages by inverse page rank
81
Link Analysis
Spam Detection
Spam Mass
In the TrustRank model, we start with good pages
and propagate trust
Complementary view: what fraction of a page’s page
rank comes from “spam” pages?
In practice, we don’t know all the spam pages, so we
need to estimate
82
Link Analysis
Spam Detection
Spam mass estimation
r(p) = page rank of page p
r+(p) = page rank of p with teleport into “good” pages
only
r-(p) = r(p) – r+(p)
Spam mass of p = r-(p)/r(p)
83
Link Analysis
Spam Detection
Good pages
For spam mass, we need a large set of “good” pages
Need not be as careful about quality of individual pages as
with TrustRank
One reasonable approach
.edu sites
.gov sites
.mil sites
84
Link Analysis
Acknowledgement
Slides are from
Prof. Jeffrey D. Ullman
Dr. Anand Rajaraman
85