The PageRank Citation Ranking: Bringing Order to the Web
Download
Report
Transcript The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking:
Bringing Order to the Web
Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd
January 29th, 1998
Stanford InfoLab
Adaptive methods for the computation
of PageRank
Sepandar Kamvar, Taher Haveliwala, Gene Golub
2004
Presented By:
Wang Hao
March 8th, 2011
Agenda
Technology Overview
Introduction & Motivation
Link Structure of the Web
Simplified PageRank
PageRank Definition
How we can get PageRank
Dangling Links
PageRank Implementation
Adaptive Methods for computation of PageRank
Searching with PageRnnk
Personalized PageRank
Application
Conclusion
References
Technology Overview
Recognized the need for a new kind of server setup
Linked PCs to quickly find each query’s answers
This resulted in: Faster Response Time
Greater Scalability
Lower costs
Google uses more than 200 signals (including PageRank
algorithm) to determine which pages are important
Google then performs hypertext-matching
- Google Corporate Information
Life of a Google Query
- Google Corporate Information
Introduction & Motivation
WWW is very large and heterogeneous
The web pages are extremely diverse in terms of content,
quality and structure
Challenging for information retrieval on WWW
Academic Citations link to
other well known papers
But they are peer reviewed
and have quality control
Web of academic documents
are homogeneous in their
quality, usage, citation &
length
Cont’d
Most web pages link to web
pages as well
Quality measure of a web
page is subjective to the user
though
Importance of a page is a
quantity that isn’t intuitively
possible to capture
Problem:
How can the most relevant pages be ranked at the top?
Answer:
Take advantage of the link structure of the Web to produce ranking
of every web page known as PageRank
Link Structure of the Web
A and B are Backlinks of C
•Every page has some number of
forward links (outedges) and
backlinks (inedges)
•We can never know all the
backlinks of a page, but we know
all of its forward links
•Generally, highly linked pages are
more “important”
PageRank Definition
• PageRank - a method for computing a ranking for every web page based on
the graph of the web
• A page has high rank if the sum of the ranks of its backlinks is high
•
•
Page has many backlinks
Page has a few highly ranked backlinks
A page is important if important pages refer to it
• PageRank is a link analysis algorithm that assigns a numerical weight that
represents how important a page is on the web
• The web is democratic i.e., pages
vote for pages
Google interprets a link from page A
to page B as a vote, by page A, for page B.
It also analyses the page that cast the vote.
Cont’d
Simple Ranking Function:
u: web page
Bu: backlinks
Nu = |Fu| number of links from u
c: factor used for normalization
Simplified PageRank Calculation
In principle, the PageRanks form a probability distribution over
web pages, so the sum of all web pages’ PageRanks will be one
Computing PageRank given a Directed Graph
The transition matrix A =
We get the eigenvalue λ = 1
Calculating the eigenvector
Cont’d
On substituting
we get,
so the vector u is of the form
Choose v to be the unique eigenvector with the sum of all entries equal to 1
PageRank vector
How we can get PageRank
It is a Markov chain.
Set the probability distribution at time 0: X0
Set one-step transition probability matrix: A
What we would like to get is the unique stationary
distribution of the Markov chain:
by
successively iterating
until convergence
that
This is the principal eigenvector of the matrix A, which
is exactly the PageRank vector.
Problem 1: Dangling Links
Dangling links are links that point to any page with
no outgoing links or pages not downloaded yet.
Problem : how their weights should be distributed.
Solution 1: they are removed from the system until
all the PageRanks are calculated. Afterwards, they
are added in without affecting things significantly.
Problem 1: Dangling Links (cont’d)
Solution 2 (presented in the second paper):
Let v be a vector representing a uniform distribution over all nodes
In terms of the random walk, the effect of D is to modify the transition probabilities so
that a surfer visiting a dangling page randomly jumps to another page in the next time
step, using the distribution given by v.
Problem 2: Rank Sink
Problem:
Some pages form a loop
that accumulates rank
(rank sink) to the infinity.
Solution:
Random Surfer Model
Jump to a random page based
on some distribution E (rank
source)
Convergence and Random Walks : Why does it work?
Irreducible Aperiodic Markov Chains with a Primitive transition
probability matrix
What are the issues all about?
We need a transition matrix model that can guarantee convergence
and does indeed converge to a unique stationary distribution vector.
PageRank Expression:
Let E(u) be some vector over the Web pages that corresponds to a source of
rank. Then, the PageRank of a set of Web pages is an assignment, R’, to the
Web pages which satisfies
PageRank of
document u
Normalization
factor
PageRank of
document v
that links to u
Vector of web
pages that the
Surfer randomly
jumps to u
Number of outlinks
from document v
such that c is maximized and ||R’||1 = 1 (||R’||1 denotes the L1 norm of
R’).
Computing PageRank
R0 S
S: any vector over the web pages
R
A
i
1
iR
Calculate the Ri+1 vector using Ri
d
R
R
i1
i
1
1
Calculate the normalizing factor
Loop:
R
R
dE Find the vector Ri+1 using d
i
1
i
1
R
R
i
1
while
i
Find the norm of the difference of 2
vectors
Loop until convergence
PageRank Implementation
Convert each URL into a unique integer ID
Sort the link structure by ID
Remove the dangling links
Make an initial assignment of ranks
Iteratively compute PageRank until Convergence
Add the dangling links back
Recompute the rankings
NOTE: After adding the dangling links back, we need to iterate as
many times as was required to remove the dangling links
The mechanism
•Web Crawler: Finds and retrieves pages on the web
•Repository: web pages are compressed and stored here
•Indexer: each index entry has a list of documents in which the term appears
and the location within the text where it occurs
Convergence
PR (322 Million Links): 52 iterations
PR (161 Million Links): 45 iterations
Scaling factor is roughly linear in logn
Adaptive Methods for the computation of PageRank
This paper presents two contributions:
First, it shows that most pages in the web converge to
their true PageRank quickly, while relatively few pages
take much longer to converge.
And it further shows that those slow-converging pages
generally have high PageRank, and those pages that
converge quickly generally have low PageRank.
Experimental results supports the findings:
Experimental results
Adaptive Algorithms
Second, the authors develop two algorithms, called Adaptive
PageRank and Modified Adaptive PageRank, that exploit this
observation to speed up the computation of PageRank by 18%
and 28%, respectively.
The main ideas of the all the proposed algorithms are the same,
which is to speed up the computation of PageRank by reducing
the cost (not computing the PageRank of converged pages at
each iteration).
Notations to be included:
A: one-step transition probability matrix.
x(k): probability distribution vector at time k.
N = not yet converged; C = converged.
Adaptive PageRank
Filter-based Adaptive PageRank
Reordering the matrix A at each iteration is expensive.
Reducing the cost by introducing sparse (zero) entries.
Filter-based Modified Adaptive PageRank
Reducing redundant computation by not recomputing
the components of the PageRanks of those pages in N
due to links from those pages in C.
Split A even further.
Performance comparison
Searching with PageRank
• Two search engines:
– Title-based search engine
– Full text search engine
• Title-based search engine
– Searches only the “Titles”
– Finds all the web pages whose titles contain all the query words
– Sorts the results by PageRank
– Very simple and cheap to implement
– Title match ensures high precision, and PageRank ensures high
quality
• Full text search engine
– Called Google
– Examines all the words in every stored document and also
performs PageRank (Rank Merging)
– More precise but more complicated
Title-based search for University
Personalized PageRank
Important component of PageRank calculation is E
A vector over the web pages (used as source of rank)
Powerful parameter to adjust the page ranks
E vector corresponds to the distribution of web pages that a
random surfer periodically jumps to
Having an E vector that is uniform over all the web pages
results in some web pages with many related links receiving
an overly high rank e.g.: copyright page or forums
General Search over the internet
Instead in Personalized PageRank E consists of a single web
page
Applications
Estimating Web Traffic
On analyzing the statistics, it was found that there are some sites that
have a very high usage, but low PageRank.
e.g.: Links to pirated software
PageRank as Backlink Predictor
The goal is to try to crawl the pages in as close to the optimal order as
possible i.e., in the order of their rank according to an evaluation func.
PageRank is a better predictor than citation counting
User Navigation: The PageRank Proxy
The user receives some information about the link before they click on it
This proxy can help users decide which links are more likely to be
interesting
Conclusion
PageRank is a global ranking of all web pages based on their
locations in the web graph structure
PageRank uses information which is external to the web pages
– backlinks
Backlinks from important pages are more significant than
backlinks from average pages
The structure of the web graph is very useful for information
retrieval tasks.
References
L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank Citation Ranking:
Bringing Order to the Web, 1998
Sepandar Kamvar, Taher Haveliwala, Gene Golub, Adaptive methods for
the computation of PageRank, Linear Algebra and its Applications 386, 2004
Published by Elsevier Inc., pp 51–65.
L. Page and S. Brin. The anatomy of a large-scale hypertextual web search
engine, 1998
THE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND
GOOGLE by KURT BRYAN AND TANYA LEISE
Google Corporate Information:
http://www.google.com/corporate/tech.html
http://en.wikipedia.org/wiki/PageRank
http://en.wikipedia.org/wiki/Eigenvalue,_eigenvector_and_eigenspace
http://www.googleguide.com/google_works.html
http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/
lecture3.html
http://pr.efactory.de/
Thank You!
Q&A