The PageRank Citation Ranking: Bringing Order to the Web

Download Report

Transcript The PageRank Citation Ranking: Bringing Order to the Web

The PageRank Citation Ranking:
Bringing Order to the Web
Presented by
Aishwarya Rengamannan
1000669605
Instructor: Dr. Gautam Das
Technology Overview
Motivation
•
•
•
•
WWW is huge and heterogeneous
WebPages proliferate free of quality control
Commercial interest to manipulate ranking
The ‘quality’ of a webpage is subjective to the
users.
Problem: Necessity to approximate the overall
relative ‘importance’ of web pages.
Solution: Take advantage of the Link Structure of
the web
Link structure of the Web
• Forward
Links(Outedges):
The outgoing links from
a webpage. C is A & B’s
forward link.
• Back Links(Inedges):
Incoming links to a
webpage. A & B are
back links for C.
Related Work
• Academic paper citations
• Link based analysis
• Clustering methods that take link structure
into account
• Modeling web as Hubs and Authorities
Ranking Intuition
• The quantity of the backlinks to a webpage
makes it important.
• The quality of the back linked pages increases
the ranking.
“A page has high rank if the sum of the ranks of
it’s backlinks is high.”
How about having a backlink from www.yahoo.com?
Naïve PageRank Calculation
•
•
•
•
•
u & v --> Webpages
Bu --> backlinks of u
Nv --> Forward Links from v to u.
R --> Ranks of the webpages
c <1 --> Used for normalization
Matrix Representation
‘A’ is a square adjacency Matrix with
• Rows and columns corresponding to web
pages (u & v)
• Au,v = 1/Nu if there is an edge from u to v
• Au,v = 0 if there is no edge.
Matrices Revisited
Eigen Values and Eigen Vectors:
• Matrix A (nXn)
• is an Eigen value of A if there exists a non-zero
vector v such that Av= v
• vector v is called an Eigen vector of A
corresponding to .
• We can rewrite Av= v as (A− I)v=0, where I is
identity matrix (nXn).
Matrices Revisited(Contd…)
How to solve for Eigen value and Eigen Vector?
Sample Calculation
1
3
2
4
Matrix Representation (contd…)
• A --> square matrix of web pages
• R --> vector over webpages
• To find: Eigen Vector corresponding to dominant
(maximum) Eigen value.
– Could be computed by repeatedly iterating till it
converges to the dominant Eigen value-Eigen Vector
Matrix Notation gives
R=cAR
c : eigenvalue
R : eigenvector of A
R=
Normalized R =
Problem with Naïve PageRank
Rank Sink:
• Two web pages that point to each other but to
no other page. Third page which points to one
of them.
• loop will accumulate rank but never distribute
it (since there are no out edges).
Solution – Extended version of
PageRank
Introducing Rank Source:
E(u): a vector over the web pages that corresponds to a source
of rank.
Random Surfer Model
• Random Surfer – Clicks on successive links at
random.
• The factor ‘E’ can be viewed as modeling this
behavior.
• “Surfer” periodically gets bored, jumped to a
random page based on E.
PageRank Computation
- initialize vector over web
pages
Loop:
- new ranks sum of
normalized backlink ranks
- compute normalizing
factor
- add escape term
- control parameter
While - stop when
converged
Another Problem?
Dangling links:
– Links to a page with no link to any other pages
– Not clear where their weights should be
distributed
Solution: Remove them from the system until after
calculating all other PageRanks!
Implementation
• Web crawler keeps a database of URLs so that it can
discover all URLs on the web
• To implement PageRank, the web crawler builds an
index of the URLs as it crawls
Problems???
•
•
•
•
Infinitely large sites
Incorrect/Broken HTML
Sites are down
Web is always changing
PageRank Implementation
•
•
•
•
Convert each URL into unique integer ID
Link structure sorted by the IDs
Remove dangling links
Make a initial assignment of ranks and iterate
until convergence
• Add the dangling links back
• Iterate the process again to assign weights to all
dangling links
• Link database A, is normally kept in RAM
Convergence Properties
• Interpret web as a expander like graph.
– if every subsets of nodes S has a neighborhood
that is larger than some factor α times |S|
• Verification - if the largest eigenvalue is
sufficiently larger than the second-largest
eigenvalue
Applications of Page Rank
•
•
•
•
•
Search, Browsing and Traffic estimation.
Help user decide if a site is trustworthy.
Estimate web traffic.
Spam detection and prevention.
Predict citation counts
• http://www.techpavan.com/2008/11/20/back
end-google-search/
• http://www.math.hmc.edu/calculus/tutorial
s/eigenstuff/
• http://williamcotton.com/pagerankexplained-with-javascript