How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Download Report

Transcript How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

How PageRank Works
Ketan Mayer-Patel
University of North Carolina
January 31, 2011
Me vs. Jeff
• High school
– Public school in Texas
• College
– The University of
California, Berkeley
• Faculty member at...
– UNC
• High School
– Hoity-toity, private allboys school in Jersey
• College
– Stanford
• Faculty member at...
– Duke
The World Wide Web
• A Simple Request/Response System
Request for web page.
Web page returned.
Making The Request
• How do you make a web request?
– Use a browser.
• Specify what you want directly.
• Follow a link.
– Turns out we very rarely specify documents directly.
– Uniform Resource Locator (URL)
• http://server-name.com/path/to/a/page
– Two key characteristics of hyperlinks:
• Directional
• Unilateral
Web Search In Three Easy Steps
• What’s step one?
– Cut a hole in the box.
Web Search In Three Easy Steps
• First, crawl.
– Try to find all of the web pages.
• Follow the links.
• Second, index.
– Organize what you find.
• Lots of secret sauce here.
• Third, query.
– Usually, text query words.
– Retrieves a list of related pages.
• Usually because they contain the query text.
Which to list first?
• Possible clues:
– Number of times the query term appears
– Where it appears
• Title, body text, URL, metadata, etc.
– How it appears
• Style of text
• Role of text
– Position in the document graph
• This is what distinguished Google from other search
engines at the time.
PageRank
• Supposedly named after Larry
Page
• Part of his research in grad
school
– Patented while in grad school.
– Licensed to Google for ~ 1 million
shares of Google.
• Sold for about $300M
Document Graph
Probability Distribution of a Random
Walk
• Start walking the graph.
• After some reasonably long amount of time,
stop.
• What’s the chance that you are on a particular
page.
– Larger chance => more important page
– Is this actually true?
• Maybe, maybe not
Random Walk Example
Random Walk Example
Random Walk Example
Random Walk Example
Random Walk Example
Random Walk Example
Random Walk Example
Trapdoors and Dead Ends
Hotel California:
Can’t ever leave.
Shangri-La:
Can’t ever get here.
Spider Traps
Fixing Our Random Walk
• What can we do to fix it?
– Add a bit more randomness.
• At each step, with probability α jump to any random
page.
• Otherwise, randomly follow a link.
– Provides a way in to / out of trapdoors / dead
ends and spider traps.
Random Walk Scalability
• Problem: Would need to simulate the random
walk over and over again to even come close to
discovering the underlying probability
distribution.
– Easy to do for small graphs.
– Pain in the ass for large ones.
• Markov Chain
– Tool for analyzing stochastic processes.
– Power method
Power Method Equation
•
•
•
•
N : Number of documents
Rk : Page rank of document k
Lk : Number of outgoing links in k
δ(k,j) : Delta function for links between k and j
δ(k,j) = 1 if and only if there exists a link from document
k to document j
N
Rk
R j   (k, j)
Lk
k 1
Power Method Equation
• Our definition is circular.
– To calculate page rank of a page we need to already
know the page rank of other pages.
• Iterative solution.
– Start with an initial assignment.
• Basically set the page rank of every page to 1/N.
• Why 1/N?
– Calculate an updated value for every page using the
current values.
– Keep repeating until the value are stable.
Power Method Equation
• Intuition:
– Page rank of a document is the sum of its fair
share of the page ranks of the pages that link to
the document.
N
R
i1
j
i
k
i
k
R
  (k, j)
L
k 1
Example
i=0
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
N
R
i1
j
Rki
  (k, j) i
Lk
k 1
Example
i=1
0.025
0.075
0.125
0.05
0.1
0.1
0.1
0.2
0
0.125
N
R
i1
j
Rki
  (k, j) i
Lk
k 1
Example
Something is wrong!
i = 10
0.015
0.051
0.189
0.036
0.134
0.072
0.154
0.071
0
0.015
N
R
i1
j
Rki
  (k, j) i
Lk
k 1
Power Method v2
• Dead ends leak.
• Spider traps slowly collect everything.
• Translating our random walk solution:
– Add a “virtual” link from every document to every other document.
– Define a weighting factor α between 0.0 and 1.0
• Distribute α proportion of your page rank over the virtual links
• Distribute (1- α) proportion of your page rank over the real links
R
(1  )R
R 
  (k, j)
i
Lk
k1 N
k1
N
i1
j
i
k
N
i
k
Power Method v2
• Dead ends leak.
• Spider traps slowly collect everything.
• Translating our random walk solution:
– Add a “virtual” link from every document to every other document.
– Define a weighting factor α between 0.0 and 1.0
• Distribute α proportion of your page rank over the virtual links
• Distribute (1- α) proportion of your page rank over the real links

R
R   (1  ) (k, j)
N
L
k1
N
i1
j
i
k
i
k
Convergence
• Typical value for α is 0.15.
• Convergence typically occurs in about 50
iterations even for large graphs.
Example
i = 10
0.024
0.074
0.115
0.061
0.112
0.073
0.107
0.011
0.105
0.034
N
i

R
k
Ri1


(1

)

(k,
j)

j
i
N
L
k
k1
Example
i = 10
0.015
0.189
0.024
0.051
0.074
0.115
0.036
0.061
0.134
0.154
0.112
0.072 0.073
0.107
0
0.011
0.071 0.105
0.034
0.015
N
i

R
k
Ri1


(1

)

(k,
j)

j
i
N
L
k
k1
Billions and billions
• How do you do this with
billions of documents?
– Can be implemented using
matrix math.
– Special techniques for sparse
matrices.
– PageRank roughly equivalent
to first eigenvector.
Gaming The System
• Google Bomb!
– Create a lot of links to the page that you want to
be highly ranked.
• Create your own spider trap.
– Relatively easy to combat by discounting links that come from
the same domain.
• Comment spam.
• Porn trap.
Last Notes
• Stanford Sucks!
• GO HEELS!
Bad Math
• When originally presented, the final version of
the power method equation was shown as:
N
Rki
R    (1  ) (k, j) i
Lk
k1
i1
j
• The simplification for the first term is wrong
and should have been:

N

Rki
i1
R j   (1  ) (k, j) i
N
Lk
k1