The Multi-Disciplinary Nature of Technology

Download Report

Transcript The Multi-Disciplinary Nature of Technology

Google’s Billion Dollar
Eigenvector
Gerald Kruse, PhD.
Associate Professor of Mathematics and Computer Science
Juniata College
Huntingdon, PA
[email protected]
http://faculty.juniata.edu/kruse
Math, math, everywhere…
Who is getting close to the $1 M?
Here’s an interesting billboard…
What happened for those who
found the answer?
The answer is 7427466391
Those who typed in the URL,
http://7427466391.com , ended up getting
another puzzle. Solving that lead them to a
page with a job application for…
Google!
First Question
(1) Just what does it take to solve that
problem?
Calculations (most probably on a computer),
knowledge of number theory, a general
aptitude and interest in problem solving.
Second Question
(2) Why does Google want to hire people
who know how to find that, what does it have
to do with a search engine?
Hmmm… Google gives great search results.
Maybe their ranking algorithm is
mathematically based?
“Google-ing” Google
Results in an early paper from Page, Brin et. al.
while in graduate school
Search Engines
We’ve all used them, but what is
“under the hood?”
Crawl the web and locate all public pages
Index the “crawled” data so it can be searched
Rank the pages for more effective searching ( the
focus of this talk )
Each word which is searched on is linked with a list of
pages (just URL’s) which contain it. The pages with the
highest rank are returned first.
Note:
Google ONLY uses the link
structure of the World Wide
Web to determine a page’s
rank, NOT its content.
PageRank is NOT a simple citation
index
Which is the more popular page below, A or B?
What if the links to A were from unpopular pages, and the
one link to B was from www.yahoo.com ?
A
B
NOTE:
(1) Rankings based on citation index would be very easy to manipulate
(2) While PageRank is an important part of Google’s search results, it
is not the sole means used to rank pages.
Intuitively PageRank is analogous
to popularity
The web as a graph: each page is a vertex, each
hyperlink a directed edge.
A page is popular if a few very popular pages point (via
hyperlinks) to it.
A page could be popular if many not-necessarily
popular pages point (via hyperlinks) to it.
Which of
these three
would have
the highest
page rank?
Page A
Page B
NA  2
NB  1
Page C
NC  1
So what is the mathematical
definition of PageRank?
In particular, a page’s rank is equal to the sum
of the ranks of all the pages pointing to it.
Rank(v)
Rank(u )  
Nv
vBu
Bu  set of pageswith links to u
N v  num berof links from v
note the scaling of each page rank
Writing out the equation for each
web-page in our example gives:
Rank( A)

0

0

Rank(C )
1
Rank( B)


0

0
Rank(C )

Rank( A)
2
Rank( A)
2

Rank( B)
1

0
Page A
Page B
NA  2
NB  1
Page C
NC  1
Even though this is a circular definition we can
calculate the ranks.
Re-write the system of equations as a MatrixVector product.
 Rank( A) 
0






 Rank( B) 
1

  


2
 Rank(C ) 
1


2



0
0
1
1  Rank( A) 




 Rank( B) 
0 



 Rank(C ) 
0 



The PageRank vector is simply an eigenvector of the


coefficient matrix, x  Ax with   1
Wait… what’s an eigenvector?
A Graphical Interpretation of a
2-Dimensional Eigenvector
http://cnx.org/content/m10736/latest/
If we have some 2-D vector x, and some
2 x 2 matrix A, generally their product,
A*x = b, will result in a new vector, b,
which is pointing in a different direction and
having a different length than x.
But, if the vector (v in the image at the left) is
an eigenvector of A, then A*v will give a
vector which is same direction as v, but just
scaled a different length, by λ.
Note that λ is called an eigenvalue of A.
PageRank = 0.4
PageRank = 0.2
Page A
Page B
NA  2
NB  1
Page C
NC  1
PageRank = 0.4

Note: we choose the eigenvector with x 1  1
Implementation Details
Billions of web-pages would make a huge matrix
The matrix (in theory) is column-stochastic, which allows
for iterative calculation
Previous PageRank is used as an initial guess
Random-Surfer term handles computational difficulties
associated with a “disconnected graph”
Attempts to Manipulate Search Results
Via a “Google Bomb”
French Military Victories
Juniata’s own “Google Bomb”
At Juniata, CS 315 is my “Analysis and
Algorithms” course
Liberals vs. Conservatives!
As of November, 2007, Google no longer returns this!
“Ego Surfing”
Be very careful…
More than one Gerald Kruse…
Miscellaneous points
Try a search in Google on “PigeonRank.”
What types of sites would Google NOT give good
results on?
PageRank is not the only means Google uses to
order search results.
Bibliography
[1] S. Brin, L. Page, et. al., The PageRank Citation Ranking: Bringing
Order to the Web, http://dbpubs.stanford.edu/pub/1999-66 , Stanford
Digital Libraries Project (January 29, 1998).
[2] K. Bryan and T. Leise, The $25,000,000,000 Eigenvector: The
Linear Algebra behind Google, SIAM Review, 48 (2006), pp. 569-581.
[3] G. Strang, Linear Algebra and Its Applications, Brooks-Cole,
Boston, MA, 2005.
[4] D. Poole, Linear Algebra: A Modern Introduction, Brooks-Cole,
Boston, MA, 2005.
Any Questions?
Slides available at
http://faculty.juniata.edu/kruse
The following slides give
some of the more in-depth
mathematics behind Google
Note that the coefficient matrix is
column-stochastic*
 a11

 a21
 

a
 n1
a12
a13

a22
a23



an 2
an 3
n
a
i 1
ij

a1n 

a2 n 
, 0  aij  1



ann 
 a1 j  a2 j   anj  1
Every column-stochastic matrix has 1 as an eigenvalue.
* As long as there are no “dangling nodes” and the graph is
connected.
Dangling Nodes have no outgoing links
Page A
Page C
Page B
In this example, Page C is a
dangling node. Note that its
associated column in the
coefficient matrix is all 0.
Matrices like these are called
column-substochastic.
 0 1/ 2



1 / 2 0


1 / 2 1 / 2

0 



0 


0 

In Page, Brin, et. al. [1], they suggest dangling nodes most
likely would occur from pages which haven’t been crawled yet,
and so they “simply remove them from the system until all the
PageRanks are calculated.”
It is interesting to note that a column-substochastic does have a
positive eigenvalue   1 and corresponding eigenvector with
non-negative entries, which is called the Perron eigenvector, as
detailed in Bryan and Leise [2].
A disconnected graph could lead to
non-unique rankings
Page A
Page C
Page E
Page B
Page D
Notice the block
diagonal structure
of the coefficient
matrix.
Note: Re-ordering
via permutation
doesn’t change the
ranking, as in [2].








0
1
0
0
1
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
In this example, the eigenspace assiciated with
eigenvalue   1 is two-dimensional. Which
eigenvector should be used for ranking?
0 

0 
1/ 2

1/ 2
0 
Add a “random-surfer” term to the
simple PageRank formula.
Let S be an n x n matrix with all entries 1/n. S is columnstochastic, and we consider the matrix M , which is a
weighted average of A and S.
M  (1  m) A  mS
This models the behavior of a real web-surfer, who might jump to another
page by directly typing in a URL or by choosing a bookmark, rather than
clicking on a hyperlink. Originally, m=0.15 in Google, according to [2].


x  Mx



can also be written as: x  (1  m) Ax  ms
Important Note: We will use this formulation with A when computing x ,
and s is a column vector with all entries 1/n, where
 
Sx  s
if
x
i
1
M for our previous disconnected
graph, with m=0.15
Page A
Page C
Page E
Page B
Page D








0.03
0.88
0.03
0.03
0.88
0.03
0.03
0.03
0.03
0.03
0.03
0.88
0.03
0.03
0.03
0.03
0.88
0.03
0.03
0.03
0.03 

0.03 
0.455

0.455
0.03 
The eigenspace associated with   1 is onedimensional, and the normalized eigenvector is
( 0.2 , 0.2 , 0.285, 0.285, 0.03 )
So the addition of the random surfer term permits
comparison between pages in different subwebs.
Iterative Calculation
By many estimates, the web currently contains at least 8 billion
pages. How does Google compute an eigenvector for
something this large?
One possibility is the power method.
In [2], it is shown that every positive (all entries are > 0)
column-stochastic matrix M has a unique vector q with positive
components such that Mq = q, with
q 1  1 , and it can be
k
q

lim
M
x0 , for any initial guess x0 with
computed as
k 
positive components and
x0 1  1 .
Iterative Calculation continued
Rather than calculating the powers of M directly, we could use


the iteration, xk  Mxk 1 .

2
M
x
O
(
n
) calculation.
Since M is positive,
would
be
an
k 1
As we mentioned previously, Google uses the equivalent

expression in the computation: 

xk  (1  m) Axk 1  ms
These products can be calculated without explicitly creating
the huge coefficient matrix, since A contains mostly 0’s.
The iteration is guaranteed to converge, and it will converge
quicker with a better first guess, so the previous PageRank
vector is used as the initial vector.
This gives a regular matrix
In matrix notation we have
Since
R 1 1
R  AR  E
we can rewrite as
R  ( A  ( E  1))R, note: E  ( E  1) R
The new coefficient matrix is regular, so we can calculate
the eigenvector iteratively.
This iterative process is a series of matrix-vector products,
beginning with an initial vector (typically the previous
PageRank vector). These products can be calculated
without explicitly creating the huge coefficient matrix.