A Look at Markov Chains and Their Use in Google
Download
Report
Transcript A Look at Markov Chains and Their Use in Google
Matrices, Digraphs, Markov
Chains & Their Use by
Google
Leslie Hogben
Iowa State University and
American Institute of Mathematics
Bay Area Mathematical Adventures
February 27, 2008
With material from Becky Atherton
Outline
• Matrices
• Markov Chains
• Digraphs
• Google’s PageRank
Introduction to Matrices
A matrix is a rectangular array of
numbers
Matrices are used to solve systems of
equations
Matrices are easy for computers to
work with
Matrix arithmetic
Matrix Addition
1 2 3 1 1 3
2 (1) 4 1
3 4 2 0 3 (2) 4 0 1 4
Matrix Multiplication
1 23 1 (1)(3) (2)(2) (1)(1) (2)(0) 1 1
3 42 0 (3)(3) (4)(2) (3)(1) (4)(0) 1 3
Introduction to Markov Chains
At each time period, every object in the
system is in exactly one state, one of
1,…,n.
Objects move according to the
transition probabilities: the probability
of going from state j to state i is tij
Transition probabilities do not change
over time.
The transition matrix of a Markov chain
T = [tij] is an n n matrix.
Each entry tij is the probability of
moving from state j to state i.
0 tij 1
Sum of entries in a column must be
equal to 1 (stochastic).
Example: Customers can choose from three major
grocery stores: H-Mart, Freddy’s and Shopper’s
Market.
Each year H-Mart retains 80% of its
customers, while losing 15% to Freddy’s and
5% to Shopper’s Market.
Freddy’s retains 65% of its customers, loses
20% to H-Mart and 15% to Shopper’s Market.
Shopper’s Market keeps 70% of its
customers, loses 20% to H-Mart and 10% to
Freddy’s.
Example: The transition matrix.
.80 .20 .20
T .15 .65 .10
.05 .15 .70
Look at the calculation used to determine the
probability of starting at H-Mart and shopping
there two year later:
P(HH)P(HH) P(HF)P(FH) P(HS)P(SH)
.80*.80 .15*.2 .05*.2 .68
We can obtain the same result by multiplying row
one by column one in the transition matrix:
.80 .20 .20.80 .20 .20 .68 .32 .32
.15
.65
.10
.15
.65
.10
.22
.47
.16
.05 .15 .70
.05 .15 .70
.10 .21 .52
• This matrix tells us the probabilities of going
from one store to another after 2 years:
.68 .32 .32
2
T .22 .47 .16
.10 .21 .52
Compute the probability of shopping at each
store 2 years after shopping at Shopper’s
Market:
.68 .32 .320 .32
.22
.47
.16
0
.16
.10 .21 .52
1
.52
If the initial distribution was evenly
distributed between H-Mart, Freddy’s, and
Shpper’s market, compute the distribution
after two years:
.68 .32 .32.333 .44
.22
.47
.16
.333
.285
.10 .21 .52
.333
.275
To utilize a Markov chain to compute
probabilities, we need to know the initial
probability vector q(0)
If there are n states, let the initial probability
vector be
q1
(0)
q
qn
where
– qi is the probability of being in state i initially
– All entries 0 qi 1
– Column sum = 1
Example:
What happens after 10 years?
.50 .50 .50
10
T .28 .28 .28
.22 .22 .22
q1 .50 .50 .50q1 .50
10
T q2 .28 .28 .28q2 .28
q3
.22 .22 .22
q3
.22
Let q(k) be the probability distribution after k
steps.
We are iterating q(k+1) = T q(k)
Eventually, for a large enough k,
q(k+1) = q(k) = s
Resulting in s = T s
s is called a steady state vector
s =q(k) is an eigenvector for eigenvalue 1
In the grocery example, there was a unique
steady state vector s, and T q(k) s. This
does not need to be the case:
0 1
1
0
2k
T
, T I
1 0
0 1
0
1
a
b
2k1
T
T,
=
1 0b a
How can we guarantee convergence to an
unique steady state vector regardless of initial
conditions?
One way is by having a regular transition matrix
A nonnegative matrix is regular if some power of
the matrix has only nonzero entries.
.15 1
B
.85 0
.8725 .15
B
.1275 .85
2
Digraphs
A directed graph (digraph) is a set of vertices
(nodes) and a set of directed edges (arcs) between
vertices
The arcs indicate relationships between nodes
Digraphs can be used as models, e.g.
cities and airline routes between them
web pages and links
How Matrices, Markov Chains
and Digraphs are used by
Google
How does Google work?
Robot web crawlers find web pages
Pages are indexed & cataloged
Pages are assigned PageRank values
PageRank is a program that prioritizes pages
Developed by Larry Page & Sergey Brin in 1998
When pages are identified in response to a
query, they are ranked by PageRank value
Why is PageRank important?
Only a few years ago users waited much longer for search
engines to return results to their queries.
When a search engine finally responded, the returned list
had many links to information that was irrelevant, and
useless links invariably appeared at or near the top of the
list, while useful links were deeply buried.
The Web's information is not structured like information in
the organized databases and document collections - it is
self organized.
The enormous size of the Web, currently containing
~10^9 pages, completely overwhelmed traditional
information retrieval (IR) techniques.
By 1997 it was clear that IR technology of the
past wasn't well suited for Web search
Researchers set out to devise new approaches.
Two big ideas emerged, each capitalizing on the
link structure of the Web to differentiate between
relevant information and fluff.
One approach, HITS (Hypertext Induced Topic
Search), was introduced by Jon Kleinberg
The other, which changed everything, is Google's
PageRank that was developed by Sergey Brin and
Larry Page
How are PageRank values assigned?
Number of links to and from a page give
information about the importance of a page.
More inlinks the more important the
page
Inlinks from “good” pages carry more
weight than inlinks from “weaker” pages.
If a page points to several pages, its
weight is distributed proportionally.
Imagine the World Wide Web as a directed
graph (digraph)
Each page is a vertex
1
Each link is an arc
2
3
A sample 6 page web (6
vertex digraph)
5
4
6
PageRank defines the rank of page i
recursively by
ri
jI i
rj
| Oj |
rj is the rank of page j
Ii is the set of pages that point into page
i
Oj is the set of pages that have outlinks
from page j
For example, the
rank of page 2 in
our sample web:
1
2
3
r1 r3 r5
r2
3 4 3
5
4
6
Since this is a recursive definition,
PageRank assigns an initial ranking equally
to all pages:
ri
(0)
1
n
• then iterates
ri
jI i
rj
| Oj |
Process can be written using matrix notation.
Let q(k) be the PageRank vector at the kth
iteration
Let T be the transition matrix for the web
Then q(k+1)= T q(k)
T is the matrix such that tij is the probability
of moving from page j to page i in one time
step
Based on the assumption that all outlinks are
equally likely to be selected.
1
, if there is a link from j to i
O
tij j
0, otherwise
Using our 6-node sample web:
1
2
3
5
4
6
Transition matrix:
0 1/ 2 1/ 4 1/ 3
0 0
1/
3
0
1/
4
0
1/
3
0
1/ 3 1/ 2
0
0
0 0
T
0 1/ 4
0 1/ 3 0
1/ 3
0
0 1/ 4 1/ 3
0 0
0
0 1/ 3 1/ 3 0
0
To eliminate dangling nodes and obtain a
stochastic matrix, replace a column of zeros
with a column of 1/n’s, where n is the number
0 1/ 2 1/ 4 1/ 3
0 0
of web pages.
0 1/ 4
0 1/ 3 0
1/ 3
1/ 3 1/ 2
0
0
0 0
T
1/
3
0
1/
4
0
1/
3
0
0
0 1/ 4 1/ 3
0 0
0
0 1/ 3 1/ 3 0
0
0 1/ 2 1/ 4 1/ 3
0 1/ 6
1/
3
0
1/
4
0
1/
3
1/
6
1/ 3 1/ 2
0
0
0 1/ 6
T
1/
3
0
1/
4
0
1/
3
1/
6
0
0 1/ 4 1/ 3
0 1/ 6
0
0 1/ 3 1/ 3 1/ 6
0
Web’s nature is such that T would not be
regular
Brin & Page force the transition matrix to be
regular by making sure every entry satisfies
0 < tij < 1
Create perturbation matrix E having all
entries equal to 1/n
Form “Google Matrix”:
T T (1 )E , for some 0 1
Using = 0.85 for our 6-node sample web:
T 0.85T (1 .85)E
1/ 40
37 /120
37 /120
37 /120
1/ 40
1/ 40
9 / 20
1/ 40
9 / 20
1/ 40
1/ 40
1/ 40
19 / 80 37 /120
1/ 40
19 / 80
1/ 40 37 /120
1/ 40
1/ 40
1/ 40
19 / 80
1/ 40 37 /120
19 / 80 37 /120
1/ 40
1/ 40 37 /120 37 /120
1/ 6
1/ 6
1/ 6
1/ 6
1/ 6
1/ 6
By calculating powers of the transition matrix,
we can determine the stationary vector:
.2066
.1770
25
.1773
T
.1770
.1314
.1309
.2066
.1770
.1773
.1770
.1314
.1309
.2066
.1770
.1773
.1770
.1314
.1309
.2066
.1770
.1773
.1770
.1314
.1309
.2066
.1770
.1773
.1770
.1314
.1309
.2066
.1770
.1773
.1770
.1314
.1309
Stationary vector for our 6-node sample
web:
.2066
.1770
.1773
s
.1770
.1314
.1309
How does Google use this stationary vector?
Query requests term 1 or term 2
Inverted file storage is accessed
Term 1 doc 3, doc 2, doc 6
Term 2 doc 1, doc 3
Relevancy set is {1, 2, 3, 6}
.2066
.1770
.1773
s
.1770
.1314
.1309
s1=.2066, s2=.1770,
s3=.1773, s6=.1309
Doc 1 deemed most
important
Adding a perturbation matrix seems reasonable,
based on the “random jump” idea- user types in
a URL
This is only the basic idea behind Google, which
has many refinements we have ignored
PageRank as originally conceived and described
here ignores the “Back” button
PageRank currently undergoing development
Details of PageRank’s operations and value of
are a trade secret.
Updates to Google matrix done periodically
Google matrix is HUGE
Sophisticated numerical methods are be used
Thank you!