Markov Chains - Tutorial #5

Download Report

Transcript Markov Chains - Tutorial #5

Markov Chains as
a Learning Tool
.
Markov Process
Simple Example
Weather:
• raining today
40% rain tomorrow
60% no rain tomorrow
• not raining today
20% rain tomorrow
80% no rain tomorrow
Stochastic Finite State Machine:
0.6
0.4
rain
0.8
no rain
0.2
2
Markov Process
Simple Example
Weather:
• raining today
40% rain tomorrow
60% no rain tomorrow
• not raining today
20% rain tomorrow
80% no rain tomorrow
The transition matrix:
Rain
Rain
No rain
No rain
 0.4 0.6 

P  
 0.2 0.8 
• Stochastic matrix:
Rows sum up to 1
• Double stochastic matrix:
Rows and columns sum up to 1
3
Markov Process
Let Xi be the weather of day i, 1 <= i <= t. We may
decide the probability of Xt+1 from Xi, 1 <= i <= t.
• Markov Property: Xt+1, the state of the system at time t+1 depends
only on the state of the system at time t
PrX t 1  xt 1 | X 1  X t  x1  xt   PrX t 1  xt 1 | X t  xt
X1
X2
X3
X4

X5
• Stationary Assumption: Transition probabilities are independent of
time (t)
Pr  X t 1  b | X t  a   pab
Markov Process
Gambler’s Example
– Gambler starts with $10 (the initial state)
- At each play we have one of the following:
• Gambler wins $1 with probability p
• Gambler looses $1 with probability 1-p
– Game ends when gambler goes broke, or gains a fortune of $100
(Both 0 and 100 are absorbing states)
p
0
1
1-p
p
p
99
2
1-p
p
1-p
1-p
100
1-p
Start
(10$)
5
Markov Process
• Markov process - described by a stochastic FSM
• Markov chain - a random walk on this graph
(distribution over paths)
• Edge-weights give us
Pr  X t 1  b | X t  a   pab
• We can ask more complex questions, like PrX t 2  a | X t  b   ?
p
0
1
1-p
p
p
99
2
1-p
p
100
1-p
1-p
Start
(10$)
6
Markov Process
Coke vs. Pepsi Example
• Given that a person’s last cola purchase was Coke,
there is a 90% chance that his next cola purchase will
also be Coke.
• If a person’s last cola purchase was Pepsi, there is
an 80% chance that his next cola purchase will also be
Pepsi.
transition matrix:
coke
coke
pepsi
pepsi
0.9 0.1
P

0.2 0.8
0.1
0.9
coke
0.8
pepsi
0.2
7
Markov Process
Coke vs. Pepsi Example (cont)
Given that a person is currently a Pepsi purchaser,
what is the probability that he will purchase Coke two
purchases from now?
Pr[ Pepsi?Coke ] =
Pr[ PepsiCokeCoke ] + Pr[ Pepsi Pepsi Coke ] =
0.2 *
0.9
+
0.8 *
0.2
= 0.34
00.9.9 00.1.1 0.9 0.1 0.83 0.17
P 






00.2.2 00.8.8 0.2 0.8 0.34 0.66
2
Pepsi  ?
?  Coke
8
Markov Process
Coke vs. Pepsi Example (cont)
Given that a person is currently a Coke purchaser,
what is the probability that he will buy Pepsi at the
third purchase from now?
0.9 0.1 0.83 0.17  0.781 0.219
P 





0.2 0.8 0.34 0.66 0.438 0.562
3
9
Markov Process
Coke vs. Pepsi Example (cont)
•Assume each person makes one cola purchase per week
•Suppose 60% of all people now drink Coke, and 40% drink Pepsi
•What fraction of people will be drinking Coke three weeks from now?
0.9 0.1
P

0
.
2
0
.
8


 0.781 0.219
P 

0
.
438
0
.
562


3
Pr[X3=Coke] = 0.6 * 0.781 + 0.4 * 0.438 = 0.6438
Qi - the distribution in week i
Q0= (0.6,0.4) - initial distribution
Q3= Q0 * P3 =(0.6438,0.3562)
10
Markov Process
Coke vs. Pepsi Example (cont)
Simulation:
2/3
3
Pr[Xi = Coke]
2
1
0.9 0.1 2
3 
  3
0
.
2
0
.
8


1
3

stationary distribution
0.1
0.9
coke
0.8
pepsi
0.2
week - i
11
How to obtain Stochastic matrix?

Solve the linear equations, e.g.,
2 3

x
1,1
x
1, 2 

1 
2


3 
3

 x 2,1 x 2, 2 
1
3

Learn from examples, e.g., what
letters follow what letters in
English words: mast, tame,
same, teams, team, meat,
steam, stem.
12
How to obtain Stochastic matrix?

Counts table vs Stochastic Matrix
P
a
s
t
m
e
\0
a
0
1/7
1/7
5/7
0
0
e
4/7
0
0
1/7
0
2/7
m
1/8
1/8
0
0
3/8
3/8
s
1/5
0
3/5
0
0
1/5
t
1/7
0
0
0
4/7
2/7
@
0
3/8
3/8
2/8
0
0
13
Application of Stochastic matrix

Using Stochastic Matrix to generate a random word:


C
Generate most likely first letter
For each current letter generate most likely next letter
A
a
s
t
m
e
\0
a
-
1
2
7
-
-
e
4
-
-
5
-
7
m
1
2
-
-
5
8
s
1
-
4
-
-
5
t
1
-
-
-
5
7
@
-
3
6
8
-
-
If C[r,j] > 0,
let A[r,j] = C[r,1]+C[r,2]+…+C[r,j]
14
Application of Stochastic matrix

Using Stochastic Matrix to generate a random word:
Generate most likely first letter: Generate a random number x
between 1 and 8. If 1 <= x <= 3, the letter is ‘s’; if 4 <= x <= 6,
the letter is ‘t’; otherwise, it’s ‘m’.
 For each current letter generate
A
a
s
t
m
e
most likely next letter: Suppose
a
1
2
7
the current letter is ‘s’ and we
e
4
5
generate a random number x
m
1
2
5
between 1 and 5. If x = 1, the next
s
1
4
letter is ‘a’; if 2 <= x <= 4, the next
t
1
5
letter is ‘t’; otherwise, the current
@
3
6
8
letter is an ending letter.

\0
7
8
5
7
-
If C[r,j] > 0,
let A[r,j] = C[r,1]+C[r,2]+…+C[r,j]
15
Supervised vs Unsupervised
tree learning is “supervised
learning” as we know the correct output of
each example.
 Learning based on Markov chains is
“unsupervised learning” as we don’t know
which is the correct output of “next letter”.
 Decision
16
K-Nearest Neighbor
 Features




All instances correspond to points in an ndimensional Euclidean space
Classification is delayed till a new instance
arrives
Classification done by comparing feature
vectors of the different points
Target function may be discrete or real-valued
1-Nearest Neighbor
3-Nearest Neighbor
Example:
Identify Animal Type
14 examples
10 attributes
5 types
What’s the type of
this new animal?
20
K-Nearest Neighbor
 An
arbitrary instance is represented by(a1(x), a2(x),
a3(x),.., an(x))
 ai(x) denotes features
 Euclidean distance between two instances
d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2)
 Continuous valued target function
 mean value of the k nearest training examples
Distance-Weighted Nearest Neighbor
Algorithm
 Assign
weights to the neighbors based on their
‘distance’ from the query point
 Weight ‘may’ be inverse square of the
distances
 All training points may influence a particular
instance
 Shepard’s method
Remarks
+ Highly effective inductive inference method for
noisy training data and complex target functions
+ Target function for a whole space may be
described as a combination of less complex local
approximations
+ Learning is very simple
- Classification is time consuming (except 1NN)