Value Iteration and Q

Transcript Value Iteration and Q

Value Iteration & Q-learning
CS 5368
Song Cui
Outline
• Recap
• Value Iteration
• Q-learning
Recap
• “Markov” meanings
• MDP: s  S
t
a t  A t( st )
t
T (s, a, s' )
R (s, a, s' )
s0
st
s sin
• Solving MDP
Recap
• Utility of sequences
Discount rate determines the “vision” of the agent.
• State-value function Vπ(s)




k
U (s)  V (s)  E   Rt  k
 k 0

1| s  s
t


• Action-value function Qπ(s,a)



k
Q (s, a )  E   Rt  k
 k 0

 1 | s  s, at  a
t


Value Iteration
• How to find Vk*(S):
k → infinity
Almost solution: recursion
Correct solution: dynamic programming
Value Iteration
Value Iteration
• Bellman update:
• Another way:
V-node
Q-node
V k  1 ( s )  max [ q k  1 ( s , a )]
*
*
V-node
a
q k 1 ( s , a ) 
*
 T ( s , a , s ' )[ R ( s , a , s ' )   V k ( s ' )]
*
s'
Value Iteration
• Algorithm:
 s : V0 ( S )  0
*
for i = 1,2,3…..
for
s  S,a  A
qi (s, a ) 
*
 T ( s , a , s ' )[ R ( s , a , s ' )  
s'
for s  S
V i ( s )  max [ q i ( s , a )]
*
*
a
 i ( s )  argmax [ q i ( s , a )]
*
*
a
*
V i 1 ( s ' )]
Value Iteration
• Theorem: convergence to a optimal value
Policy may converge faster
Three components to return
U  max
U
U
t 1
U
t 1
t
V
t 1
s
| U (s) |
  U V
  U
t
t 1
t
 U  2  (1   )
Value Iteration
• Advantages compared with Expectimax:
Given MDP: state space: 1,2 action: 1,2 transition:80% reward:
state1 →1, state2→0
0.8
S1
0.8
0.2
S1(1)
80%
1
20%
S2
S1(1)
80%
0
0
S1
S1(1)
S1(2)
S2(1)
S2
S2(2)
20%
1
S2(1)
S1
Repeats !
Q-learning
• Compared with Value Iteration:
same: MDP model
seeking policy
different:
T(s,a,s’) and R(s,a,s’) unkown
different ways of solving RDP (learned model vs. unlearned
model)
• Reinforcement Learning
policy, experience, reward
model-based vs. model free
passive learning vs. active learning
Q-learning
• Value iteration:
• Q-values:
Q-learning
• Q-learning: Q-value iteration
• Process:
• sample: s →a→s’
r
• Update new Q-value based on the sample:
Q-learning
• Q-learning: converge to optimal policy
• Sample enough, leaning rate small
enough
Ways to explore:
epsilon-greedy action selection : choose between acting
randomly and acting accordingly to the best current Q-value

Value Iteration and Q

Transcript Value Iteration and Q

Directory