Transcript 17.ppt

MDPs and
Reinforcement
Learning
Overview
• MDPs
• Reinforcement learning
Sequential decision problems
• In an environment, find a sequence of actions in an
uncertain environment that balance risks and rewards
• Markov Decision Process (MDP):
– In a fully observable environment we know initial state (S0)
and state transitions T(Si, Ak, Sj) = probability of reaching Sj
from Si when doing Ak
– States have a reward associated with them R(Si)
• We can define a policy π that selects an action to
perform given a state, i.e., π(Si)
• Applying a policy leads to a history of actions
• Goal: find policy maximizing expected utility of history
4x3 Grid World
4x3 Grid World
• Assume R(s) = -0.04 except where marked
• Here’s an optimal policy
4x3 Grid World
Different default rewards produce different
optimal policies
life=pain, get
out quick
Life = ok, go for
+1, minimize risk
Life = struggle,
go for +1, accept
risk
Life = good,
avoid exits
Finite and infinite horizons
• Finite Horizon
– There’s a fixed time N when the game is over
– U([s1…sn]) = U([s1…sn…sk])
– Find a policy that takes that into account
• Infinite Horizon
– Game goes on forever
• The best policy for with a finite horizon can
change over time: more complicated
Rewards
• The utility of a sequence is usually additive
– U([s0…s1]) = R(s0) + R(s1) + … R(sn)
• But future rewards might be discounted by a
factor γ
– U([s0…s1]) = R(s0) + γ*R(s1) + γ2*R(s2)…+ γn*R(sn)
• Using discounted rewards
– Solves some technical difficulties with very long or
infinite sequences and
– Is psychologically realistic
shortsighted 0   1 farsighted
Value Functions
• The value of a state is the expected return starting from that
state; depends on the agent’s policy:
State - value function for policy  :
  k

V (s)  E Rt st  s E   rt k 1 st  s
k 0


• The value of taking an action in a state under policy  is the
expected return starting from that state, taking that action, and
thereafter following  :
Action - value function for policy  :
 k

Q (s, a)  E Rt st  s, at  a E  rt  k 1 st  s,at  a 
k  0

9

Bellman Equation for a Policy 
The basic idea:
Rt  rt 1   rt 2   2 rt  3   3 rt  4
 rt 1   rt 2   rt 3   2 rt  4

 rt 1   Rt 1
So:
V (s)  E Rt st  s

 E rt 1   V st 1  st  s
Or, without the expectation operator:
V (s)    (s,a) PsasRsas   V  ( s)
a
s
10
Values for states in 4x3 world