Transcript 17.ppt
MDPs and
Reinforcement
Learning
Overview
• MDPs
• Reinforcement learning
Sequential decision problems
• In an environment, find a sequence of actions in an
uncertain environment that balance risks and rewards
• Markov Decision Process (MDP):
– In a fully observable environment we know initial state (S0)
and state transitions T(Si, Ak, Sj) = probability of reaching Sj
from Si when doing Ak
– States have a reward associated with them R(Si)
• We can define a policy π that selects an action to
perform given a state, i.e., π(Si)
• Applying a policy leads to a history of actions
• Goal: find policy maximizing expected utility of history
4x3 Grid World
4x3 Grid World
• Assume R(s) = -0.04 except where marked
• Here’s an optimal policy
4x3 Grid World
Different default rewards produce different
optimal policies
life=pain, get
out quick
Life = ok, go for
+1, minimize risk
Life = struggle,
go for +1, accept
risk
Life = good,
avoid exits
Finite and infinite horizons
• Finite Horizon
– There’s a fixed time N when the game is over
– U([s1…sn]) = U([s1…sn…sk])
– Find a policy that takes that into account
• Infinite Horizon
– Game goes on forever
• The best policy for with a finite horizon can
change over time: more complicated
Rewards
• The utility of a sequence is usually additive
– U([s0…s1]) = R(s0) + R(s1) + … R(sn)
• But future rewards might be discounted by a
factor γ
– U([s0…s1]) = R(s0) + γ*R(s1) + γ2*R(s2)…+ γn*R(sn)
• Using discounted rewards
– Solves some technical difficulties with very long or
infinite sequences and
– Is psychologically realistic
shortsighted 0 1 farsighted
Value Functions
• The value of a state is the expected return starting from that
state; depends on the agent’s policy:
State - value function for policy :
k
V (s) E Rt st s E rt k 1 st s
k 0
• The value of taking an action in a state under policy is the
expected return starting from that state, taking that action, and
thereafter following :
Action - value function for policy :
k
Q (s, a) E Rt st s, at a E rt k 1 st s,at a
k 0
9
Bellman Equation for a Policy
The basic idea:
Rt rt 1 rt 2 2 rt 3 3 rt 4
rt 1 rt 2 rt 3 2 rt 4
rt 1 Rt 1
So:
V (s) E Rt st s
E rt 1 V st 1 st s
Or, without the expectation operator:
V (s) (s,a) PsasRsas V ( s)
a
s
10
Values for states in 4x3 world