Reinforcement Learning: Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/ Reinforcement Learning RL = “Sampling based methods to solve optimal control problems” (Rich Sutton) Contents Defining AI Markovian.
Download
Report
Transcript Reinforcement Learning: Dynamic Programming Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/ Reinforcement Learning RL = “Sampling based methods to solve optimal control problems” (Rich Sutton) Contents Defining AI Markovian.
Reinforcement Learning:
Dynamic Programming
Csaba Szepesvári
University of Alberta
Kioloa, MLSS’08
Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/
1
Reinforcement Learning
RL =
“Sampling based methods to solve
optimal control problems”
(Rich Sutton)
Contents
Defining AI
Markovian Decision Problems
Dynamic Programming
Approximate Dynamic Programming
Generalizations
2
Literature
Books
Richard S. Sutton, Andrew G. Barto:
Reinforcement Learning: An Introduction,
MIT Press, 1998
Dimitri P. Bertsekas, John Tsitsiklis: NeuroDynamic Programming, Athena Scientific,
1996
Journals
JMLR, MLJ, JAIR, AI
Conferences
NIPS, ICML, UAI, AAAI, COLT, ECML, IJCAI
3
Some More Books
Martin L. Puterman. Markov Decision
Processes. Wiley, 1994.
Dimitri P. Bertsekas: Dynamic
Programming and Optimal Control.
Athena Scientific. Vol. I (2005), Vol.
II (2007).
James S. Spall: Introduction to
Stochastic Search and Optimization:
Estimation, Simulation, and Control,
Wiley, 2003.
4
Resources
RL-Glue
http://rlai.cs.ualberta.ca/RLBB/top.html
http://rlai.cs.ualberta.ca/RLR/index.html
http://www.igi.tugraz.at/riltoolbox/general/overview.html
RL-Library
The RL Toolbox 2.0
OpenDP
http://opendp.sourceforge.net
http://rl-competition.org/
June 1st, 2008: Test runs begin!
Operations research (MOR, OR)
Control theory (IEEE TAC, Automatica, IEEE CDC, ECC)
Simulation optimization (Winter Simulation Conference)
RL-Competition (2008)!
Related fields:
5
Abstract Control Model
Environment
Sensations
(and reward)
actions
Controller
= agent
“Perception-action loop”
6
Zooming in..
memory
reward
external sensations
agent
state
internal
sensations
actions
7
A Mathematical Model
Plant (controlled object):
xt+1 = f(xt,at,vt)
xt : state, vt : noise
zt = g(xt,wt)
zt : sens/obs, wt : noise
State: Sufficient statistics for the future
Independently of what we measure
..or..
Relative to measurements
Controller
at = F(z1,z2,…,zt)
at: action/control
=> PERCEPTION-ACTION LOOP
“CLOSED-LOOP CONTROL”
Design problem: F = ?
Goal: t=1T r(zt,at)! max
“Subjective
“Objective
State”
State”
8
A Classification of Controllers
Feedforward:
a1,a2,… is designed ahead in time
???
Feedback:
Purely reactive systems: at = F(zt)
Why is this bad?
Feedback with memory:
mt = M(mt-1,zt,at-1)
~interpreting sensations
at = F(mt)
decision making: deliberative vs. reactive
9
Feedback controllers
Plant:
xt+1 = f(xt,at,vt)
zt+1 = g(xt,wt)
Controller:
mt = M(mt-1,zt,at-1)
at = F(mt)
mt ¼ xt: state estimation, “filtering”
difficulties: noise,unmodelled parts
How do we compute at?
With a model (f’): model-based control
..assumes (some kind of) state estimation
Without a model: model-free control
10
Markovian Decision Problems
11
Markovian Decision Problems
(X,A,p,r)
X – set of states
A – set of actions (controls)
p – transition probabilities
p(y|x,a)
r – rewards
r(x,a,y), or r(x,a), or r(x)
° – discount factor
0·°<1
12
The Process View
(Xt,At,Rt)
Xt – state at time t
At – action at time t
Rt – reward at time t
Laws:
Xt+1~p(.|Xt,At)
At ~ ¼(.|Ht)
¼: policy
Ht = (Xt,At-1,Rt-1, .., A1,R1,X0) – history
Rt = r(Xt,At,Xt+1)
13
The Control Problem
Value functions:
P1
V¼(x) = E¼[ t = 0 ° t Rt jX 0 = x]
Optimal value function:
V ¤ (x) = max¼ V¼(x)
Optimal policy:
V¼¤ (x) = V ¤ (x)
14
Applications of MDPs
Operations research
Econometrics
Control, statistics
Games, AI
Optimal investments
Replacement problems
Option pricing
Logistics, inventory
management
Active vision
Production scheduling
Dialogue control
Bioreactor control
Robotics (Robocup
Soccer)
Driving
Real-time load
balancing
Design of experiments
(Medical tests)
15
Variants of MDPs
Discounted
Undiscounted: Stochastic Shortest
Path
Average reward
Multiple criteria
Minimax
Games
16
MDP Problems
Planning
The MDP (X,A,P,r,°) is known.
Find an optimal policy ¼*!
Learning
The MDP is unknown.
You are allowed to interact with it.
Find an optimal policy ¼*!
Optimal learning
While interacting with the MDP,
minimize the loss due to not using an
optimal policy from the beginning
17
Solving MDPs – Dimensions
Which problem? (Planning, learning, optimal learning)
Exact or approximate?
Uses samples?
Incremental?
Uses value functions?
Yes: Value-function based methods
Planning: DP, Random Discretization Method, FVI, …
Learning: Q-learning, Actor-critic, …
No: Policy search methods
Planning: Monte-Carlo tree search, Likelihood ratio
methods (policy gradient), Sample-path optimization
(Pegasus),
Representation
Structured state:
Factored states, logical representation, …
Structured policy space:
Hierarchical methods
18
Dynamic Programming
19
Richard Bellman (1920-1984)
Control theory
Systems Analysis
Dynamic Programming:
RAND Corporation, 1949-1955
Bellman equation
Bellman-Ford algorithm
Hamilton-Jacobi-Bellman equation
“Curse of dimensionality”
invariant imbeddings
Grönwall-Bellman inequality
20
Bellman Operators
Let ¼:X ! A be a stationary policy
B(X) = { V | V:X! R, ||V||1<1 }
T¼:B(X)! B(X)
(T¼ V)(x) =
yp(y|x,¼(x)) [r(x,¼(x),y)+° V(y)]
Theorem:
T¼ V¼ = V¼
Note: This is a linear system of
equations: r¼ + ° P¼ V¼ = V¼
V¼ = (I-° P¼)-1 r¼
21
Proof of T¼ V¼ = V¼
What you need to know:
Linearity of expectation: E[A+B] = E[A]+E[B]
Law of total expectation:
E[ Z ] = x P(X=x) E[ Z | X=x ], and
E[ Z | U=u ] = x P(X=x|U=u) E[Z|U=u,X=x].
Markov property:
E[ f(X1,X2,..) | X1=y,X0=x] = E[ f(X1,X2,..) | X1=y]
V¼(x) = E¼ [t=01 °t Rt|X0 = x]
= y P(X1=y|X0=x) E¼[t=01 °t Rt|X0 = x,X1=y]
(by the law of total expectation)
= y p(y|x,¼(x)) E¼[t=01 °t Rt|X0 = x,X1=y]
(since X1~p(.|X0,¼(X0)))
= y p(y|x,¼(x))
{E¼[ R0|X0=x,X1=y]+° E¼ [ t=01°t Rt+1|X0=x,X1=y]}
(by the linearity of expectation)
= y p(y|x,¼(x)) {r(x,¼(x),y) + ° V¼(y)}
(using the definition of r, V¼)
= (T¼ V¼)(x).
(using the definition of T¼)
22
The Banach Fixed-Point Theorem
B = (B,||.||) Banach space
T: B1! B2 is L-Lipschitz (L>0) if for
any U,V,
|| T U – T V || · L ||U-V||.
T is contraction if B1=B2, L<1; L is a
contraction coefficient of T
Theorem [Banach]: Let T:B! B be
a °-contraction. Then T has a unique
fixed point V and 8 V02 B, Vk+1=T Vk,
Vk ! V and ||Vk-V||=O(°k)
23
An Algebra for Contractions
Prop: If T1:B1! B2 is L1-Lipschitz,
T2: B2 ! B3 is L2-Lipschitz then T2 T1 is L1 L2
Lipschitz.
Def: If T is 1-Lipschitz, T is called a
non-expansion
Prop: M: B(X£ A) ! B(X),
M(Q)(x) = maxa Q(x,a) is a non-expansion
Prop: Mulc: B! B, Mulc V = c V is
|c|-Lipschitz
Prop: Addr: B ! B, Add V = r + V is a
non-expansion.
Prop: K: B(X) ! B(X),
(K V)(x)=y K(x,y) V(y) is a non-expansion
if K(x,y)¸ 0, y K(x,y) =1.
24
Policy Evaluations are Contractions
Def: ||V||1 = maxx |V(x)|,
supremum norm; here ||.||
Theorem: Let T¼ the policy
evaluation operator of some policy ¼.
Then T¼ is a °-contraction.
Corollary: V¼ is the unique fixed
point of T¼. Vk+1 = T¼ Vk ! V¼,
8 V0 2 B(X) and ||Vk-V¼|| = O(°k).
25
The Bellman Optimality Operator
Let T:B(X)! B(X) be defined by
(TV)(x) =
maxa y p(y|x,a) { r(x,a,y) + ° V(y) }
Def: ¼ is greedy w.r.t. V if T¼V =T V.
Prop: T is a °-contraction.
Theorem (BOE): T V* = V*.
Proof: Let V be the fixed point of T.
T¼ · T V* · V. Let ¼ be greedy
w.r.t. V. Then T¼ V = T V. Hence
V¼ = V V · V* V = V*.
26
Value Iteration
Theorem: For any V0 2 B(X), Vk+1 = T Vk,
Vk ! V* and in particular ||Vk – V*||=O(°k).
What happens when we stop “early”?
Theorem: Let ¼ be greedy w.r.t. V. Then
||V¼ – V*|| · 2||TV-V||/(1-°).
Proof: ||V¼-V*||· ||V¼-V||+||V-V*|| …
Corollary: In a finite MDP, the number of
policies is finite. We can stop when
||Vk-TVk|| · ¢(1-°)/2, where
¢ = min{ ||V*-V¼|| : V¼ V* }
Pseudo-polynomial complexity
27
Policy Improvement [Howard ’60]
Def: U,V2 B(X), V ¸ U if V(x) ¸ U(x)
holds for all x2 X.
Def: U,V2 B(X), V > U if V ¸ U and
9 x2 X s.t. V(x)>U(x).
Theorem (Policy Improvement):
Let ¼’ be greedy w.r.t. V¼. Then
V¼’ ¸ V¼. If T V¼>V¼ then V¼’>V¼.
28
Policy Iteration
Policy Iteration(¼)
V V¼
Do {improvement}
V’ V
Let ¼: T¼ V = T V
V V¼
While (V>V’)
Return ¼
29
Policy Iteration Theorem
Theorem: In a finite, discounted
MDP policy iteration stops after a
finite number of steps and returns an
optimal policy.
Proof: Follows from the Policy
Improvement Theorem.
30
Linear Programming
V ¸ T V V ¸ V* = T V*.
Hence, V* is the “largest” V that satisfies
V ¸ T V.
V¸TV
,
(*) V(x) ¸ yp(y|x,a){r(x,a,y)+° V(y)},
8x,a
LinProg(V):
x
V(x) ! min s.t. V satisfies (*).
Theorem: LinProg(V) returns the optimal
value function, V*.
Corollary: Pseudo-polynomial complexity
31
Variations of a Theme
32
Approximate Value Iteration
AVI: Vk+1 = T Vk + ²k
AVI Theorem:
Let ² = maxk ||²k||. Then
limsupk!1 ||Vk-V*|| · 2° ² / (1-°).
Proof: Let ak = ||Vk –V*||.
Then ak+1 = ||Vk+1 – V*|| = ||T Vk – T
V* + ²k || · ° ||Vk-V*|| + ² = ° ak + ².
Hence, ak is bounded. Take “limsup”
of both sides: a· ° a + ²; reorder.//
(e.g., [BT96])
33
Fitted Value Iteration
– Non-expansion Operators
FVI: Let A be a non-expansion,
Vk+1 = A T Vk. Where does this
converge to?
Theorem: Let U,V be such that A T U
= U and T V = V. Then
||V-U|| · ||AV –V||/(1-°).
Proof: Let U’ be the fixed point of TA.
Then ||U’-V|| · ° ||AV-V||/(1-°).
Since A U’ = A T (AU’), U=AU’. Hence,
||U-V|| =||AU’-V||
· ||AU’-AV||+||AV-V|| …
[Gordon ’95]
34
Application to Aggregation
Let ¦ be a partition of X, S(x) be the
unique cell that x belongs to.
Let A: B(X)! B(X) be
(A V)(x) = z ¹(z;S(x)) V(z), where ¹ is a
distribution over S(x).
p’(C|B,a) =
x2 B ¹(x;B) y2 C p(y|x,a),
r’(B,a,C) =
x2 B ¹(x;B) y2 C p(y|x,a) r(x,a,y).
Theorem: Take (¦,A,p’,r’), let V’ be its
optimal value function, V’E(x) = V’(S(x)).
Then ||V’E – V*|| · ||AV*-V*||/(1-°).
35
Action-Value Functions
L: B(X)! B(X£ A),
(L V)(x,a) = y p(y|x,a) {r(x,a,y) + ° V(y)}.
“One-step lookahead”.
Note: ¼ is greedy w.r.t. V if
(LV)(x,¼(x)) = max a (LV)(x,a).
Def: Q* = L V*.
Def: Let Max: B(X£ A)! B(X),
(Max Q)(x) = max a Q(x,a).
Note: Max L = T.
Corollary: Q* = L Max Q*.
Proof: Q* = L V* = L T V* = L Max L V* = L Max Q*.
T = L Max is a °-contraction
Value iteration, policy iteration, …
36
Changing Granularity
Asynchronous Value Iteration:
Every time-step update only a few states
AsyncVI Theorem: If all states are updated infinitely often,
the algorithm converges to V*.
How to use?
Prioritized Sweeping
IPS [MacMahan & Gordon ’05]:
Instead of an update, put state on the priority queue
When picking a state from the queue, update it
Put predecessors on the queue
Theorem: Equivalent to Dijkstra on shortest path problems,
provided that rewards are non-positive
LRTA* [Korf ’90] ~ RTDP [Barto, Bradtke, Singh ’95]
Focussing on parts of the state that matter
Constraints:
Same problem solved from several initial positions
Decisions have to be fast
Idea: Update values along the paths
37
Changing Granularity
Generalized Policy Iteration:
Partial evaluation
and partial
improvement
of policies
Multi-step lookahead
improvement
AsyncPI Theorem: If both evaluation and
improvement happens at every state
infinitely often then the process converges to
an optimal policy.
[Williams & Baird ’93]
38
Variations of a theme
[SzeLi99]
Game against nature [Heger ’94]:
infw t°t Rt(w) with X0 = x
Risk-sensitive criterion:
log ( E[ exp(t°t Rt ) | X_0 = x ] )
Stochastic Shortest Path
Average Reward
Markov games
Simultaneous action choices (Rockpaper-scissor)
Sequential action choices
Zero-sum (or not)
39
References
[Howard ’60] R.A. Howard: Dynamic Programming and Markov
Processes, The MIT Press, Cambridge, MA, 1960.
[Gordon ’95] G.J. Gordon: Stable function approximation in
dynamic programming. ICML, pp. 261—268, 1995.
[Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards,
PhD Thesis, 1990.
[McMahan, Gordon ’05] H. B. McMahan and Geoffrey J.
Gordon: Fast Exact Planning in Markov Decision Processes. ICAPS.
[Korf ’90] R. Korf: Real-Time Heuristic Search. Artificial Intelligence
42, 189–211, 1990.
[Barto, Bradtke & Singh, ’95] A.G. Barto, S.J. Bradtke & S. Singh:
Learning to act using real-time dynamic programming, Artificial
Intelligence 72, 81—138, 1995.
[Williams & Baird, ’93] R.J. Williams & L.C. Baird: Tight
Performance Bounds on Greedy Policies Based on Imperfect Value
Functions. Northeastern University Technical Report NU-CCS-9314, November, 1993.
[SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of
Value-Function-Based Reinforcement-Learning Algorithms, Neural
Computation, 11, 2017—2059, 1999.
[Heger ’94] M. Heger: Consideration of risk in reinforcement
learning, ICML, 105—111, 1994.
40