Approximate Dynamic Programming and Policy Search: Does anything work? Rutgers Applied Probability Workshop June 6, 2014 Warren B.
Download ReportTranscript Approximate Dynamic Programming and Policy Search: Does anything work? Rutgers Applied Probability Workshop June 6, 2014 Warren B.
Approximate Dynamic Programming and Policy Search: Does anything work?
Rutgers Applied Probability Workshop June 6, 2014 Warren B. Powell Daniel R. Jiang with contributions from: Daniel Salas Vincent Pham Warren Scott
© 2014 Warren B. Powell, Princeton University
Storage problems
How much energy to store in a battery to handle the volatility of wind and spot prices to meet demands?
Storage problems
How much money should we hold in cash given variable market returns and interest rates to meet the needs of a business?
Stock prices Bonds
Storage problems
Elements of a “storage problem” » Controllable scalar state giving the amount in storage: • Decision may be to deposit money, charge a battery, chill the water, release water from the reservoir.
• There may also be exogenous changes (deposits/withdrawals) » Multidimensional “state of the world” variable that evolves exogenously: • Prices • Interest rates • Weather • Demand/loads » Other features: • Problem may be time-dependent (and finite horizon) or stationary • We may have access to forecasts of future
Storage problems
Dynamics are captured by the transition function: 1 » Controllable resource (scalar):
R t
t
1 reservoir, energy in chilled water tank » Exogenous state variables:
p t e t
1 1
Wind
1
D t
1
p t e t Wind D t
R t
S
D t t M
( , ,
t t t Ax t
1 Spot prices
e
ˆ
t Wind
1 1 1 )
R
ˆ Money in the bank, water in the
t
1 Energy from wind water/...
Slide 5
Stochastic optimization models
The objective function min
E
T
t
Expectation over all 0
t
random outcomes Finding the best policy
t
State variable
t
S
t
Cost function Decision function (policy) where
S t
1
S M
t t
,
t
1 » With deterministic problems, we want to find the best
decision.
» With stochastic problems, we want to find the best
function
(policy) for making a decision.
Four classes of policies
1) Policy function approximations (PFAs) » Lookup tables, rules, parametric functions 2) Cost function approximation (CFAs) »
X CFA
(
S t
arg min
x t
X
t C
( ,
t t
3) Policies based on value function approximations (VFAs) »
X t VFA S t
arg min
x t
( , )
t t
V t x
t x
( , )
t t
4) Lookahead policies (a.k.a. model predictive control) » »
Deterministic lookahead (rolling horizon procedures)
T X t LA
-
D
(
S t
) = arg min
C
(
x tt
,
x t
,
t
+ 1 ,...,
x t
,
t
+
T S tt
,
x tt
) + å
t
' =
t
+ 1 g
t
' -
t C
(
S tt
' ,
x tt
' )
Stochastic lookahead (stochastic programming,MCTS)
T X t LA
-
S
(
S t
) = arg min
C
(
x tt
,
x t
,
t
+ 1 ,...,
x t
,
t
+
T S tt
,
x tt
) + Î w Î W
t p
( w ) Î
t
' =
t
+ 1 g
t
' -
t C
(
S tt
' ( w ),
x tt
' ( w ))
Value iteration
Classical backward dynamic programming
V t S t
max
x
C S x t
E
V t
1 (
S t
1 ) |
S t
The action space The outcome space The state space » The three curses of dimensionality
A storage problem
Energy storage with stochastic prices, supplies and demands.
E t wind D t P t grid R t battery E P D t t wind
1
grid
1
t load
1
R t battery
1
E t wind
t wind
1
P t grid
P t grid
1
D t load
R t battery
t load
1
x t S t
1 Exogenous inputs State variable
x t
Controllable inputs
A storage problem
Bellman’s optimality equation
V S t
min
x t
X ( ,
t t
) E
E t wind P t grid D t load R t battery
x t x t x t x t x t
t
1 ( ,
t t t
1 )
E
ˆ
t wind
1
P t grid
1
t load
1
Managing a water reservoir
Backward dynamic programming in one dimension Step 0: Initialize
V T
1 (
R T
1 ) Step 1: Step backward
t
0 for
R T
1 1,
T
2,...
0,1,...,100 Step 2: Loop over
R t
0,1,...,100 Step 3: Loop over all decisions 0
t R t
Step 4: Take the expectation over all rainfall levels (also discretized):
t
,
t
)
t
,
t
) 100
w
0
V t
1 (min
R
max ,
R t
W
End step 4; End Step 3; Find
V t
* (
R t
) max
x t
Store
X t
* (
R t
) arg max
x t t
, End Step 2; End Step 1;
t
)
t
,
t
). (This is our policy)
Managing cash in a mutual fund
Dynamic programming in multiple dimensions Step 0: Initialize
V T
1 (
S T
1 ) 0 for all states.
Step 1: Step backward
t S t
t
1,
T t
,
t
,
t
2,...
(four loops)
x t x t
s a vector) Step 4: Take the expectation over each random dimension Compute ( , )
t t
( , )
t t
w
1 100 100 100 0
w
2 0
w
3 0
V t
1
S M
t
1 (
w
1 , ,
t
, ˆ
t
, 3 )
P W
( ˆ
t
, End step 4; End Step 3; Find
t
( )
t
max
x t
Store
X t
*
S t
( , ) arg max
x t t t
( , ). (This is our policy)
t t
End St ep 2; End Step 1; 2 , 3 )
Approximate dynamic programming
Algorithmic strategies: » Approximate value iteration • Mimics backward dynamic programming » Approximate policy iteration • Mimics policy iteration » Policy search • Based on the field of stochastic search
Approximate value iteration
Step 1: Start with a pre-decision state
S t n
Step 2: Solve the deterministic optimization using an approximate value function:
v
ˆ
t n
min
x x t n
t n x t
V t n
1 (
S
(
S t n
Deterministic
x t
optimization Step 3: Update the value function approximation
V t n
1 (
S t
1 )
n
1 )
V t n
1 1 (
S t
1 )
n
1
v
ˆ
t n W t
(
n
) Recursive statistics Simulation compute the next pre-decision state:
S t n
1
S M
(
S t n
,
t n t
1 (
n
)) Step 5: Return to step 1. “on policy learning”
Approximate value iteration
Step 1: Start with a pre-decision state
S t n
Step 2: Solve the deterministic optimization using an approximate value function:
v
ˆ
t n
min ( (
x C S t n x t n x t
f
f f
(
S M
(
S m x
Deterministic optimization Step 3: Update the value function approximation
V t n
1 (
S t
1 )
n
1 )
V t n
1 1 (
S t
1 )
n
1
v
ˆ
t n W t
(
n
) Recursive statistics Simulation compute the next pre-decision state:
S t n
1
S M
(
S t n
,
t n t
1 (
n
)) Step 5: Return to step 1.
Approximate value iteration
The true (discretized) value function
Outline
Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts
Approximate dynamic programming
Classical approximate dynamic programming » We can estimate the value of being in a state using
v
ˆ
t n
min
x
t n
f f f
(
S t x
(
S t n
» Use linear regression to estimate .
n
v
ˆ
t n
f
f f
(
S t x
(
S t n
n
» Our policy is then given by
X
(
S t
|
n
) arg min
x
t
, )
f f
(
S t x
(
t
» This is known as
Bellman error minimization.
Bellman error =
V n
1 (
S t n
)
v
ˆ
n
f f f
(
S t n
)
v
ˆ
n
Approximate dynamic programming
Least squares policy iteration (Lagoudakis and Parr) » Bellman’s equation: » … is equivalent to (for a fixed policy) » Rearranging gives: » … where “X” is our explanatory variable. But we cannot compute this exactly (due to the expectation) so we can sample it.
Approximate dynamic programming
… in matrix form:
Approximate dynamic programming
First issue:
C t
t
1
t
C t
C t
X t
Independent variable Sample state, compute basis functions. Simulate next state and compute basis functions » This is known as an “errors-in-variable” model, which produces variables.
Approximate dynamic programming
Second issue: » Bellman’s optimality equation written using basis functions:
f
f f
max
x
E
f
f f
» … does not possess a fixed point (result due to van Roy and de Farias). This is the reason that classical Bellman error minimization using basis functions:
x
E 2 …does not work. Instead we have to use
projected Bellman error:
max
x
E 2 Projection operator onto the space spanned by the basis functions.
Approximate dynamic programming
Surprising result: »
Theorem (W. Scott and W.B.P.)
Bellman error using instrumental variables and projected Bellman error minimization
are the same!
Optimizing storage
For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm
Optimizing storage
For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm
Optimizing storage
For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm
Outline
Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts
Policy search
Finding the best policy (“policy search”) » Assume our policy is given by
X
(
S t
arg min
x
t n x t
» We wish to maximize the function
f
min E
F
E
t T
0
t
t
(
S t
f f
(
t x t n x t
Error correction term
F W
1 2
Policy search
A number of fields work on this problem under different names: » Stochastic search » Simulation optimization » Black box optimization » Sequential kriging » Global optimization » Open loop control » Optimization of expensive functions » Bandit problems (for on-line learning) » Optimal learning
Policy search
For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm
Policy search
For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm
Outline
Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts
Approximate policy iteration
Step 1: Start with a pre-decision state
S t n
Step 2: Inner loop: Do for m=1,…,M: Step 2a: Solve the deterministic optimization using
v
ˆ an approximate value function:
m
min
x
m
V n
1 (
S
(
S m x
x m
Step 2b: Update the value function approximation
V n
1,
m
(
S
)
m
1 )
V n
1,
m
1 (
S
)
m
1
v
ˆ
m W
(
m
) compute the next pre-decision state:
S m
1
S M
(
S m
,
m
(
m
))
V n
1,
Approximate policy iteration
Machine learning methods (coded in R) » » » » » »
SVR
- Support vector regression with Gaussian radial basis kernel
LBF
– Weighted linear combination of polynomial basis functions
GPR
– Gaussian process regression with Gaussian RBF
LPR
– Kernel smoothing with second-order local polynomial fit
DC-R
– Dirichlet clouds – Local parametric regression.
TRE
– Regression trees with constant local fit.
Approximate policy iteration
Test problem sets » Linear Gaussian control • L1 = Linear quadratic regulation • Remaining problems are nonquadratic » Finite horizon energy storage problems (Salas benchmark problems) • 100 time-period problems • Value functions are fitted for each time period
100 80 60 40 20 0
Linear Gaussian control
LBF – regression trees LBF – linear basis functions LPR – kernel smoothing DC-R – local parametric regression GPR - Gaussian process regression SVR - Support vector regression
100 80 60 40 20 0
Energy storage applications
DC-R – local parametric regression LPR – kernel smoothing GPR - Gaussian process regression SVR - Support vector regression
Approximate policy iteration
A tale of two distributions » The
sampling distribution
, which governs the likelihood that we sample a state.
» The
learning distribution
, which is the distribution of states we would visit given the current policy 1 ( )
V
2 State S
Approximate policy iteration
Using the optimal value function Optimal value function …with quadratic fit State distribution using optimal policy
Now we are going to use the optimal policy to fit approximate value functions and watch the stability.
Approximate policy iteration
State distribution using optimal policy • • • Policy evaluation: 500 samples (problem only has 31 states!) After 50 policy improvements with
optimal distribution
: divergence in sequence of VFA’s, 40%-70% optimality.
After 50 policy improvements with
uniform distribution
: stable VFAs, 90% optimality.
VFA estimated after 50 policy iterations VFA estimated after 51 policy iterations
Outline
Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts
Exploiting concavity
Bellman’s optimality equation » With pre-decision state:
t
( )
t
min
x t
X ( ,
t t
» With post-decision state:
t
( )
t
min
x t
X ( ,
t t
) ) E
V t
1
V t x
S t x
S t
1 ( ,
t
( ,
t t
)
t t
1
S t
Inventory held over from previous time period
Exploiting concavity
We update the piecewise linear value functions by computing estimates of slopes using a backward pass:
R
» The cost along the marginal path is the derivative of the simulation with respect to the flow perturbation.
Exploiting concavity
Derivatives are used to estimate a piecewise linear approximation
t t
)
R t
Exploiting concavity
Convergence results for piecewise linear, concave functions: » Godfrey, G. and W.B. Powell, "An Adaptive, Distribution-Free Algorithm for the Newsvendor Problem with Censored Demands, with Application to Inventory and Distribution Problems,"
Management Science
, Vol. 47, No. 8, pp. 1101-1112, (2001).
» Topaloglu, H. and W.B. Powell, “An Algorithm for Approximating Piecewise Linear Concave Functions from Sample Gradients,”
Operations Research Letters
, Vol. 31, No. 1, pp. 66-76 (2003).
» Powell, W.B., A. Ruszczynski and H. Topaloglu, “Learning Algorithms for Separable Approximations of Stochastic Optimization Problems,”
Mathematics of Operations Research
, Vol 29, No. 4, pp. 814-836 (2004).
Convergence results for storage problems » J. Nascimento, W. B. Powell, “An Optimal Approximate Dynamic Programming Algorithm for Concave, Scalar Storage Problems with Vector-Valued Controls,” IEEE Transactions on Automatic Control, Vol. 58, No. 12, pp. 2995-3010 (2013) » Powell, W.B., A. Ruszczynski and H. Topaloglu, “Learning Algorithms for Separable Approximations of Stochastic Optimization Problems,”
Mathematics of Operations Research
, Vol 29, No. 4, pp. 814-836 (2004).
» Nascimento, J. and W. B. Powell, “An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem,”
Mathematics of Operations Research
, (2009).
Exploiting concavity
120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Storage problem 14 15 16 17 18 19 20 21
Exploiting concavity
101 100 99 98 97 96 95 1 2 3 4 5 6 7 8 9 10 11 12 13 Storage problem 14 15 16 17 18 19 20 21
Grid level storage
Slide 49
Grid level storage
ADP (blue) vs. LP optimal (black)
Exploiting concavity
The problem of dealing with state of the world » Temperature, interest rates, …
t t World
R t
Exploiting concavity
Active area of research. Key ideas center on different methods for clustering.
Query state
R t
Lauren Hannah, W.B. Powell, D. Dunson, “Semi-Convex Regression for Metamodeling-Based Optimization,” SIAM J. on Optimization, Vol. 24, No. 2, pp. 573-597, (2014).
Outline
Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts
Hour-ahead biding
Bid is placed at 1pm, consisting of charge and discharge prices between 2pm and 3pm.
140,00 120,00 100,00 80,00
b
1 Discharge
pm
,2
pm
60,00 40,00
b
1 Charge
pm
,2
pm
20,00 0,00 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 1pm 2pm 3pm
A bidding problem The exact value function
A bidding problem Approximate value function without monotonicity
A bidding problem
Outline
Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts
Observations
» Approximate value iteration using a linear model can produce very poor results under the best of circumstances, and potentially terrible results.
» Least squares approximate value iteration, a highly regarded classic algorithm by Lagoudakis and Parr, works poorly.
» Approximate policy iteration is OK with support vector regression, but below expectation for such a simple problem.
» Basic lookup table by itself works poorly » Lookup table with structure works • Convexity – Does not require explicit exploration • Monotonicity –
Does very
well: require explicit exploration but limited to very low dimensional information state.
» So, we can conclude that
nothing
works reliably in a way that would scale to more complex problems!