Approximate Dynamic Programming and Policy Search: Does anything work? Rutgers Applied Probability Workshop June 6, 2014 Warren B.

Download Report

Transcript Approximate Dynamic Programming and Policy Search: Does anything work? Rutgers Applied Probability Workshop June 6, 2014 Warren B.

Approximate Dynamic Programming and Policy Search: Does anything work?

Rutgers Applied Probability Workshop June 6, 2014 Warren B. Powell Daniel R. Jiang with contributions from: Daniel Salas Vincent Pham Warren Scott

© 2014 Warren B. Powell, Princeton University

Storage problems

 How much energy to store in a battery to handle the volatility of wind and spot prices to meet demands?

Storage problems

 How much money should we hold in cash given variable market returns and interest rates to meet the needs of a business?

Stock prices Bonds

Storage problems

 Elements of a “storage problem” » Controllable scalar state giving the amount in storage: • Decision may be to deposit money, charge a battery, chill the water, release water from the reservoir.

• There may also be exogenous changes (deposits/withdrawals) » Multidimensional “state of the world” variable that evolves exogenously: • Prices • Interest rates • Weather • Demand/loads » Other features: • Problem may be time-dependent (and finite horizon) or stationary • We may have access to forecasts of future

Storage problems

 Dynamics are captured by the transition function:           1 » Controllable resource (scalar):

R t

 

t

 1 reservoir, energy in chilled water tank » Exogenous state variables:

p t e t

 1  1

Wind

 1

D t

 1   

p t e t Wind D t

R t

 

S

 

D t t M

( , ,

t t t Ax t

 1  Spot prices

e

ˆ

t Wind

 1  1  1 )

R

ˆ Money in the bank, water in the

t

 1 Energy from wind water/...

Slide 5

Stochastic optimization models

 The objective function min 

E

T

t

 Expectation over all 0 

t

random outcomes Finding the best policy 

t

State variable

t

S

t

Cost function  Decision function (policy) where

S t

 1 

S M

t t

,

t

 1  » With deterministic problems, we want to find the best

decision.

» With stochastic problems, we want to find the best

function

(policy) for making a decision.

Four classes of policies

1) Policy function approximations (PFAs) » Lookup tables, rules, parametric functions 2) Cost function approximation (CFAs) »

X CFA

(

S t

 arg min

x t

 X

t C

 ( ,

t t

3) Policies based on value function approximations (VFAs) »

X t VFA S t

 arg min

x t

 ( , )

t t

 

V t x

t x

( , )

t t

  4) Lookahead policies (a.k.a. model predictive control) » »

Deterministic lookahead (rolling horizon procedures)

T X t LA

-

D

(

S t

) = arg min

C

(

x tt

,

x t

,

t

+ 1 ,...,

x t

,

t

+

T S tt

,

x tt

) + å

t

' =

t

+ 1 g

t

' -

t C

(

S tt

' ,

x tt

' )

Stochastic lookahead (stochastic programming,MCTS)

T X t LA

-

S

(

S t

) = arg min

C

(

x tt

,

x t

,

t

+ 1 ,...,

x t

,

t

+

T S tt

,

x tt

) + Î w Î W

t p

( w ) Î

t

' =

t

+ 1 g

t

' -

t C

(

S tt

' ( w ),

x tt

' ( w ))

Value iteration

 Classical backward dynamic programming

V t S t

 max

x

C S x t

  E 

V t

 1 (

S t

 1 ) |

S t

  The action space The outcome space The state space » The three curses of dimensionality

A storage problem

 Energy storage with stochastic prices, supplies and demands.

E t wind D t P t grid R t battery E P D t t wind

 1

grid

 1

t load

 1

R t battery

 1 

E t wind

t wind

 1 

P t grid

P t grid

 1 

D t load

R t battery

t load

 1 

x t S t

 1  Exogenous inputs State variable

x t

 Controllable inputs

A storage problem

 Bellman’s optimality equation

V S t

 min

x t

 X  ( ,

t t

)   E      

E t wind P t grid D t load R t battery

             

x t x t x t x t x t

       

t

 1 ( ,

t t t

 1 )     

E

ˆ

t wind

 1

P t grid

 1

t load

 1    

Managing a water reservoir

 Backward dynamic programming in one dimension Step 0: Initialize

V T

 1 (

R T

 1 ) Step 1: Step backward

t

  0 for

R T

 1  1,

T

  2,...

0,1,...,100 Step 2: Loop over

R t

 0,1,...,100 Step 3: Loop over all decisions 0  

t R t

Step 4: Take the expectation over all rainfall levels (also discretized):

t

,

t

) 

t

,

t

)  100 

w

 0

V t

 1 (min 

R

max ,

R t

W

End step 4; End Step 3; Find

V t

* (

R t

)  max

x t

Store

X t

 * (

R t

)  arg max

x t t

, End Step 2; End Step 1;

t

)

t

,

t

). (This is our policy)

Managing cash in a mutual fund

 Dynamic programming in multiple dimensions Step 0: Initialize

V T

 1 (

S T

 1 )  0 for all states.

Step 1: Step backward

t S t

 

t

 1,

T t

,

t

, 

t

 2,...

(four loops)

x t x t

s a vector) Step 4: Take the expectation over each random dimension  Compute ( , )

t t

 ( , )

t t

w

1 100 100 100     0

w

2  0

w

3  0

V t

 1 

S M

t

 1  (

w

1 , ,

t

, ˆ

t

, 3 )  

P W

( ˆ

t

 , End step 4; End Step 3; Find

t

( )

t

 max

x t

Store

X t

 *

S t

 ( , ) arg max

x t t t

( , ). (This is our policy)

t t

End St ep 2; End Step 1; 2 , 3 )

Approximate dynamic programming

 Algorithmic strategies: » Approximate value iteration • Mimics backward dynamic programming » Approximate policy iteration • Mimics policy iteration » Policy search • Based on the field of stochastic search

Approximate value iteration

Step 1: Start with a pre-decision state

S t n

Step 2: Solve the deterministic optimization using an approximate value function:

v

ˆ

t n

 min

x x t n

t n x t

V t n

 1 (

S

(

S t n

Deterministic

x t

 optimization Step 3: Update the value function approximation

V t n

 1 (

S t

 1 ) 

n

 1 )

V t n

  1 1 (

S t

 1 )  

n

 1

v

ˆ

t n W t

( 

n

) Recursive statistics Simulation compute the next pre-decision state:

S t n

 1 

S M

(

S t n

,

t n t

 1 ( 

n

)) Step 5: Return to step 1. “on policy learning”

Approximate value iteration

Step 1: Start with a pre-decision state

S t n

Step 2: Solve the deterministic optimization using an approximate value function:

v

ˆ

t n

 min ( (

x C S t n x t n x t

f

f f

(

S M

(

S m x

Deterministic optimization Step 3: Update the value function approximation

V t n

 1 (

S t

 1 ) 

n

 1 )

V t n

  1 1 (

S t

 1 )  

n

 1

v

ˆ

t n W t

( 

n

) Recursive statistics Simulation compute the next pre-decision state:

S t n

 1 

S M

(

S t n

,

t n t

 1 ( 

n

)) Step 5: Return to step 1.

Approximate value iteration

 The true (discretized) value function

Outline

 Least squares approximate policy iteration  Direct policy search  Approximate policy iteration using machine learning  Exploiting concavity  Exploiting monotonicity  Closing thoughts

Approximate dynamic programming

 Classical approximate dynamic programming » We can estimate the value of being in a state using

v

ˆ

t n

 min

x

  

t n

  

f f f

(

S t x

(

S t n

» Use linear regression to estimate .

n

  

v

ˆ

t n

 

f

 

f f

(

S t x

(

S t n

 

n

» Our policy is then given by

X

 (

S t

| 

n

)  arg min

x

  

t

, )   

f f

(

S t x

(

t

   » This is known as

Bellman error minimization.

Bellman error =

V n

 1 (

S t n

) 

v

ˆ

n

 

f f f

(

S t n

) 

v

ˆ

n

Approximate dynamic programming

 Least squares policy iteration (Lagoudakis and Parr) » Bellman’s equation: » … is equivalent to (for a fixed policy) » Rearranging gives: » … where “X” is our explanatory variable. But we cannot compute this exactly (due to the expectation) so we can sample it.

Approximate dynamic programming

 … in matrix form:

Approximate dynamic programming

 First issue:

C t

t

 1 

t

   

C t

C t

X t

 Independent variable Sample state, compute basis functions. Simulate next state and compute basis functions » This is known as an “errors-in-variable” model, which produces  variables.

Approximate dynamic programming

 Second issue: » Bellman’s optimality equation written using basis functions:   

f

 

f f

 max

x

     E 

f

 

f f

   » … does not possess a fixed point (result due to van Roy and de Farias). This is the reason that classical Bellman error minimization using basis functions:   

x

   E    2 …does not work. Instead we have to use

projected Bellman error:

      max

x

   E    2 Projection operator onto the space spanned by the basis functions.

Approximate dynamic programming

 Surprising result: »

Theorem (W. Scott and W.B.P.)

Bellman error using instrumental variables and projected Bellman error minimization

are the same!

Optimizing storage

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Optimizing storage

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Optimizing storage

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Outline

 Least squares approximate policy iteration  Direct policy search  Approximate policy iteration using machine learning  Exploiting concavity  Exploiting monotonicity  Closing thoughts

Policy search

 Finding the best policy (“policy search”) » Assume our policy is given by

X

 (

S t

  arg min

x

  

t n x t

  » We wish to maximize the function

f

min  E

F

 E

t T

  0 

t

t

 (

S t

 

f f

( 

t x t n x t

   Error correction term

F W

 1  2

Policy search

 A number of fields work on this problem under different names: » Stochastic search » Simulation optimization » Black box optimization » Sequential kriging » Global optimization » Open loop control » Optimization of expensive functions » Bandit problems (for on-line learning) » Optimal learning

Policy search

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Policy search

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Outline

 Least squares approximate policy iteration  Direct policy search  Approximate policy iteration using machine learning  Exploiting concavity  Exploiting monotonicity  Closing thoughts

Approximate policy iteration

Step 1: Start with a pre-decision state

S t n

Step 2: Inner loop: Do for m=1,…,M: Step 2a: Solve the deterministic optimization using

v

ˆ an approximate value function:

m

 min

x

m

V n

 1 (

S

(

S m x

x m

Step 2b: Update the value function approximation

V n

 1,

m

(

S

) 

m

 1 )

V n

 1,

m

 1 (

S

)  

m

 1

v

ˆ

m W

( 

m

) compute the next pre-decision state:

S m

 1 

S M

(

S m

,

m

( 

m

))

V n

 1,

Approximate policy iteration

 Machine learning methods (coded in R) » » » » » »

SVR

- Support vector regression with Gaussian radial basis kernel

LBF

– Weighted linear combination of polynomial basis functions

GPR

– Gaussian process regression with Gaussian RBF

LPR

– Kernel smoothing with second-order local polynomial fit

DC-R

– Dirichlet clouds – Local parametric regression.

TRE

– Regression trees with constant local fit.

Approximate policy iteration

 Test problem sets » Linear Gaussian control • L1 = Linear quadratic regulation • Remaining problems are nonquadratic » Finite horizon energy storage problems (Salas benchmark problems) • 100 time-period problems • Value functions are fitted for each time period

100 80 60 40 20 0

Linear Gaussian control

LBF – regression trees LBF – linear basis functions LPR – kernel smoothing DC-R – local parametric regression GPR - Gaussian process regression SVR - Support vector regression

100 80 60 40 20 0

Energy storage applications

DC-R – local parametric regression LPR – kernel smoothing GPR - Gaussian process regression SVR - Support vector regression

Approximate policy iteration

 A tale of two distributions » The

sampling distribution

, which governs the likelihood that we sample a state.

» The

learning distribution

, which is the distribution of states we would visit given the current policy 1 ( )

V

2 State S

Approximate policy iteration

 Using the optimal value function Optimal value function …with quadratic fit State distribution using optimal policy

Now we are going to use the optimal policy to fit approximate value functions and watch the stability.

Approximate policy iteration

State distribution using optimal policy • • • Policy evaluation: 500 samples (problem only has 31 states!) After 50 policy improvements with

optimal distribution

: divergence in sequence of VFA’s, 40%-70% optimality.

After 50 policy improvements with

uniform distribution

: stable VFAs, 90% optimality.

VFA estimated after 50 policy iterations VFA estimated after 51 policy iterations

Outline

 Least squares approximate policy iteration  Direct policy search  Approximate policy iteration using machine learning  Exploiting concavity  Exploiting monotonicity  Closing thoughts

Exploiting concavity

 Bellman’s optimality equation » With pre-decision state:

t

( )

t

 min

x t

 X  ( ,

t t

» With post-decision state:

t

( )

t

 min

x t

 X  ( ,

t t

) )    E 

V t

 1 

V t x

S t x

S t

 1 ( ,

t

( ,

t t

)  

t t

 1

S t

  Inventory held over from previous time period

Exploiting concavity

 We update the piecewise linear value functions by computing estimates of slopes using a backward pass: 

R

» The cost along the marginal path is the derivative of the simulation with respect to the flow perturbation.

Exploiting concavity

 Derivatives are used to estimate a piecewise linear approximation

t t

)

R t

Exploiting concavity

  Convergence results for piecewise linear, concave functions: » Godfrey, G. and W.B. Powell, "An Adaptive, Distribution-Free Algorithm for the Newsvendor Problem with Censored Demands, with Application to Inventory and Distribution Problems,"

Management Science

, Vol. 47, No. 8, pp. 1101-1112, (2001).

» Topaloglu, H. and W.B. Powell, “An Algorithm for Approximating Piecewise Linear Concave Functions from Sample Gradients,”

Operations Research Letters

, Vol. 31, No. 1, pp. 66-76 (2003).

» Powell, W.B., A. Ruszczynski and H. Topaloglu, “Learning Algorithms for Separable Approximations of Stochastic Optimization Problems,”

Mathematics of Operations Research

, Vol 29, No. 4, pp. 814-836 (2004).

Convergence results for storage problems » J. Nascimento, W. B. Powell, “An Optimal Approximate Dynamic Programming Algorithm for Concave, Scalar Storage Problems with Vector-Valued Controls,” IEEE Transactions on Automatic Control, Vol. 58, No. 12, pp. 2995-3010 (2013) » Powell, W.B., A. Ruszczynski and H. Topaloglu, “Learning Algorithms for Separable Approximations of Stochastic Optimization Problems,”

Mathematics of Operations Research

, Vol 29, No. 4, pp. 814-836 (2004).

» Nascimento, J. and W. B. Powell, “An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem,”

Mathematics of Operations Research

, (2009).

Exploiting concavity

120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Storage problem 14 15 16 17 18 19 20 21

Exploiting concavity

101 100 99 98 97 96 95 1 2 3 4 5 6 7 8 9 10 11 12 13 Storage problem 14 15 16 17 18 19 20 21

Grid level storage

Slide 49

Grid level storage

 ADP (blue) vs. LP optimal (black)

Exploiting concavity

 The problem of dealing with state of the world » Temperature, interest rates, … 

t t World

R t

Exploiting concavity

 Active area of research. Key ideas center on different methods for clustering.

Query state

R t

Lauren Hannah, W.B. Powell, D. Dunson, “Semi-Convex Regression for Metamodeling-Based Optimization,” SIAM J. on Optimization, Vol. 24, No. 2, pp. 573-597, (2014).

Outline

 Least squares approximate policy iteration  Direct policy search  Approximate policy iteration using machine learning  Exploiting concavity  Exploiting monotonicity  Closing thoughts

Hour-ahead biding

 Bid is placed at 1pm, consisting of charge and discharge prices between 2pm and 3pm.

140,00 120,00 100,00 80,00

b

1 Discharge

pm

,2

pm

60,00 40,00

b

1 Charge

pm

,2

pm

20,00 0,00 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 1pm 2pm 3pm

A bidding problem The exact value function

A bidding problem Approximate value function without monotonicity

A bidding problem

Outline

 Least squares approximate policy iteration  Direct policy search  Approximate policy iteration using machine learning  Exploiting concavity  Exploiting monotonicity  Closing thoughts

Observations

» Approximate value iteration using a linear model can produce very poor results under the best of circumstances, and potentially terrible results.

» Least squares approximate value iteration, a highly regarded classic algorithm by Lagoudakis and Parr, works poorly.

» Approximate policy iteration is OK with support vector regression, but below expectation for such a simple problem.

» Basic lookup table by itself works poorly » Lookup table with structure works • Convexity – Does not require explicit exploration • Monotonicity –

Does very

well: require explicit exploration but limited to very low dimensional information state.

» So, we can conclude that

nothing

works reliably in a way that would scale to more complex problems!