Applying Online Search Techniques to Reinforcement Learning Scott Davies, Andrew Ng, and

Download Report

Transcript Applying Online Search Techniques to Reinforcement Learning Scott Davies, Andrew Ng, and

Applying Online Search Techniques to Reinforcement Learning

Scott Davies, Andrew Ng, and Andrew Moore Carnegie Mellon University

The Agony of Continuous State Spaces

• Learning useful value functions for continuous-state optimal control problems can be difficult – Small inaccuracies/inconsistencies in approximated value functions can cause simple controllers to fail miserably – Accurate value functions can be very expensive to compute even in relatively low-dimensional spaces with perfectly accurate state transition models

Combining Value Functions With Online Search • Instead of modeling the value function accurately everywhere, we can perform online searches for good trajectories from the agent’s

current position

to compensate for value function inaccuracies • We examine two different types of search: – “Local” searches in which the agent performs a finite depth look-ahead search – “Global” searches in which the agent searches for trajectories all the way to goal states

Typical One-Step “Search”

Given a value function

V

(

x

) over the state space, an agent typically uses a model to predict where each possible one-step trajectory

T

takes it, then chooses the trajectory that maximizes

R T

+ 

V

(

x T

) where

R T

 is the reward accumulated along

T

is the discount factor

x T

is the state at the end of

T

This takes

O

(|

A

|) time, where

A

is the set of possible actions.

Given a perfect

V

(

x

), this would lead to optimal behavior.

Local Search

• An obvious possible extension: consider all possible

d

-step trajectories

T

, selecting the one that maximizes

R T

d V

(

x T

).

– Computational expense:

O

(|

A

|

d

).

+ • To make deeper searches more computationally tractable, we can limit agent to considering only trajectories in which the action is switched at most

s

– Computational expense :

O

(

d

times.

d s

 

A s

 1 ) (considerably cheaper than full

d

-step search if

s

<<

d

)

Title: hillc ar.f ig Creat or: f ig2dev Version 3.1 Patchlevel 2 Prev iew : This EPS pict ure w as not s av ed w ith a preview inc luded in it.

Comment: This EPS pict ure w ill print to a Pos tSc ript printer, but not to other ty pes of printers.

Local Search: Example

Title: Creator: Preview : This EPS picture w as not saved w ith a preview included in it.

Comment: This EPS picture w ill print to a Post Script print er, but not to ot her ty pes of printers .

• Two-dimensional state space (position + velocity) • Car must back up to take “running start” to make it Position Search over 20-step trajectories with at most one switch in actions

Using Local Search Online

Repeat: • From current state, consider all possible

d

-step trajectories

T

in which the action is changed at most

s

times • Perform the first action in the trajectory that maximizes

R T

+ 

d V

(

x T

).

Let

B

denote the “parallel backup operator” such that

BV

(

x

)  max

a

A

R

(

x

,

a

)   

V

(  (

x

,

a

))  If

s

= (

d

-1), Local Search is formally equivalent to behaving greedily with respect to the new value function

B d-1 V.

Since

V

is typically arrived at through iterations of a much cruder backup operator, this value function is often much more accurate than

V

.

Uninformed Global Search

• Suppose we have a minimum-cost-to-goal problem in a continuous state space with nonnegative costs. Why not forget about explicitly calculating

V

and just extend the search from the current position all the way to the goal?

• Problem: combinatorial explosion.

• Possible solution: – Break state space into partitions, e.g. a uniform grid. (Can be represented sparsely.) – Use previously discussed local search procedure to find trajectories between partitions – Prune all but least-cost trajectory entering any given partition

Title: Creator: Preview : This EPS picture w as not saved w ith a preview included in it.

Comment: This EPS picture w ill print to a Post Script print er, but not to ot her ty pes of printers .

Uninformed Global Search

• Problems: – Still computationally expensive – Even with fine partitioning of state space, pruning the wrong trajectories can cause search to fail

Informed Global Search

• Use approximate value function

V

to guide the selection of which points to search from next • Reasonably accurate

V

will cause search to stay along optimal path to goal: dramatic reduction in search time •

V

can help choose effective points within each partition from which to search, thereby improving solution quality • Uniformed Global Search same as “Informed” Global Search with

V

(

x

) = 0

Informed Global Search Algorithm

• Let

x 0

be current state, and

g

(

x 0

) be the grid element containing

x 0

• Set

g

(

x 0

)’s “representative state” to

x 0

, and add

g

(

x 0

) to priority queue

P

with priority

V

(

x 0

) • Until goal state found or

P

empty: – Remove grid element

g

from top of

P.

Let

x

“representative state.” denote

g

’s –

SEARCH-FROM

(

g, x

) • If goal found, execute trajectory; otherwise signal failure

Informed Global Search Algorithm, cont’d

SEARCH-FROM

(

g, x

)

:

• Starting from

x

, perform “local search” as described earlier, but prune the search wherever it reaches a different grid element

g

 

g

.

• Each time another grid element

g

 reached at state

x

 : – If

g

 previously

SEARCHED-FROM

, do nothing.

– If

g

+

|T|

never previously reached, add

g V(x

),

where

T

 to

P

with priority

R T (x 0 …x

)

is trajectory from

x 0

“representative state” to x to

x

.

Set

g

 ’s  . Record trajectory from

x

to

x

.

– If

g

 previously reached but previous priority is lower than

R T (x 0 …x

) +

|T| V(x

),

update

g

 and set “representative state” to

x

s priority to

R T (x 0 …x

) +

|T| V(x

)

 . Record trajectory from

x

to

x

.

Informed Global Search Examples

Title: Creator: Preview : This EPS picture w as not saved w ith a preview included in it.

Comment: This EPS picture w ill print to a Post Script print er, but not to ot her ty pes of printers .

Title: Creator: Preview : This EPS picture w as not saved w ith a preview included in it.

Comment: This EPS picture w ill print to a Post Script print er, but not to ot her ty pes of printers .

7*7 simplex-interpolated

V

13*13 simplex-interpolated

V

Hill-car Search Trees

Informed Global Search as

A*

• Informed Global Search is essentially an

A*

search using the value function

V

as a search heuristic • Using

A*

with an

optimistic

heuristic function normally guarantees optimal path to the goal. • Uninformed global search effectively uses trivially optimistic heuristic V(s) = 0. Might we expect better solution quality with uninformed search than with non optimistic crude approximate value function V?

• Not necessarily! A crude approximate non-optimistic value function can

improve solution quality

by helping the algorithm avoid pruning wrong parts of search tree

Title: hillc ar.f ig Creat or: f ig2dev Version 3.1 Patchlevel 2 Prev iew : This EPS pict ure w as not s av ed w ith a preview inc luded in it.

Comment: This EPS pict ure w ill print to a Pos tSc ript printer, but not to other ty pes of printers.

Hill-car

• Car on steep hill • State variables: position and velocity (2-d) • Actions: accelerate forward or backward • Goal: park near top • Random start states • Cost: total time to goal

Acrobot

Goal 

2

1

• Two-link planar robot acting in vertical plane under gravity • Underactuated joint at elbow; unactuated shoulder • Two angular positions & their velocities (4-d) • Goal: raise tip at least one link’s height above shoulder • Two actions: full torque clockwise / counterclockwise • Random starting positions • Cost: total time to goal

Move-Cart-Pole

x

 Goal configuration • Upright pole attached to cart by unactuated joint • State: horizontal position of cart, angle of pole, and associated velocities (4-d) • Actions: accelerate left or right • Goal configuration: cart moved, pole balanced • Start with random

x

;  = 0 • Per-step cost quadratic in distance from goal configuration • Big penalty if pole falls over

Title: Creator: Preview : This EPS picture w as not saved w ith a preview included in it.

Comment: This EPS picture w ill print to a Post Script print er, but not to ot her ty pes of printers .

Planar Slider

• Puck sliding on bumpy 2-d surface • Two spatial variables & their velocities (4-d) • Actions: accelerate NW, NE, SW, or SE • Goal in NW corner • Random start states • Cost: total time to goal

Local Search Experiments

60000 50000 40000 30000 20000 10000 0 1 2 3 4 5 6

Move-Cart-Pole

18 16 14 12 10 8 6 4 2 0 7 8 9 10 1 2 3 Search depth 4 5 6 Search depth 7 8 9 10

• CPU Time and Solution cost vs. search depth

d

• No limits imposed on number of action switches (

s

=

d

) • Value function: 13 4 simplex-interpolation grid

Local Search Experiments

200 150 100 50 0 0 10 Search depth 20

Hill-car

14 30 12 10 8 6 4 2 0 0 10 Search depth 20

• CPU Time and Solution cost vs. search depth

d

• Max. number of action switches fixed at 2 (

s

= 2) • Value function: 7 2 simplex-interpolated value function

30

Comparative experiments: Hill-Car

Solution Cost CPU Time/Trial

Search Method None Local

187 0.02

140 0.36

Uninf. Glob.

FAIL N/A

Inf. Glob.

151 0.14

• Local search:

d

=6,

s

=2 • Global searches: – Local search between grid elements:

d

=20,

s

=1 – 50 2 search grid resolution • 7 2 simplex-interpolated value function

Hill-Car results cont’d

• Uninformed Global Search prunes wrong trajectories • Increase search grid to 100 2 so this doesn’t happen: – Uninformed does near-optimal – Informed doesn’t: crude value function not optimistic Failed search trajectory picture goes here Solution Cost CPU Time/Trial

Search Method Uninf. Glob. 1

FAIL N/A

Inf. Glob. 1 Uninf. Glob. 2 Inf. Glob. 2

151 0.14

109 0.82

138 0.29

Comparative Results: Four-d domains

All value functions: 13 4 simplex interpolations All local searches between global search elements: depth 20, with at max. 1 action switch (

d

=20,

s

=1) • Acrobot: – Local Search: depth 4; no action switch restriction (

d

=4,

s

=4) – Global: 50 4 search grid • Move-Cart-Pole: same as Acrobot • Slider: – Local Search: depth 10; max. 1 action switch (

d

=10,

s

=1) – Global: 20 4 search grid

Acrobot

No search cost time Local search Uninf. Global Search cost time cost time #LS

Acrobot 454 Move-Cart-Pole 49993 Planar Slider 212 0.1

305 1.2

0.66 10339 1.13

1.9

197 52 407 5.8

3164 3.45

104 94 14250 7605 23690

Inf. Global Search cost time #LS

198 0.47

5073 0.64

54 2 914 1072 533 #LS: number of local searches performed to find paths between elements of global search grid • Local search significantly improves solution quality, but increases CPU time by order of magnitude • Uninformed global search takes even more time; poor solution quality indicates suboptimal trajectory pruning • Informed global search finds much better solutions in relatively little time. Value function drastically reduces search, and better pruning leads to better solutions

Move-Cart-Pole

Acrobot Move-Cart-Pole Planar Slider

No search cost time Local search Uninf. Global Search cost time cost time #LS

454 49993 212 0.1

305 1.2

0.66 10339 1.13

1.9

197 52 407 5.8

3164 3.45

104 94 14250 7605 23690

Inf. Global Search cost time #LS

198 0.47

5073 0.64

54 2 914 1072 533 • No search: pole often falls, incurring large penalties; overall poor solution quality • Local search improves things a bit • Uninformed search finds better solutions than informed – Few grid cells in which pruning is required – Value function not optimistic, so informed search solutions suboptimal • Informed search reduces costs by order of magnitude with no increase in required CPU time

Planar Slider

No search cost time Local search Uninf. Global Search cost time cost time #LS

Acrobot 454 Move-Cart-Pole 49993 Planar Slider 212 0.1

305 1.2

0.66 10339 1.13

1.9

197 52 407 5.8

3164 3.45

104 94 14250 7605 23690

Inf. Global Search cost time #LS

198 0.47

5073 0.64

54 2 914 1072 533 • Local search almost useless, and incurs massive CPU expense • Uninformed search decreases solution cost by 50%, but at even greater CPU expense • Informed search decreases solution cost by factor of 4, at no increase in CPU time

Using Search with Learned Models

• Toy Example: Hill-Car – 7 2 simplex-interpolated value function – One nearest-neighbor function approximator per possible action used to learn d

x

/d

t

– States sufficiently far away from nearest neighbor optimistically assumed to be absorbing to encourage exploration • Average costs over first few hundred trials: – No search: 212 – Local search: 127 – Informed global search: 155

Using Search with Learned Models

• Problems do arise when using learned models: – Inaccuracies in models may cause global searches to fail. Not clear then if failure should be blamed on model inaccuracies or on insufficiently fine state space partitioning – Trajectories found will be inaccurate • Need adaptive closed-loop controller • Fortunately, we will get new data with which to increase the accuracy of our model – Model approximators must be fast and accurate

Avenues for Future Research

• Extensions to nondeterministic systems?

• Higher-dimensional problems • Better function approximators for model learning • Variable-resolution search grids • Optimistic value function generation?