The Reinforcement Learning Toolbox

Download Report

Transcript The Reinforcement Learning Toolbox

The Reinforcement Learning Toolbox
– Reinforcement Learning in Optimal
Control Tasks
Gerhard Neumann
Master Thesis
2005
Institute für Grundlagen der Informationsverarbeitung (IGI)
www.igi.tu-graz.ac.at/ril-toolbox
Master Thesis:

Reinforcement Learning Toolbox


General Software Tool for Reinforcement
Learning
Benchmark tests of Reinforcement Learning
algorithms on three Optimal Control
Problems



Pendulum Swing Up
Cart-Pole Swing Up
Acro-Bot Swing Up
www.igi.tu-graz.ac.at/ril-toolbox
RL Toolbox: Features

Software:


C++ Class System
Open Source / Non Commercial




Homepage: www.igi.tu-graz.ac.at/ril-toolbox
Class Reference, Manual
Runs under Linux and Windows
> 40.000 lines of code, > 250 classes
www.igi.tu-graz.ac.at/ril-toolbox
RL Toolbox: Features



Learning in discrete or continuous State Space
Learning in discrete or continuous Action Space
Different kinds of Learning Algorithms







Use Different Function Approximators






TD-lambda learning
Actor critic learning
Dynamic Programming, Model based learning, planning methods
Continuous time RL
Policy search algorithm
Residual / Residual gradient Algorithm
RBF-Networks
Linear Interpolation
CMAC-Tile Coding
Feed Forward Neural Networks
Learning from other (self coded) Controllers
Hierarchical Reinforcement Learning
www.igi.tu-graz.ac.at/ril-toolbox
Structure of the Learning System

The Agent and the environment


The agent tells the environment which action to
execute, the environment makes the internal
state transitions
Environment defines the learning problem
www.igi.tu-graz.ac.at/ril-toolbox
Structure of the learning system

Linkage to the learning algorithms


All algorithms need <st,at,st+1> for learning.
The algorithms are implemented as listeners

The algorithms adapt the agent
controller to learn optimal policy

Agent informs all listeners
about the steps and when a
new episode has started.
www.igi.tu-graz.ac.at/ril-toolbox
Reinforcement Learning:

Agent:




State Space S
Action Space A
Transition Function
Agent has to optimize the future discounted
reward


Many possibilities to solve the optimization
task:



Value based Approaches
Genetic Search
Other Optimization algorithms
www.igi.tu-graz.ac.at/ril-toolbox
Short Overview over the algorithms:

Value-based algorithms


Policy-search algorithms


Calculate the goodness of each
state
Represent the policy directly, search
in the policy parameter space
Hybrid Methods

Actor-Critic Learning
www.igi.tu-graz.ac.at/ril-toolbox
Value Based Algorithms

Calculate either:

Action value function (Q-Function):


Value Function



Need the transition function for action selection
E.g. Do state prediction or use the derivation of the transition
function
Representation of the V or Q Function is in the most
cases independent of the learning algorithm.



Directly used for action selection
We can use any function approximator for the value
function
Independent V-Function and Q-Function interfaces
Different Algorithms: TD-Learning, Advantage
Learning, Continuous Time RL
www.igi.tu-graz.ac.at/ril-toolbox
Policy Search / Policy Gradient
Algorithms


Directly climb the value function with a
parameterized policy
Calculate the Values of N given initial
states per simulation (PEGASUS, [NG,
2000])

Use standard optimization techniques like
gradient ascent, simulated annealing or
genetic algorithms.

Gradient Ascent used in the Toolbox
www.igi.tu-graz.ac.at/ril-toolbox
Actor Critic Methods:

Learn the value function and an extra
policy representation

Discrete actor critic

Stochastic Policies

Represent directly the action
selection propabilities.

Similar to TD-Q Learning

Continous actor critic

Directly output the continuous
control vector

Policy can be represented by
any Function approximator

Stochastic Real Values (SRV)
Algorithm ([Gullapalli, 1992])

Policy-Gradient Actor-Critic
(PGAC) algorithm
www.igi.tu-graz.ac.at/ril-toolbox
Policy-Gradient Actor-Critic Agorithm

Learn the V-Function with standard algorithm
Calculate Gradient of the Value within a certain time
window (k-steps in the past, l-steps in the future)

Gradient is then estimated by:

Again exact model is needed

www.igi.tu-graz.ac.at/ril-toolbox
Second Part: Benchmark Tests

Pendulum Swing Up


CartPole Swing Up


Easy Task
Medium Task
AcroBot Swing Up

Hard Task
www.igi.tu-graz.ac.at/ril-toolbox
Benchmark Problems





Common problems in non-linear
control
Try to find an unstable fixpoint
2 or 4 continuous state variables
1 continuous control variable
Reward: Height of the end point at
time each step
www.igi.tu-graz.ac.at/ril-toolbox
Benchmark Tests:

Test the algorithms on the benchmark problems with different
parameter settings.


Compare sensitivity of the parameter setting
Use different Function Approximators (FA)

Linear FAs (e.g. RBF-Networks)



Non-Linear FA (e.g. Feed-Forward Neural-Networks):



No expontial dependency on the input state dimension
Harder to learn (no local representation)
Compare the algorithms with respect to their features and
requirements




Typical local representation
Curse of dimensionality
Is the exact transition function needed?
Can the algorithm produce continuous actions?
How much computation time is needed?
Use hierarchical RL, directed exploration strategies or planning
methods to boost learning
www.igi.tu-graz.ac.at/ril-toolbox
Benchmark Tests:

Planning boosts performance significantly


Cart-Pole Task, RBF-network
Very time intensive (search depth 5 – 120 times longer
computation time)
PG-AC approach can compete with standard VLearning approach

Can not represent sharp decision boundaries
www.igi.tu-graz.ac.at/ril-toolbox
Benchmark:
PG-AC


PG-AC vs V-Planning, Feed Forward NN
V-Planning
Learning with FF-NN using the standard planning approach
almost impossible (very unstable performance)
PG-AC with RBF critic (time window = 7 time steps) manges
to learn the task in almost 1/10 of episodes of the standard
planning approach.
www.igi.tu-graz.ac.at/ril-toolbox
V-Planning

Cart-Pole Task: Higher Search Depths could
improve performance significantly, but at
exponential cost of computation time
www.igi.tu-graz.ac.at/ril-toolbox
Hierarchical RL

Cart-Pole Task: The Hierarchical Sub-Goal
Approach (alpha = 0.6) outperforms the flat
approach (alpha = 1.0)
www.igi.tu-graz.ac.at/ril-toolbox
Other general results

The Acro-Bot Task could not be learned with the flat
architecture



Nearly all algorithms managed to learn the first two tasks with
linear function approximation (RBF networks)
Non linear function approximators are very hard to learn




Hierarchical Architecture manges to swing up, but could not stay
on top
Feed Forward NN‘s have a very poor performance (no locality),
but can be used for larger state spaces
Very restrictive parameter settings
Approaches which use the transition function typically
outperform the model-free approaches.
The Policy Gradient algorithm (PEGASUS) only worked with
the linear FAs, with non-linear FAs it could not recover from
local maxima.
www.igi.tu-graz.ac.at/ril-toolbox
Literature







[Sutt_1999] R. Sutton and A. Barto: Reinforcement Learning:
An Introduction. MIT press
[NG_2000] A. Ng an M. Jordan: PEGASUS: A policy search
method for large mdps and pomdps approximation
[Doya_1999] K. Doya: Reinforcement Learning in continuous
time and space
[Baxter, 1999] J. Baxter: Direct gradient-based reinforcement
learning: 2. gradient ascent algorithms and experiments.
[Baird_1999] L. Baird: Reinforcement Learning Through
Gradient Descent. PhD Thesis
[Gulla_1992] V. Gullapalli: Reinforcement Learning and its
application to control
[Coulom_2000] R. Coulom: Reinforcement Learning using
Neural Networks. PhD thesis
www.igi.tu-graz.ac.at/ril-toolbox