Transcript Slides

Reinforcement Learning in Robotics:
Applications and Real-World Challenges
Petar Kormushev , Sylvain Calinon and Darwin G. Caldwell
Iretiayo Akinola
Outline
Introduction/Background/Motivation
Problem Formulation:
Algorithms, Policy representations, Reward
Experimental Analysis:
Pancake Flipping Task
Archery-Based Aiming Task
Bipedal Walking Energy Minimization Task
Comments
Why Reinforcement Learning in Robotics?
Intelligence acquisition alternatives:
Direct Programming: rule-based, complete control, structured environment
Imitation Learning: demonstration based
Kinesthetic teaching: direct movement of robot’s body
Teleoperation: remote control, larger distance ->time delays
Observational learning: demonstration captured using motion capture systems (video, sensors),
correspondence problem has to be solved
Reinforcement Learning: trial-and-error, pseudo carrot-and-stick model (reward
system)
Why Reinforcement Learning in Robotics?
learn new or non-demonstrable tasks
Absence of analytical formulation or closed-form solution e.g. fastest gait
skill adaptation to new case scenario/ dynamically changing world
refine/improve knowledge gained from demonstration
adapt to changes in robot itself e.g. mechanical wear, heating up, growing part
Reinforcement Learning
Reinforcement Learning components:
algorithm
reward function
policy representation
Reinforcement Learning
Specifications: state (s in S), , action (a in A , policy(pi):
The goal is to optimize an objective: sum of rewards
discount factor
time step
Reward at time h
rewards R(s,a)
RL Formulation
Two main solution approaches:
Policy search (primal):
Value function (dual):
close form solution is hard
not as scalable
Curse of Dimensionality
2 × (7 + 3) = 20 state dimensions
7-dimensional continuous actions.
Recent RL Algorithms
MDP/POMDP (not so recent)- finite space (discretization), Markov property
Function approximation technique- e.g. locally linear
Policy Gradient methods- high sensitivity to learning rate & exploratory variance
Expectation-Maximization (EM)- no need for learning rate, easy to implement,
efficient
Policy-search RL method (PI2)- significant performance improvements,
scalability to high-dimensional control problems
Stochastic optimization: see reference [22] in paper
Policy Representation
Performance of RL algorithm depends on policy representation
Multifaceted challenges due to high requirements of policy representation
prior/bias, correlations, adaptability, multi-resolution, globality, multidimensionality, convergence etc.
GMM/GMR: RL used to adjust the Gaussian parameters of learned model for
robust performance.
Dynamic Movement Primitives (DMP): switches between a set of attractors
(motor primitives: dynamical systems) using PD controller to reach a target
RL in Action
Archery-Based Aiming Task
Pancake Flipping Task
Bipedal Walking Energy Minimization Task
All three use EM-based RL algorithms but different policy representations
Example: Archery-Based Aiming Task
This example addresses multi-dimensionality and convergence speed challenges
Peculiarities of the archery task:
involves bi-manual coordination
performed with slow movements of the arms using small torques/forces
requires using tools (bow and arrow) to affect an external object (target)
appropriate for testing learning algorithms, the reward is relatively obvious
Video here: https://goo.gl/9cYnNw
Example: Archery-Based Aiming Task
Goal: on-learning the bi-manual coordination necessary to control the shooting
direction and velocity in order to hit the target.
Approach: Learning algorithm to modulate and coordinate the motion of the two
hands, while an inverse kinematics controller is used for the motion of the arms.
Expectation-Maximization-based Reinforcement Learning (PoWER)
Chained vector regression (ARCHER)
Input: Instruction on holding the bow and releasing the arrow
Output: Hit the center of the target.
Example: Archery-Based Aiming Task
Expectation-Maximization-based RL (PoWER)
does not need a learning rate (unlike policy-gradient methods)
combines importance sampling with EM to make better use of the previous experience.
PoWER uses a parameterized policy and tries to find values for the parameters
that maximize the expected return under the corresponding policy.
For the archery task, the policy parameters are represented by the elements of a
3D vector corresponding to the relative position of the two hands performing the
task.
Example: Archery-Based Aiming Task
Expectation-Maximization-based RL (PoWER)
Reward function:
rˆT: estimated 2D position of the center of the target on the target’s plane,
rˆA estimated 2D position of the arrow’s tip,
EM:
relative exploration between the k-th and current policy parameters
Example: Archery-Based Aiming Task
Augmented Reward Chained Regression (ARCHER)
richer feedback information about the result of a trial.
Uses position vector, while PoWER uses only scalar (distance) feedback.
ARCHER Solution
2-dimensional reward/the relative position of the arrow’s tip
relative position of the hands
Example: Archery-Based Aiming Task
PoWER
ARCHER
vs
Example: Archery-Based Aiming Task
PoWER vs ARCHER: 40 runs each with 60 rollouts (trials)
Example: Archery-Based Aiming Task
PoWER vs ARCHER: Verdict
RL in combination with regression yields an extremely fast-converging algorithm
Questions/Comments
Example: Pancake Flipping Task
Addresses correlations, compactness and smoothness challenges of policy rep.
Task Description:
Goal: toss pancake in the air so that it rotates 180o, then catch it with the frying
pan
Approach: kinesthetic teaching used to initialize the RL policy
Video here: https://goo.gl/1qvJkC
Example: Pancake Flipping Task
Challenge 1: difficult to learn from multiple demonstrations:
high variability of the task execution.
the generalization process may smooth important acceleration peaks and sharp
turns in the motion.
Solution: Select a single successful demonstration (among a small series of
trials) to initialize the learning process.
Example: Pancake Flipping Task
Compact Encoding with Coupling
movement represented as a superposition of basis force fields
use a controller based on a mixture of K proportional-derivative systems
attractor vectors
coordination matrices
Example: Pancake Flipping Task
Experimental Set-up
Position and orientation of pancakes are tracked in real-time by a reflective
marker-based motion capture system.
Reward function:
tf: time the pancake passes, with a downward direction, the horizontal level at a fixed height, ∆ h ,
above the frying pan’s current vertical position
v0 is the initial orientation of the pancake
vtf is the orientation of the pancake at time, tf
xp is the position of the pancake center at time, tf
Example 1: Pancake Flipping Task
Results/Observation:
up-down bouncing of the frying pan towards the end of the learned skill shows
power of RL
learns couplings across multiple motor control variables (correlations)
using correlations in RL reduces the size of the representation (compactness)
fast, dynamic tasks can still be represented and executed in a safe-for-the-robot
manner (smoothness)
Questions/Comments
Example:Bipedal Walking Energy Minimization Task
Challenge: the walking energy minimization problem is nearly impossible to be
solved analytically.
Goal: apply RL to learn to minimize the energy consumption required for walking
of the passively-compliant bipedal robot
Approach: Use RL method to learn an optimal vertical trajectory for the center of
mass (CoM) of the robot to be used during walking,
in order to minimize the energy consumption.
Video here: https://goo.gl/ANzg9v
Example:Bipedal Walking Energy Minimization Task
Fixed Policy Parameterization
Too simple policy parameterization (a few parameters) -> the convergence is
quick, but often a sub-optimal solution is reached.
Too complex policy parameterization, the convergence is slow, much worse
local optimum likely,
Target: both fast convergence and a good enough solution.
Example:Bipedal Walking Energy Minimization Task
Evolving Policy Parameterization
change the complexity of the policy representation dynamically
“grow” to accommodate increasingly more complex policies and get closer to
the global optimum
informed initialization is key
providing backward compatibility
Example:Bipedal Walking Energy Minimization Task
Experimental Set-up:
Reward:
J: set of interesting joints
: is the accumulated consumed electric energy
for the motor of the j-th individual joint
k is a scaling constant
(c = 4 walking cycles)
Example:Bipedal Walking Energy Minimization Task
Results: RL optimal policy consumes 18% less energy than a conventional fixedheight walking
Proposed evolving policy parameterization demonstrates three major advantages:
faster convergence and higher rewards than the fixed policy parameterization,
using varying resolution for the policy parameterization (adaptability and multiresolution)
Much lower variance of the generated policies, (gradual exploration)
Avoids local minima (globality)
Questions/Comments
Reference
Petar Kormushev, Sylvain Calinon, Darwin G. Caldwell, “Reinforcement Learning in Robotics:
Applications and Real-World Challenges”, In Robotics, vol. 2, no. 3, pp. 122-148, 2013.
J. Kober, J. Andrew (Drew) Bagnell, and J. Peters, "Reinforcement Learning in Robotics: A Survey,"
International Journal of Robotics Research, July, 2013.
Kormushev, P.; Calinon, S.; Saegusa, R.; Metta, G., "Learning the skill of archery by a humanoid
robot iCub," in Humanoid Robots (Humanoids), 2010 10th IEEE-RAS International Conference on , vol.,
no., pp.417-423, 6-8 Dec. 2010