Transcript Slides
Reinforcement Learning in Robotics: Applications and Real-World Challenges Petar Kormushev , Sylvain Calinon and Darwin G. Caldwell Iretiayo Akinola Outline Introduction/Background/Motivation Problem Formulation: Algorithms, Policy representations, Reward Experimental Analysis: Pancake Flipping Task Archery-Based Aiming Task Bipedal Walking Energy Minimization Task Comments Why Reinforcement Learning in Robotics? Intelligence acquisition alternatives: Direct Programming: rule-based, complete control, structured environment Imitation Learning: demonstration based Kinesthetic teaching: direct movement of robot’s body Teleoperation: remote control, larger distance ->time delays Observational learning: demonstration captured using motion capture systems (video, sensors), correspondence problem has to be solved Reinforcement Learning: trial-and-error, pseudo carrot-and-stick model (reward system) Why Reinforcement Learning in Robotics? learn new or non-demonstrable tasks Absence of analytical formulation or closed-form solution e.g. fastest gait skill adaptation to new case scenario/ dynamically changing world refine/improve knowledge gained from demonstration adapt to changes in robot itself e.g. mechanical wear, heating up, growing part Reinforcement Learning Reinforcement Learning components: algorithm reward function policy representation Reinforcement Learning Specifications: state (s in S), , action (a in A , policy(pi): The goal is to optimize an objective: sum of rewards discount factor time step Reward at time h rewards R(s,a) RL Formulation Two main solution approaches: Policy search (primal): Value function (dual): close form solution is hard not as scalable Curse of Dimensionality 2 × (7 + 3) = 20 state dimensions 7-dimensional continuous actions. Recent RL Algorithms MDP/POMDP (not so recent)- finite space (discretization), Markov property Function approximation technique- e.g. locally linear Policy Gradient methods- high sensitivity to learning rate & exploratory variance Expectation-Maximization (EM)- no need for learning rate, easy to implement, efficient Policy-search RL method (PI2)- significant performance improvements, scalability to high-dimensional control problems Stochastic optimization: see reference [22] in paper Policy Representation Performance of RL algorithm depends on policy representation Multifaceted challenges due to high requirements of policy representation prior/bias, correlations, adaptability, multi-resolution, globality, multidimensionality, convergence etc. GMM/GMR: RL used to adjust the Gaussian parameters of learned model for robust performance. Dynamic Movement Primitives (DMP): switches between a set of attractors (motor primitives: dynamical systems) using PD controller to reach a target RL in Action Archery-Based Aiming Task Pancake Flipping Task Bipedal Walking Energy Minimization Task All three use EM-based RL algorithms but different policy representations Example: Archery-Based Aiming Task This example addresses multi-dimensionality and convergence speed challenges Peculiarities of the archery task: involves bi-manual coordination performed with slow movements of the arms using small torques/forces requires using tools (bow and arrow) to affect an external object (target) appropriate for testing learning algorithms, the reward is relatively obvious Video here: https://goo.gl/9cYnNw Example: Archery-Based Aiming Task Goal: on-learning the bi-manual coordination necessary to control the shooting direction and velocity in order to hit the target. Approach: Learning algorithm to modulate and coordinate the motion of the two hands, while an inverse kinematics controller is used for the motion of the arms. Expectation-Maximization-based Reinforcement Learning (PoWER) Chained vector regression (ARCHER) Input: Instruction on holding the bow and releasing the arrow Output: Hit the center of the target. Example: Archery-Based Aiming Task Expectation-Maximization-based RL (PoWER) does not need a learning rate (unlike policy-gradient methods) combines importance sampling with EM to make better use of the previous experience. PoWER uses a parameterized policy and tries to find values for the parameters that maximize the expected return under the corresponding policy. For the archery task, the policy parameters are represented by the elements of a 3D vector corresponding to the relative position of the two hands performing the task. Example: Archery-Based Aiming Task Expectation-Maximization-based RL (PoWER) Reward function: rˆT: estimated 2D position of the center of the target on the target’s plane, rˆA estimated 2D position of the arrow’s tip, EM: relative exploration between the k-th and current policy parameters Example: Archery-Based Aiming Task Augmented Reward Chained Regression (ARCHER) richer feedback information about the result of a trial. Uses position vector, while PoWER uses only scalar (distance) feedback. ARCHER Solution 2-dimensional reward/the relative position of the arrow’s tip relative position of the hands Example: Archery-Based Aiming Task PoWER ARCHER vs Example: Archery-Based Aiming Task PoWER vs ARCHER: 40 runs each with 60 rollouts (trials) Example: Archery-Based Aiming Task PoWER vs ARCHER: Verdict RL in combination with regression yields an extremely fast-converging algorithm Questions/Comments Example: Pancake Flipping Task Addresses correlations, compactness and smoothness challenges of policy rep. Task Description: Goal: toss pancake in the air so that it rotates 180o, then catch it with the frying pan Approach: kinesthetic teaching used to initialize the RL policy Video here: https://goo.gl/1qvJkC Example: Pancake Flipping Task Challenge 1: difficult to learn from multiple demonstrations: high variability of the task execution. the generalization process may smooth important acceleration peaks and sharp turns in the motion. Solution: Select a single successful demonstration (among a small series of trials) to initialize the learning process. Example: Pancake Flipping Task Compact Encoding with Coupling movement represented as a superposition of basis force fields use a controller based on a mixture of K proportional-derivative systems attractor vectors coordination matrices Example: Pancake Flipping Task Experimental Set-up Position and orientation of pancakes are tracked in real-time by a reflective marker-based motion capture system. Reward function: tf: time the pancake passes, with a downward direction, the horizontal level at a fixed height, ∆ h , above the frying pan’s current vertical position v0 is the initial orientation of the pancake vtf is the orientation of the pancake at time, tf xp is the position of the pancake center at time, tf Example 1: Pancake Flipping Task Results/Observation: up-down bouncing of the frying pan towards the end of the learned skill shows power of RL learns couplings across multiple motor control variables (correlations) using correlations in RL reduces the size of the representation (compactness) fast, dynamic tasks can still be represented and executed in a safe-for-the-robot manner (smoothness) Questions/Comments Example:Bipedal Walking Energy Minimization Task Challenge: the walking energy minimization problem is nearly impossible to be solved analytically. Goal: apply RL to learn to minimize the energy consumption required for walking of the passively-compliant bipedal robot Approach: Use RL method to learn an optimal vertical trajectory for the center of mass (CoM) of the robot to be used during walking, in order to minimize the energy consumption. Video here: https://goo.gl/ANzg9v Example:Bipedal Walking Energy Minimization Task Fixed Policy Parameterization Too simple policy parameterization (a few parameters) -> the convergence is quick, but often a sub-optimal solution is reached. Too complex policy parameterization, the convergence is slow, much worse local optimum likely, Target: both fast convergence and a good enough solution. Example:Bipedal Walking Energy Minimization Task Evolving Policy Parameterization change the complexity of the policy representation dynamically “grow” to accommodate increasingly more complex policies and get closer to the global optimum informed initialization is key providing backward compatibility Example:Bipedal Walking Energy Minimization Task Experimental Set-up: Reward: J: set of interesting joints : is the accumulated consumed electric energy for the motor of the j-th individual joint k is a scaling constant (c = 4 walking cycles) Example:Bipedal Walking Energy Minimization Task Results: RL optimal policy consumes 18% less energy than a conventional fixedheight walking Proposed evolving policy parameterization demonstrates three major advantages: faster convergence and higher rewards than the fixed policy parameterization, using varying resolution for the policy parameterization (adaptability and multiresolution) Much lower variance of the generated policies, (gradual exploration) Avoids local minima (globality) Questions/Comments Reference Petar Kormushev, Sylvain Calinon, Darwin G. Caldwell, “Reinforcement Learning in Robotics: Applications and Real-World Challenges”, In Robotics, vol. 2, no. 3, pp. 122-148, 2013. J. Kober, J. Andrew (Drew) Bagnell, and J. Peters, "Reinforcement Learning in Robotics: A Survey," International Journal of Robotics Research, July, 2013. Kormushev, P.; Calinon, S.; Saegusa, R.; Metta, G., "Learning the skill of archery by a humanoid robot iCub," in Humanoid Robots (Humanoids), 2010 10th IEEE-RAS International Conference on , vol., no., pp.417-423, 6-8 Dec. 2010