Transcript Slides
Achieving Synergy in Cognitive Behavior of Humanoids via Deep Learning of Dynamic Visuo-Motor-Attentional Coordination (arXiv:1507.02347) Presented by Jared Weiss An overview of deep learning ● Began in the 70s/80s ● Has recently been getting a lot of attention as collected gotten more data and have optimized the way we train ● Used in computer vision, NLP, robotics, etc. An overview of deep learning: CNNs ● Uses “convolutional” layers to compute activations on filters that are “slid” across an image. ● Each feature map (or filter) at a low level corresponds with low-level features (i.e. line detection) ● Hierarchy adds complexity From wikipedia An overview of deep learning: RNNs ● Inputs and outputs are sequences of vectors ● Used primarily in NLP where CNNs are too rigid ● The hidden states of the network are updated with every step of the network https://karpathy.github.io/2015/05/21/rnn-effectiveness/ Deep learning networks: the VMDNN ● VMDNN = Visuo-Motor Deep Dynamic Neural Network ○ Made up of: MSTNN (Multiple Spatio-Temporal Neural Network) - Used for understanding the image with temporal integration MTRNN (Multiple Timescale Recurrent Neural Network) - Used for behavior generation (using sequential data) ○ The two networks are linked via “PFC” (prefrontal cotex?) The MTRNN (behavioral generation) ● Uses two “types” of neurons ○ Fast units (respond to changes quickly) ○ Slow units (more tolerant of noise) ● This is how we get the name “multiple timescale RNN” Yuichi Yamashita, Jun Tani (2008) ● Inputs are joint states https://www.youtube.com/watch?v=n9NYcG8xlYs ○ (connected directly to ‘fast’ units, not slow units) The MSTNN (vision) ● Built on top of MTRNN ○ Meant to combine temporal capabilities of RNNs with the self-organizational spatial hierarchy capabilities of CNNs. ○ The neurons are “leaky integrator models” from MTRNNs, meaning they maintain an internal state from previous time-steps. ○ Based on the decay rate, 𝜏, which gets longer as we ascend the network’s layers, the neurons will maintain their state for more time-steps. The MSTNN (vision) The robot: iCub Via iCub.org Training: initial training of MSTNN ● Pre-trained separately from MTRNN using supervised learning ○ The MSTNN trained using “target” labels to classify dynamic scenes into labels ● This was done separately from the training of the MTRNN in order to develop good features in the visual layers. Learning was disabled in these layers in the subsequent training of the whole VMDNN Training: The MTRNN and behavior generation ● MTRNN was coupled to the MSTNN through the PFC ● Trained using Backpropagation Through Time (BPTT) ○ The network is “unfolded” for a sequence of a particular length ○ The weights are all updated at the same time for each sequence shown to the network Experiments: Stage 1 Obstacle avoidance and object manipulation in simulation - Robot focuses on object by moving head, then moves arm while avoiding collision to attempt a grasp on the object - Trials were run with unlearned obstacle position, object position, and orientation - Success rate of 84.61% - Used a time constant of 150 in the PFC subnetwork Experiments: Stage 2 (Real-world) - An extension of the first stage, adding human gesture recognition - Did not attempt to test generalization by varying object location or orientation - Neural activation patterns show distinct transitions between observation phase and action phase - Also, show differentiation in activations between choosing the left and right objects Experiments: Video https://www.youtube.com/watch?v=0hXAWQvnJJ4