Transcript Slides
Achieving Synergy in Cognitive Behavior of
Humanoids via Deep Learning of Dynamic
Visuo-Motor-Attentional Coordination
(arXiv:1507.02347)
Presented by Jared Weiss
An overview of deep learning
● Began in the 70s/80s
● Has recently been getting a lot of attention as collected gotten more data
and have optimized the way we train
● Used in computer vision, NLP, robotics, etc.
An overview of deep learning: CNNs
● Uses “convolutional” layers to compute activations on filters that are “slid”
across an image.
● Each feature map (or filter) at a low level corresponds with low-level
features (i.e. line detection)
● Hierarchy adds complexity
From
wikipedia
An overview of deep learning: RNNs
● Inputs and outputs are sequences of
vectors
● Used primarily in NLP where CNNs are
too rigid
● The hidden states of the network are
updated with every step of the network
https://karpathy.github.io/2015/05/21/rnn-effectiveness/
Deep learning networks: the VMDNN
● VMDNN = Visuo-Motor Deep Dynamic Neural Network
○ Made up of:
MSTNN (Multiple Spatio-Temporal Neural Network)
- Used for understanding the image with temporal
integration
MTRNN (Multiple Timescale Recurrent Neural Network)
- Used for behavior generation (using sequential data)
○ The two networks are linked via “PFC” (prefrontal cotex?)
The MTRNN (behavioral generation)
● Uses two “types” of neurons
○ Fast units (respond to changes quickly)
○ Slow units (more tolerant of noise)
● This is how we get the name “multiple timescale RNN”
Yuichi Yamashita, Jun Tani (2008)
● Inputs are joint states
https://www.youtube.com/watch?v=n9NYcG8xlYs
○ (connected directly to ‘fast’ units, not slow units)
The MSTNN (vision)
● Built on top of MTRNN
○ Meant to combine temporal capabilities of RNNs with the self-organizational spatial hierarchy
capabilities of CNNs.
○ The neurons are “leaky integrator models” from MTRNNs, meaning they maintain an internal
state from previous time-steps.
○ Based on the decay rate, 𝜏, which gets longer as we ascend the network’s layers, the neurons will
maintain their state for more time-steps.
The MSTNN (vision)
The robot: iCub
Via iCub.org
Training: initial training of MSTNN
● Pre-trained separately from MTRNN using supervised learning
○ The MSTNN trained using “target” labels to classify dynamic scenes into labels
● This was done separately from the training of the MTRNN in order to
develop good features in the visual layers. Learning was disabled in these
layers in the subsequent training of the whole VMDNN
Training: The MTRNN and behavior generation
● MTRNN was coupled to the MSTNN through the PFC
● Trained using Backpropagation Through Time (BPTT)
○ The network is “unfolded” for a sequence of a particular length
○ The weights are all updated at the same time for each sequence shown to the network
Experiments: Stage 1
Obstacle avoidance and object manipulation in simulation
- Robot focuses on object by moving head, then moves arm while avoiding
collision to attempt a grasp on the object
- Trials were run with unlearned obstacle position, object position, and
orientation
-
Success rate of 84.61%
-
Used a time constant of 150 in the PFC subnetwork
Experiments: Stage 2 (Real-world)
- An extension of the first stage, adding human
gesture recognition
-
Did not attempt to test generalization by varying
object location or orientation
- Neural activation patterns show distinct
transitions between observation phase and
action phase
-
Also, show differentiation in activations between
choosing the left and right objects
Experiments: Video
https://www.youtube.com/watch?v=0hXAWQvnJJ4