Transcript Document

Yair Weiss, CS HUJI Daphna Weinshall, CS HUJI Amnon Shashua, CS HUJI Yonina Eldar, EE Technion Ron Meir, EE Technion

The human brain can

rapidly recognize

thousands of objects while using

less power

than modern computers use in “quiet mode”.

Although there are many neurons devoted to visual recognition, only a tiny fraction fire at any given moment. Can we use machine learning to learn such representations and build systems with such performance?

We believe a key component enabling this remarkable performance is the use of sparse representations, and seek to develop brain-inspired hierarchical representations for visual recognition.

1. Low power visual recognition - sparsity before the A2D: algorithms and theories for sparsification in the analog domain ( Weiss & Eldar) 2. Extracting Informative Features from Sensory Input: an approach based on slowness (Meir & Eldar) 3. Sparsity at all levels of the hierarchy: algorithms for learning hierarchical sparse representations - from the input to the top levels (Weinshall & Shahsua)

• Research direction: Compressed Sensing for low power cameras • • • We have shown that random projections are poor fits for compressed sensing of natural images. We are working on better linear projections that take advantage of image statistics. We want to explore nonlinear compressed sensing.

Optimizing projections for recognition, not for visual reconstruction.

Weiss & Eldar

        Sensory signals effectively represent environmental signals Slow Feature Analysis extracts features based on slowness The approach has been applied successfully to generate biologically plausible features, blind source separation and pattern recognition. The SFA does not deal directly with representational accuracy We formulate a generalized multi-objective criterion balancing representational and temporal reliability Obtain feasible objective for optimization Preliminary results demonstrate advantages over SFA Future work: robustness, online learning, distributed implementation through local learning rules Ron Meir

 Instance-based typically relies on similarity metrics  class-based recognition typically relies on statistical learning  Our goal: develop a

unifieid

statistical learning framework for both recognition tasks Cohen & Shashua object instances object classes

 Cognitive psychology: Basic-Level Category (Rosch 1976). Intermediate category level which is learnt faster and earlier as compared to other levels in the category hierarchy  Neurophysiology: Agglomerative clustering of responses taken from population of neurons within the IT of macaque monkeys resembles an intuitive hierarchy (Kiani et al. 2007)

 Goal: jointly learn classifiers for a few tasks  Implicit goal: information sharing ◦ ◦ ◦ Achieve more economical overall representation A way to enhance impoverished training data Knowledge transfer (learning to learn)  Our method: share information hierarchically in a cascade, whose levels are automatically discovered Publication: Regularization Cascade for Joint Learning , Alon Zweig and Daphna Weinshall, ICML, June 2013

 How we compute the classifiers?

Build classifiers for all tasks, each is a linear combination of classifiers computed in a cascade ◦ Higher levels – high incentive for information sharing ◦  more tasks participate, classifiers are less precise Lower levels – low incentive for sharing  fewer tasks participate, classifiers get more precise  How we control the incentive to share?

 vary regularization of loss function

 Regularization: ◦ ◦ restrict the number of features the classifiers can use by imposing sparse regularization - || • || 1 add another sparse regularization term which does not penalize for joint features - || • || 1,2  λ|| • || 1,2 + (1 λ )|| • || 1  ◦ ◦ Incentive to share : λ=1  λ=0  highest incentive to share no incentive to share 11

12 Head Legs Wings Long Beak Short Beak Trunk Short Ears Long Ears Eagle Owl Asian Elephant African Elephant

Head Legs Wings Long Beak Short Beak Trunk Short Ears Long Ears

+ + =

13

Loss + || • ||

12  Loss + λ|| • || 1,2 + (1- λ )|| • || 1  Loss + || • || 1 14

• We train a linear classifier in Multi-task and multi-class settings, as defined by the respective loss function • Iterative algorithm over the basic step: ϴ = {W,b} ϴ ’ stands for the parameters learnt up till the current step λ governs level of sharing from max sharing λ = 0 to none λ = 1 • Each step λ is increased The aggregated parameters plus the decreased level of sharing is intended to guide the learning to focus on more task/class specific information as compared to the previous step 15

  Synthetic and real data (many sets) Multi-task and multi-class loss functions Multi-task loss Multi-class loss   Low level features vs. high level features Compare the cascade approach against the same algorithm with:    No regularization L 1 L 12 sparse regularization multi-task regularization

NoReg 4 T1 2 1 5 T2 6 T3 L12 3 7 T4 17

100 tasks. 20 positive sample and 20 negative samples per task.

18

Step 1 output 100 tasks. 20 positive sample and 20 negative samples per task.

19

Step 2 output 100 tasks. 20 positive sample and 20 negative samples per task.

20

Step 3 output 100 tasks. 20 positive sample and 20 negative samples per task.

21

Step 4 output 100 tasks. 20 positive sample and 20 negative samples per task.

22

Step 5 output 100 tasks. 20 positive sample and 20 negative samples per task.

23

Sample size 24

Sample size

Caltech 101 Datasets Caltech 256 Imagenet Cifar-100 (subset of tiny images) 26

Datasets MIT-Indoor-Scene (annotated with label-me) 27

Representation for sparse hierarchical sharing: low-level vs. mid-level o o

Low level features:

any of the images features which are computed from the image via some local or global operator, such as Gist or Sift.

Mid level features:

features capturing some semantic notion. Classifiers over low level features.

Low Level Cifar-100 Imagenet Gist, RBF kernel approximation by random projections (Rahimi et al. NIPS ’07) Sift, 1000 word codebook, tf-idf normalization Caltech-101 Caltech-256 Indoor-Scene Mid Level Feature specific classifiers (of Gehler et al. 2009).

Feature specific classifiers or Classemes (Torresani et al. 2010).

Object Bank (Li et al. 2010).

28

Cifar-100 MIT indoor scene, ObjBank multi class Caltech 256, Gehler multi task 29

• Main objective: faster learning algorithm for dealing with larger dataset (more classes, more samples) • Iterate over original algorithm for each new sample, where each level uses the current value of the previous level • Solve each step of the algorithm using the online version presented in “Online learning for group Lasso ”, Yang et al. 2011 (we proved regret convergence)

• Experiment on 1000 classes from Imagenet with 3000 samples per class and 21000 features per sample (ILSVRC2010) H Top1 0.285

Zhao et al. 0.221

Top2 0.365

0.302

Top3 0.403

0.366

data repetitions Top4 0.434

0.411

Top5 0.456

0.435

31

A different setting for sharing

: share information between pre-trained models and a new learning task (typically small sample settings).

 Extension of both batch and online algorithms, but online extension is more natural  Gets as input the implicit hierarchy computed during training with the known classes  When given examples from a new task: ◦ The online learning algorithms continues from where it stopped ◦ The matrix of weights is enlarged to include the new task, and the weights of the new task are initialized ◦ Sub-gradients of known classes are not changed

Task 1 Task K MTL 1 . . . K = K+1 = Online KT Method K+1 + K+1 + + K+1 + Batch KT Method = α π α π + + α π

34 Synthetic data Sample size ILSVRC2010

• We assume hierarchical structure of shared information which is unknown; hierarchy exploitation is implicit.

• Describe a cascade based on varying sparse regularization, for multi-Task/multi-Class and knowledge-transfer algorithms.

• Cascade shows improved performance in all experiments.

• Investigate different visual representation schemes: better value in multi-task learning with higher level features • Different levels of sharing help and can be efficient.

35