Deep Learning - University of Houston

Download Report

Transcript Deep Learning - University of Houston

Kernel Analysis of Deep Networks

By: Gregoire Montavon Mikio L. Braun Klaus-Robert Muller (Technical University of Berlin)

JMLR 2011

Presented by: Behrang mehrparvar (University of Houston)

April 8th, 2014

Roadmap

 Deep Learning  Goodness of Representations  Measuring goodness  Role of architecture

Deep Learning?

 Distributed representation  Depth   Less examples in regions Capture global structure   Efficient representation Abstraction   Flexibility Higher-level features  Incorporate prior knowledge

Distributed Representation [1]

Depth [2]

Abstraction [?]

Problem Specification

 Deep Learning is still a

Black Box!

 Theoretical aspect   e.g. studying depth in sum-product networks Analytical arguments   e.g. analysis of depth Experimental results   e.g. performance in application domains Visualization  e.g. measuring invariance

Kernel Methods

 Decouples learning algorithms from data representation  Kernel operator:    Measures similarity between points All the prior knowledge of the learning problem In this paper:   Not a learning machine Abstraction tool to model the deep network

Kernel Methods (cont.)

 Kernel Methods   model the deep network Used to quantify ...

  the goodness of representations the evolution of good representations

Hypothesis

1) Simpler and more accurate representation throughout the depth 2) Structure of the network (restrictions) define the speed of how representations are formed – Evolution from dist. of pixels to dist. of classes

Problem Specification

Problem

: Role of depth in goodness of representation 

Challenge

: Definition and Measurement for goodness 

Solution

: – Simplicity –   Dimensionality: number of kernel PCs Number of local variations Accuracy  Classification error

Hypothesis (Cont.)

Method

1) Train the deep network 2) Infer the representation of each layer 3) Apply kernel PCA on each layer representations 4) Project data points on first d eigenvectors 5) Analyze the results

Method (Analysis)

Why Kernels?

1) Incorporating prior knowledge 2) Measurable simplicity and accuracy 3) Theoretical framework and convergence bounds [3] 4) Flexibility

Dimensionality and Complexity

Dimensionality and Complexity (cont.)

Intuition

  Accuracy – Task-relevant information Simplicity – Number of allowed local variations in the inputs space – However, does not explain domain-specific regularities – Robust to number of samples • vs. number of support vectors

Effects of Kernel mapping

Experiment setup

   Datasets – – Tasks MNIST CIFAR – – Supervised learning Transfer learning Architectures – – – Multilayer perceptron (MLP) Pretrained multilayer perceptron (PMLP) Convolutional neural networks (CNN)

Effect of Settings

Effect of Depth (Hyp. 1)

Observation

 Higher layers – – More accurate representations More simple representations

Architectures

   Multilayer Perceptrons – – No preconditioning on learning problem

Prior: NONE

Pretrained Multilayer perceptrons – – – Better represents the underlying representation Contains a certain part of soluton

Prior: generative model of input

Convolutional Neural Networks –

Prior: Spatial invariance

Multilayer Perceptron [4]

Convolutional Neural Networks [4]

Effect of Architecture (Hyp. 2)

Observation

  MNIST: – – MLP: Discriminating is solved greedily PMLP and CNN: postpone to last layers CIFAR – – MLP: Doesn't discriminate till last layer PMLP and CNN: spread it to more layers

WHY?!

– –

Good observation, but no explanation!

Hints: dataset, priors, etc. ?

Effect of Architecture (Cont.)

Observation

 Regularities in PMLP and CNN – Facilitate the construction of a structured solution – Controls the rate of discrimination at every level

Label Contribution of PCs

Comments

  Strengths – – – – Important and interesting Simple and intuitive Well designed Good

analysis approach experiments

of results

problem

Weaknesses – – Too many observations • e.g. role of sigma in scale invariance explaining observations

Future works?

 Experiments on

Unsupervised Learning

Explaining

the results  Analysis on biological neural systems?!

References

1) Bengio, Yoshua, and Olivier Delalleau. "

On the expressive power of deep architectures.

" Algorithmic Learning Theory. Springer Berlin Heidelberg, 2011.

2) Poon, Hoifung, and Pedro Domingos. "

Sum-product networks: A new deep architecture.

" Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2011.

3) Braun, Mikio L., Joachim M. Buhmann, and Klaus Robert Müller. "

On relevant dimensions in kernel feature spaces.

" The Journal of Machine Learning Research 9 (2008): 1875-1908.

4) http://deeplearning.net/

Thanks ...