Statistical Learning Theory

Download Report

Transcript Statistical Learning Theory

Statistical Learning Theory
4/29/2020
Statistical Learning Theory
A model of supervised learning consists of:
a) Environment

- Supplying a vector x with a fixed but

unknown pdf Fx (x )
b) Teacher. It provides a desired response d

for every x according to a conditional pdf



Fx ( x d ). These are related by d  f ( x, v)
4/29/2020
Statistical Learning Theory
v is a noise term.
c) Learning machine. It is capable of implementing a set of I/O mapping functions:
 
y  F ( x, w)

where y is the actual response and w
is a set
of free parameters (weights) selected from
the parameter (weight) space Ŵ.
4/29/2020
Statistical Learning Theory
The supervised learning problem is that of


selecting the particular F ( x, w) that
approximates d in an optimum fashion. The
selection itself is based on a set of iid
training samples:

N
ˆ
T  {(xi , di )}i 1
Each sample is drawn from Tˆ with a joint

pdf
Fx , d ( x , d )
4/29/2020
Statistical Learning Theory
Supervised learning depends on the following:

“Do the training examples {( xi , d i )}
contain enough information to construct a
LM capable of good generalization?”
To answer, we will see this problem as an
approximation problem. We wish to find the


function F ( x, w) which
 is the best possible
approximation to f (x ) .
4/29/2020
Statistical Learning Theory




Let L(d , F ( x, w))  (d  F ( x, w))2 denote a
measure of the discrepancy between a d

corresponding to a vector x and the actual


response produced by F ( x, w)
The expected value of the loss is defined by
the risk functional


 


R(w)  L(d , F ( x, w)) dFx , D ( x, d )
4/29/2020
Statistical Learning Theory
The risk functional may be easily understood
from the finite approximation




R( w) 
L( xi ,d i )  P( xi ,d i )
i

where P( xi ,d i ) denotes the probability of
drawing the i-th sample.
4/29/2020
Principle of Empirical Risk
Minimization

Instead of using R(w) we use an empirical
measure:

1

 
RE ( w) 
L(d i , F ( xi , w))
N i
 in two
This measure differs from R(w
)
desirable ways:
a) It does not depend on the unknown pdf


Fx , D ( x , d ) explicitly.
4/29/2020
Principle of Empirical Risk
Minimization
b) In theory it can be minimized with respect

to w .
------ 

Let wE and F ( x, wE ) denote the weight

vector and the mapping that minimize RE (w)

 
Also, let w0 and F ( x, w0 ) denote the ana
logues for R(w)


Both wE and w0 correspond to the space Ŵ.
4/29/2020
Principle of Empirical Risk
Minimization
We must now consider under which condi



tions F ( x, wE ) is close to F ( x, w0 ) as

measured by the mismatch between RE (w)

and R(w) .
4/29/2020
Principle of Empirical Risk
Minimization

1. In place of R(w), construct
1

 
RE ( w) 
L(d i , F ( xi , w))
N i

on the basis of the training set of iid samples

( xi ,d i ) i = 1, ..., N
4/29/2020
Principle of Empirical Risk
Minimization
 converges in probability to the mi2. R( w
E)

nimum possible values of R(w) as N  

provided that RE (w) converges uniformly

to R(w) .
3. Uniform convergence as per


P( sup R( w)  RE ( w)  )  0

w W
is necessary and sufficient for consistency of
the PERM.
4/29/2020
The Vapnik Chervonenkis
Dimension

The theory of uniform convergence of RE (w)

to R(w) includes rates of convergence based
on a parameter called the VC dimension.
It is a measure of the capacity or expressive
power of the family of classification
functions realized by the learning machine.
4/29/2020
The Vapnik Chervonenkis
Dimension
To describe the concept of VC dimension let
us consider a binary pattern classification
problem for which the desired response is
d  {0,1} .
A dichotomy is a classification function. Let F̂
denote the set of dichotomies implemented
by a learning machine:
   ˆ
m ˆ
ˆ
F  {F ( x, w) : w W , F : R W  {0,1}}
4/29/2020
The Vapnik Chervonenkis
Dimension
Let L̂ denote the set of N points in the mdimensional space X̂ of input vectors:

Lˆ  {xi  Xˆ ; i  1,..., N }
A dichotomy partitions L̂ into two disjoint
sets L̂0 and L̂1 such that
  0
F ( x , w)  
1
4/29/2020
 ˆ
for x  L0
 ˆ
for x  L1
The Vapnik Chervonenkis
Dimension
Let  Fˆ ( Lˆ ) denote the number of distinct
dichotomies implemented by the L.M.
Let  Fˆ (l ) denote the maximum  Fˆ ( Lˆ )
over all L̂ with Lˆ  l .
Lˆ
L̂ is shattered by F̂ if  Fˆ ( Lˆ )  2 . That is,
if all the possible dichotomies of L̂ can be
induced by functions in F̂ .
4/29/2020
The Vapnik Chervonenkis
Dimension
In the figure we illustrate a two-dimensional
space consisting of 4
points (x1,...,x4). The
decision boundaries of
F0 and F1 correspond
to the classes 0 and 1
being true. F0 induces
the dichotomy:
4/29/2020
The Vapnik Chervonenkis
Dimension
  

Dˆ 0  {Lˆ0  [ x1 , x2 , x4 ], Lˆ1  [ x3 ]}
While F1 induces
 
 
Dˆ 1  {Lˆ0  [ x1 , x2 ], Lˆ1  [ x3 , x4 ]}
with the set L̂ consisting of four
points, the cardinality Lˆ  4
Hence,
 Fˆ ( Lˆ )  2 4  16
4/29/2020
The Vapnik Chervonenkis
Dimension
We now formally define the VC dimension as:
“The VC dimension of an ensemble of
dichotomies
is the cardinality of the
F̂ is shattered by .”
largest set that
F̂
L̂
4/29/2020
The Vapnik Chervonenkis
Dimension
In more familiar terms, the VC dimension of
the set of classification functions
   ˆ
{F ( x , w) : w  W }
is the maximum number of training examples
that can be learned by the machine without
error for all possible labelings of the
classification functions.
4/29/2020
Importance of the VC Dimension
Roughly speaking, the number of examples
needed to learn a class of interest reliably is
proportional to the VC dimension.
In some cases the VC dimension is
determined by the free parameters of a
Neural Network.
In this regard, the following two results are of
interest.
4/29/2020
Importance of the VC Dimension
1. Let N̂ denote an arbitrary feedforward network
built up from neurons with a threshold activation
function:
1
 (v )  
0
for
v0
for
v0
the VC dimension of N̂ is O(W logW) where W is
the total number of free parameters in the network.
4/29/2020
Importance of the VC Dimension
2. Let N̂ denote a multilayer feedforward network
whose neurons use a sigmoid activation function
1
 (v ) 
1  e v
the VC dimension is O(W2), where W is the number
of free parameters in the network.
4/29/2020
Importance of the VC Dimension
In the case of binary pattern classification the
loss function has only two possible values:
 
  0 if F ( x , w)  d
L(d , F ( x , w)  
 1 otherwise

The risk functional R( w ) and the empirical

risk functional Remp( w) assume the
following interpretations:
4/29/2020
Importance of the VC Dimension

R( w ) is the probability of classification error

denoted by P(w ).
 ) is the training error, denoted by
Remp( w

v( w ).
Then (Haykin, p.98):


P(supP(w)  v(w) )  0 as N  
4/29/2020
Importance of the VC Dimension
The notion of VC provides a bound on the rate of
uniform convergence. For the set of classification
functions with VC dimension h the following
inequality holds:
h
 2eN 


2
P(sup P(w)  v(w) )  
 exp(   N ) (vc.1)
 h 
where N is the size of the training sample. In other
words, a finite VC dimension is a necessary and
sufficient condition for uniform convergence of the
principle of empirical risk minimization.
4/29/2020
Importance of the VC dimension
The factor 2eN / h  in (vc.1) represents a
bound on the growth function  Fˆ (l ) for



ˆ
the family of functions F  {F ( x , w); w  Wˆ }
for l  h  1 Provided that this function does
not grow too fast, the right hand side will go
to zero as N goes to infinity.
This requirement is satisfied if the VC
dimension is finite.
h
4/29/2020
Importance of the VC Dimension
Thus, a finite VC dimension is a necessary and
sufficient condition for uniform convergence of
the principle of empirical risk minimization.
Let  denote the probability of occurrence of the
event


sup P(w)  v(w) 

using the previous bound (vc.1) we find
h
 2eN 
2
 
exp(


N)

 h 
4/29/2020
(vc.2)
Importance of the VC Dimension
Let 0 ( N , h, ) denote the special value of
that satisfies (vc.2). Then we obtain
(Haykin, 99):
h   2N   1
0 ( N , h, ) 
log 
  1  log 

N  h   N
We refer to 0 as the confidence interval.
4/29/2020
Importance of the VC Dimension
We may also write


P(w)  v(w) 1 ( N , h, , v)
where



v
(
w
)

0 ( N , h, , v)  2 0 ( N , h,  )1  1  2



(
N
,
h
,

)
0


4/29/2020
Importance of the VC Dimension
Conclusions:


1.
P(w)  v(w) 1 ( N , h, , v)
2. For a small training error (close to zero):


P(w)  v(w)  4 0 ( N , h, )
3. For a large training error (close to unity):


P(w)  v(w) 0 ( N , h, )
4/29/2020
Structural Risk Minimization
The training error is the frequency of errors
made during the training session for some

machine with weight vector w during the
training session.
The generalization error is the frequency of
errors made by the machine when it is
tested with examples not seen before.
Let this two errors to be denoted with


vtrain(w) and vgene (w) .
4/29/2020
Structural Risk Minimization
Let h be the VC dimension of a family of



classification functions F ( x , w); w  Wˆ 
with respect to the input space X̂

The generalization error v gene (w) is lower
than the guaranteed risk defined by the sum
of competing terms


vguarant(w)  vtrain(w) 1 ( N , h, , vtrain)
where the confidence interval 1 ( N , h, , vtrain)
is defined as before.
4/29/2020
Structural Risk Minimization



v
(
w
)

1 ( N , h,  , vtrain)  2 0 ( N , h,  )1  1  2



(
N
,
h
,

)
0


For a fixed number of training samples N, the
training error decreases monotonically as
the capacity or h is increased, whereas the
confidence interval increases
monotonically.
4/29/2020
Structural Risk Minimization
The training error is the frequency of errors
made during the training session for some

machine with weight vector w during the
training session.
The generalization error is the frequency of
errors made by the machine when it is
tested with examples not seen before.
Let this two errors to be denoted with


vtrain(w) and vgene (w) .
4/29/2020
Structural Risk Minimization
The training error is the frequency of errors
made during the training session for some

machine with weight vector w during the
training session.
The generalization error is the frequency of
errors made by the machine when it is
tested with examples not seen before.
Let this two errors to be denoted with


vtrain(w) and vgene (w) .
4/29/2020
Structural Risk Minimization
The challenge in solving a supervised learning
problem lies in realizing the best
generalization performance by matching the
machine capacity to the available amount of
training data for the problem at hand. The
method of structural risk minimization
provides an inductive procedure to achieve
this goal by making the VC dimension of
the learning machine a control variable.
4/29/2020
Structural Risk Minimization
Consider an ensemble of pattern classifiers
   ˆ
{F ( x , w) : w  W }
and define a nested structure of n such machines



Fˆk  {F ( x , w); w  Wˆ k } k  1,..., n
such that we have
Fˆ1  Fˆ2  ...  Fˆn
correspondingly, the VC dimensions of the individual pattern classifiers satisfy h1  h2  ...  hn
which implies that the VC dimension of each
classifier is finite (see next figure)
4/29/2020
Illustration of relationship between training error, confidence interval
and guaranteed risk
4/29/2020
Structural Risk Minimization
Then:
a) The empirical risk (training error) of each
classifier is minimized
b) The pattern classifier F̂ * with the smallest
guaranteed risk is identified; this particular
machine provides the best compromise
between the training error (quality of
approximation) and the confidence interval
(complexity of the approximation function).
4/29/2020
Structural Risk Minimization
Our goal is to find a network structure such that
decreasing the VC dimension occurs at the
expense of the smallest possible increase in trainig
error.
We achieve this, for example, varying h by varying
the number of hidden neurons.
We evaluate the ensemble of fully connected
multilayer feedforward networks in which the
number of neurons in one of the hidden layers is
increased in a monotonic fashion.
4/29/2020
Structural Risk Minimization
The principle of SRM states that the best
network in this ensemble is the one for
which the guaranteed risk is the minimum.
4/29/2020