Transcript Slide 1

580.691 Learning Theory
Reza Shadmehr
LMS with Newton-Raphson, weighted least squares, choice of loss
function
Review of regression
f : Rd  R
Multivariate regression:


 wd xd
y w x 
N
2
1
1
T
(n)
T ( n)


E loss  y w      y  w x    y  Xw   y  Xw 
N
N
loss y
( n)

f (x; w)  w0  w1x1 
w  loss y
(n)
T ( n)
w x

XTy
( n)
T ( n)
n 1
LMS
w
(t )
w
( n)
 y
w
( n)
 y
w X X
Batch algorithm
Steepest descent

T
w
w
w
(t 1)
( n 1)
( n 1)
1


1 N ( n)

y  wT x ( n ) x ( n )
N n1

( n)
( n)
T ( n)
w x
 x(n)T x(n)
x( n )
x( n)T x( n)
x( n )
2
Finding the minimum of a function in a single step
J ( w)
w*
Taylor series expansion
J'
 w  w 
 J '  J ''  w*  w 
w
w
J w*  J w 
1!
J'
w*
*
w
0
J ''
w
w*   J '
w*  w 
w
 J ''
J'
J ''
w
w
w
w
J ''
w
2!
w
*
w

2

w
J '''
w
3!
w
*
w

3

(If J is quadratic, otherwise more terms here)
Newton-Raphson method
w*  w  
dJ
1 T d 2J
J w* = J w  
 

2
dw w 2
dw w
T
dJ
dJ
d 2J


0
2
dw w * d w w d w
w
 d 2J 

  
2
 dw

w

1
 d 2J 
*

w  w 
 dw 2 
w

dJ
dw w
1
dJ
dw w
 d 2J

( n 1)
( n)

w
 w  
2
 dw

w(n) 

1
dJ
dw w ( n )
The gradient of the loss function


2
1 N (n)
T (n)
J (w ) 
y w x
N n 1
Newton-Raphson
 d 2J

(t 1)
(t )

w
 w  
2
 dw ( t ) 
w 

1
dJ
dw w ( t )




dJ
2 N (n)

y  wT x( n ) xi( n )
dwi
N n 1
N
dJ
2

y ( n )  wT x ( n ) x ( n )
dw
N n 1
d 2J
N

2
d T (n) (n)

w x x
2
N n 1 dw
dw
The gradient of the loss function
wT xx   w1x1  w2 x2 
 x2
 1
 x2 x1
d T
w xx  
dw


 xm x1
d T
w xx  xxT
dw
x1x2
x22
xm x2
2 N d T ( n) ( n)

w x x
2
N
d
w
dw
n 1
d 2J

2 N ( n) ( n)T

x x
N n 1

 x1   x1  w1x1 
x  x w x 

 wm xm   2    2 1 1
  
  
 xm   xm  w1x1 
x1xm 

x2 xm 


2 
xm

 wm xm  

 wm xm  


 wm xm  
LMS algorithm with Newton-Raphson
1
 d 2J
 dJ

w (t 1)  w (t )   
 dw 2 (t )  dw w (t )
w 



N
dJ
2

y ( n )  wT x ( n ) x ( n )
dw
N n1
2 N ( n) ( n)T

x x
2
N n 1
dw
d 2J
Steepest descent
algorithm

N


(t 1)
(t )
( n ) ( n )T

w
 w   x x


 n1


1

N
n 1
0<  1
Note:
x( n) x( n)T is a singular matrix.
LMS
w
(n1)
w
( n)

 x
(n)T (n)
x
  y(n)  wT x(n)  x(n)
1

y ( n)  w (t )T x( n) x ( n)
Weighted Least Squares
• Suppose some data points are more important than others. We want to
weight the errors in matching those data points more.


N
2
1
( n) ( n)
T ( n)
J (w ) 
p
y w x
N n 1


P  diag p (1) , p (2) ,
, p( N )

1
 y  Xw T P  y  Xw 
N
T
dJ
1
1
T
T
   X  P  y  Xw  
 y  Xw  P   X   0
dw N
N
T
dJ
T
T
T
T T
  X Py  X PXw   y PX  w X PX  0
dw
dJ
  X T Py  X T PXw  X T PT y  X T PT Xw  0
dw
dJ
 2  X T Py  X T PXw  0
dw
J (w ) 






note: P  PT
How to handle artifacts in FMRI data
Diedrichsen and Shadmehr, NeuroImage (2005)
In fMRI, we typically measure the signal intensity from N voxels at acquisition time t=1…T. Each of these T
measurements constitutes an image. We assume that the time series of voxel n is an arbitrary linear function of the
design matrix X plus a noise term:
T x p design matrix
p x 1 vector
y n = Xβn + εn
T x 1 column vector
If one source of noise is due to random discrete events, for example, artifacts arising from the participant
moving their jaw, then only some images will be influenced, violating the assumption of a stationary noise
process. To relax this assumption, a simple approach is to allow the variance of noise in each image to be
scaled by a separate parameter. Under the temporal independence assumption, the variance-covariance matrix
of the noise process might be:
 s1
0
var  ε n   


0
0
s2
0
0
0  2
 n  V n2


sT 
si
a variance scaling parameter for the i-th time that
the voxel was imaged
Discrete events (e.g., swallowing) will impact only those images that were acquired during the event. What
should be done with these images, once they are identified? A typical approach would be to discard images
based on some fixed threshold. If we knew
var εn 
the optimal approach would be to weight the images by the inverse of their variance.
β*n   XT V 1X  XT V 1y n
1
But how do we get V? We can use the residuals from our model:


1
T
ˆ
rn  y n  Xβ n  I - X  X X  XT y n  Ry n
1
sˆ 
N
ˆ n2  rnT
 diag  r r / ˆ 
r / T  rank  X  
N
n 1
T
n n
2
n
n
This is a good start, but has some issues regarding bias of our estimator of variance. To improve things,
see Diedrichsen and Shadmehr (2005).
Weighted Least Squares
J (w) 
1
 y  Xw T P  y  Xw 
N


dJ
 2  X T Py  X T PXw  0
dw
“Normal equations”
for weighted least
squares
Weighted LMS
X T PXw  X T Py

T
w
( n)
w  X PX
w
( n 1)

1
 p
X T Py
( n)
x
( n)T ( n)
x
  y ( n )  w ( n )T x ( n )  x( n )
1
Regression with basis functions
• In general, predictions can be based on a linear combination of a set of
basis functions:
basis set:
 g1 (x), g2 (x),
g m (x)
gi ( x )  R d  R
f (x; w )  w0  w1g1 (x) 
wm g m (x)
Examples:
Linear basis set:
gi (x)  xi
Gaussian basis set:


1
T
gi (x)  exp  
x

p
x

p

i 
i 
2
 2

Each basis is a local expert. This measures
how close are the features of the input to that
preferred by expert i.
Radial basis set (RBF)
 1

T
gi (x)  exp  
x  pi  M T M  x  p i  

 2 2

Output
w0
1
Collection of experts
Input space
f (x; w)  w0  w1g1(x) 
wn gn (x)
Regression with basis functions
 y (1) 


y

 ( n) 
 y 
 
 
 
g 2  x(2) 
 
g 2 x( n )
1 g x(1)
1


(2)
1 g1 x
X 


( n)
1 g1 x

g 2 x(1)
 
yˆ  Xw
1
1
T
J n (w )   y  yˆ   y  yˆ   yT y
n
n
1
T
J n (w )   y  Xw   y  Xw 
n

T
w X X

1
XTy
 
 
g m x(1) 


g m x(2) 


( n) 
gm x


 
 w0 
w 
 1
w   w2 
 
 
 wm 
 
Choice of loss function
y  y  yˆ
0 if y  0
Loss ( y )  
1 if y  0
Loss ( y )  y
y
2
Loss ( y )  y 2  y
y
Loss ( y )  y
y
2
4
1.5
3
2


 y2  
Loss ( y )  k 1  exp  

2



 2  

1
0
-2
-1
0
1
y
w  arg min w E Loss  y 
In learning, our aim is to find
parameters w so to minimize the
expected loss:
E Loss  y  
This is a weighted sum. The loss is weighted by
the likelihood of observing that loss.

 p  y w Loss  y  dy

Probability density of error,
given our model parameters
2
Inferring the choice of loss function from behavior
Kording & Wolpert, PNAS (2004)
A trial lasted 6 seconds. Over this period, a series of ‘peas’ appeared near the target, drawn from
a distribution that depended on the finger position. The object was to “place the finger so that
on average, the peas land as close as possible to the target”.
p  y w  w1 
The delta loss function
p  y w  w2 
1
4
0.8
3
0.6
Loss
2
0.4
1
0.2
-2
-1
y
1
2
0
-2
-1
0
1
y
Imagine that the learner cannot arbitrarily change the density of the errors through learning.
All the learner can do is shift the density left or right through setting the parameter w. If the
learning uses this loss function:
0 if y  0
Loss( y )  
1 if y  0
w  argmax w Pr  y  0 w
Then the smallest possible expected value of loss occurs when p(y) has its peak at yerror =0
Therefore, in the above plot choice of w2 is better than w1. In effect, the w that the learner
chooses will depend on the exact shape of p(y).
2
Behavior with the delta loss function
p  y w  1    p1  y w   p2  y w
1    N  w  0.2,12 
1
0.8
  0.3
0.6
N
w0
0.4

w  0.2  0.2 /  , 22

  0.15
1
0.8
0.6
0.4
0.2
0.2
0
0
-2
-1
0
1
y
-2
2
-1
0
1
0 if y  0
Loss( y )  
1 if y  0
0.15
w
1.2
w  argmaxw p  y  0 w
0.1
0.05
0
-0.05
0.3
0.4
0.5

0.6
0.7
0.8
Suppose the “outside” system (e.g., the teacher) sets . Given the loss function, we can
predict what the best w will be for the learner.
2
Behavior with the squared error loss function
2
2
var  x   E  x  E  x    E  x  x  




 E  x 2  2 xx  x 2   E  x 2   2 E  x  x  E  x 2 


 
 
 E  x2   2 x 2  x 2  E  x2   x 2
 
 
E  x 2   var  x   x 2
 
Loss  y   y
w  arg min w E  y 2 w


E  y 2 w 






p  y w   1    N w  0.2,12 
2
p  y w  y 2 dy

2
E  y 2 w  var  y w  E  y w



 N w  0.2  0.2 /  , 22

E  y w  1    w  0.2     w  0.2  0.2 /  
E  y w  w
arg min w E  y 2 w  arg min w E  y w  0


We have a p(ytilda) with a variance that is independent of w. So to minimize E(loss), we should
pick a w that produces the smallest E[ytilda]. That happens at a w that sets mean of p(ytilda)
equal to zero.
Kording & Wolpert, PNAS (2004)
y
0.15
y
delta
1.5
0.1
Typical
subjects
w
0.05
0
-0.05
y
2
0.2
0.3
0.4

0.5
0.6
0.7
0.8
1.2
1
• Results: large errors are penalized by less
than a squared term. The loss function was
estimated at:
Loss  y   y
1.75
  0.15
0.8
0.6
0.4
0.2
0
-2
• However, note that the largest errors tended to
occur very infrequently in this experiment.
-1
0
y
1
(cm)
2
Mean and variance of mixtures of normal distributions




E  x   1  xN  1 ,12  dx   2  xN  2 , 22  dx
p  x   1N 1 ,12   2 N 2 , 22
 11   2 2
var  x   E  x 2   E  x 
 
2




E  x 2   1 x 2 N 1 ,12 dx   2 x 2 N 2 , 22 dx
 



 

2
var  x   1 12  12    2  22  22    11  2 2 
 1 12  12   2  22  22