Transcript Slide 1

Overview of Supervised
Learning
Outline
• Linear Regression and Nearest Neighbors
method
• Statistical Decision Theory
• Local Methods in High Dimensions
• Statistical Models, Supervised Learning and
Function Approximation
• Structured Regression Models
• Classes of Restricted Estimators
• Model Selection and Bias
2015/7/8
Overview of Supervised Learning
2
Notation
• X: inputs, feature vector, predictors, independent variables.
Generally X will be a vector of p values. Qualitative features
are coded in X.
– Sample values of X generally in lower case; xi is i-th of N sample
values.
• Y: output, response, dependent variable.
– Typically a scalar, can be a vector, of real values. Again yi is a
realized value.
• G: a qualitative response, taking values in a discrete set G;
e.g. G={ survived, died }. We often code G via a binary
indicator response vector Y.
2015/7/8
Overview of Supervised Learning
3
Problem
• 200 points generated in IR2
from a unknown
distribution; 100 in each of
two classes G={ GREEN,
RED }.
• Can we build a rule to
predict the color of the
future points?
2015/7/8
Overview of Supervised Learning
4
Linear regression
• Code Y=1 if G=RED, else Y=0.
• We model Y as a linear function of X:



p

Y  b0   X j b j  X b
T
j 1
• Obtain b by least squares, by minimizing the
quadratic criterion:
N
RSS( b )   ( yi  xiT b ) 2
i 1
• Given an N  p model matrix X and a response

vector y,
b  ( X T X ) 1 X T y
2015/7/8
Overview of Supervised Learning
5
Linear regression
2015/7/8
Overview of Supervised Learning
6
Linear regression
• Figure 2.1: A Classification
example in two dimensions.
The classes are coded as a
binary variable (GREEN=0,
RED=1) and then fit by
linear regression. The line
is the decision boundary
T
defined by X b  0.5. The
red shaded region denotes
that part of input space
classified as RED ,while
the green region is
classified as GREEN.
2015/7/8
Overview of Supervised Learning
7
Possible scenarios
2015/7/8
Overview of Supervised Learning
8
K-Nearest Neighbors
2015/7/8
Overview of Supervised Learning
9
K-Nearest Neighbors
• Figure 2.2: The same
classification example in two
dimensions as in Figure 2.1.
The classes are coded as a
binary variable (GREEN=0,
RED=1) and the fit by 15nearest-neighbor.
• The predicted class is hence
chosen by majority vote
amongst the 15-nearest
neighbors.
2015/7/8
Overview of Supervised Learning
10
K-Nearest Neighbors
• Figure 2.3: The same
classification example
are coded as a binary
variable ( GREEN=0,
RED=1), and then
predicted by
1-nearest-neighbor
classification.
2015/7/8
Overview of Supervised Learning
11
Linear regression vs. k-NN
2015/7/8
Overview of Supervised Learning
12
Linear regression vs. k-NN
• Figure 2.4: Misclassification curves
for the simulation example above.
a test sample of size 10,000 was
used. The red curves are test and
the green are training error for kNN classification. The results for
linear regression are the bigger
green and red dots at three
degrees of freedom. The purple
line is the optimal Bayes Error
Rate.
2015/7/8
Overview of Supervised Learning
13
Other Methods
2015/7/8
Overview of Supervised Learning
14
Statistical decision theory
2015/7/8
Overview of Supervised Learning
15
回归函数
EPE( f )  E[Y  f ( X )]
  (y-f(x))2 pr(dx,dy)
  (y-f(x))2 pr(dy| dx) pr(dx)
 E X EY | X ([Y  f ( X )]2 | X )
对EP E逐点极小化得
f ( x)  arg minc EY | X ([Y  c]2 | X  x)
极小解为:
f ( x)  E (Y | X  x)
2015/7/8
Overview of Supervised Learning
16
2015/7/8
Overview of Supervised Learning
17
2015/7/8
Overview of Supervised Learning
18
Bayes Classifier
2015/7/8
Overview of Supervised Learning
19
Bayes Classifier
• Figure 2.5: The optimal
Bayes decision
boundary for the
simulation example
above.
• Since the generating
density is known for
each class, this
boundary can be
calculated exactly.
2015/7/8
Overview of Supervised Learning
20
Curse of dimensionality
2015/7/8
Overview of Supervised Learning
21

EPE( x0 )  ET [ f ( x0 )  y 0 ]2



 ET [ f ( x0 )  y 0 ]  2 ET ( y 0 )  2 ET ( y 0 ) 2
2
2




 f ( x0 )  2 f ( x) ET y 0  ET y 0  2 ET ( y 0 )  2 ET ( y 0 ) 2
2



2

2

 [ ET y 0  2 ET ( y 0 )  ET ( y 0 ) ]  [ ET ( y 0 )  2 f ( x) ET y 0  f ( x0 ) 2 ]
2


2
2
2

 ET [ y 0  ET ( y 0 )]  [ ET ( y 0 )  f ( x0 )]2
2015/7/8

2

2Overview of Supervised Learning
0
 VarT ( y 0 )  Bias ( y )
22
2015/7/8
Overview of Supervised Learning
23
2015/7/8
Overview of Supervised Learning
24
Linear Model
• Linear Model
Y  XTb 
• Linear Regression


b  (X X ) X y
T
1
T
N
• Test error y  x0T bˆ  x0T b   li ( x0 ) i
i 1
li ( x0 )   the i-th component of X ( X T X )1 x0
2015/7/8
Overview of Supervised Learning
25
Curse of dimensionality
2015/7/8
Overview of Supervised Learning
26
2015/7/8
Overview of Supervised Learning
27
Statistical Models
2015/7/8
Overview of Supervised Learning
28
Supervised Learning
2015/7/8
Overview of Supervised Learning
29
Two Types of Supervised Learning
2015/7/8
Overview of Supervised Learning
30
Learning Classification Models
2015/7/8
Overview of Supervised Learning
31
Learning Regression Models
2015/7/8
Overview of Supervised Learning
32
Function Approximation
2015/7/8
Overview of Supervised Learning
33
Function Approximation
• Figure 2.10: Least
squares fitting of a
function of two inputs.
The parameters of fθ(x)
are chosen so as to
minimize the sum-ofsquared vertical errors.
2015/7/8
Overview of Supervised Learning
34
Function Approximation
• More generally, Maximum Likelihood Estimation
provides a natural basis for estimation.
• E.g. multinomial
P rG  k X   pk , ( X )
N
( )   log P rg i , ( xi )
i 1
2015/7/8
Overview of Supervised Learning
35
Structured Regression Models
2015/7/8
Overview of Supervised Learning
36
Classes of Restricted Estimators
2015/7/8
Overview of Supervised Learning
37
Model Selection & the Bias-Variance Tradeoff
2015/7/8
Overview of Supervised Learning
38
Model Selection & the Bias-Variance Tradeoff
• Test and training error as a function of model
complexity.
2015/7/8
Overview of Supervised Learning
39
• Page 27
• Ex 2.1; 2.2; 2.4; 2.6
2015/7/8
Overview of Supervised Learning
40