Transcript Slide 1

Last lecture summary
Basic terminology
• tasks
– classification
– regression
• learner, algorithm
– each has one or several parameters influencing its
behavior
• model
– one concrete combination of learner and parameters
– tune the parameters using the training set
– the generalization is assessed using test set
(previously unseen data)
• learning (training)
– supervised
• a target vector t is known, parameters are tuned to
achieve the best match between prediction and the
target vector
– unsupervised
• training data consists of a set of input vectors x without
any corresponding target value
• clustering, vizualization
• for most applications, the original input
variables must be preprocessed
– feature selection
– feature extraction
selection
x1
x2
x3
x1
x4
x5
x5
x103
extraction
x6 .. . x784
x456
x1
x* 1
x2
x* 2
x3
x4
x5
x* 3
x* 4
x* 5
x*18
x*152
x*309
x6 .. . x784
x*6 .. . x*784
x*666
• feature selection/extraction = dimensionality reduction
– generally good thing
– curse of dimensionality
• example:
– learner: regression (polynomial, y = w0 + w1x + w2x2 + w3x3 + …)
– parameters: weights (coeffiients) w, order of polynomial
• weights
– adjusted so the the sum of squared errors SSE (error function)
is as small as possible
1
𝑺𝑺𝑬 =
2
suma čtverců chyb
𝑁
𝑦 𝑥𝑛 , 𝒘 − 𝑡𝑛
2
𝑛=1
predicted
known target
New stuff
Model selection
𝑀
𝑀
𝑗
𝑦 𝑥, 𝑤 =
𝑤𝑗 𝑥 = w0 = const
𝑗=0
𝑤𝑗 𝑥 𝑗 = 𝑤0 + 𝑤1 𝑥
𝑦 𝑥, 𝑤 =
𝑗=0
overfitting
comparing error for data sets of different size –
root mean squared error RMS
1
𝑺𝑺𝑬 =
2
𝑹𝑴𝑺 =
2 × 𝑆𝑆𝐸
=
𝑁
𝑁
𝑦 𝑥𝑛 , 𝒘 − 𝑡𝑛
𝑛=1
1
𝑁
2
RMS – root mean squared error
odmocnina střední kvadratické
chyby
MSE – mean squared error
střední kvadratická chyba
𝑁
𝑦 𝑥𝑛 , 𝒘 − 𝑡𝑛
2
𝑛=1
1 𝑁
𝐌𝑺𝑬 =
𝑦 𝑥𝑛 , 𝒘
𝑁 𝑛=1
− 𝑡𝑛
2
Summary of errors
1
𝑺𝑺𝑬 =
2
𝑁
𝑦 𝑥𝑛 , 𝒘 − 𝑡𝑛
sum of squared errors
𝑛=1
1
𝑴𝑺𝑬 =
𝑁
𝑹𝑴𝑺 =
2
𝑁
𝑦 𝑥𝑛 , 𝒘 − 𝑡𝑛
2
mean squared error
𝑛=1
2 × 𝑆𝑆𝐸
=
𝑁
1
𝑁
𝑁
𝑦 𝑥𝑛 , 𝒘 − 𝑡𝑛
2
= 𝑀𝑆𝐸
𝑛=1
root mean squared error
Training set
Test set
• the bad result for M = 9 may seem paradoxical
because
– polynomial of given order contains all lower order
polynomials as special cases (M=9 polynomial
should be at least as good as M=3 polynomial)
• OK, let’s examine the values of the coefficients
w* for polynomials of various orders
M=0 M=1 M=3
w0*
0.82
0.31
0.35
-1.27
7.99
232.37
w2*
-25.43
-5321.83
w3*
17.37
48568.31
w1*
0.19
M=9
w4*
-231639.30
w5*
640042.26
w6*
-1061800.52
w7*
1042400.18
w8*
-557682.99
w9*
125201.43
M=9
N = 15
for a given model complexity the
overfitting problem becomes less
severe as the size of the data set
increases
M=9
N = 100
or in other words, the larger the
data set is, the more complex
(flexible) model can be fitted
Overfitting in
classification
Bias-variance tradeoff
• low flexibility (low degree of polynomial)
models have large bias and low variance
– bias means large quadratic error of the model
– variance means that the predictions of the model
will depend only little on the particular sample
that was used for building the model
• i.e. there is little change in the model if training data set
is changed
• thus there is little change between predictions for
given x for different models
• high flexibility models have low bias and large
variance
– Large degree will make the polynomial
very sensitive to the details of the sample.
– Thus the polynomial changes dramatically upon
the change of the data set.
– However, bias is low, as the quadratic error is low.
• A polynomial with too few parameters (too
low degree) will make large errors because of
a large bias.
• A polynomial with too many parameters (too
high degree) will make large errors because of
a large variance.
• The degree of the ”best” polynomial must be
somewhere ”in-between” - bias-variance
tradeoff
MSE = variance + bias2
• This phenomenon is not specific to polynomial
regression!
• In fact, it shows-up in any kind of model.
• Generally, the bias-variance tradeoff principle can
be stated as:
– Models with too few parameters are inaccurate
because they are not flexible enough (large bias, large
error of the model).
– Models with too many parameters are inaccurate
because they overfit data (large variance, too much
sensitivity to the data)
– Identifying the best model requires identifying the
proper “model complexity” (number of parameters).
Test-data and Cross Validation
attributes, input/independent variables, features
object
instance
sample
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
class
Attribute types
• discrete
– Has only a finite or countably infinite set of values.
– nominal (also categorical)
• the values are just different labels (e.g. ID number, eye color)
• central tendency given by mode (median, mean not defined)
– ordinal
• their values reflect the order (e.g. ranking, height in {tall,
medium, short})
• central tendency given by median, mode (mean not defined)
– binary attributes - special case of discrete attributes
• continuous (also quantitative)
– Has real numbers as attribute values.
– central tendency given by mean, + stdev, …
A regression problem
y = f(x) + noise
Can we learn from this data?
y
Consider three methods
x
taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
Linear regression
What will the regression model
will look like?
y = ax + b
y
Univariate linear regression
with a constant term.
x
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
Quadratic regression
What will the regression model
will look like?
y = ax2 + bx + c
y
x
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
Join-the-dots
Also known as piecewise linear
nonparametric regression if that
makes you feel better.
y
x
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
Which is best?
Why not to choose the method with the best fit to data?
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
What do we really want ?
Why not to choose the method with the best fit to data?
How well are you going to
predict future data?
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
The test set method
1. Randomly choose 30%
of data to be in test set.
2. The remainder is training set.
3. Perform regression on the
training set.
y
4. Estimate future performance
with the test set.
x
linear regression
MSE = 2.4
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
The test set method
1. Randomly choose 30%
of data to be in test set.
2. The remainder is training set.
3. Perform regression on the
training set.
y
4. Estimate future performance
with the test set.
x
quadratic regression
MSE = 0.9
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
The test set method
1. Randomly choose 30%
of data to be in test set.
2. The remainder is training set.
3. Perform regression on the
training set.
y
x
4. Estimate future performance
with the test set.
join-the-dots
MSE = 2.2
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
Test set method
• good news
– very simple
– Model selection: choose method with the best score.
• bad news
– wastes data (we got an estimate of the best method by
using 30% less data)
Train
Test
– if you don’t have enough data, test set may be just
lucky/unlucky
test set estimator of performance has high variance
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
the above examples were for different
algorithms, this one is about the model
complexity (for the given algorithm)
testing error
training error
model complexity
• stratified division
– same proportion of data in the training and test
sets
LOOCV (Leave-one-out Cross Validation)
1.
2.
3.
4.
y
choose one data point
remove it from the set
fit the remaining data points
note your error
Repeat these steps for all points.
When you are done report the
mean square error.
x
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
taken from Cross Validation tutorial by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
MSELOOCV = 2.12
taken from Cross Validation tutorial by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
MSELOOCV = 0.962
taken from Cross Validation tutorial by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
MSELOOCV = 3.33
Which kind of Cross Validation?
Good
Test set Cheap
Bad
Variance
Wastes data
LOOCV Doesn’t waste data Expensive
Can we get best of both worlds?
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
k-fold Cross Validation
Randomly break data set into k partitions.
In our case k = 3.
Red partition: Train on all points not in the
red partition. Find the test set sum of errors
on the red points.
Blue partition: Train on all points not in the
blue partition. Find the test set sum of errors
on the blue points.
y
Green partition: Train on all points not in the
green partition. Find the test set sum of errors
on the green points.
x
linear regression
MSE3fold = 2.05
Then report the mean error.
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
Results of 3-fold Cross Validation
MSE3fold
linear
2.05
quadratic
1.11
join-the-dots 2.93
taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html
Which kind of Cross Validation?
Test set
Good
Cheap.
LOOCV
3-fold
Doesn’t waste data.
Slightly better than testset.
10-fold
Only wastes 10%.
Only 10 times more
expensive instead of R
times.
R-fold is identical to LOOCV
Bad
Variance
Wastes data.
Expensive.
Wastier than LOOCV.
More expensive than
test-set.
Wastes 10%.
10 times more
expensive instead of R
times (as LOOCV is).
• We are trying to decide which model to use. For the
polynomial regression decide about the degree of
polynom.
• Train each machine and make a table.
degree
MSEtrain
MSE10-fold
Choice
1
2
3
4
5
6
• Whichever model gave best CV score: train it with all
the data. That’s the predictive model you’ll use.
taken from Cross Validation tutorial by Andrew Moore, http://www.autonlab.org/tutorials/overfit.html
Model selection via CV
Selection and testing
• Complete procedure to algorithm selection and
estimation of its quality
1. Divide data to train/test
Train
Test
2. By Cross Validation on the Train choose the
algorithm
Train
Val
3. Use this algorithm to construct a classifier using Train
Train
4. Estimate its quality on the Test
Test
• Training error can not be used as an indicator
of model’s performance due to overfitting.
• Training data set - train a range of models, or
a given model with a range of values for its
parameters.
• Compare them on independent data –
Validation set.
– If the model design is iterated many times, then
some overfitting to the validation data can occur
and so it may be necessary to keep aside a third
• Test set on which the performance of the
selected model is finally evaluated.
Fnally comes our first machine
learning algorithm
?
y
x
• Which class (Blue or Orange) would you predict
for this point?
• And why?
• classification boundary
?
y
x
• And now?
• Classification boundary is quadratic
y
?
x
• And now?
• And why?
Nearest Neighbors Classification
instances
• But, what does it mean similar?
A
B
C
D
source: Kardi Teknomo’s Tutorials, http://people.revoledu.com/kardi/tutorial/index.html