Lecture 6: Linear Regression II Machine Learning CUNY Graduate Center Extension to polynomial regression.

Download Report

Transcript Lecture 6: Linear Regression II Machine Learning CUNY Graduate Center Extension to polynomial regression.

Lecture 6: Linear Regression II
Machine Learning
CUNY Graduate Center
Extension to polynomial regression
1
Extension to polynomial regression
• Polynomial regression is the same as
linear regression in D dimensions
2
Generate new features
Standard Polynomial with coefficients, w
Risk
3
Generate new features
Feature Trick: To fit a D dimensional polynomial,
Create a D-element vector from xi
Then standard linear regression in D dimensions
4
How is this still linear regression?
• The regression is linear in the parameters,
despite projecting xi from one dimension to D
dimensions.
• Now we fit a plane (or hyperplane) to a
representation of xi in a higher dimensional
feature space.
• This generalizes to any set of functions
5
Basis functions as feature
extraction
• These functions are called basis functions.
– They define the bases of the feature space
• Allows linear decomposition of any type of
function to data points
• Common Choices:
–
–
–
–
Polynomial
Gaussian
Sigmoids
Wave functions (sine, etc.)
6
Training data vs. Testing Data
• Evaluating the performance of a classifier on
training data is meaningless.
• With enough parameters, a model can simply
memorize (encode) every training point
• To evaluate performance, data is divided into
training and testing (or evaluation) data.
– Training data is used to learn model parameters
– Testing data is used to evaluate performance
7
Overfitting
8
Overfitting
9
Overfitting performance
10
Definition of overfitting
• When the model describes the noise,
rather than the signal.
• How can you tell the difference between
overfitting, and a bad model?
11
Possible detection of overfitting
• Stability
– An appropriately fit model is stable under
different samples of the training data
– An overfit model generates inconsistent
performance
• Performance
– A good model has low test error
– A bad model has high test error
12
What is the optimal model size?
• The best model size generalizes to unseen
data the best.
• Approximate this by testing error.
• One way to optimize parameters is to
minimize testing error.
– This operation uses testing data as tuning or
development data
• Sacrifices training data in favor of parameter
optimization
• Can we do this without explicit evaluation
data?
13
Context for linear regression
•
•
•
•
Simple approach
Efficient learning
Extensible
Regularization provides robust models
14
Linear Regression
• Identify the best parameters, w, for a
regression function
15
Overfitting
• Recall: overfitting happens when a model
is capturing idiosyncrasies of the data
rather than generalities.
– Often caused by too many parameters
relative to the amount of training data.
– E.g. an order-N polynomial can intersect any
N+1 data points
16
Dealing with Overfitting
•
•
•
•
Use more data
Use a tuning set
Regularization
Be a Bayesian
17
Regularization
Regularization
•InIn
a linear
regression
model
overfitting
is
a Linear
Regression
model, overfitting
is characterized
by large
parameters.
characterized by large weights.
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
M = 0
0.19
M = 1
0.82
-1.27
M= 3
0.31
7.99
-25.43
17.37
M = 9
0.35
232.37
-5321.83
48568.31
-231639.30
640042.26
-1061800.52
1042400.18
-557682.99
125201.43
18
Penalize large weights
• Introduce a penalty term in the loss
function.
Regularized Regression
(L2-Regularization or Ridge Regression)
19
Regularization Derivation
20
21
Regularization in Practice
22
Regularization Results
23
More regularization
• The penalty term
defines the styles of
regularization
• L2-Regularization
• L1-Regularization
• L0-Regularization
– L0-norm is the
optimal subset of
features
24
Curse of dimensionality
• Increasing dimensionality of features increases the
data requirements exponentially.
• For example, if a single feature can be accurately
approximated with 100 data points, to optimize the
joint over two features requires 100*100 data points.
• Models should be small relative to the amount of
available data
• Dimensionality reduction techniques – feature
selection – can help.
– L0-regularization is explicit feature selection
– L1- and L2-regularizations approximate feature selection.
25
Bayesians v. Frequentists
• What is a probability?
• Frequentists
– A probability is the likelihood that an event will happen
– It is approximated by the ratio of the number of observed events to the
number of total events
– Assessment is vital to selecting a model
– Point estimates are absolutely fine
• Bayesians
– A probability is a degree of believability of a proposition.
– Bayesians require that probabilities be prior beliefs conditioned on data.
– The Bayesian approach “is optimal”, given a good model, a good prior
and a good loss function. Don’t worry so much about assessment.
– If you are ever making a point estimate, you’ve made a mistake. The
only valid probabilities are posteriors based on evidence given some
prior
26
Bayesian Linear Regression
• The previous MLE derivation of linear regression uses
point estimates for the weight vector, w.
• Bayesians say, “hold it right there”.
– Use a prior distribution over w to estimate parameters
• Alpha is a hyperparameter over w, where alpha is the
precision or inverse variance of the distribution.
• Now optimize:
27
Optimize the Bayesian posterior
As usual it’s easier to optimize after a log transform.
28
Optimize the Bayesian posterior
As usual it’s easier to optimize after a log transform.
29
Optimize the Bayesian posterior
Ignoring terms that do not depend on w
IDENTICAL formulation as L2-regularization
30
Context
• Overfitting is bad.
• Bayesians vs. Frequentists
– Is one better?
– Machine Learning uses techniques from both
camps.
31
Next Time
• Logistic Regression
32