Transcript ppt - CUNY

Lecture 3: Linear Regression
Machine Learning
CUNY Graduate Center
Today
• Calculus
– Lagrange Multipliers
• Linear Regression
1
Optimization with constraints
• What if I want to constrain the parameters
of the model.
– The mean is less than 10
• Find the best likelihood, subject to a
constraint.
• Two functions:
– An objective function to maximize
– An inequality that must be satisfied
2
Lagrange Multipliers
• Find maxima of
f(x,y) subject to a
constraint.
3
General form
• Maximizing:
• Subject to:
• Introduce a new variable, and find a
maxima.
4
Example
• Maximizing:
• Subject to:
• Introduce a new variable, and find a
maxima.
5
Example
Now have 3 equations with 3 unknowns.
6
Example
Eliminate Lambda
Substitute and Solve
7
Basics of Linear Regression
• Regression algorithm
• Supervised technique.
• In one dimension:
– Identify
• In D-dimensions:
– Identify
• Given: training data:
– And targets:
8
Graphical Example of Regression
?
9
Graphical Example of Regression
10
Graphical Example of Regression
11
Definition
• In linear regression, we assume that the
model that generates the data involved only
a linear combination of input variables.
Where w is a vector of weights which
define the D parameters of the model
12
Evaluation
• How can we
evaluate the
performance of a
regression solution?
• Error Functions (or
Loss functions)
– Squared Error
– Linear Error
13
Regression Error
14
Empirical Risk
• Empirical risk is the measure of the loss from
data.
• By minimizing risk on the training data, we
optimize the fit with respect to the loss function
15
Model Likelihood and Empirical
Risk
• Two related but distinct ways to look at a
model.
1. Model Likelihood.
1. “What is the likelihood that a model generated
the observed data?”
2. Empirical Risk
1. “How much error does the model have on the
training data?”
16
Model Likelihood
Assuming Independently Identically
Distributed (iid) data.
17
Understanding Model Likelihood
Substitution for
the eqn of a
gaussian
Apply a log
function
Let the log
dissolve
products into
sums
18
Understanding Model Likelihood
Optimize the
weights.
(Maximum
Likelihood
Estimation)
Log Likelihood
Empirical
Risk w/
Squared Loss
Function
19
Maximizing Log Likelihood (1-D)
• Find the optimal settings of w.
20
Maximizing Log Likelihood
Partial
derivative
Set to zero
Separate
the sum to
isolate w0
21
Maximizing Log Likelihood
Partial
derivative
Set to zero
Separate
the sum to
isolate w0
22
Maximizing Log Likelihood
From
previous
partial
From prev.
slide
Substitute
Isolate w1
23
Maximizing Log Likelihood
• Clean and easy.
• Or not…
• Apply some linear algebra.
24
Likelihood using linear algebra
• Representing the linear regression
function in terms of vectors.
25
Likelihood using linear algebra
• Stack xT into a matrix of data points, X.
Representation
as vectors
Stack the data
into a matrix
and use the
Norm operation
to handle the
sum
26
Likelihood in multiple dimensions
• This representation of risk has no inherent
dimensionality.
27
Maximum Likelihood Estimation
redux
Decompose
the norm
FOIL – linear
algebra style
Differentiate
Combine terms
Isolate w
28
Extension to polynomial regression
29
Extension to polynomial regression
• Polynomial regression is the same as
linear regression in D dimensions
30
Generate new features
Standard Polynomial with coefficients, w
Risk
31
Generate new features
Feature Trick: To fit a D dimensional polynomial,
Create a D-element vector from xi
Then standard linear regression in D dimensions
32
How is this still linear regression?
• The regression is linear in the parameters,
despite projecting xi from one dimension to D
dimensions.
• Now we fit a plane (or hyperplane) to a
representation of xi in a higher dimensional
feature space.
• This generalizes to any set of functions
33
Basis functions as feature
extraction
• These functions are called basis functions.
– They define the bases of the feature space
• Allows linear decomposition of any type of
function to data points
• Common Choices:
–
–
–
–
Polynomial
Gaussian
Sigmoids
Wave functions (sine, etc.)
34
Training data vs. Testing Data
• Evaluating the performance of a classifier on
training data is meaningless.
• With enough parameters, a model can simply
memorize (encode) every training point
• To evaluate performance, data is divided into
training and testing (or evaluation) data.
– Training data is used to learn model parameters
– Testing data is used to evaluate performance
35
Overfitting
36
Overfitting
37
Overfitting performance
38
Definition of overfitting
• When the model describes the noise,
rather than the signal.
• How can you tell the difference between
overfitting, and a bad model?
39
Possible detection of overfitting
• Stability
– An appropriately fit model is stable under
different samples of the training data
– An overfit model generates inconsistent
performance
• Performance
– A good model has low test error
– A bad model has high test error
40
What is the optimal model size?
• The best model size generalizes to unseen
data the best.
• Approximate this by testing error.
• One way to optimize parameters is to
minimize testing error.
– This operation uses testing data as tuning or
development data
• Sacrifices training data in favor of parameter
optimization
• Can we do this without explicit evaluation
data?
41
Context for linear regression
•
•
•
•
Simple approach
Efficient learning
Extensible
Regularization provides robust models
42
Break
Coffee. Stretch.
43
Linear Regression
• Identify the best parameters, w, for a
regression function
44
Overfitting
• Recall: overfitting happens when a model
is capturing idiosyncrasies of the data
rather than generalities.
– Often caused by too many parameters
relative to the amount of training data.
– E.g. an order-N polynomial can intersect any
N+1 data points
45
Dealing with Overfitting
•
•
•
•
Use more data
Use a tuning set
Regularization
Be a Bayesian
46
Regularization
Regularization
•InIn
a linear
regression
model
overfitting
is
a Linear
Regression
model, overfitting
is characterized
by large
parameters.
characterized by large weights.
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
M = 0
0.19
M = 1
0.82
-1.27
M= 3
0.31
7.99
-25.43
17.37
M = 9
0.35
232.37
-5321.83
48568.31
-231639.30
640042.26
-1061800.52
1042400.18
-557682.99
125201.43
47
Penalize large weights
• Introduce a penalty term in the loss
function.
Regularized Regression
(L2-Regularization or Ridge Regression)
48
Regularization Derivation
49
50
Regularization in Practice
51
Regularization Results
52
More regularization
• The penalty term
defines the styles of
regularization
• L2-Regularization
• L1-Regularization
• L0-Regularization
– L0-norm is the
optimal subset of
features
53
Curse of dimensionality
• Increasing dimensionality of features increases the
data requirements exponentially.
• For example, if a single feature can be accurately
approximated with 100 data points, to optimize the
joint over two features requires 100*100 data points.
• Models should be small relative to the amount of
available data
• Dimensionality reduction techniques – feature
selection – can help.
– L0-regularization is explicit feature selection
– L1- and L2-regularizations approximate feature selection.
54
Bayesians v. Frequentists
• What is a probability?
• Frequentists
– A probability is the likelihood that an event will happen
– It is approximated by the ratio of the number of observed events to the
number of total events
– Assessment is vital to selecting a model
– Point estimates are absolutely fine
• Bayesians
– A probability is a degree of believability of a proposition.
– Bayesians require that probabilities be prior beliefs conditioned on data.
– The Bayesian approach “is optimal”, given a good model, a good prior
and a good loss function. Don’t worry so much about assessment.
– If you are ever making a point estimate, you’ve made a mistake. The
only valid probabilities are posteriors based on evidence given some
prior
55
Bayesian Linear Regression
• The previous MLE derivation of linear regression uses
point estimates for the weight vector, w.
• Bayesians say, “hold it right there”.
– Use a prior distribution over w to estimate parameters
• Alpha is a hyperparameter over w, where alpha is the
precision or inverse variance of the distribution.
• Now optimize:
56
Optimize the Bayesian posterior
As usual it’s easier to optimize after a log transform.
57
Optimize the Bayesian posterior
As usual it’s easier to optimize after a log transform.
58
Optimize the Bayesian posterior
Ignoring terms that do not depend on w
IDENTICAL formulation as L2-regularization
59
Context
• Overfitting is bad.
• Bayesians vs. Frequentists
– Is one better?
– Machine Learning uses techniques from both
camps.
60
Next Time
• Logistic Regression
• Read Chapter 4.1, 4.3
61