Transcript ppt - CUNY
Lecture 4: Logistic Regression
Machine Learning
CUNY Graduate Center
Today
• Linear Regression
– Bayesians v. Frequentists
– Bayesian Linear Regression
• Logistic Regression
– Linear Model for Classification
1
Regularization:
Penalize large weights
• Introduce a penalty term in the loss
function.
Regularized Regression
(L2-Regularization or Ridge Regression)
2
More regularization
• The penalty term
defines the styles of
regularization
• L2-Regularization
• L1-Regularization
• L0-Regularization
– L0-norm is the
optimal subset of
features
3
Curse of dimensionality
• Increasing dimensionality of features increases the
data requirements exponentially.
• For example, if a single feature can be accurately
approximated with 100 data points, to optimize the
joint over two features requires 100*100 data points.
• Models should be small relative to the amount of
available data
• Dimensionality reduction techniques – feature
selection – can help.
– L0-regularization is explicit feature selection
– L1- and L2-regularizations approximate feature selection.
4
Bayesians v. Frequentists
• What is a probability?
• Frequentists
– A probability is the likelihood that an event will happen
– It is approximated by the ratio of the number of observed events to the
number of total events
– Assessment is vital to selecting a model
– Point estimates are absolutely fine
• Bayesians
– A probability is a degree of believability of a proposition.
– Bayesians require that probabilities be prior beliefs conditioned on data.
– The Bayesian approach “is optimal”, given a good model, a good prior
and a good loss function. Don’t worry so much about assessment.
– If you are ever making a point estimate, you’ve made a mistake. The
only valid probabilities are posteriors based on evidence given some
prior
5
Bayesian Linear Regression
• The previous MLE derivation of linear regression uses
point estimates for the weight vector, w.
• Bayesians say, “hold it right there”.
– Use a prior distribution over w to estimate parameters
• Alpha is a hyperparameter over w, where alpha is the
precision or inverse variance of the distribution.
• Now optimize:
6
Optimize the Bayesian posterior
As usual it’s easier to optimize after a log transform.
7
Optimize the Bayesian posterior
As usual it’s easier to optimize after a log transform.
8
Optimize the Bayesian posterior
Ignoring terms that do not depend on w
IDENTICAL formulation as L2-regularization
9
Context
• Overfitting is bad.
• Bayesians vs. Frequentists
– Is one better?
– Machine Learning uses techniques from both
camps.
10
Logistic Regression
• Linear model applied to classification
• Supervised: target information is available
– Each data point xi has a corresponding target
ti.
• Goal: Identify a function
11
Target Variables
• In binary classification, it is convenient to represent ti
as a scalar with a range of [0,1]
– Interpretation of ti as the likelihood that xi is the member of
the positive class
– Used to represent the confidence of a prediction.
• For L > 2 classes, ti is often represented as a K
element vector.
– tij represents the degree of membership in class j.
– |ti| = 1
– E.g. 5-way classification vector:
12
Graphical Example of Classification
13
Decision Boundaries
14
Graphical Example of Classification
15
Classification approaches
• Generative
– Models the joint distribution
between c and x
– Highest data requirements
• Discriminative
– Fewer parameters to approximate
• Discriminant Function
– May still be trained probabilistically,
but not necessarily modeling a
likelihood.
16
Treating Classification as a
Linear model
17
Relationship between
Regression and Classification
• Since we’re classifying two classes, why not
set one class to ‘0’ and the other to ‘1’ then
use linear regression.
– Regression: -infinity to infinity, while class labels
are 0, 1
• Can use a threshold, e.g.
– y >= 0.5 then class 1
– y < 0.5 then class 2
1
Happy/Good/ClassA
f(x)>=0.5?
Sad/Not Good/ClassB
18
Odds-ratio
• Rather than thresholding, we’ll relate the
regression to the class-conditional
probability.
• Ratio of the odd of prediction y = 1 or y = 0
– If p(y=1|x) = 0.8 and p(y=0|x) = 0.2
– Odds ratio = 0.8/0.2 = 4
• Use a linear model to predict odds rather
than a class label.
19
Logit – Log odds ratio function
• LHS: 0 to infinity
• RHS: -infinity to
infinity
• Use a log function.
– Has the added bonus
of disolving the
division leading to
easy manipulation
20
Logistic Regression
• A linear model used to predict log-odds
ratio of two classes
21
Logit to probability
22
Sigmoid function
• Squashing function to map the reals to a
finite domain.
23
Gaussian Class-conditional
• Assume the data is generated from a
gaussian distribution for each class.
• Leads to a bayesian formulation of logistic
regression.
24
Bayesian Logistic Regression
25
Maximum Likelihood Extimation
Logistic Regression
• Class-conditional Gaussian.
• Multinomial Class distribution.
• As ever, take the derivative of this
likelihood function w.r.t.
26
Maximum Likelihood Estimation
of the prior
27
Maximum Likelihood Estimation
of the prior
28
Maximum Likelihood Estimation
of the prior
29
Discriminative Training
• Take the derivatives w.r.t.
– Be prepared for this for homework.
• In the generative formulation, we need to
estimate the joint of t and x.
– But we get an intuitive regularization
technique.
• Discriminative Training
– Model p(t|x) directly.
30
What’s the problem with
generative training
• Formulated this way, in D dimensions, this
function has D parameters.
• In the generative case, 2D means, and
D(D+1)/2 covariance values
• Quadratic growth in the number of
parameters.
• We’d rather linear growth.
31
Discriminative Training
32
Optimization
• Take the gradient in terms of w
33
Optimization
34
Optimization
35
Optimization
36
Optimization: putting it together
37
Optimization
• We know the gradient of the error function,
but how do we find the maximum value?
• Setting to zero is nontrivial
• Numerical approximation
38
Gradient Descent
• Take a guess.
• Move in the direction of the negative
gradient
• Jump again.
• In a convex function this will converge
• Other methods include Newton-Raphson
39
Multi-class discriminant
functions
• Can extend to multiple classes
• Other approaches include constructing K-1
binary classifiers.
• Each classifier compares cn to not cn
• Computationally simpler, but not without
problems
40
Exponential Model
• Logistic Regression is a type of
exponential model.
– Linear combination of weights and features to
produce a probabilistic model.
41
Problems with Binary
Discriminant functions
42
K-class discriminant
43
Entropy
• Measure of uncertainty, or Measure of
“Information”
• High uncertainty equals high entropy.
• Rare events are more “informative” than
common events.
44
Entropy
• How much information is received when
observing ‘x’?
• If independent, p(x,y) = p(x)p(y).
– H(x,y) = H(x) + H(y)
– The information contained in two unrelated
events is equal to their sum.
45
Entropy
• Binary coding of p(x): -log p(x)
– “How many bits does it take to represent a
value p(x)?”
– How many “decimal” places? How many
binary decimal places?
• Expected value of observed information
46
Examples of Entropy
• Uniform distributions have higher
distributions.
47
Maximum Entropy
• Logistic Regression is also known as
Maximum Entropy.
• Entropy is convex.
– Convergence Expectation.
• Constrain this optimization to enforce good
classification.
• Increase maximum likelihood of the data
while making the distribution of weights most
even.
– Include as many useful features as possible.
48
Maximum Entropy with
Constraints
•
From Klein and Manning Tutorial
49
Optimization formulation
• If we let the weights represent likelihoods
of value for each feature.
For each feature i
50
Solving MaxEnt formulation
• Convex optimization with a concave
objective function and linear constraints.
• Lagrange Multipliers
Dual representation of the
maximum likelihood estimation of
Logistic Regression
For each feature i
51
Summary
• Bayesian Regularization
– Introduction of a prior over parameters serves to
constrain weights
• Logistic Regression
–
–
–
–
Log odds to construct a linear model
Formulation with Gaussian Class Conditionals
Discriminative Training
Gradient Descent
• Entropy
– Logistic Regression as Maximum Entropy.
52
Next Time
• Graphical Models
• Read Chapter 8.1, 8.2
53