Lecture 1: Introduction - City University of New York

Download Report

Transcript Lecture 1: Introduction - City University of New York

Lecture 4: Logistic Regression
Machine Learning
CUNY Graduate Center
Today
• Bayesians v. Frequentists
• Logistic Regression
– Linear Model for Classification
1
Bayesians v. Frequentists
• What is a probability?
• Frequentists
– A probability is the likelihood that an event will happen
– It is approximated by the ratio of the number of observed events to the
number of total events
– Assessment is vital to selecting a model
– Point estimates are absolutely fine
• Bayesians
– A probability is a degree of believability of a proposition.
– Bayesians require that probabilities be prior beliefs conditioned on data.
– The Bayesian approach “is optimal”, given a good model, a good prior
and a good loss function. Don’t worry so much about assessment.
– If you are ever making a point estimate, you’ve made a mistake. The
only valid probabilities are posteriors based on evidence given some
prior
2
Logistic Regression
• Linear model applied to classification
• Supervised: target information is available
– Each data point xi has a corresponding target
ti.
• Goal: Identify a function
3
Target Variables
• In binary classification, it is convenient to represent ti
as a scalar with a range of [0,1]
– Interpretation of ti as the likelihood that xi is the member of
the positive class
– Used to represent the confidence of a prediction.
• For L > 2 classes, ti is often represented as a K
element vector.
– tij represents the degree of membership in class j.
– |ti| = 1
– E.g. 5-way classification vector:
4
Graphical Example of Classification
5
Decision Boundaries
6
Graphical Example of Classification
7
Classification approaches
• Generative
– Models the joint distribution
between c and x
– Highest data requirements
• Discriminative
– Fewer parameters to approximate
• Discriminant Function
– May still be trained probabilistically,
but not necessarily modeling a
likelihood.
8
Treating Classification as a
Linear model
9
Relationship between
Regression and Classification
• Since we’re classifying two classes, why not
set one class to ‘0’ and the other to ‘1’ then
use linear regression.
– Regression: -infinity to infinity, while class labels
are 0, 1
• Can use a threshold, e.g.
– y >= 0.5 then class 1
– y < 0.5 then class 2
1
Happy/Good/ClassA
f(x)>=0.5?
Sad/Not Good/ClassB
10
Odds-ratio
• Rather than thresholding, we’ll relate the
regression to the class-conditional
probability.
• Ratio of the odd of prediction y = 1 or y = 0
– If p(y=1|x) = 0.8 and p(y=0|x) = 0.2
– Odds ratio = 0.8/0.2 = 4
• Use a linear model to predict odds rather
than a class label.
11
Logit – Log odds ratio function
• LHS: 0 to infinity
• RHS: -infinity to
infinity
• Use a log function.
– Has the added bonus
of dissolving the
division leading to
easy manipulation
12
Logistic Regression
• A linear model used to predict log-odds
ratio of two classes
13
Logit to probability
14
Sigmoid function
• Squashing function to map the reals to a
finite domain.
15
Gaussian Class-conditional
• Assume the data is generated from a
gaussian distribution for each class.
• Leads to a bayesian formulation of logistic
regression.
16
Bayesian Logistic Regression
17
Maximum Likelihood Extimation
Logistic Regression
• Class-conditional Gaussian.
• Multinomial Class distribution.
• As ever, take the derivative of this
likelihood function w.r.t.
18
Maximum Likelihood Estimation
of the prior
19
Maximum Likelihood Estimation
of the prior
20
Maximum Likelihood Estimation
of the prior
21
Discriminative Training
• Take the derivatives w.r.t.
– Be prepared for this for homework.
• In the generative formulation, we need to
estimate the joint of t and x.
– But we get an intuitive regularization
technique.
• Discriminative Training
– Model p(t|x) directly.
22
What’s the problem with
generative training
• Formulated this way, in D dimensions, this
function has D parameters.
• In the generative case, 2D means, and
D(D+1)/2 covariance values
• Quadratic growth in the number of
parameters.
• We’d rather linear growth.
23
Discriminative Training
24
Optimization
• Take the gradient in terms of w
25
Optimization
26
Optimization
27
Optimization
28
Optimization: putting it together
29
Optimization
• We know the gradient of the error function,
but how do we find the maximum value?
• Setting to zero is nontrivial
• Numerical approximation
30
Gradient Descent
• Take a guess.
• Move in the direction of the negative
gradient
• Jump again.
• In a convex function this will converge
• Other methods include Newton-Raphson
31
Multi-class discriminant
functions
• Can extend to multiple classes
• Other approaches include constructing K-1
binary classifiers.
• Each classifier compares cn to not cn
• Computationally simpler, but not without
problems
32
Exponential Model
• Logistic Regression is a type of
exponential model.
– Linear combination of weights and features to
produce a probabilistic model.
33
Problems with Binary
Discriminant functions
34
K-class discriminant
35
Entropy
• Measure of uncertainty, or Measure of
“Information”
• High uncertainty equals high entropy.
• Rare events are more “informative” than
common events.
36
Entropy
• How much information is received when
observing ‘x’?
• If independent, p(x,y) = p(x)p(y).
– H(x,y) = H(x) + H(y)
– The information contained in two unrelated
events is equal to their sum.
37
Entropy
• Binary coding of p(x): -log p(x)
– “How many bits does it take to represent a
value p(x)?”
– How many “decimal” places? How many
binary decimal places?
• Expected value of observed information
38
Examples of Entropy
• Uniform distributions have higher
distributions.
39
Maximum Entropy
• Logistic Regression is also known as
Maximum Entropy.
• Entropy is convex.
– Convergence Expectation.
• Constrain this optimization to enforce good
classification.
• Increase maximum likelihood of the data
while making the distribution of weights most
even.
– Include as many useful features as possible.
40
Maximum Entropy with
Constraints
•
From Klein and Manning Tutorial
41
Optimization formulation
• If we let the weights represent likelihoods
of value for each feature.
For each feature i
42
Solving MaxEnt formulation
• Convex optimization with a concave
objective function and linear constraints.
• Lagrange Multipliers
Dual representation of the
maximum likelihood estimation of
Logistic Regression
For each feature i
43
Summary
• Bayesian Regularization
– Introduction of a prior over parameters serves to
constrain weights
• Logistic Regression
–
–
–
–
Log odds to construct a linear model
Formulation with Gaussian Class Conditionals
Discriminative Training
Gradient Descent
• Entropy
– Logistic Regression as Maximum Entropy.
44
Next Time
• Graphical Models
• Read Chapter 8.1, 8.2
45