#### Transcript Lecture 1: Introduction - City University of New York

Lecture 4: Logistic Regression Machine Learning CUNY Graduate Center Today • Bayesians v. Frequentists • Logistic Regression – Linear Model for Classification 1 Bayesians v. Frequentists • What is a probability? • Frequentists – A probability is the likelihood that an event will happen – It is approximated by the ratio of the number of observed events to the number of total events – Assessment is vital to selecting a model – Point estimates are absolutely fine • Bayesians – A probability is a degree of believability of a proposition. – Bayesians require that probabilities be prior beliefs conditioned on data. – The Bayesian approach “is optimal”, given a good model, a good prior and a good loss function. Don’t worry so much about assessment. – If you are ever making a point estimate, you’ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior 2 Logistic Regression • Linear model applied to classification • Supervised: target information is available – Each data point xi has a corresponding target ti. • Goal: Identify a function 3 Target Variables • In binary classification, it is convenient to represent ti as a scalar with a range of [0,1] – Interpretation of ti as the likelihood that xi is the member of the positive class – Used to represent the confidence of a prediction. • For L > 2 classes, ti is often represented as a K element vector. – tij represents the degree of membership in class j. – |ti| = 1 – E.g. 5-way classification vector: 4 Graphical Example of Classification 5 Decision Boundaries 6 Graphical Example of Classification 7 Classification approaches • Generative – Models the joint distribution between c and x – Highest data requirements • Discriminative – Fewer parameters to approximate • Discriminant Function – May still be trained probabilistically, but not necessarily modeling a likelihood. 8 Treating Classification as a Linear model 9 Relationship between Regression and Classification • Since we’re classifying two classes, why not set one class to ‘0’ and the other to ‘1’ then use linear regression. – Regression: -infinity to infinity, while class labels are 0, 1 • Can use a threshold, e.g. – y >= 0.5 then class 1 – y < 0.5 then class 2 1 Happy/Good/ClassA f(x)>=0.5? Sad/Not Good/ClassB 10 Odds-ratio • Rather than thresholding, we’ll relate the regression to the class-conditional probability. • Ratio of the odd of prediction y = 1 or y = 0 – If p(y=1|x) = 0.8 and p(y=0|x) = 0.2 – Odds ratio = 0.8/0.2 = 4 • Use a linear model to predict odds rather than a class label. 11 Logit – Log odds ratio function • LHS: 0 to infinity • RHS: -infinity to infinity • Use a log function. – Has the added bonus of dissolving the division leading to easy manipulation 12 Logistic Regression • A linear model used to predict log-odds ratio of two classes 13 Logit to probability 14 Sigmoid function • Squashing function to map the reals to a finite domain. 15 Gaussian Class-conditional • Assume the data is generated from a gaussian distribution for each class. • Leads to a bayesian formulation of logistic regression. 16 Bayesian Logistic Regression 17 Maximum Likelihood Extimation Logistic Regression • Class-conditional Gaussian. • Multinomial Class distribution. • As ever, take the derivative of this likelihood function w.r.t. 18 Maximum Likelihood Estimation of the prior 19 Maximum Likelihood Estimation of the prior 20 Maximum Likelihood Estimation of the prior 21 Discriminative Training • Take the derivatives w.r.t. – Be prepared for this for homework. • In the generative formulation, we need to estimate the joint of t and x. – But we get an intuitive regularization technique. • Discriminative Training – Model p(t|x) directly. 22 What’s the problem with generative training • Formulated this way, in D dimensions, this function has D parameters. • In the generative case, 2D means, and D(D+1)/2 covariance values • Quadratic growth in the number of parameters. • We’d rather linear growth. 23 Discriminative Training 24 Optimization • Take the gradient in terms of w 25 Optimization 26 Optimization 27 Optimization 28 Optimization: putting it together 29 Optimization • We know the gradient of the error function, but how do we find the maximum value? • Setting to zero is nontrivial • Numerical approximation 30 Gradient Descent • Take a guess. • Move in the direction of the negative gradient • Jump again. • In a convex function this will converge • Other methods include Newton-Raphson 31 Multi-class discriminant functions • Can extend to multiple classes • Other approaches include constructing K-1 binary classifiers. • Each classifier compares cn to not cn • Computationally simpler, but not without problems 32 Exponential Model • Logistic Regression is a type of exponential model. – Linear combination of weights and features to produce a probabilistic model. 33 Problems with Binary Discriminant functions 34 K-class discriminant 35 Entropy • Measure of uncertainty, or Measure of “Information” • High uncertainty equals high entropy. • Rare events are more “informative” than common events. 36 Entropy • How much information is received when observing ‘x’? • If independent, p(x,y) = p(x)p(y). – H(x,y) = H(x) + H(y) – The information contained in two unrelated events is equal to their sum. 37 Entropy • Binary coding of p(x): -log p(x) – “How many bits does it take to represent a value p(x)?” – How many “decimal” places? How many binary decimal places? • Expected value of observed information 38 Examples of Entropy • Uniform distributions have higher distributions. 39 Maximum Entropy • Logistic Regression is also known as Maximum Entropy. • Entropy is convex. – Convergence Expectation. • Constrain this optimization to enforce good classification. • Increase maximum likelihood of the data while making the distribution of weights most even. – Include as many useful features as possible. 40 Maximum Entropy with Constraints • From Klein and Manning Tutorial 41 Optimization formulation • If we let the weights represent likelihoods of value for each feature. For each feature i 42 Solving MaxEnt formulation • Convex optimization with a concave objective function and linear constraints. • Lagrange Multipliers Dual representation of the maximum likelihood estimation of Logistic Regression For each feature i 43 Summary • Bayesian Regularization – Introduction of a prior over parameters serves to constrain weights • Logistic Regression – – – – Log odds to construct a linear model Formulation with Gaussian Class Conditionals Discriminative Training Gradient Descent • Entropy – Logistic Regression as Maximum Entropy. 44 Next Time • Graphical Models • Read Chapter 8.1, 8.2 45