Logistic Regression

Transcript Logistic Regression

Logistic Regression
Who intends to vote?




Scholars and politicians would both like to
understand who voted.
Imagine that you did a survey of voters after an
election and you ask people if they voted.
You want to be able to use their responses to
understand what factors influence turnout
decisions.
How would you analyze who voted? What
factors do you think influence whether or not
people voted – or to say that they voted?
Problem: Dichotomous Variable

No one factor influences turnout.


The problem is that turnout is a dichotomous variable.


A multivariate analysis makes the most sense to include
independent variables like education, interest in politics, feelings
towards the candidates and perceived closeness of the election.
You either voted or you did not vote.
Linear (OLS) regression is best used for explaining
variation in continuous variables.

Multivariate analyses with dichotomous dependent variables
require logistic regression.
Logistic Regression


Like linear (OLS) regression, binomial
logistic regression models the relationship
between multiple independent variables
and a dependent variable.
For logistic regressions, though, the
dependent variable is dichotomous

Usually coded 0 and 1
Dichotomous Variables

A variable is dichotomous when there are only
two possible options, like yes and no.


Sometimes dichotomous variables are called binary
variables since the values are often coded as one and
zero.
This is a common dependent variable in political
science because scholars are often interested in
Yes/No questions like:



Did you vote?
Do you approve of the President’s performance?
Does a country have an independent judiciary?
Dichotomous Variable Conventions


Dichotomous variables have only two values.
Typically, inaction, absences or negative
outcomes are coded as 0.


Examples: Did not vote, does not have an
independent judiciary, did not riot, does not have
the death penalty.
An action or an occurrence of an event, the
presence of something, or if a person agreed
with a statement is coded as 1.
When to use Logistic Regression


Logistic regression is a multivariate
analysis designed to gauge how
independent variables influence the
likelihood of an outcome occurring.
This outcome could be:



an event, like a war, occurring,
a choice being made, like deciding to vote Democrat.
or an action being taken, like voting or joining a
protest.
Summary: Interpretation of Logistic Results

Logistic regression coefficients cannot be interpreted like
linear regression coefficients.



You can assess whether the independent variable increases (or
decreases) the chances of the event occurring, choosing an
option, or partaking in the action like voting by looking at the sign
of the coefficient (negative or positive).
You can see if the effect of the independent variable on the
dependent variable is due to chance by looking at familiar
measures of statistical significance.
Measures similar to R-squared as well as a classification
table indicate model goodness of fit.
Did poor states vote for George W. Bush?

Lets say you want to test the hypothesis that George W.
Bush in 2000 was more likely to win a plurality of votes in
a state if people in that state tended to be poor.



Poor states include “red” states like Mississippi and Alabama.
The independent variable is the median household
income of the state (an interval variable, “medhhinc”) in
100’s of dollars.
The dependent variable is whether or not a plurality of
voters in that state cast votes for Bush (“votebush”),
giving Bush those electoral college votes.

This is a dichotomous dependent variable.
Problems with Linear Regression

Linear (or Ordinary Least Squares [OLS]) regression is
inappropriate for explaining dependent variables that are
dichotomous because the regression model tries to fit a
straight line between the observations, and this line does
a very poor job of fitting the data.


Linear regression is fine for dichotomous independent “dummy”
variables.
To illustrate, I am going to use a bivariate regression that
depicts the relationship between the wealth of a state
and whether or not a plurality in that state voted for Bush
or Gore in 2000.
Example
In the example to the right, all of the
observations are at two values on
the Y axis, at one and at zero.
Observations are depicted at one if
Bush won a plurality of votes in that
state in 2000, zero if the state voted
for Gore.
The X-axis is the median income of
American states.
To simplify, I include only the ten
poorest states and the ten richest
states.
On the graph you can see in the top
left that almost all of the poorest
states (including Gore's home
state, Tennessee) voted for Bush.
In the bottom right most of the rich
states voted for Gore.
Example
The model predicts the value
of Vote for Bush (y-axis) if
median household income ≈
$40,000 (x-axis)?
Look at where the line crosses
$40,000 (solid red arrow) and
then read the value of the yaxis at that point (dashed red
arrow).
The answer looks to be about
0.6-0.65… Which is impossible
for a variable that is either zero
or one.
Example
What does the model
predict is the value of Vote
for Bush if median
household income ≈
$50,000?
Look at where the line crosses
$50,000 (solid red arrow) and
then read the value of the yaxis at that point (dashed red
arrow).
The answer looks to be about
0.3… Which is impossible for a
variable that is only either zero
or one.
0.3 is not even very close to
either zero or one!.
There’s no such thing as a little pregnant!


In our example, when we fit a straight regression line to
the data, the line predicts that at most levels of median
household income, states vote a little for Bush or a little
for Gore…
This is fine if the dependent variable is the percentage of
the vote, but when the variable is dichotomous, voting a
little for Bush (or Gore) is like being a little pregnant… A
plurality in each state either votes for Gore or Bush!


Remember that all that matters in the Electoral College is if a
candidate wins a plurality – the margin is unimportant.
As a result, analysts might like to study whether or not Bush won,
not the margin of Bush’ victory.
Problems with fitting the straight line




In the preceding graph, the linear regression line predicts
that for most levels of X (median household income), Y
(a plurality vote for Bush) should be between one and
zero.
This is problematic because a Y (plurality vote for Bush)
can only be either one (yes) or zero (no).
Even worse problems are not uncommon as linear
regression lines can predict values from infinity to
negative infinity (“unbounded”).
A better model would reflect the reality that only two
options – one (yes) or zero (no) are possible.
Implications of linear regression



When linear regression is applied to a model with a
categorical dependent variable, the distance between
almost all observed points and the regression line is
quite large.
By predicting values that are both impossible and far
from the actual values, standard errors increase and we
explain little total variation.
Linear regression also assumes that the error terms be
normally distributed, an assumption that logit does not
make.
A solution
By fitting a “sigmoidal” or Sshaped, curved line to the
data (see chart on left), we
can do a much better job of
minimizing the errors.
For much of the range, the
black line in the middle of
the graph is very close to 1
or 0.
Note: This line has a
negative slope, so it looks
more like the letter Z, but
the positive slope looks
more like the letter S,
running from the bottom-left
to the top right.
Curves are better than sticks



This S-curve does a much better job of minimizing the
errors than a linear line
The range of values of X for which Y is predicted to be
between one and zero is minimized to a narrow range in
the middle of the distribution.
Most computer programs round up values predicted by
the S-curve to be over 0.5 (by default) to gauge how
many observations the model correctly predicts.
Curve dynamics


We can interpret the data in terms of increasing the
odds, chances or probability that the choice is one (a
plurality for Bush).
Because the curve tends to flatten out as it approaches
the extreme ends of the range of X, the probabilities of
choosing 1 or 0 also tend to flatten as the values of X
increase.

In our example, poor states were more likely to vote for Bush, but
the likelihood of voting for Bush does not change much for the
three poorest states compared to the other poor states.
Logit and Probit

There are two types of similar S-curves used
to analyze these data, logit and probit.



The two tend to yield similar results.
Probit coefficients more quickly reach probabilities
asymptotically closer to zero or one, so logit
models tend to be more sensitive when dealing
with rare events.
Logit analyses appear more frequently in political
science largely because they can be more readily
interpreted in terms of odds and odds ratios.
Maximum Likelihood Analysis


Both logit and probit are examples of maximum
likelihood estimation techniques that find the parameter
that maximizes the likelihood of observing the sample
data if the models’ assumptions are accurate.
The techniques work by fitting an equation to the
observed values and then repeatedly changing the
equation a little to find a better fit until the new equation
hardly improves on the previous model.
 These techniques can also be used to explain the
number of times an event takes place.

See King (1989), Long and Freese (2006).
Beyond Logit and Probit

In this lecture, we will only discuss modeling choices
between dichotomous variables, but there are ways of
analyzing more than two choices.



Ordered logit (or probit) fits multiple S-curves like steps on ordinal
dependent variables.
Multinomial logit (or probit) enables scholars to explain variation
in nominal dependent variables.
These methods bridge the gap between the logit and
probit models of dichotomous choices and OLS
regression models best used when the dependent
variable is at least ordinal with many value categories.
Running Logit


To analyze a logistic regression with a computer
program, one must specify a dependent variable and at
least one independent variable in much the same way
that they are specified in a linear regression.
Independent variables can be used just like they are in
linear regressions.


All independent variables must be ordinal or “dummies.”
Interaction terms can be used, using the same rules as linear
regression.
Dichotomous Dependent Variable

The dependent variable must be dichotomous


Ensure that options like “don’t know” or “maybe” are
declared missing or recoded.
Some computer programs require that the dependent
variable be coded 0 and 1.


Even if this is not the case, recoding the variable as 0 and 1 is
recommended.
Run a frequency table before running the regression to make
sure you did the recoding correctly!
Logit in STATA

In the menus, select “Statistics”, then Binary
Outcomes and finally, Logistic Regression.




A new window will open;
Select the dependent variable from the drop down
menu on the left.
Select independent variables from the drop down
menu on the left (you can just keep clicking to add
more variables to the analysis).
Click “OK”
Logit in STATA: command line
Or, at the command line, simply type:
logit depvar var1 var2



Replace depvar with the name/identifier of your
dependent variable, var1, var2 with the name/indicator of
your independent variables.
For example, when asking whether poorer states
(medhhinc) were more likely to be won by Bush
(votebush), the command would be:
logit votebush medhhinc
Logit in SPSS, using menus

To run logit in SPSS using the menu interface,
Go the Analyze menu, select Regression and then
Binary Logistic to see a dialogue window.


Select a dependent variable in the box at the top of the
window.
Select independent variables (labeled “covariates” in the
middle of the window).
SPSS Syntax

You can also manually enter the logit regression
syntax.
LOGISTIC REGRESSION VARIABLES depvar
/METHOD=ENTER var1 var2 var3
/PRINT=CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).


Replace depvar with the name/locator of your
dependent variable and var1, var2… by name/locators of
your independent variable(s).
The confidence intervals can be omitted, and the criteria
can be adjusted as desired.
Example SPSS Syntax
For example, when asking whether poorer states
(medhhinc) were more likely to be won by Bush (votebush),
the command would be:
LOGISTIC REGRESSION VARIABLES votebush
/METHOD=ENTER medhhinc
/CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).
R Syntax

R requires two steps to run logit:



a call to the the glm function.
a “summary” command to print the output.
The glm function, with ordinal independent
variables:
glm(depvar~var1+var2,
family=binomial(link="logit"),
na.action=na.pass)

If “var2” is categorical, then type:
glm(depvar~var1+as.factor(var2),…
R Summary Command

One of the easiest ways of running glm is to give
the model a name, like “logit1”, at the start of the
glm line.
logit1<-glm(depvar~var1+var2,
family=binomial(link="logit"),
na.action=na.pass)
Then request a summary, using that name.
summary(logit1)


This is especially useful when you are trying several different versions of the
model.
Logit using Interactive Menus in R


Deducer includes logit commands.
After loading the Deducer package, click on the
“Analysis” menu and then “Logistic Model”



Put the dependent variable in the top box marked
“Outcome”
Put any ordinal independent variables in the next box,
labeled “As Numeric.”
Put any categorical independent variables in the box
labeled “As Factor.”
Logit output


The output generated by the computer will look
– at least at the bottom of the screen – a lot like
a linear regression analysis.
Look for a list of the independent variables
followed by columns of coefficients, standard
errors, Z-scores or Wald Chi-Square scores,
and a test of significance.
Logit output

After that, the output varies by program.





STATA includes 95% confidence intervals unless log odds are
specified.
SPSS includes the degrees of freedom and log odds - Exp(B)
SAS includes odds ratio point estimates and confidence intervals
for the Wald Chi-Square test.
R just includes stars to indicate significance levels.
What else is presented varies widely between
programs.
Did poor states vote for Bush? Logit

Lets return to the sample analysis of whether
poorer states were likely to vote for Bush.


Earlier, we saw slides that illustrated the S-curve
that fit the relationship between state median
household income and whether or not the state
voted for Bush.
In the next slides, I will present the logit
regression output made by statistical
programs of that relationship.
STATA logit output
STATA logit output
The maximum
likelihood process.
Dependent variable
Model
goodness
of fit
Independent variable
and coefficient.
Significance of
coefficient.
Block 1: Method = Enter
SPSS Logit Output - 1
SPSS presents a long set of outputs,
not all of which are relevant to most
researchers and can be confusing.
After three tables presented under
the heading “Block 0: Beginning
Block” (which includes no
independent variables), look for the
label, Block 1: Method=Enter in big
black letters. Under this label are
your logit results.
The first three tables, “Omnibus
Tests of Model Coefficients,” “Model
Summary,” and “Classification
Table,” are all model goodness of fit
measures.
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
9.051
1
.003
Block
9.051
1
.003
Model
9.051
1
.003
Model Summary
Step
Cox & Snell R
Nagelkerke R
Square
Square
-2 Log likelihood
1
60.053
a
.163
.219
a. Estimation terminated at iteration number 4 because
parameter estimates changed by less than .001.
Classification Table
a
Observed
Predicted
votebush
.00
Step 1
votebush
Correct
.00
10
11
47.6
1.00
6
24
80.0
Overall Percentage
a. The cut value is .500
Percentage
1.00
66.7
SPSS logit output - 2
Variables in the Equation
Step
1a
MEDHHINC
B
-.015
S.E.
.005
Wald
7.358
Constant
6.710
2.376
7.974
df
1
1
Sig.
Exp(B)
.007
.985
.005
820.640
a. Variable(s) entered on step 1: MEDHHINR.

After the three goodness of fit tables, the
“Variables in the Equation” table displays
the independent variable(s), coefficients
and significance.
SPSS logit output - 2
Significance
Variables in the Equation
Step
1a
MEDHHINC
B
-.015
S.E.
.005
Wald
7.358
Constant
6.710
2.376
7.974
df
1
1
Sig.
Exp(B)
.007
.985
.005
820.640
a. Variable(s) entered on step 1: MEDHHINR.
Independent variable
and coefficient.


In this example, there is only one independent variable,
State Median Household Income (medhhinr), measured
in $100’s.
The coefficient is boxed in red, the significance is circled
in green.
Logit Presentation

Scholars typically publish the results of a
logistic regression much like they present
the results of a linear regression analyses.


Emphasizing coefficients and statistical
significance.
Normally also present measures indicating
goodness of fit (how well the model as a whole
explains variation in the dependent variable).
Presentation
Example
This model explains how
members of Congress
voted on the North
American Free Trade
Agreement (NAFTA).
From:
C. Don Livingston, & Wink, Kenneth
A. (1997). “The passage of the
North American Free Trade
agreement in the U.S. House of
Representatives: Presidential
leadership or presidential luck?“
Presidential Studies Quarterly,
27(1), pp. 52-70.
Presentation
Example
Coefficients
Coefficients show the
effect of each
independent variable on
the likelihood of voting in
favor of NAFTA.
From:
C. Don Livingston, & Wink, Kenneth
A. (1997). “The passage of the
North American Free Trade
agreement in the U.S. House of
Representatives: Presidential
leadership or presidential luck?“
Presidential Studies Quarterly,
27(1), pp. 52-70.
Goodness of fit
measures
Significance
Interpretation


Not a linear model, so coefficients are not the
slope of a line.
As a result, logistic regression coefficients
cannot be interpreted in a simple,
straightforward fashion.

Coefficients must be transformed to get an easily
understood measure of the magnitude of the
effect of the independent variable on the
dependent variable.
At a glance conclusions

Although gauging magnitude of the impact of
each logistic regression coefficient is difficult,
one can still readily ascertain:
1.
2.
Whether the independent has a negative or
positive effect on the dependent variable by
looking at the sign of the coefficient.
If the coefficient is significant, we can be
confident that variation in the independent
variable is associated with variation in the
dependent variable.
Negative or positive

Like linear regression, the sign on the coefficient
tells whether that variable has a positive or
negative effect on the dependent variable.


A positive coefficient means that as values on the
independent value go up, the outcome or choice
described by the dependent variable becomes more
likely.
A negative coefficient means that as values on the
independent value go up, the outcome or choice
described by the dependent variable becomes less
likely.
Statistical significance


Statistical significance means exactly the
same thing as in linear regression.
If the coefficient is significant, we can be
confident that the results are not due to
chance.

So, if the coefficient is significant, we can be
confident that variation in the independent
variable explains variation in the dependent
variable.
Interpreting coefficients (STATA):
Example from did poor states vote for Bush?


In the output presented earlier, the coefficient
(highlighted by a red square) is negative and significant
at P < 0.01 (green circle).
We can conclude that wealthier states are less likely to
vote for Bush.
Interpreting coefficients (SPSS):
Example from did poor states vote for Bush?
Variables in the Equation
B
Step
1a
S.E.
Wald
df
Sig.
Exp(B)
MEDHHINR
-.015
.005
7.358
1
.007
.985
Constant
6.710
2.376
7.974
1
.005
820.640
a. Variable(s) entered on step 1: MEDHHINR.


In the output presented earlier, the coefficient
(highlighted by a red square) is negative and significant
at P < 0.01 (green circle).
We can conclude that wealthier states are less likely to
vote for Bush.
NAFTA Vote
Coefficient
Example
Voting in favor of NAFTA was coded
as 1.
Therefore, positive coefficients like
for the independent variable
HISPANIC (the proportion of Latinos
in a Congressional district [green
rectangle]), indicates that
Representatives with high
proportions of Latino constituents
were more likely to vote for NAFTA
when controlling for all other
variables.
Three stars indicates that this effect
is statistically significant at p < 0.01.
Source: C. Don Livingston, & Wink, Kenneth A. (1997).
“The passage of the North American Free Trade
agreement in the U.S. House of Representatives:
Presidential leadership or presidential luck?“
Presidential Studies Quarterly, 27(1), pp. 52-70.
Coefficient Summary

Coefficients in logistic regressions cannot be
interpreted readily without transformation
EXCEPT to see:

The sign


The sign tells us whether the effect of the independent
variable on the dependent variable is positive or negative
(making the outcome, event or decision more or less likely).
Statistical significance

Significance tells us whether or not we can be sure the effect
of the dependent variable on the independent variable is not
due to chance/different than zero.
Goodness of fit

Most statistical packages also provide a Rsquared statistic for the model.

These are called “pseudo R2“ or have a statisticians
name in front of it to differentiate it from the R2 for
linear regressions.


Examples include Hosmer-Lemeshow’s R2, Cox and Snell’s
R2, McFadden’s R2, or Nagelkerke’s R2.
All are interpreted like R2 on a scale between
zero and one.

The different R2’s do not always yield the same result
and there is no consensus over which one is best.
R-squared


Cox and Snell’s R2, and Nagelkerke’s R2
are reported by SPSS.
STATA reports “pseudo- R2,” which is
McFadden’s R2.

If spost is installed on STATA, running the command
fitstat after running a logit analysis, a half-dozen
different R-squared statistics will be displayed.

SPOST is a routine written by J. Scott Long and Jeremy
Freese and can be installed for free by typing net search
spost.
Likelihood Ratio



Many statistical programs will also produce a likelihood
ratio test that compares the model to a model without
any independent variables.
This ratio can be used to calculate the statistical
significance of the model, using chi-squared, letting us
know whether we can be confident that the model’s
explanatory power is better than no model at all.
With large datasets common in politics, especially public
opinion surveys, it is unusual that this test is insignificant
if any independent variable is significant.
STATA Likelihood Ratio
Likelihood Ratio Chi-square test of model
significance.
Look at lower line to make sure the
significance is less than 0.05, like this model.
SPSS Likelihood
Ratio
Look under the label Block
1: Method=Enter, which
appears in big black letters.
The first of three tables,
“Omnibus Tests of Model
Coefficients,” presents the
likelihood ratio chi-square
test.
Look at the bottom right
number in the column
marked “Sig.” to make sure
the model significance is
less than 0.05, like this
model (green box)
Block 1: Method = Enter
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
9.051
1
.003
Block
9.051
1
.003
Model
9.051
1
.003
Model Summary
Step
Cox & Snell R
Nagelkerke R
Square
Square
-2 Log likelihood
1
60.053
a
.163
.219
a. Estimation terminated at iteration number 4 because
parameter estimates changed by less than .001.
Classification Table
a
Observed
Predicted
votebush
.00
Step 1
votebush
Correct
.00
10
11
47.6
1.00
6
24
80.0
Overall Percentage
a. The cut value is .500
Percentage
1.00
66.7
Alternative measure of goodness of fit



An alternative to the R2 calculations estimating
how well the model explains variation in the
dependent variable would be to look at how well
your model predicts the actual observations.
Most statistical programs present the
classification table, a cross-tabulation of the
predicted results from the actual observations.
Use this classification table to gauge how well
your model predicts the actual results.
Classification Table
Observed
Predicted: No
Predicted: Yes
No
Yes


In the classification table, the rows display
the actual observations.
The columns display the predicted values,
making a 2 x 2 table.

Remember: No = 0, Yes = 1.
Best Models
Observed
No
Yes

Predicted: No
Predicted: Yes
√√√
√√√
The best models correctly predict
most observations.

Find most observations in the two boxes
marked √√√
Failures
Observed
No
Yes

Predicted: No
√√√
Predicted: Yes
X
√√√
Observations in the other squares indicate
failures to explain those observation.


Example: in the square marked with an ‘X’, the model
predicts ‘yes’, but the observation is ‘no’.
In a model explaining turnout, the model predicts that
a person voted, but this person did not.
Example: Poor states voted for Bush?
Observed


Predicted: No
Predicted: Yes
No
10
11
Yes
6
24
In the sample analysis presented earlier, we looked to
see if median household income of a state had an effect
on the likelihood of the state voting for Bush.
This is the classification table for that model.
 The model correctly predicted 24 states out of the 30
states that voted for Bush.
Example: Poor states voted for Bush?
Observed



Predicted: No
Predicted: Yes
No
10
11
Yes
6
24
The model also correctly predicted 10 states that did not
vote for Bush.
How many states were predicted to vote for Bush but did
not?
How many states were predicted by the model to vote for
Gore but voted for Bush?
Classification Table
(STATA)
After running logit (or probit), the
command
estat classification
will provide a classification table.
Notice that the classification
table includes the predicted
values in the rows, and the
observed “true” values in the
columns.
The rest of the table calculates
the percentages, including the
total correctly classified
(24+10=34/51=66.67%).
SPSS Classification
Table
SPSS presents two
classification tables after
running a binary logistic
regression
The first, labeled “Block 0:
Beginning Block” is the naïve
model, as only the modal value
is predicted.
The second appears after a
label, “Block 1: Method=Enter”
below the “Model Summary”
containing the R-squared
measures, but above the
“Variables in the Equation”
table. This is the correct
classification table to analyze.
Block 1: Method = Enter
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
9.051
1
.003
Block
9.051
1
.003
Model
9.051
1
.003
Model Summary
Step
Cox & Snell R
Nagelkerke R
Square
Square
-2 Log likelihood
1
60.053
a
.163
.219
a. Estimation terminated at iteration number 4 because
parameter estimates changed by less than .001.
Classification Table
a
Observed
Predicted
votebush
.00
Step 1
votebush
Correct
.00
10
11
47.6
1.00
6
24
80.0
Overall Percentage
a. The cut value is .500
Percentage
1.00
66.7
Classification Table (SPSS)
Classification Tablea
Observed
Predicted
votebush
.00
Step 1
votebush
Percentage
Correct
1.00
.00
10
11
47.6
1.00
6
24
80.0
Overall Percentage
66.7
a. The cut value is .500



SPSS presents a classification table between the variable list and
the model summary (including two R-squared measures).
The observed values are in the rows, the predicted values are in the
columns.
Compare the Overall Percentage to the Overall Percentage of the
naïve model (in the output, look under the heading, “Block 0:
Beginning Block”)
How good is the model?


The percentage of observations correctly
predicted gives one measure of how well
your model explains the data.
The best way to assess whether this a high
percentage is to compare the percentage of
observations you correctly explain to the
naïve model.
Naïve Model

The naïve model is the percentage of observations you
would correctly explain simply by guessing the mode for
every case or observation.


Sometimes called the null model.
For example, lets turn to the turnout model.



If the percentage of people in the survey who said that they voted
was about 65%, then the mode is “voted”.
If you guessed “voted” for every single person in the survey, you
would be correct about 65% of the time.
Your model should do better than the simple, naive model of
guessing “voted” that is accurate 65% of the time.
Comparing to the naïve model



Some software packages automatically display
the frequency table of the dependent variable.
For others, you will need to run that model
separately.
Then compare the percentage explained in the
classification table to the frequency of the modal
response.
How much better than being naïve?

You can estimate how much better your model is over
the naïve model by dividing the difference between your
model and the naïve model by the percent left
unexplained by the naïve model.



To illustrate, consider the hypothetical turnout analysis where
65% of all respondents said they voted (therefore, 1 - 0.65 = 35%
did not vote).
Lets say your model explained 80% of the observations.
80% - 65% = 15% / 35% = 42.8% improvement over the naïve
model.
NAFTA Goodness
of Fit
NAFTA passed with 56% of the
vote in the House of
Representatives, so the naïve
model correctly predicts 56%
of the votes.
Rather than presenting a
classification table, the authors
simply indicate that Model 1
(left) explains 76.5% of the
votes (red oval), a 46.1%
improvement over the naïve
model (red underline).
The likelihood ratio X2 is
significant (green square).
From:
C. Don Livingston, & Wink, Kenneth A. (1997). “The
passage of the North American Free Trade agreement
in the U.S. House of Representatives: Presidential
leadership or presidential luck?“ Presidential Studies
Quarterly, 27(1), pp. 52-70.
Examining failures

If there are many observations that the
model fails to predict, researchers might
revise their model while focusing on these
failures.

The authors of the logit analysis of the vote on
NAFTA in the U.S. House of Representatives,
Livingston and Wink, published the list of
Representatives who voted contrary to their
model’s predictions and discussed why these
legislators did not conform to expectations.
Example: Poor states voted for Bush?
Observed


Predicted: No
Predicted: Yes
No
10
11
Yes
6
24
Where is our current model weakest? Is there more
room for improvement predicting “No” responses or
“Yes” responses?
Notice that more half of all of those states that did NOT
vote for Bush (top row) were wrongly predicted by the
model to vote for Bush.
Example: Poor states voted for Bush?
Observed


Predicted: No
Predicted: Yes
No
10
11
Yes
6
24
Remember that the current model only includes one independent
variable: state median household income.
What independent variables do you think might be added to the
model to better explain whether or not a state voted for Bush in
2000?

What variable might explain why some relatively poor states voted for
Gore instead of for Bush, as predicted by their state income?
Classification Table Summary

Classification tables compare the actual
observations with what the model predicts.


This is a useful tool for gauging how well the
model explains the variation in the dependent
variable.
Can be very helpful when building a model by
suggesting what the current model is not doing a
good job of explaining.
Some cautions regarding classification
tables


Keep in mind that some events are very rare or very
common and your model may be very good without
explaining more than the naïve model.
Remember that the computer automatically rounds up
the values of the S-curve to determine whether the
response or outcome is predicted to be zero or one.
 The default of value for rounding is usually 0.5, but
can be adjusted. This can be useful if the researcher
thinks that the model is good but the model predicts
that almost all observations are zero OR one, but can
also be used to inflate the perceived strength of the
model.
Summary


Logistic regressions are used to analyze
dichotomous dependent variables.
Since logistic regression works by imposing a Scurve rather than a straight line on the data, the
coefficients cannot be interpreted like
coefficients in linear regression.


But look to see if the coefficients are positive/negative
and significant.
Goodness of fit can be measured by
classification tables.
Odds

Odds are calculated by dividing the number of
occurrences of one outcome by the number of
occurrences of the other outcome or by
converting from the probabilities using the
formula: Odds=Probability/(1-Probability)

Example: In 2000, Bush won 30 states and Gore won 20
states (plus DC, but that makes the math harder).



The odds of Bush winning a state is 30/20, usually described as
odds of 3 to 2 or odds of 1.5
The probability of winning is 0.60, so the odds are 1.5=0.6/(1-0.6)
The change in the odds is the odds-ratio.
Coefficients into Odds Ratios



Logit coefficients can be converted into odds ratios by
raising the natural log base e to the power of the
coefficient.
Odds ratios estimate the effect of changing the
independent variable by one unit (much like linear
regression coefficients) on the likelihood of the
choice/outcome/event occurring when controlling for all
other variables.
In our example, it would be the effect of another $100 of
median household income on the likelihood of a state
voting for Bush.
Odds Ratios

All variables that have a positive effect on the likelihood
of the dependent variable have odds ratios greater than
1.


Variables with a negative effect have odds ratios lower
than 1.


Interpreted as “the odds are ___ times larger”.
Interpreted as “the odds are ___ times smaller”.
Odds ratios are multiplicative, so an odds ratio of 2
means that the odds have doubled. An impact of that
magnitude by a variable with a negative relationship with
the dependent variable is 0.5 since 0.5=1/2 of 1.
Odds Ratios in STATA

To ask STATA to compute the odds ratios,
simply add the option or to the logit command
after a comma to ask the computer to display the
odds ratios instead of the coefficients.
logit depvar var1 var2, or

Our example command becomes:
logit votebush medhhinc, or
 OR replace logit with the command
logistic
Odds Ratios in STATA - 2

When the odds ratios are displayed,




The model goodness of fit measures are unchanged
because the analysis is exactly the same.
As a result, the Z-scores and the significance are also
unchanged.
However, the 95% confidence intervals and
standard errors will be different, reflecting
confidence interval/SE’s of the odds ratios rather
than the coefficients.
There is no constant reported.
Odds Ratios in SPSS
Variables in the Equation
B
Step
1a
S.E.
Wald
df
Sig.
Exp(B)
MEDHHINR
-.015
.005
7.358
1
.007
.985
Constant
6.710
2.376
7.974
1
.005
820.640
a. Variable(s) entered on step 1: MEDHHINR.

In SPSS, the default presentation of logit results include
the odds-ratios, in the column on the far right of the
“Variables in the Equation” table alongside the coefficient
(blue circle).
Interpreting Odds Ratios

In the example used earlier, the odds-ratio for median household
income was 0.985. This can be interpreted as:


OR take the inverse of the effect :


“For every $100 [one-unit] increase in median household income, the odds of
Bush winning the state decrease by a factor of 0.985, holding all other variables
constant.”
“For every $100 [one-unit] increase in median household income, the odds of
Bush losing the state increase are 1.015 times greater, holding all other variables
constant.”
OR convert to a percentage by subtracting 1 from the odds ratio and
then multiplying by 100.

“For every $100 [one-unit] increase in median household income, the odds of
Bush winning the state decrease by 1.5%, holding all other variables constant.”
Problem Interpreting Odds Ratios

Odds ratios are sensitive to:



the scale of the variable, and
what other independent variables are included in the model.
Odds ratios assume that the all other independent
variables are held constant, but does not specify that
value.

If the underlying odds (before the change indicated by the odds
ratio) is really small, then a large odds ratio may not represent a
substantively large change in the probability of the outcome or
event..


For example, if the odds are 1/100, doubling the odds (to 2/100) only
increases the probability to 0.02 from 0.01!
Conversely, if the odds are really large, a large odds ratio does
not actually change the probability of the outcome very much.
Summary


Logistic regressions are used to analyze dichotomous
dependent variables.
Since logistic regression works by imposing a S-curve
rather than a straight line on the data, the coefficients
cannot be interpreted like coefficients in linear
regression.



But look to see if the coefficients are positive/negative and
significant.
View the odds-ratio for an easily understood measure of the
magnitude of a one unit change in the independent variable on
the dependent variable.
Goodness of fit can be measured by classification tables.

Logistic Regression

Transcript Logistic Regression

Directory