SW388R7 Data Analysis & Computers II Logistic Regression – Basic Relationships Slide 1 Logistic Regression Describing Relationships Classification Accuracy Sample Problems.

Download Report

Transcript SW388R7 Data Analysis & Computers II Logistic Regression – Basic Relationships Slide 1 Logistic Regression Describing Relationships Classification Accuracy Sample Problems.

SW388R7
Data Analysis &
Computers II
Logistic Regression – Basic Relationships
Slide 1
Logistic Regression
Describing Relationships
Classification Accuracy
Sample Problems
SW388R7
Data Analysis &
Computers II
Logistic regression
Slide 2



Logistic regression is used to analyze relationships between a
dichotomous dependent variable and metric or dichotomous
independent variables. (SPSS now supports Multinomial Logistic
Regression that can be used with more than two groups, but our
focus here is on binary logistic regression for two groups.)
Logistic regression combines the independent variables to
estimate the probability that a particular event will occur, i.e.
a subject will be a member of one of the groups defined by the
dichotomous dependent variable. In SPSS, the model is always
constructed to predict the group with higher numeric code. If
responses are coded 1 for Yes and 2 for No, SPSS will predict
membership in the No category. If responses are coded 1 for No
and 2 for Yes, SPSS will predict membership in the Yes category.
We will refer to the predicted event for a particular analysis as
the modeled event.
This will create some awkward wording in our problems. Our
only option for changing this is to recode the variable.
SW388R7
Data Analysis &
Computers II
What logistic regression predicts
Slide 3



The variate or value produced by logistic regression is a
probability value between 0.0 and 1.0.
If the probability for group membership in the modeled
category is above some cut point (the default is 0.50), the
subject is predicted to be a member of the modeled group. If
the probability is below the cut point, the subject is predicted
to be a member of the other group.
For any given case, logistic regression computes the probability
that a case with a particular set of values for the independent
variable is a member of the modeled category.
SW388R7
Data Analysis &
Computers II
Level of measurement requirements
Slide 4




Logistic regression analysis requires that the dependent variable
be dichotomous.
Logistic regression analysis requires that the independent
variables be metric or dichotomous.
If an independent variable is nominal level and not
dichotomous, the logistic regression procedure in SPSS has a
option to dummy code the variable for you.
If an independent variable is ordinal, we will attach the usual
caution.
SW388R7
Data Analysis &
Computers II
Assumptions
Slide 5


Logistic regression does not make any assumptions of normality,
linearity, and homogeneity of variance for the independent
variables.
Because it does not impose these requirements, it is preferred
to discriminant analysis when the data does not satisfy these
assumptions.
SW388R7
Data Analysis &
Computers II
Sample size requirements
Slide 6


The minimum number of cases per independent variable is 10,
using a guideline provided by Hosmer and Lemeshow, authors of
Applied Logistic Regression, one of the main resources for
Logistic Regression.
For preferred case-to-variable ratios, we will use 20 to 1 for
simultaneous and hierarchical logistic regression and 50 to 1 for
stepwise logistic regression.
SW388R7
Data Analysis &
Computers II
Methods for including variables
Slide 7


There are three methods available for including variables in the
regression equation:
 the simultaneous method in which all independents are
included at the same time
 The hierarchical method in which control variables are
entered in the analysis before the predictors whose effects
we are primarily concerned with.
 The stepwise method (forward conditional in SPSS) in which
variables are selected in the order in which they maximize
the statistically significant contribution to the model.
For all methods, the contribution to the model is measures by
model chi-square is a statistical measure of the fit between the
dependent and independent variables, like R².
SW388R7
Data Analysis &
Computers II
Computational method
Slide 8




Multiple regression uses the least-squares method to find the
coefficients for the independent variables in the regression
equation, i.e. it computed coefficients that minimized the
residuals for all cases.
Logistic regression uses maximum-likelihood estimation to
compute the coefficients for the logistic regression equation.
This method finds attempts to find coefficients that match the
breakdown of cases on the dependent variable.
The overall measure of how will the model fits is given by the
likelihood value, which is similar to the residual or error sum of
squares value for multiple regression. A model that fits the data
well will have a small likelihood value. A perfect model would
have a likelihood value of zero.
Maximum-likelihood estimation is an interative procedure that
successively tries works to get closer and closer to the correct
answer. When SPSS reports the "iterations," it is telling us how
may cycles it took to get the answer.
SW388R7
Data Analysis &
Computers II
Overall test of relationship
Slide 9



The overall test of relationship among the independent
variables and groups defined by the dependent is based on the
reduction in the likelihood values for a model which does not
contain any independent variables and the model that contains
the independent variables.
This difference in likelihood follows a chi-square distribution,
and is referred to as the model chi-square.
The significance test for the model chi-square is our statistical
evidence of the presence of a relationship between the
dependent variable and the combination of the independent
variables.
SW388R7
Data Analysis &
Computers II
Beginning logistic regression model
Slide 10

The SPSS output for logistic
regression begins with output
for a model that contains no
independent variables. It labels
this output "Block 0: Beginning
Block" and (if we request the
optional iteration history)
reports the initial -2 Log
Likelihood, which we can think
of as a measure of the error
associated trying to predict the
dependent variable without
using any information from the
independent variables.
The initial -2 log
likelihood is 213.891.
We will not routinely request
the iteration history because
it does not usually yield us
additional useful
information.
SW388R7
Data Analysis &
Computers II
Ending logistic regression model
Slide 11



After the independent
variables are entered in
Block 1, the -2 log likelihood
is again measured (180.267
in this problem).
The difference between
ending and beginning -2 log
likelihood is the model chisquare that is used in the
test of overall statistical
significance.
In this problem, the model
chi-square is 33.625 (213.891
– 180.267), which is
statistically significant at
p<0.001.
Model chi-square is
33.625, significant at
p < 0.001.
Relationship of Individual Independent
Variables and Dependent Variable
SW388R7
Data Analysis &
Computers II
Slide 12



There is a test of significance for the relationship between an
individual independent variable and the dependent variable, a
significance test of the Wald statistic .
The individual coefficients represent change in the probability of being
a member of the modeled category. Individual coefficients are
expressed in log units and are not directly interpretable. However, if
the b coefficient is used as the power to which the base of the natural
logarithm (2.71828) is raised, the result represents the change in the
odds of the modeled event associated with a one-unit change in the
independent variable.
If a coefficient is positive, its transformed log value will be greater
than one, meaning that the modeled event is more likely to occur. If a
coefficient is negative, its transformed log value will be less than one,
and the odds of the event occurring decrease. A coefficient of zero (0)
has a transformed log value of 1.0, meaning that this coefficient does
not change the odds of the event one way or the other.
SW388R7
Data Analysis &
Computers II
Numerical problems
Slide 13




The maximum likelihood method used to calculate logistic
regression is an iterative fitting process that attempts to cycle
through repetitions to find an answer.
Sometimes, the method will break down and not be able to
converge or find an answer.
Sometimes the method will produce wildly improbable results,
reporting that a one-unit change in an independent variable
increases the odds of the modeled event by hundreds of
thousands or millions. These implausible results can be
produced by multicollinearity, categories of predictors having
no cases or zero cells, and complete separation whereby the
two groups are perfectly separated by the scores on one or
more independent variables.
The clue that we have numerical problems and should not
interpret the results are standard errors for some independent
variables that are larger than 2.0.
SW388R7
Data Analysis &
Computers II
Strength of logistic regression relationship
Slide 14


While logistic regression does compute correlation measures to
estimate the strength of the relationship (pseudo R square
measures, such as Nagelkerke's R²), these correlations measures
do not really tell us much about the accuracy or errors
associated with the model.
A more useful measure to assess the utility of a logistic
regression model is classification accuracy, which compares
predicted group membership based on the logistic model to the
actual, known group membership, which is the value for the
dependent variable.
SW388R7
Data Analysis &
Computers II
Evaluating usefulness for logistic models
Slide 15



The benchmark that we will use to characterize a logistic
regression model as useful is a 25% improvement over the rate
of accuracy achievable by chance alone.
Even if the independent variables had no relationship to the
groups defined by the dependent variable, we would still
expect to be correct in our predictions of group membership
some percentage of the time. This is referred to as by chance
accuracy.
The estimate of by chance accuracy that we will use is the
proportional by chance accuracy rate, computed by summing
the squared percentage of cases in each group.
SW388R7
Data Analysis &
Computers II
Comparing accuracy rates
Slide 16

To characterize our model as useful, we compare the overall
percentage accuracy rate produced by SPSS at the last step in which
variables are entered to 25% more than the proportional by chance
accuracy. (Note: SPSS does not compute a cross-validated accuracy
rate for logistic regression.)
Classification Tablea
Step 1
Observed
EXPECT U.S. IN WORLD
WAR IN 10 YEARS
YES
NO
Predicted
EXPECT U.S. IN
WORLD WAR IN 10
YEARS
YES
NO
20
34
10
72
Overall Percentage
a. The cut value is .500
SPSS reports the overall accuracy rate in
the footnotes to the table "Classification
Table." The overall accuracy rate
computed by SPSS was 67.6%.
Percentage
Correct
37.0
87.8
67.6
SW388R7
Data Analysis &
Computers II
Computing by chance accuracy
Slide 17
The number of cases in each group is found in the Classification Table at
Step 0 (before any independent variables are included). The proportion
of cases in the largest group is equal to the overall percentage (60.3%).
Classification Tablea,b
Step 0
Observed
EXPECT U.S. IN WORLD
WAR IN 10 YEARS
YES
NO
Predicted
EXPECT U.S. IN
WORLD WAR IN 10
YEARS
YES
NO
0
54
0
82
Overall Percentage
Percentage
Correct
.0
100.0
60.3
a. Constant is included in the model.
b. The cut value is .500
The proportional by chance accuracy rate was computed by
calculating the proportion of cases for each group based on the
number of cases in each group in the classification table at Step
0, and then squaring and summing the proportion of cases in
each group (0.397² + 0.603² = 0.521).
The proportional by chance accuracy criteria is 65.2% (1.25 x
52.1% = 65.2%).
SW388R7
Data Analysis &
Computers II
Problem 1
Slide 18
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews]
were useful predictors for distinguishing between groups based on responses to "seen x-rated
movie in last year" [xmovie]. These predictors differentiate survey respondents who have not
seen an x-rated movie from survey respondents who have seen an x-rated movie.
Survey respondents who were older were more likely to have not seen an x-rated movie. A one
unit increase in age increased the odds that survey respondents have not seen an x-rated movie
by 3.9%. Survey respondents who were female were approximately six and three quarters times
more likely to have not seen an x-rated movie. Survey respondents who were more conservative
were more likely to have not seen an x-rated movie. A one unit increase in liberal or
conservative political views increased the odds that survey respondents have not seen an xrated movie by approximately one and a quarter times.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 1
Slide 19
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases,
and that the validation analysis will confirm the generalizability of the results. Use a level
of significance of 0.05 for evaluating the statistical relationship.
The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews]
For these problems, we will
were useful predictors for distinguishing between groups based on responses to "seen x-rated
assume that there is no problem
movie in last year" [xmovie]. These predictors
differentiate
survey
with missing
data, outliers,
or respondents who have not
seen an x-rated movie from survey respondents
who have
an x-rated movie.
influential cases,
and seen
that the
validation analysis will confirm
the generalizability of the
Survey respondents who were older were
more likely to have not seen an x-rated movie. A one
results
unit increase in age increased the odds that survey respondents have not seen an x-rated movie
by 3.9%. Survey respondents who wereInfemale
were approximately
this problem,
we are told to six and three quarters times
use
0.05
as
alpha
for the
more likely to have not seen an x-rated movie. Survey respondents
who were more conservative
logistic
regression.
were more likely to have not seen an x-rated movie. A one unit increase in liberal or
conservative political views increased the odds that survey respondents have not seen an xrated movie by approximately one and a quarter times.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 2
Slide 20
The
variables
listed first in
In the
dataset
GSS2000.sav,
is the problem
following statement true, false, or an incorrect application of
statement
are
the
independent
a statistic? Assume that there is no variables
problem with missing data, outliers, or influential cases, and
(IVs): "age" [age], "sex" [sex], and "liberal
that the validation analysis will confirm the generalizability of the results. Use a level of
or conservative political views" [polviews].
significance of 0.05 for evaluating the statistical relationship.
The variables "age" [age], "sex" [sex], and "liberal or conservative political views"
[polviews] were useful predictors for distinguishing between groups based on responses to
"seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents
who have not seen an x-rated movie from survey respondents who have seen an x-rated movie.
The variable
used to
define
Survey
respondents
who
were older were more likely to have not seen an x-rated movie. A one
groups
is
the
dependent
unit increase in age increased the odds that survey respondents have not seen an x-rated movie
(DV): "seen x-rated
byvariable
3.9%. Survey
respondents who were female were approximately six and three quarters times
movie in last year" [xmovie].
more likely to have not seen an x-rated movie. Survey respondents who were more conservative
When a problem states that a list of
were more likely to have not seen an x-rated movie.independent
A one unit increase
liberal or
variablesin
can
conservative political views increased the odds that distinguish
survey respondents
have and
not does
seen an xamong groups
rated movie by approximately one and a quarter times.
not identify control variable or an
order of importance for the variables,
we do a logistic regression entering
all of the variables simultaneously.
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 3
Slide 21
SPSS logistic regression models the relationship by computing
the changes in the likelihood of falling in the category of the
dependent variable which had the highest numerical code.
responses to
an x-rated
movie were
In the datasetThe
GSS2000.sav,
is seeing
the following
statement
true,coded:
false, or an incorrect application of
1=
Yes
and
2
=
No.
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
The SPSS output will model the changes in the likelihood of
significance of
0.05 for evaluating the statistical relationship.
not seeing an x-rated movie because the code for No is 2.
The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews]
were useful predictors for distinguishing between groups based on responses to "seen x-rated
movie in last year" [xmovie]. These predictors differentiate survey respondents who have
not seen an x-rated movie from survey respondents who have seen an x-rated movie.
Survey respondents who were older were more likely to have not seen an x-rated movie. A
one unit increase in age increased the odds that survey respondents have not seen an xrated movie by 3.9%. Survey respondents who were female were approximately six and
three quarters times more likely to have not seen an x-rated movie. Survey respondents
The statements of the specific relationships
who were more conservative
were independent
more likely variables
to have not
between
and seen
the an x-rated movie. A one
unit increase in liberal or conservative
political
increased
the odds that survey
dependent variable
are views
all phrased
in terms
respondents have not seenofan
x-rated
by approximately
one and a quarter times.
impact
on movie
not seeing
an x-rated movie.
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 4
Slide 22
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm
the generalizability
results. Use a level of
The specific
relationships for of
thethe
independent
significance of 0.05 for evaluating
the statistical
variables
listed inrelationship.
the problem indicate the direction
of the relationship, increasing or decreasing the
likelihood
of fallingorinconservative
the modeled political
group, and
the [polviews]
The variables "age" [age], "sex" [sex],
and "liberal
views"
amount between
of changegroups
in the odds
with to
a "seen x-rated
were useful predictors for distinguishing
basedassociated
on responses
one-unit
change
in
the
independent
variable.
movie in last year" [xmovie]. These predictors differentiate survey respondents who have not
seen an x-rated movie from survey respondents who have seen an x-rated movie.
Survey respondents who were older were more likely to have not seen an x-rated movie. A
one unit increase in age increased the odds that survey respondents have not seen an xrated movie by 3.9%. Survey respondents who were female were approximately six and
three quarters times more likely to have not seen an x-rated movie. Survey respondents
who were more conservative were more likely to have not seen an x-rated movie. A one
unit increase in liberal or conservative political views increased the odds that survey
respondents have not seen an x-rated movie by approximately one and a quarter times.
1.
2.
3.
4.
True
In order for the logistic regression question to be
true, the overall relationship must be statistically
True with caution
significant, there must be no evidence of a flawed
False
numerical analysis, the classification accuracy
Inappropriate application of a statistic
rate must be substantially better than could be
obtained by chance alone, and each significant
relationship must be interpreted correctly.
SW388R7
Data Analysis &
Computers II
LEVEL OF MEASUREMENT - 1
Slide 23
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews]
were useful predictors for distinguishing between groups based on responses to "seen x-rated
movie in last year" [xmovie]. These predictors differentiate survey respondents who have not
seen an x-rated movie from survey respondents who have seen an x-rated movie.
Survey respondents who were older were more likely to have not seen an x-rated movie. A one
unit increase in ageLogistic
increased
the odds
that survey
respondents
regression
requires
that the
dependenthave not seen an x-rated movie
by 3.9%. Survey respondents
were female
were
approximately six and three quarters times
variable bewho
non-metric
and the
independent
or movie.
dichotomous.
xmore likely to havevariables
not seenbe
anmetric
x-rated
Survey"seen
respondents
who were more conservative
rated
movie
in
last
year"
[xmovie]
is
an
were more likely to have not seen an x-rated movie. A one unit increase in liberal or
dichotomous variable, which satisfies the level of
conservative political
views increased
the odds that survey respondents have not seen an xmeasurement
requirement.
rated movie by approximately one and a quarter times.
It contains two categories: survey respondents
who had seen an x-rated movie in the last year
True
and survey respondents who had not seen an xrated movie in the last year.
True with caution
1.
2.
3. False
4. Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
LEVEL OF MEASUREMENT - 2
Slide 24
"Age" [age] is an interval level
"Sex" [sex] is a dichotomous
variable,
which satisfies
the level
or dummy-coded
In the dataset
GSS2000.sav,
is the following statement true,
false, or an nominal
incorrect application of
of ameasurement
requirements
foris no problem with missing
variable
which may
statistic? Assume
that there
data, outliers,
or be
influential cases, and
logistic
regression
analysis.
included
in
logistic
regression.
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
The variables "age" [age], "sex" [sex], and "liberal or conservative political views" [polviews]
were useful predictors for distinguishing between groups based on responses to "seen x-rated
movie in last year" [xmovie]. These predictors differentiate survey respondents who have not
seen an x-rated movie from survey respondents who have seen an x-rated movie.
Survey respondents who were older were more likely to have not seen an x-rated movie. A one
unit increase in age increased the odds that survey
respondents
havepolitical
not seen
an x-rated movie
"Liberal
or conservative
views"
by 3.9%. Survey respondents who were female[polviews]
were approximately
andvariable.
three quarters
times
is an ordinalsix
level
If
more likely to have not seen an x-rated movie.
who
more conservative
weSurvey
follow respondents
the convention
of were
treating
as metric
were more likely to have not seen an x-rated ordinal
movie. level
A onevariables
unit increase
in liberal or
variables,
the
level
of
measurement
conservative political views increased the odds that survey respondents have not seen an xrequirement
rated movie by approximately one and a quarter
times. for logistic regression
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
analysis is satisfied. Since some data
analysts do not agree with this
convention, a note of caution should be
included in our interpretation.
SW388R7
Data Analysis &
Computers II
Request simultaneous logistic regression
Slide 25
Select the Regression |
Binary Logistic…
command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Selecting the dependent variable
Slide 26
First, highlight the
dependent variable
xmovie in the list
of variables.
Second, click on the right
arrow button to move the
dependent variable to the
Dependent text box.
SW388R7
Data Analysis &
Computers II
Selecting the independent variables
Slide 27
Move the independent
variables listed in the
problem to the
Covariates list box.
SW388R7
Data Analysis &
Computers II
Specifying the method for including variables
Slide 28
SPSS provides us with two methods for including
variables: to enter all of the independent variables
at one time, and a stepwise method for selecting
variables using a statistical test to determine the
order in which variables are included.
SPSS also supports the specification of "Blocks" of
variables for testing hierarchical models.
Since the problem
states that there is a
relationship without
requesting the best
predictors, we specify
Enter as the method for
including variables.
SW388R7
Data Analysis &
Computers II
Completing the logistic regression request
Slide 29
Click on the OK
button to request
the output for the
logistic regression.
The logistic procedure supports the selection of subsets of
cases, automatic recoding of nominal variables, saving
diagnostic statistics like standardized residuals and Cook's
distance, and options for additional statistics. However,
none of these are needed for this analysis.
SW388R7
Data Analysis &
Computers II
Sample size – ratio of cases to variables
Slide 30
Case Processing Summary
Unweig hted Cases
Selected Cases
Unselected Cases
Total
a
N
Included in Analysis
Missing Cases
Total
177
93
270
0
270
a. If weight is in effect, see classification table for the total
number of cases.
The minimum ratio of valid cases to
independent variables for logistic regression is
10 to 1, with a preferred ratio of 20 to 1. In this
analysis, there are 177 valid cases and 3
independent variables. The ratio of cases to
independent variables is 59.0 to 1, which
satisfies the minimum requirement. In addition,
the ratio of 59.0 to 1 satisfies the preferred ratio
of 20 to 1.
Percent
65.6
34.4
100.0
.0
100.0
SW388R7
Data Analysis &
Computers II
Slide 31
OVERALL RELATIONSHIP BETWEEN
INDEPENDENT AND DEPENDENT VARIABLES
Omnibus Tests of Model Coefficients
Step 1
Step
Block
Model
Chi-square
39.668
39.668
39.668
df
3
3
3
Sig .
.000
.000
.000
The presence of a relationship between the dependent
variable and combination of independent variables is
based on the statistical significance of the model chisquare at step 1 after the independent variables have
been added to the analysis.
In this analysis, the probability of the model chi-square
(39.668) was <0.001, less than or equal to the level of
significance of 0.05. The null hypothesis that there is
no difference between the model with only a constant
and the model with independent variables was rejected.
The existence of a relationship between the
independent variables and the dependent variable was
supported.
SW388R7
Data Analysis &
Computers II
NUMERICAL PROBLEMS
Slide 32
Variables in the Equation
B
Step
a
1
AGE
SEX
POLVIEWS
Constant
.038
1.901
.306
-4.590
S.E.
.014
.410
.135
1.045
Wald
7.629
21.452
5.110
19.302
df
1
1
1
1
Sig .
.006
.000
.024
.000
Exp(B)
1.039
6.689
1.358
.010
a. Variable(s) entered on step 1: AGE, SEX, POLVIEWS.
Multicollinearity in the logistic regression solution is detected
by examining the standard errors for the b coefficients. A
standard error larger than 2.0 indicates numerical problems,
such as multicollinearity among the independent variables,
zero cells for a dummy-coded independent variable because
all of the subjects have the same value for the variable, and
'complete separation' whereby the two groups in the
dependent event variable can be perfectly separated by
scores on one of the independent variables. Analyses that
indicate numerical problems should not be interpreted.
None of the independent variables in this analysis had a
standard error larger than 2.0. (The check for standard
errors larger than 2.0 does not include the standard error for
the Constant.)
SW388R7
Data Analysis &
Computers II
Slide 33
RELATIONSHIP OF INDIVIDUAL INDEPENDENT
VARIABLES TO DEPENDENT VARIABLE - 1
The probability of the Wald statistic for the variable age
was 0.006, less than or equal to the level of significance
of 0.05. The null hypothesis that the b coefficient for age
was equal to zero was rejected. This supports the
relationship that "survey respondents who were older
were more likely to have not seen an x-rated movie."
Variables in the Equation
B
Step
a
1
AGE
SEX
POLVIEWS
Constant
.038
1.901
.306
-4.590
S.E.
.014
.410
.135
1.045
Wald
7.629
21.452
5.110
19.302
df
1
1
1
1
Sig .
.006
.000
.024
.000
Exp(B)
1.039
6.689
1.358
.010
a. Variable(s) entered on step 1: AGE, SEX, POLVIEWS.
The value of Exp(B) was 1.039 which implies that a
one unit increase in age increased the odds that
survey respondents have not seen an x-rated movie
by 3.9%. This confirms the statement of the amount
of change in the likelihood of belonging to the modeled
group of the dependent variable associated with a one
unit change in the independent variable, age.
SW388R7
Data Analysis &
Computers II
Slide 34
RELATIONSHIP OF INDIVIDUAL INDEPENDENT
VARIABLES TO DEPENDENT VARIABLE - 2
The probability of the Wald statistic for the variable sex
was <0.001, less than or equal to the level of
significance of 0.05. The null hypothesis that the b
coefficient for sex was equal to zero was rejected. This
supports the relationship that "survey respondents who
were female were approximately six and three quarters
times more likely to have not seen an x-rated movie."
Variables in the Equation
B
Step
a
1
AGE
SEX
POLVIEWS
Constant
.038
1.901
.306
-4.590
S.E.
.014
.410
.135
1.045
Wald
7.629
21.452
5.110
19.302
df
1
1
1
1
Sig .
.006
.000
.024
.000
a. Variable(s) entered on step 1: AGE, SEX, POLVIEWS.
The value of Exp(B) was 6.689 which implies
that a one unit increase in sex increased the
odds by approximately six and three
quarters times that survey respondents
have not seen an x-rated movie.
Exp(B)
1.039
6.689
1.358
.010
SW388R7
Data Analysis &
Computers II
Slide 35
RELATIONSHIP OF INDIVIDUAL INDEPENDENT
VARIABLES TO DEPENDENT VARIABLE - 3
The probability of the Wald statistic for the variable liberal or
conservative political views was 0.024, less than or equal to the
level of significance of 0.05. The null hypothesis that the b
coefficient for liberal or conservative political views was equal to
zero was rejected. This supports the relationship that "survey
respondents who were more conservative were more likely to have
not seen an x-rated movie." Liberal or conservative political views is
an ordinal variable that is coded so that higher numeric values are
associated with survey respondents who were more conservative.
Variables in the Equation
B
Step
a
1
AGE
SEX
POLVIEWS
Constant
.038
1.901
.306
-4.590
S.E.
.014
.410
.135
1.045
Wald
7.629
21.452
5.110
19.302
df
1
1
1
1
Sig .
.006
.000
.024
.000
a. Variable(s) entered on step 1: AGE, SEX, POLVIEWS.
The value of Exp(B) was 1.358 which implies that
a one unit increase in liberal or conservative
political views increased the odds that survey
respondents have not seen an x-rated movie by
approximately one and a quarter times.
Exp(B)
1.039
6.689
1.358
.010
CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL:
by chance accuracy rate
SW388R7
Data Analysis &
Computers II
Slide 36
The independent variables could be characterized as useful
predictors distinguishing survey respondents who have not
seen an x-rated movie from survey respondents who have
seen an x-rated movie if the classification accuracy rate was
substantially higher than the accuracy attainable by chance
alone. Operationally, the classification accuracy rate should
be 25% or more higher than the proportional by chance
accuracy rate.
Classification Tablea,b
Step 0
Observed
SEEN X-RATED MOVIE
IN LAST YEAR
YES
NO
Overall Percentage
Predicted
SEEN X-RATED MOVIE
IN LAST YEAR
YES
NO
0
45
0
132
Percentage
Correct
.0
100.0
74.6
a. Constant is included in the model.
Thecut
proportional
b. The
value is .500 by chance accuracy rate was computed by first
calculating the proportion of cases for each group based on the number
of cases in each group in the classification table at Step 0. The
proportion in the "YES" group is 45/177 = 0.254. The proportion in the
"No" group is 132/177 = 0.746.
Then, we square and sum the proportion of cases in each group (0.254²
+ 0.746² = 0.621). 0.621 is the proportional by chance accuracy rate.
CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL:
criteria for classification accuracy
SW388R7
Data Analysis &
Computers II
Slide 37
Classification Tablea
Step 1
Observed
SEEN X-RATED MOVIE
IN LAST YEAR
YES
NO
Predicted
SEEN X-RATED MOVIE
IN LAST YEAR
YES
NO
19
26
9
123
Overall Percentage
a. The cut value is .500
The accuracy rate computed by SPSS was 80.2%
which was greater than or equal to the
proportional by chance accuracy criteria of
77.6% (1.25 x 62.1% = 77.6%).
The criteria for classification accuracy is
satisfied.
Percentage
Correct
42.2
93.2
80.2
SW388R7
Data Analysis &
Computers II
Answering the question in problem 1 - 1
Slide 38
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
The variables "age" [age], "sex" [sex], and "liberal or conservative political views"
[polviews] were useful predictors for distinguishing between groups based on responses to
"seen x-rated movie in last year" [xmovie]. These predictors differentiate survey
respondents who have not seen an x-rated movie from survey respondents who have seen
an x-rated movie.
Survey respondents who were older
were
more
likely to have
not seen
an x-rated movie. A one
We
found
a statistically
significant
overall
between
the combination
of seen an x-rated movie
unit increase in age increased the relationship
odds that survey
respondents
have not
independent
variables
and
the
dependent
by 3.9%. Survey respondents who were female were approximately six and three quarters times
variable.
more likely to have not seen an x-rated
movie. Survey respondents who were more conservative
were more likely to have not seen an x-rated movie. A one unit increase in liberal or
There was no evidence of numerical problems in
conservative political views increased
the odds that survey respondents have not seen an xthe solution.
rated movie by approximately one and a quarter times.
Moreover, the classification accuracy surpassed
the proportional by chance accuracy criteria,
supporting the utility of the model.
SW388R7
Data Analysis &
Computers II
Answering the question in problem 1 - 2
Slide 39
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating
the statistical
We verified
that eachrelationship.
statement about the
relationship between an independent variable and
the[sex],
dependent
variableorwas
correct in both
The variables "age" [age], "sex"
and "liberal
conservative
political views" [polviews]
direction
of
the
relationship
and
the
change
in
were useful predictors for distinguishing between groups based on responses
to "seen x-rated
likelihood
associated
with
a
one-unit
change
of
the
movie in last year" [xmovie]. These predictors differentiate survey respondents who have not
independent variable.
seen an x-rated movie from survey respondents who have seen an x-rated movie.
Survey respondents who were older were more likely to have not seen an x-rated movie. A
one unit increase in age increased the odds that survey respondents have not seen an xrated movie by 3.9%. Survey respondents who were female were approximately six and
three quarters times more likely to have not seen an x-rated movie. Survey respondents
who were more conservative were more likely to have not seen an x-rated movie. A one
unit increase in liberal or conservative political views increased the odds that survey
respondents have not seen an x-rated movie by approximately one and a quarter times.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
The answer to the question is true
with caution.
A caution is added because of the
inclusion of ordinal level variables.
SW388R7
Data Analysis &
Computers II
Problem 2
Slide 40
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal"
[grass], the variable "general happiness" [happy] and "confidence in the executive branch of the
federal government" [confed] were useful predictors for distinguishing between groups based on
responses to "should marijuana be made legal" [grass]. These predictors differentiate survey
respondents who have been less supportive that the use of marijuana should be made legal
from survey respondents who have been more supportive that the use of marijuana should be
made legal.
Survey respondents who were less happy overall were less likely to have been less supportive
that the use of marijuana should be made legal. A one unit increase in general happiness
decreased the odds that survey respondents have been less supportive that the use of
marijuana should be made legal by 66.9%. Survey respondents who had less confidence in the
executive branch of the federal government were less likely to have been less supportive that
the use of marijuana should be made legal. A one unit increase in confidence in the executive
branch of the federal government decreased the odds that survey respondents have been less
supportive that the use of marijuana should be made legal by 42.8%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 2 - 1
Slide 41
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases,
and that the validation analysis will confirm the generalizability of the results. Use a level
of significance of 0.05 for evaluating the statistical relationship.
After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal"
[grass], the variable "general happiness"
and "confidence
For [happy]
these problems,
we will in the executive branch of the
federal government" [confed] were useful
predictors
for
distinguishing
assume that there is no problem between groups based on
responses to "should marijuana be made
legal" [grass]. These predictors differentiate survey
with missing data, outliers, or
respondents who have been less supportive that the use of marijuana should be made legal
influential cases, and that the
from survey respondents who have been
more supportive that the use of marijuana should be
validation
analysis will confirm
made legal.
the generalizability of the
results
Survey respondents who were less happy overall were less likely to have been less supportive
that the use of marijuana should be made
legal.
A onewe
unit
In this
problem,
areincrease
told to in general happiness
decreased the odds that survey respondents
have been less supportive that the use of
use 0.05 as alpha for the
marijuana should be made legal by 66.9%. Survey respondents who had less confidence in the
logistic regression.
executive branch of the federal government
were less likely to have been less supportive that
the use of marijuana should be made legal. A one unit increase in confidence in the executive
branch of the federal government decreased the odds that survey respondents have been less
supportive that the use of marijuana should be made legal by 42.8%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 2 - 2
Slide 42
The variables listed first in the problem statement are
the independent variables (IVs): "sex" [sex] , "general
happiness" [happy], and "confidence in the executive
branch of the federal government" [confed].
Sex is a control variable and general happiness and
In the dataset
GSS2000.sav, is the following statement true, false, or an incorrect application of
confidence in the executive branchy are predictors.
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
After controlling for the effect of the variable "sex" [sex] on "should marijuana be made
legal" [grass], the variable "general happiness" [happy] and "confidence in the executive
branch of the federal government" [confed] were useful predictors for distinguishing
between groups based on responses to "should marijuana be made legal" [grass]. These
predictors differentiate survey respondents who have been less supportive that the use of
marijuana should be made legal from survey respondents who have been more supportive that
the use of marijuana should be made legal.
The variable used to define groups
is the dependent variable (DV):
Survey respondents
were
less happy overall were less likely to have been less supportive
"should
marijuana bewho
made
legal"
that
the
use
of
marijuana
should
be made legal. A one unit increase in general happiness
[grass].
decreased the odds that survey respondents have been less supportive that the use of
When a problem
identifies
control
marijuana should be made legal by 66.9%. Survey respondents
who had
less confidence
in the
variables,
we
do
a
hierarchical
executive branch of the federal government were less likely to have been less supportive that
logistic
regression
entering in
the
the use of marijuana should be made legal. A one unit
increase
in confidence
the executive
variables
in SPSS
blocks.
branch of the federal government decreased the odds
that survey
respondents
have been less
supportive that the use of marijuana should be made legal by 42.8%.
SW388R7
Data Analysis &
Computers II
Dissecting problem 2 - 3
Slide 43
SPSS logistic regression models the relationship by computing
the changes in the likelihood of falling in the category of the
dependent variable which had the highest numerical code.
The responses to seeing an x-rated movie were coded:
In the
GSS2000.sav,
is the following statement true, false, or an incorrect application of
1=dataset
Legal and
2 = Not Legal.
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that The
the SPSS
validation
analysis
will the
confirm
the in
generalizability
of the results. Use a level of
output
will model
changes
the likelihood of
significance
of
0.05
for
evaluating
the
statistical
relationship.
being less supportive of legalizing marijuana because 2
corresponds to not legalizing marijuana.
After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal"
[grass], the variable "general happiness" [happy] and "confidence in the executive branch of the
federal government" [confed] were useful predictors for distinguishing between groups based on
responses to "should marijuana be made legal" [grass]. These predictors differentiate survey
respondents who have been less supportive that the use of marijuana should be made legal
from survey respondents who have been more supportive that the use of marijuana should
be made legal.
Survey respondents who were less happy overall were less likely to have been less supportive
that the use of marijuana should be made legal. A one unit increase in general happiness
decreased the odds that survey respondents have been less supportive that the use of
marijuana should be made legal by 66.9%. Survey respondents who had less confidence in the
Thefederal
statements
of the specific
relationships
between
executive branch of the
government
were less
likely to have
been less supportive that
the use of marijuana independent
should be made
legal.and
A one
increasevariable
in confidence
variables
theunit
dependent
are all in the executive
branch of the federal phrased
government
decreased
the
odds
that
survey
respondents
have been less
in terms of impact on being less supportive of
supportive that the use
of marijuana
should be made legal by 42.8%.
legalizing
marijuana.
SW388R7
Data Analysis &
Computers II
Dissecting problem 2 - 4
Slide 44
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
The specific relationships for the independent
listed"sex"
in the
problem
indicate
the direction
After controlling for the effect ofvariables
the variable
[sex]
on "should
marijuana
be made legal"
of
the
relationship,
increasing
or
decreasing
the
[grass], the variable "general happiness" [happy] and "confidence in the executive branch of the
likelihood
of falling infor
thedistinguishing
modeled group,
and thegroups based on
federal government" [confed] were
useful predictors
between
responses to "should marijuana be
made of
legal"
[grass].
These
survey
amount
change
in the
oddspredictors
associated differentiate
with a
respondents who have been less supportive
that the
useindependent
of marijuana
should be made legal
one-unit change
in the
variable.
from survey respondents who have been more supportive that the use of marijuana should be
made legal.
Survey respondents who were less happy overall were less likely to have been less
supportive that the use of marijuana should be made legal. A one unit increase in general
happiness decreased the odds that survey respondents have been less supportive that the
use of marijuana should be made legal by 66.9%. Survey respondents who had less
confidence in the executive branch of the federal government were less likely to have been
less supportive that the use of marijuana should be made legal. A one unit increase in
confidence in the executive branch of the federal government decreased the odds that
survey respondents have been less supportive that the use of marijuana should be made
In order for the logistic regression question to be true, the
legal by 42.8%.
1.
2.
3.
4.
relationship between the predictors and the dependent variable
must be statistically significant after entering the control
True
variables in a previous stage, there must be no evidence of a
True with caution
flawed numerical analysis, the classification accuracy rate must
False
be substantially better than could be obtained by chance alone,
Inappropriate application
of a statistic
and each significant
relationship must be interpreted correctly.
SW388R7
Data Analysis &
Computers II
LEVEL OF MEASUREMENT - 1
Slide 45
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal"
[grass], the variable "general happiness" [happy] and "confidence in the executive branch of the
federal government" [confed] were useful predictors for distinguishing between groups based on
responses to "should marijuana be made legal" [grass]. These predictors differentiate survey
respondents who have been less supportive that the use of marijuana should be made legal
from survey respondents who have been more supportive that the use of marijuana should
be made legal.
Survey respondents who were less happy overall were less likely to have been less supportive
that the use of marijuana should be made legal. A one unit increase in general happiness
Logistic
regression
analysis requires
that the
decreased the odds
that
survey respondents
have been
lessdependent
supportive that the use of
variable be dichotomous and the independent variables
marijuana should
made
by 66.9%.
Surveymarijuana
respondents
who had less confidence in the
be be
metric
or legal
dichotomous.
"Should
be made
executive branch
of
the
federal
government
were
less
likely
to
have been less supportive that
legal" [grass] is a dichotomous variable, which satisfies
the use of marijuana
should
be
made
legal.
A
one
unit
increase
in confidence in the executive
the level of measurement requirement for the dependent
branch of the federal
government decreased the odds that survey respondents have been less
variable.
supportive that the use of marijuana should be made legal by 42.8%.
It contains two categories:
•survey respondents who have been less supportive that
True
the use of marijuana should be made legal
•survey respondents who have been more supportive
True with caution
that the use of marijuana should be made legal
False
1.
2.
3.
4. Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
LEVEL OF MEASUREMENT - 2
Slide 46
"Sex" [sex] is a dichotomous or
dummy-coded nominal variable which
In the dataset
GSS2000.sav,
following
statement true, false, or an incorrect application of
may
be includedisinthe
logistic
regression
a statistic? Assume
that there is no problem with missing data, outliers, or influential cases, and
analysis.
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
After controlling for the effect of the variable "sex" [sex] on "should marijuana be made
legal" [grass], the variable "general happiness" [happy] and "confidence in the executive
branch of the federal government" [confed] were useful predictors for distinguishing
between groups based on responses to "should marijuana be made legal" [grass]. These
predictors differentiate survey respondents who have been less supportive that the use of
marijuana should be made legal from survey respondents who have been more supportive that
the use of marijuana should be made legal.
"General happiness" [happy] and "confidence in the
Survey respondents who were less
happy branch
overall of
were
likely
to have been less supportive
executive
the less
federal
government"
that the use of marijuana should
be
made
legal.
A
one
unit
increase
in general
happiness
[confed] are ordinal level variables. If we
follow the
decreased the odds that surveyconvention
respondents
have
been
less
supportive
that
the
use of
of treating ordinal level variables as
marijuana should be made legalmetric
by 66.9%.
Survey
respondents
who had less confidence in the
variables,
the
level of measurement
logistic
analysis
executive branch of the federalrequirement
governmentfor
were
less regression
likely to have
beenisless supportive that
Since
some
data
analysts
not agreein
with
the use of marijuana should be satisfied.
made legal.
A one
unit
increase
in do
confidence
the executive
this
convention,
a
note
of
caution
should
be
included
branch of the federal government decreased the odds that survey respondents have been less
in our should
interpretation.
supportive that the use of marijuana
be made legal by 42.8%.
SW388R7
Data Analysis &
Computers II
Request hierarchical logistic regression
Slide 47
Select the Regression |
Binary Logistic…
command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Selecting the dependent variable
Slide 48
First, highlight the
dependent variable
grass in the list of
variables.
Second, click on the right
arrow button to move the
dependent variable to the
Dependent text box.
SW388R7
Data Analysis &
Computers II
Selecting the control independent variables
Slide 49
First, move the control
independent variable,
sex, listed in the
problem to the
Covariates list box.
Second, click on the
Next button to add the
new block that will
contain the predictors.
SW388R7
Data Analysis &
Computers II
Adding the predictor independent variables
Slide 50
First, move the
predictors to the
Covariates list box.
SW388R7
Data Analysis &
Computers II
Specifying the method for including variables
Slide 51
In our hierarchical
regression, we will specify
that all of the variables in
each block be entered
simultaneously when the
block is entered.
SW388R7
Data Analysis &
Computers II
Completing the logistic regression request
Slide 52
Click on the OK
button to request
the output for the
logistic regression.
The logistic procedure supports the selection of subsets of
cases, automatic recoding of nominal variables, saving
diagnostic statistics like standardized residuals and Cook's
distance, and options for additional statistics. However,
none of these are needed for this analysis.
SW388R7
Data Analysis &
Computers II
Sample size – ratio of cases to variables
Slide 53
Case Processing Summary
Unweig hted Cases
Selected Cases
Unselected Cases
Total
a
N
Included in Analysis
Missing Cases
Total
163
107
270
0
270
a. If weight is in effect, see classification table for the total
number of cases.
The minimum ratio of valid cases to
independent variables for logistic regression is
10 to 1, with a preferred ratio of 20 to 1. In this
analysis, there are 163 valid cases and 3
independent variables. The ratio of cases to
independent variables is 54.33 to 1, which
satisfies the minimum requirement. In addition,
the ratio of 54.33 to 1 satisfies the preferred
ratio of 20 to 1.
Percent
60.4
39.6
100.0
.0
100.0
SW388R7
Data Analysis &
Computers II
Slide 54
OVERALL RELATIONSHIP BETWEEN
INDEPENDENT AND DEPENDENT VARIABLES
In a hierarchical logistic regression, the presence of a relationship
between the dependent variable and combination of independent
variables entered after the control variables have been included is
based on the statistical significance of the block chi-square for
the second block of variables in which the predictor independent
variables are included.
In this analysis, the probability of the block chi-square (17.467)
was <0.001, less than or equal to the level of significance of
0.05. The null hypothesis that there is no difference between the
model with only a constant and the control variables versus the
model with the predictor independent variables was rejected. The
contribution of the relationship between the predictor
independent variables and the dependent variable was supported.
SW388R7
Data Analysis &
Computers II
NUMERICAL PROBLEMS
Slide 55
Variables in the Equation
B
Step
a
1
SEX
HAPPY
CONFED
Constant
.154
-1.104
-.559
3.721
S.E.
.351
.354
.270
1.066
Wald
.194
9.739
4.290
12.195
df
1
1
1
1
Sig .
.660
.002
.038
.000
Exp(B)
1.167
.331
.572
41.308
a. Variable(s) entered on step 1: HAPPY, CONFED.
Multicollinearity in the logistic regression solution is detected
by examining the standard errors for the b coefficients. A
standard error larger than 2.0 indicates numerical problems,
such as multicollinearity among the independent variables,
zero cells for a dummy-coded independent variable because
all of the subjects have the same value for the variable, and
'complete separation' whereby the two groups in the
dependent event variable can be perfectly separated by
scores on one of the independent variables. Analyses that
indicate numerical problems should not be interpreted.
None of the independent variables in this analysis had a
standard error larger than 2.0. (The check for standard
errors larger than 2.0 does not include the standard error for
the Constant.)
SW388R7
Data Analysis &
Computers II
Slide 56
RELATIONSHIP OF INDIVIDUAL INDEPENDENT
VARIABLES TO DEPENDENT VARIABLE - 1
The probability of the Wald statistic for the variable general
happiness was 0.002, less than or equal to the level of
significance of 0.05. The null hypothesis that the b coefficient
for general happiness was equal to zero was rejected. This
supports the relationship that "survey respondents who were
less happy overall were less likely to have been less supportive
that the use of marijuana should be made legal." General
happiness is an ordinal variable that is coded so that lower
numeric values are associated with survey respondents who
were happier overall.
Variables in the Equation
B
Step
a
1
SEX
HAPPY
CONFED
Constant
.154
-1.104
-.559
3.721
S.E.
.351
.354
.270
1.066
Wald
.194
9.739
4.290
12.195
df
1
1
1
1
Sig .
.660
.002
.038
.000
a. Variable(s) entered on step 1: HAPPY, CONFED.
The value of Exp(B) was 0.331 which implies that a
one unit increase in general happiness decreased the
odds that survey respondents have been less
supportive that the use of marijuana should be made
legal by 66.9%.
Exp(B)
1.167
.331
.572
41.308
SW388R7
Data Analysis &
Computers II
Slide 57
RELATIONSHIP OF INDIVIDUAL INDEPENDENT
VARIABLES TO DEPENDENT VARIABLE - 2
The probability of the Wald statistic for the variable confidence in the
executive branch of the federal government was 0.038, less than or
equal to the level of significance of 0.05. The null hypothesis that the
b coefficient for confidence in the executive branch of the federal
government was equal to zero was rejected. This supports the
relationship that "survey respondents who had less confidence in the
executive branch of the federal government were less likely to have
been less supportive that the use of marijuana should be made legal."
Confidence in the executive branch of the federal government is an
ordinal variable that is coded so that lower numeric values are
associated with survey respondents who had more confidence in the
executive branch of the federal government.
Variables in the Equation
B
Step
a
1
SEX
HAPPY
CONFED
Constant
.154
-1.104
-.559
3.721
S.E.
.351
.354
.270
1.066
Wald
.194
9.739
4.290
12.195
df
1
1
1
1
Sig .
.660
.002
.038
.000
a. Variable(s) entered on step 1: HAPPY, CONFED.
The value of Exp(B) was 0.572 which implies
that a one unit increase in confidence in the
executive branch of the federal government
decreased the odds that survey respondents
have been less supportive that the use of
marijuana should be made legal by 42.8%.
Exp(B)
1.167
.331
.572
41.308
CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL:
by chance accuracy rate
SW388R7
Data Analysis &
Computers II
Slide 58
The independent variables could be characterized as useful
predictors distinguishing survey respondents who have been
less supportive that the use of marijuana should be made
legal from survey respondents who have been more
supportive that the use of marijuana should be made legal if
the classification accuracy rate was substantially higher than
the accuracy attainable by chance alone. Operationally, the
classification accuracy rate should be 25% or more higher
than the proportional by chance accuracy rate.
Classification Tablea,b
Step 0
Observed
SHOULD MARIJUANA
BE MADE LEGAL
LEGAL
NOT LEGAL
Predicted
SHOULD MARIJUANA BE
MADE LEGAL
LEGAL
NOT LEGAL
0
57
0
106
Overall Percentage
a. Constant is included in the model.
b. The cut value is .500
The proportional by chance accuracy rate was computed by
calculating the proportion of cases for each group based on
the number of cases in each group in the classification table
at Step 0, and then squaring and summing the proportion of
cases in each group (0.350² + 0.650² = 0.545).
Percentage
Correct
.0
100.0
65.0
CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL:
criteria for classification accuracy
SW388R7
Data Analysis &
Computers II
Slide 59
Classification Tablea
Step 1
Observed
SHOULD MARIJUANA
BE MADE LEGAL
LEGAL
NOT LEGAL
Predicted
SHOULD MARIJUANA BE
MADE LEGAL
LEGAL
NOT LEGAL
18
39
13
93
Overall Percentage
a. The cut value is .500
The accuracy rate computed by SPSS was 68.1%
which was greater than or equal to the
proportional by chance accuracy criteria of
68.1% (1.25 x 54.5% = 68.1%).
The criteria for classification accuracy is
satisfied.
Percentage
Correct
31.6
87.7
68.1
SW388R7
Data Analysis &
Computers II
Answering the question in problem 2 - 1
Slide 60
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
After controlling for the effect of the variable "sex" [sex] on "should marijuana be made legal"
[grass], the variable "general happiness" [happy] and "confidence in the executive branch of the
federal government" [confed] were useful predictors for distinguishing between groups based on
responses to "should marijuana be made legal" [grass]. These predictors differentiate survey
respondents who have been less supportive that the use of marijuana should be made legal
from survey respondents who have been more supportive that the use of marijuana should be
made legal.
Survey respondents who were less happy overall were less likely to have been less supportive
that the use of marijuana should be made legal. A one unit increase in general happiness
We found ahave
statistically
significant
overall
decreased the odds that survey respondents
been less
supportive
that the use of
relationship
between
the
predictor
independent
marijuana should be made legal by 66.9%. Survey respondents who had
less confidence in the
variables and
theless
dependent
executive branch of the federal government
were
likely tovariable.
have been less supportive that
the use of marijuana should be made legal. A one unit increase in confidence in the executive
was no
of survey
numerical
problems in
branch of the federal government There
decreased
theevidence
odds that
respondents
have been less
supportive that the use of marijuana
be made legal by 42.8%.
the should
solution.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of
Moreover, the classification accuracy surpassed
the proportional by chance accuracy criteria,
supporting the utility of the model.
a statistic
SW388R7
Data Analysis &
Computers II
Answering the question in problem 2 - 2
Slide 61
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
We verified that each statement about the
relationship between an independent variable and
After controlling for the effect
the variable
"sex" was
[sex]correct
on "should
marijuana be made legal"
theofdependent
variable
in both
[grass], the variable "generaldirection
happiness"
[happy] and "confidence in the executive branch of the
of the relationship and the change in
federal government" [confed] were useful
predictors for distinguishing between groups based on
likelihood
associated
with a These
one-unit
change of
the
responses to "should marijuana be made legal" [grass].
predictors
differentiate
survey
independent
variable.
respondents who have been less
supportive
that the use of marijuana should be made legal
from survey respondents who have been more supportive that the use of marijuana should be
made legal.
Survey respondents who were less happy overall were less likely to have been less supportive
that the use of marijuana should be made legal. A one unit increase in general happiness
decreased the odds that survey respondents have been less supportive that the use of
marijuana should be made legal by 66.9%. Survey respondents who had less confidence in the
executive branch of the federal government were less likely to have been less supportive that
the use of marijuana should be made legal. A one unit increase in confidence in the executive
branch of the federal government decreased the odds that survey respondents have been less
supportive that the use of marijuana should be made legal by 42.8%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
The answer to the question is true
with caution.
A caution is added because of the
inclusion of ordinal level variables.
SW388R7
Data Analysis &
Computers II
Problem 3
Slide 62
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
From the list of variables "highest academic degree" [degree], "total family income" [income98],
and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing
between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was
"total family income" [income98]. These predictors differentiate survey respondents who have
been less positive that the United States would fight in another world war within the next ten
years from survey respondents who have been more positive that the United States would fight
in another world war within the next ten years.
The most important predictor for identifying survey respondents who have been less positive
that the United States would fight in another world war within the next ten years was total
family income.
Survey respondents who had higher total family incomes were more likely to have been less
positive that the United States would fight in another world war within the next ten years. A
one unit increase in total family income increased the odds that survey respondents have been
less positive that the United States would fight in another world war within the next ten years
by 10.0%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting Problem 3 - 1
Slide 63
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases,
and that the validation analysis will confirm the generalizability of the results. Use a level
of significance of 0.05 for evaluating the statistical relationship.
From the list of variables "highest academic degree" [degree], "total family income" [income98],
and "satisfaction with financial situation"
[satfin],
the most
For these
problems,
weuseful
will predictor for distinguishing
between groups based on responses toassume
"expectthat
u.s.there
in world
war
in 10 years" [uswary] was
is no problem
"total family income" [income98]. These
predictors differentiate survey respondents who have
with missing data, outliers, or
been less positive that the United States would fight in another world war within the next ten
influential cases, and that the
years from survey respondents who have
been more positive that the United States would fight
validation
in another world war within the next ten years.analysis will confirm
the generalizability of the
results
The most important predictor for identifying survey respondents who have been less positive
that the United States would fight in another
world war
the
In this problem,
we within
are told
to next ten years was total
family income.
use 0.05 as alpha for the
logistic regression.
Survey respondents who had higher total family incomes were more likely to have been less
positive that the United States would fight in another world war within the next ten years. A
one unit increase in total family income increased the odds that survey respondents have been
less positive that the United States would fight in another world war within the next ten years
by 10.0%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting Problem 3 - 2
Slide 64
The variables listed first in the
The variable used to
problem statement are the
define groups is the
independent variables (IVs): "highest
dependent variable (DV):
academic degree" [degree], "total
"expect u.s. in world war
family income" [income98], and
in 10oryears"
[uswary].
In the dataset GSS2000.sav, is the following statement true, false,
an incorrect
application of
"satisfaction
with
financial
situation"
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
[satfin].
that the
validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
From the list of variables "highest academic degree" [degree], "total family income"
[income98], and "satisfaction with financial situation" [satfin], the most useful predictor for
distinguishing between groups based on responses to "expect u.s. in world war in 10 years"
[uswary] was "total family income" [income98]. These predictors differentiate survey
respondents who have been less positive that the United States would fight in another world
war within the next ten years from survey respondents who have been more positive that the
United States would fight in another world war within the next ten years.
The most important predictor for identifying survey respondents who have been less positive
that the United States would fight in another world war within the next ten years was total
family income.
Since the problem identifies
themore
most likely
usefulto
of have been less
Survey respondents who had higher total family incomes were
predictor,
positive that the United States would fight in another worldimportant
war within
the nextwe
tendoyears. A
a
stepwise
logistic
one unit increase in total family income increased the odds that survey respondents have been
regression.
less positive that the United States would fight in another world
war within the next ten years
by 10.0%.
SW388R7
Data Analysis &
Computers II
Dissecting Problem 3 - 3
Slide 65
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
From the list of variables "highest academic degree" [degree], "total family income" [income98],
and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing
between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was
"total family income" [income98]. These predictors differentiate survey respondents who
have been less positive that the United States would fight in another world war within the
next ten years from survey respondents who have been more positive that the United
States would fight in another world war within the next ten years.
The most important predictor for identifying survey respondents who have been less positive
that the United States would fight in another world war within the next ten years was total
family income.
SPSS logistic regression models the relationship by computing the
changes
thehigher
likelihood
fallingincomes
in the category
of the
Survey respondents
who in
had
totaloffamily
were more
likely to have been less
variable
which
had
highest
numerical
code.the next ten years. A
positive that thedependent
United States
would
fight
inthe
another
world
war within
one unit increase in total family income increased the odds that survey respondents have been
less positive that
theresponses
United States
wouldu.s.
fight
in another
war within
The
to “expect
in world
war inworld
10 years”
were the next ten years
by 10.0%.
coded: 1= Yes and 2 = No.
The SPSS output will model the changes in the likelihood of being
less positive that the United States would fight in another world
war within the next ten years.
SW388R7
Data Analysis &
Computers II
Dissecting Problem 3 - 4
Slide 66
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
The statements
of the specific
significance of 0.05 for evaluating the statistical
relationship.
relationships between independent
variables and the dependent variable are
From the list of variables "highest academic
degree" [degree], "total family income" [income98],
all phrased in terms of impact on being
and "satisfaction with financial situation" [satfin],
the most useful predictor for distinguishing
less
positive
that the
between groups based on responses to "expect u.s. in world
warUnited
in 10 States
years" would
[uswary] was
fight in another
world war
within
the next who have
"total family income" [income98]. These predictors
differentiate
survey
respondents
ten years.
been less positive that the United States would
fight in another world war within the next ten
years from survey respondents who have been more positive that the United States would fight
in another world war within the next ten years.
The most important predictor for identifying survey respondents who have been less
positive that the United States would fight in another world war within the next ten years
was total family income.
Survey respondents who had higher total family incomes were more likely to have been less
positive that the United States would fight in another world war within the next ten years.
A one unit increase in total family income increased the odds that survey respondents have
been less positive that the United States would fight in another world war within the next
ten years by 10.0%.
SW388R7
Data Analysis &
Computers II
Dissecting Problem 3 - 5
Slide 67
From the list of variables "highest academic degree" [degree], "total family income" [income98],
and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing
between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was
"total family income" [income98]. These predictors differentiate survey respondents who have
specific
relationships
for the world
independent
been less positive that the UnitedThe
States
would
fight in another
war within the next ten
variables
listed
in
the
problem
indicate
the direction
years from survey respondents who have been more positive that the United
States would fight
of theten
relationship,
increasing or decreasing the
in another world war within the next
years.
likelihood of falling in the modeled group, and the
of change
withwho
a one-unit
The most important predictor foramount
identifying
surveyassociated
respondents
have been less positive
change
in theworld
independent
variable.
that the United States would fight
in another
war within
the next ten years was total
family income.
Survey respondents who had higher total family incomes were more likely to have been less
positive that the United States would fight in another world war within the next ten years.
A one unit increase in total family income increased the odds that survey respondents have
been less positive that the United States would fight in another world war within the next
ten years by 10.0%.
1.
2.
3.
4.
True
True with caution
In order for the logistic regression question to be true, the
False
relationshipofbetween
the predictors selected for inclusion and the
Inappropriate application
a statistic
dependent variable must be statistically significant, there must be
no evidence of a flawed numerical analysis, the classification
accuracy rate must be substantially better than could be obtained
by chance alone, and the order of entry and each significant
relationship must be interpreted correctly.
SW388R7
Data Analysis &
Computers II
LEVEL OF MEASUREMENT - 1
Slide 68
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
From the list of variables "highest academic degree" [degree], "total family income" [income98],
and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing
between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was
"total family income" [income98]. These predictors differentiate survey respondents who have
been less positive that the United States would fight in another world war within the next ten
years from survey respondents who have been more positive that the United States would fight
in another world war within the next ten years.
The most important predictor for identifying survey respondents who have been less positive
that the United States would fight in another world war within the next ten years was total
family income.Logistic regression analysis requires that the dependent variable
be dichotomous and the independent variables be metric or
dichotomous. "Expect u.s. in world war in 10 years" [uswary] is a
Survey respondents
who hadvariable,
higher total
incomes
were
more likely to have been less
dichotomous
whichfamily
satisfies
the level
of measurement
positive that the
United
States
would
fight
in
another
world
war
within the next ten years. A
requirement for the dependent variable.
one unit increase in total family income increased the odds that survey respondents have been
less positive that the United States would fight in another world war within the next ten years
by 10.0%.
It contains two categories:
survey
respondents
States would fight in
True
survey respondents
True with caution
States would fight in
who have been less positive that the United
another world war within the next ten years
who have been more positive that the United
another world war within the next ten years.
1.
2.
3. False
4. Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
LEVEL OF MEASUREMENT - 2
Slide 69
"Highest academic degree" [degree], "total family
income" [income98], and "satisfaction with financial
situation" [satfin] are ordinal level variables. If we
follow the convention of treating ordinal level
variables as metric variables, the level of
measurement requirement for logistic regression
In the dataset GSS2000.sav, is the
following
statement
true,
false,
an incorrect
analysis
is satisfied.
Since
some
dataor
analysts
do notapplication of
a statistic? Assume that there isagree
no problem
with
missing
data,
outliers,
or
influential
cases, and
with this convention, a note of caution should
that the validation analysis willbe
confirm
theingeneralizability
of the results. Use a level of
included
our interpretation.
significance of 0.05 for evaluating the statistical relationship.
From the list of variables "highest academic degree" [degree], "total family income" [income98],
and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing
between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was
"total family income" [income98]. These predictors differentiate survey respondents who have
been less positive that the United States would fight in another world war within the next ten
years from survey respondents who have been more positive that the United States would fight
in another world war within the next ten years.
The most important predictor for identifying survey respondents who have been less positive
that the United States would fight in another world war within the next ten years was total
family income.
Survey respondents who had higher total family incomes were more likely to have been less
positive that the United States would fight in another world war within the next ten years. A
one unit increase in total family income increased the odds that survey respondents have been
less positive that the United States would fight in another world war within the next ten years
by 10.0%.
SW388R7
Data Analysis &
Computers II
Request stepwise logistic regression
Slide 70
Select the Regression |
Binary Logistic…
command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Selecting the dependent variable
Slide 71
First, highlight the
dependent variable
uswary in the list
of variables.
Second, click on the right
arrow button to move the
dependent variable to the
Dependent text box.
SW388R7
Data Analysis &
Computers II
Adding the independent variables
Slide 72
First, move the
predictors to the
Covariates list box.
SW388R7
Data Analysis &
Computers II
Specifying the method for including variables
Slide 73
In our stepwise logistic
regression, we specify
the Forward
Conditional method for
adding variables.
SW388R7
Data Analysis &
Computers II
Adding options to the output
Slide 74
To add a summary of steps
at the end of the analysis
and specifications for
stepwise method, click on
the Options… button.
SW388R7
Data Analysis &
Computers II
Including a summary of steps
Slide 75
To obtain a summary of the steps
on which variables were added or
removed from the analysis, mark
the option button At last step in
the Display panel.
SW388R7
Data Analysis &
Computers II
Specifications for stepwise method
Slide 76
Click on the
Continue button to
close the dialog box.
We can change the criteria for adding and
removing variables from the analysis by
changing the probability for entry and
removal. We will use the default level of
significance of 0.05 for entry and 0.10 for
removal.
SW388R7
Data Analysis &
Computers II
Completing the logistic regression request
Slide 77
Click on the OK
button to request
the output for the
logistic regression.
SW388R7
Data Analysis &
Computers II
Sample size – ratio of cases to variables
Slide 78
Case Processing Summary
Unweig hted Cases
Selected Cases
Unselected Cases
Total
a
N
Included in Analysis
Missing Cases
Total
136
134
270
0
270
a. If weight is in effect, see classification table for the total
number of cases.
The minimum ratio of valid cases to
independent variables for stepwise logistic
regression is 10 to 1, with a preferred ratio of
50 to 1. In this analysis, there are 136 valid
cases and 3 independent variables. The ratio of
cases to independent variables is 45.33 to 1,
which satisfies the minimum requirement.
However, the ratio of 45.33 to 1 does not satisfy
the preferred ratio of 50 to 1. A caution should
be added to the interpretation of the analysis
and a split sample validation should be
conducted.
Percent
50.4
49.6
100.0
.0
100.0
SW388R7
Data Analysis &
Computers II
Slide 79
OVERALL RELATIONSHIP BETWEEN
INDEPENDENT AND DEPENDENT VARIABLES
The presence of a relationship between the dependent variable
and combination of independent variables is based on the
statistical significance of the model chi-square.
In this analysis, the probability of the model chi-square (9.001)
was 0.003, less than or equal to the level of significance of 0.05.
The null hypothesis that there is no difference between the model
with only a constant and the model with independent variables
was rejected. The existence of a relationship between the
independent variables and the dependent variable was supported.
SW388R7
Data Analysis &
Computers II
NUMERICAL PROBLEMS
Slide 80
Variables in the Equation
B
Step
a
1
INCOME98
Constant
.095
-1.033
S.E.
.033
.527
Wald
8.436
3.847
df
1
1
Sig .
.004
.050
Exp(B)
1.100
.356
a. Variable(s) entered on step 1: INCOME98.
Multicollinearity in the logistic regression solution is detected
by examining the standard errors for the b coefficients. A
standard error larger than 2.0 indicates numerical problems,
such as multicollinearity among the independent variables,
zero cells for a dummy-coded independent variable because
all of the subjects have the same value for the variable, and
'complete separation' whereby the two groups in the
dependent event variable can be perfectly separated by
scores on one of the independent variables. Analyses that
indicate numerical problems should not be interpreted.
None of the independent variables in this analysis had a
standard error larger than 2.0. (The check for standard
errors larger than 2.0 does not include the standard error for
the Constant.)
SW388R7
Data Analysis &
Computers II
Slide 81
RELATIONSHIP OF INDIVIDUAL INDEPENDENT
VARIABLES TO DEPENDENT VARIABLE
The probability of the Wald statistic for the variable total family
income was 0.004, less than or equal to the level of significance
of 0.05. The null hypothesis that the b coefficient for total family
income was equal to zero was rejected. This supports the
relationship that "survey respondents who had higher total
family incomes were more likely to have been less positive that
the United States would fight in another world war within the
next ten years." Total family income is an ordinal variable that is
coded so that higher numeric values are associated with survey
respondents who had higher total family incomes.
Variables in the Equation
B
Step
a
1
INCOME98
Constant
.095
-1.033
S.E.
.033
.527
Wald
8.436
3.847
df
1
1
Sig .
.004
.050
a. Variable(s) entered on step 1: INCOME98.
The value of Exp(B) was 1.100 which implies that a
one unit increase in total family income increased the
odds that survey respondents have been less positive
that the United States would fight in another world
war within the next ten years by 10.0%.
Exp(B)
1.100
.356
IMPORTANCE OF INDIVIDUAL INDEPENDENT
VARIABLES TO DEPENDENT VARIABLE
SW388R7
Data Analysis &
Computers II
Slide 82
The order of importance is based on the entry
order of the variables included in the stepwise
logistic regression. The entry order is
summarized in the Step Summary table, in
which we see which variable was added or
removed at each step.
Step Summarya,b
Step
1
Improvement
Chi-square
df
9.001
1
Sig .
.003
Chi-square
Model
df
9.001
Sig .
1
Correct
Class %
.003
67.6%
a. No more variables can be deleted from or added to the current model.
b. End block: 1
The most important predictor for identifying
survey respondents who have been less
positive that the United States would fight in
another world war within the next ten years
was total family income [INCOME98].
The importance of the predictors stated in
the problem is correct.
Variable
IN:
INCOME9
8
CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL:
by chance accuracy rate
SW388R7
Data Analysis &
Computers II
Slide 83
The independent variables could be characterized as useful
predictors distinguishing survey respondents who have been
less positive that the United States would fight in another
world war within the next ten years from survey respondents
who have been more positive that the United States would
fight in another world war within the next ten years if the
classification accuracy rate was substantially higher than the
accuracy attainable by chance alone. Operationally, the
classification accuracy rate should be 25% or more higher
than the proportional by chance accuracy rate.
Classification Tablea,b
Step 0
Observed
EXPECT U.S. IN WORLD
WAR IN 10 YEARS
YES
NO
Predicted
EXPECT U.S. IN
WORLD WAR IN 10
YEARS
YES
NO
0
54
0
82
Overall Percentage
a. Constant is included in the model.
b. The
The
by chance accuracy rate was computed by
cut proportional
value is .500
calculating the proportion of cases for each group based on
the number of cases in each group in the classification table
at Step 0, and then squaring and summing the proportion of
cases in each group (0.397² + 0.603² = 0.521).
Percentage
Correct
.0
100.0
60.3
CLASSIFICATION USING THE LOGISTIC REGRESSION MODEL:
criteria for classification accuracy
SW388R7
Data Analysis &
Computers II
Slide 84
Classification Tablea
Step 1
Observed
EXPECT U.S. IN WORLD
WAR IN 10 YEARS
YES
NO
Predicted
EXPECT U.S. IN
WORLD WAR IN 10
YEARS
YES
NO
20
34
10
72
Overall Percentage
a. The cut value is .500
The accuracy rate computed by SPSS was
67.6% which was greater than or equal to the
proportional by chance accuracy criteria of
65.2% (1.25 x 52.1% = 65.2%).
The criteria for classification accuracy is
satisfied.
Percentage
Correct
37.0
87.8
67.6
SW388R7
Data Analysis &
Computers II
Answering the question in problem 3 - 1
Slide 85
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for evaluating the statistical relationship.
From the list of variables "highest academic degree" [degree], "total family income" [income98],
and "satisfaction with financial situation" [satfin], the most useful predictor for distinguishing
between groups based on responses to "expect u.s. in world war in 10 years" [uswary] was
"total family income" [income98]. These predictors differentiate survey respondents who have
been less positive that the United States would fight in another world war within the next ten
years from survey respondents who have been more positive that the United States would fight
in another world war within the next ten years.
The most important predictor for identifying survey respondents who have been less positive
that the United States would fight in another world war within the next ten years was total
family income.
We found a statistically significant overall
relationship between the predictor independent
and the
dependent
Survey respondents who had highervariables
total family
incomes
were variable.
more likely to have been less
positive that the United States would fight in another world war within the next ten years. A
There was
no evidence
of numerical
problems
in
one unit increase in total family income
increased
the odds
that survey
respondents
have been
less positive that the United Statesthe
would
solution.
fight in another world war within the next ten years
by 10.0%.
1.
2.
3.
4.
Moreover, the classification accuracy surpassed
the proportional by chance accuracy criteria,
supporting the utility of the model.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Answering the question in problem 3 - 2
Slide 86
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and
that the validation analysis will confirm the generalizability of the results. Use a level of
significance of 0.05 for
theeach
statistical
relationship.
Weevaluating
verified that
statement
about the
relationship between an independent variable and
the dependent
variable was
correct
in both "total family income" [income98],
From the list of variables
"highest academic
degree"
[degree],
and "satisfaction withdirection
financialofsituation"
[satfin],
the
most
useful
the relationship
and
the
change
in predictor for distinguishing
between groups based
on
responses
to
"expect
u.s.
in
world
war
in the
10 years" [uswary] was
likelihood associated with a one-unit change of
"total family income" independent
[income98].variable.
These predictors differentiate survey respondents who have
been less positive that the United States would fight in another world war within the next ten
years from survey respondents
who have
been of
more
positive for
that
the United States would fight
We also verified
the order
importance
the
in another world war independent
within the next
ten years.
variables
included in the stepwise
analysis.
The most important predictor for identifying survey respondents who have been less positive
that the United States would fight in another world war within the next ten years was total
family income.
Survey respondents who had higher total family incomes were more likely to have been less
positive that the United States would fight in another world war within the next ten years. A
one unit increase in total family income increased the
that
respondents
Theodds
answer
tosurvey
the question
is true have been
less positive that the United States would fight in another
world war within the next ten years
with caution.
by 10.0%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
A caution is added to the findings
because of the inclusion of ordinal
level independent variables. A
caution is added to the findings
because of the preferred sample
size is not met.
SW388R7
Data Analysis &
Computers II
Slide 87
Steps in binary logistic regression:
level of measurement and initial sample size
The following is a guide to the decision process for answering
problems about the basic relationships in logistic regression:
Dependent dichotomous?
Independent variables
metric or dichotomous?
No
Inappropriate
application of
a statistic
Yes
Ratio of cases to
independent variables at
least 10 to 1?
Yes
Run logistic regression, using method for including
variables identified in the research question.
No
Inappropriate
application of
a statistic
SW388R7
Data Analysis &
Computers II
Slide 88
Steps in logistic regression:
overall relationship and numerical problems
No
No
False
Hierarchical method of
entry used to include
independent variables?
Presence of
relationship confirmed
by test of model chisquare?
Yes
Presence of
relationship confirmed
by test of block chisquare?
No
False
Yes
Yes
Standard errors of
coefficients indicate
presence of numerical
problems (s.e. > 2.0)?
No
Yes
False
SW388R7
Data Analysis &
Computers II
Slide 89
Steps in logistic regression:
relationships between IV's and DV
Stepwise method of entry
used to include
independent variables?
Yes
No
Entry order of variables
interpreted correctly?
No
Yes
Relationships between
individual IVs and DV groups
interpreted correctly?
Yes
No
False
False
SW388R7
Data Analysis &
Computers II
Slide 90
Steps in logistic regression:
classification accuracy and adding cautions
Overall accuracy rate is
25% > than proportional
by chance accuracy rate?
No
False
Yes
Satisfies preferred ratio of
cases to IV's of 20 to 1
(50 to 1 for stepwise)
No
True with caution
Yes
One or more IV's are
ordinal level variables?
No
True
Yes
True with caution