Standard Binary Logistic Regression Slide 1 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous dependent variable and metric or.

Transcript Standard Binary Logistic Regression Slide 1 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous dependent variable and metric or.

Standard Binary Logistic Regression
Slide 1
Logistic regression

Logistic regression is used to analyze relationships between a dichotomous dependent
variable and metric or non-metric independent variables. (SPSS now supports
Multinomial Logistic Regression that can be used with more than two groups, but our
focus now is on binary logistic regression for two groups.)

Logistic regression combines the independent variables to estimate the probability that a
particular event will occur, i.e. a subject will be a member of one of the groups defined by
the dichotomous dependent variable. In SPSS, the model is always constructed to predict
the group with higher numeric code. If responses are coded 1 for Yes and 2 for No, SPSS
will predict membership in the No category. If responses are coded 1 for No and 2 for Yes,
SPSS will predict membership in the Yes category. We will refer to the predicted event for
a particular analysis as the modeled event.

Predicting the “No” event create some awkward wording in our problems. Our only
option for changing this is to recode the variable.

If the probability for group membership in the modeled category is above some cut point
(the default is 0.50), the subject is predicted to be a member of the modeled group. If the
probability is below the cut point, the subject is predicted to be a member of the other
group.

For any given case, logistic regression computes the probability that a case with a
particular set of values for the independent variable is a member of the modeled category
Slide 2
Level of measurement requirements

Logistic regression analysis requires that the dependent variable be dichotomous.

Logistic regression analysis requires that the independent variables be metric or nonmetric. The logistic regression procedure will dummy-code non-metric variables for us.
For logistic regression, we will use indicator dummy-coding, rather than deviation
dummy-coding since I think it makes more sense to compare the odds for two groups
rather than compare the odds for one group to the average odds for all groups.

If an independent variable is ordinal, we can either treat it as non-metric and dummy-code
it or we can treat it as interval, in which case we will attach the usual caution.

Dichotomous independent variables do not have to be dummy-coded, but in our problems
we will have SPSS dummy-code them because then we do not need to worry about the
original codes for the variable as we can always interpret
Slide 3
Dummy-coding in SPSS - 1
When we want SPSS to dummycode a variable, we enter the
specifications in the Define
Categorical Variables dialog box.
Here we are dummy-coding sex,
using the defaults of indicatory
coding with the last category as
the reference category.
SPSS shows you its coding
scheme in the table of Categorical
Variables Codings in the output.
Since we chose the last category
as reference, FEMALE is coded 0.
In the table of coefficients, the
dummy-coded variable is referred to
by its original name plus the value
for the Parameter coding in the
Categorical Variables Codings table.
Slide 4
Dummy-coding in SPSS - 2
Here we are dummy-coding sex,
using the defaults of indicatory
coding with the First category
as the reference category. Note
you must click on the Change
button after selecting the First
option button.
SPSS shows you its coding
scheme in the table of Categorical
Variables Codings in the output.
Since we chose the FIRST
category as reference, MALE is
coded 0.
Step
a
1
sex(1)
Constant
B
-1.590
-.235
In the table of coefficients, the dummy-coded
Variables
in the
variable
is Equation
still referred to by its original name
plus the value for the Parameter coding in the
S.E.
Wald
df
Categorical
Variables
Codings Sig.
table, butExp(B)
in this
.361
19.427
1
.000
.204
case it stands for females.
.229
a. Variable(s) entered on step 1: sex.
1.047
1
.306
.791
Slide 5
Assumptions

Logistic regression does not make any assumptions of normality, linearity, and
homogeneity of variance for the independent variables.

When the variables satisfy the assumptions of normality, linearity, and homogeneity of
variance, discriminant analysis has historically been cited as the more effective statistical
procedure for evaluating relationships with a non-metric dependent variable. However,
logistic regression is being used more and more frequently because it can be interpreted
similarly to other general linear model problems.

When the variables do not satisfy the assumptions of normality, linearity, and
homogeneity of variance, logistic regression is the statistic of choice since it does not
make these assumptions.

Multicollinearity is a problem for logistic regression with the same consequences as
multiple regression, i.e. we are likely to misinterpret the contribution of independent
variables when they are collinear. SPSS does not compute tolerance values for logistic
regression, so we will detect it through the examination of standard errors. We will not
interpret models when evidence of multicollinearity is found.

Evidence of multicollinearity is detected as a numerical problem in the attempted solution.
Slide 6
Numerical problems

The maximum likelihood method used to calculate logistic regression is an iterative fitting
process that attempts to cycle through repetitions to find an answer.

Sometimes, the method will break down and not be able to converge or find an answer.

Sometimes the method will produce wildly improbable results, reporting that a one-unit
change in an independent variable increases the odds of the modeled event by hundreds of
thousands or millions. These implausible results can be produced by multicollinearity,
categories of predictors having no cases or zero cells, and complete separation whereby
the two groups are perfectly separated by the scores on one or more independent variables.

The clue that we have numerical problems and should not interpret the results are standard
errors for some independent variables that are larger than 2.0 (this does not apply to the
constant).
Slide 7
Sample size requirements

The minimum number of cases per independent variable is 10, using a guideline provided
by Hosmer and Lemeshow, authors of Applied Logistic Regression, one of the main
resources for Logistic Regression.

If we do not meet the sample size requirement, it is suggested that this be mentioned as a
limitation to our analysis. If the relationships between predictors and the dependent
variable are strong, we may still attain statistical significance with smaller samples.
Slide 8
Methods for including variables

SPSS supports the three methods for including variables in the regression equation:
 the standard or simultaneous method in which all independents are included at the
same time
 The hierarchical method in which control variables are entered in the analysis before
the predictors whose effects we are primarily concerned with.
 The stepwise method (forward conditional or forward LR in SPSS) in which
variables are selected in the order in which they maximize the statistically significant
contribution to the model.

For all methods, the contribution to the model is measures by model chi-square is a
statistical measure of the fit between the dependent and independent variables, like R².
Slide 9
Computational method

Multiple regression uses the least-squares method to find the coefficients for the
independent variables in the regression equation, i.e. it computed coefficients that
minimized the residuals for all cases.

Logistic regression uses maximum-likelihood estimation to compute the coefficients for
the logistic regression equation. This method finds attempts to find coefficients that
match the breakdown of cases on the dependent variable.

The overall measure of how will the model fits is given by the likelihood value, which is
similar to the residual or error sum of squares value for multiple regression. A model that
fits the data well will have a small likelihood value. A perfect model would have a
likelihood value of zero.

Maximum-likelihood estimation is an iterative procedure that successively tries works to
get closer and closer to the correct answer. When SPSS reports the "iterations," it is
telling us how may cycles it took to get the answer.
Slide 10
Overall test of relationship

Errors in a logistic regression models are measured in terms of “-2 log likelihood” values
which are analogous to “total sum of squares”. When an independent variable has a
relationship to the dependent variable the measure of error decreases. Since “-2 log
likelihood (abbreviated at -2LL) is measured in negative numbers, an improvement is
relationship is indicated by a larger number, e.g. if -2LL were -200, a -2LL of -100 would
represent an improvement.

The overall test of relationship among the independent variables and groups defined by
the dependent is based on the reduction in the -2 log likelihood values for a model which
does not contain any independent variables and the model that contains the independent
variables.

This difference in likelihood follows a chi-square distribution, and is referred to as the
model chi-square.

The significance test for the model chi-square is our statistical evidence of the presence of
a relationship between the dependent variable and the combination of the independent
variables.

In a hierarchical logistic regression, the significance test for the addition of the predictor
variables is based on the block chi-square in the omnibus tests of model coefficients.
Slide 11
Overall test of relationship in SPSS output
Though the iteration history
is not usually an output of
interest, it does show us
how the model chi-square
value is derived.
The original -2 Log
Likelihood value is 213.891.
At the end of this step, the 2 Log Likelihood value is
192.726.
213.891 – 192.726 = 21.165,
the value for Model Chi-square
in the table of Omnibus Tests
of Model Coefficients.
Slide 12
Relationship of Individual Independent Variables and Dependent Variable

There is a test of significance for the relationship between an individual independent
variable and the dependent variable, a significance test of the Wald statistic .

The individual coefficients represent change in the odds of being a member of the
modeled category. Individual coefficients are expressed in log units and are not directly
interpretable. However, if the b coefficient is used as the power to which the base of the
natural logarithm (2.71828) is raised, the result represents the change in the odds of the
modeled event associated with a one-unit change in the independent variable.

If a coefficient is positive, its transformed log value will be greater than one, meaning that
the modeled event is more likely to occur. If a coefficient is negative, its transformed log
value will be less than one, and the odds of the event occurring decrease. A coefficient of
zero (0) has a transformed log value of 1.0, meaning that this coefficient does not change
the odds of the event one way or the other.

The interpretive statement for individual relationships, provided they are statistically
significant, incorporates the odds ratio or Exp(B) in SPSS output.
Slide 13
Interpreting individual relationships - 1
Exp(B) can be interpreted as a
percentage change by
subtracting 1.0 from the Exp(B)
value.
In this example,
Exp(B) – 1.0 =
.204 – 1.0 = .796
We can state this finding as
females (sex(1) value in this
example) were 79.6% less
likely to …
Note: in this example, sex
was coded so that males
was the reference category.
Slide 14
Interpreting individual relationships - 2
Exp(B) can be interpreted as a
multiplier when percentage
change is confusing.
In this example,
Exp(B) – 1.0 =
4.902 – 1.0 = 3.902,
or 390.2% more likely.
We can state this finding as
males (sex(1) value in this
example) were 4.9 or
approximately 5 times more
likely to …
Note: in this example, sex
was coded so that females
was the reference category.
Slide 15
Strength of logistic regression relationship

While logistic regression does compute correlation measures to estimate the strength of
the relationship (pseudo R square measures, such as Nagelkerke's R²), these correlations
measures do not really tell us much about the accuracy or errors associated with the
model.

A more useful measure to assess the utility of a logistic regression model is classification
accuracy, which compares predicted group membership based on the logistic model to the
actual, known group membership, which is the value for the dependent variable.
Slide 16
Evaluating usefulness for logistic models

The benchmark that we will use to characterize a logistic regression model as useful is a
25% improvement over the rate of accuracy achievable by chance alone.

Even if the independent variables had no relationship to the groups defined by the
dependent variable, we would still expect to be correct in our predictions of group
membership some percentage of the time. This is referred to as by chance accuracy.

The estimate of by chance accuracy that we will use is the proportional by chance
accuracy rate, computed by summing the squared percentage of cases in each group.
Slide 17
Comparing accuracy rates

To characterize our model as useful, we compare the overall percentage accuracy rate
produced by SPSS at the last step in which variables are entered to 25% more than the
proportional by chance accuracy. (Note: SPSS does not compute a cross-validated
accuracy rate for logistic regression.)
Classification Tablea
Step 1
Observed
EXPECT U.S. IN WORLD
WAR IN 10 YEARS
Overall Percentage
YES
NO
Predicted
EXPECT U.S. IN
WORLD WAR IN 10
YEARS
YES
NO
20
34
10
72
Percentage
Correct
37.0
87.8
67.6
a. The cut value is .500
SPSS reports the overall accuracy rate in
the Classification Table. The overall
accuracy rate computed by SPSS was
67.6% in this example.
Slide 18
Computing by chance accuracy
The number of cases in each group is found in the Classification Table at Step 0 (before any
independent variables are included). The proportion of cases in the largest group is equal to
the overall percentage (60.3%).
Classification Tablea,b
Step 0
Observed
EXPECT U.S. IN WORLD
WAR IN 10 YEARS
Overall Percentage
YES
NO
Predicted
EXPECT U.S. IN
WORLD WAR IN 10
YEARS
YES
NO
0
54
0
82
Percentage
Correct
.0
100.0
60.3
a. Constant is included in the model.
b. The cut value is .500
The proportional by chance accuracy rate was computed by
calculating the proportion of cases for each group based on the
number of cases in each group in the classification table at Step
0, and then squaring and summing the proportion of cases in
each group (0.397² + 0.603² = 0.521).
The proportional by chance accuracy criteria is 65.2% (1.25 x
52.1% = 65.2%).
Since the accuracy rate in this example, 67.6%, is greater than
the 65.2% by chance accuracy criteria, this would would be
characterized as useful.
Slide 19
Outliers

Logistic regression models the relationship between a set of independent variables and
the probablity that a case is a member of one of the categories of the dependent variable
(In SPSS, the modeled category is the one with the higher numeric code.) If the
probability is greater than 0.5, the case is classified in the modeled category. If the
probability is less than 0.50, the case is classified in the other category.

The actual probability of the modeled event for any case is either 1.0 or 0.0, i.e. a case is
in the modeled category or it is not.

The residual is the difference between the actual probability and the predicted probability
for a case. If the predicted probability for a case that actually belonged to the modeled
category was 0.80, the residual would be 1.00 – 0.80 = 0.20.

The residual can be standardized by dividing it by an estimate of its standard deviation.
Since the dependent variable is dichotomous or binary, the standard deviation for
proportions is used.
Slide 20
Strategy for Outliers

Our strategy for evaluating the impact of outliers on our logistic regression model will
parallel what we have done for multiple regression and discriminant analysis:



First, we run a baseline model including all cases
Second, we run a model excluding outliers whose studentized residual is
greater than 2.58 or less than -2.58 (z-score for p = .01).
If the model excluding outliers has a classification accuracy rate that is 2%
or more higher than the accuracy rate of the baseline model, we will
interpret the revised model. If the accuracy rate of the revised model
without outliers is less than 2% more accurate, we will interpret the baseline
model.
Slide 21
The Problem in Blackboard
The problem statement tells us:
 the variables included in the analysis
 whether each variable should be
treated as metric or non-metric
 the type of dummy coding and
reference category for non-metric
variables
 the alpha for both the statistical
relationships and for diagnostic tests
Slide 22
The Statement about Level of Measurement
The first statement in the problem asks about level of
measurement. Standard binary logistic regression
requires that the dependent variable be dichotomous, the
metric independent variables be interval level, and the
non-metric independent variables be dummy-coded if
they are not dichotomous. SPSS Binary Logistic
Regression calls non-metric variables “categorical.”
SPSS Binary Logistic Regression will dummy-code
categorical variables for us, provided it is useful to use
either the first or last category as the reference category.
Slide 23
Marking the Statement about Level of Measurement
Mark the check box as a correct statement because:
• The dependent variable "computer use" [compuse] is
dichotomous level, satisfying the requirement for the
dependent variable.
• The independent variables "highest year of school completed"
[educ] and "socioeconomic index" [sei] are interval level,
satisfying the requirement for independent variables.
• The independent variable "sex" [sex] is dichotomous level,
satisfying the requirement for independent variables.
• The independent variable "condition of health" [health] is
ordinal level which the problem instructs us to dummy-code as
a non-metric variable.
Slide 24
The Statement about Outliers
While we do not need to be concerned about normality, linearity,
and homogeneity of variance, we need to determine whether or
not outliers were substantially reducing the classification
accuracy of the model.
To test for outliers, we run the binary logistic regression in SPSS
and check for outliers. Next, we exclude the outliers and run the
logistic regression a second time. We then compare the accuracy
rates of the models with and without the outliers. If the accuracy
of the model without outliers is 2% or more accurate than the
model with outliers, we interpret the model excluding outliers.
Slide 25
Running the standard binary logistic regression
Select the Regression |
Binary Logistic…
command from the
Analyze menu.
Slide 26
Selecting the dependent variable
First, highlight the
dependent variable
compuse in the list
of variables.
Second, click on the right
arrow button to move the
dependent variable to the
Dependent text box.
Slide 27
Selecting the independent variables
Move the independent
variables stated in the
problem to the
Covariates list box.
Slide 28
Declare the categorical variables - 1
To tell SPSS that two of
the variables are nonmetric and need to be
dummy-coded, click on
the Categorical button.
Slide 29
Declare the categorical variables - 2
Move the variables sex
and health to the
Categorical Covariates list
box.
SPSS assigns its default method for
dummy-coding, Indicator coding, to
each variable, placing the name of
the coding scheme in parentheses
after each variable name.
Slide 30
Declare the categorical variables - 3
We could change the dummycoding to a different scheme
by choosing another method
from the drop-down menu, and
clicking on the Change button.
However, we will use indicator dummy-coding for
our logistic regression problems, so that we are
comparing the difference in odds between two
specific groups, rather than comparing one group to
the average odds for all other groups. I think
“average odds” complicates the interpretation.
Slide 31
Declare the categorical variables - 4
We will also accept the default
of using the last valid category
as the reference category for
each variable (we do not use
higher numbered missing values
as a reference category).
Click on the
Continue
button to close
the dialog box.
Note that sex is a dichotomous variable, and does
not require dummy-coding. I prefer to dummy-code
it anyhow so that my interpretation is consistently
based on the difference between categories coded 0
and 1. I do not need to alter my interpretation if two
different numbers were used for the original coding.
Slide 32
Specifying the method for including variables
SPSS provides us with two methods for including
variables: to enter all of the independent variables
at one time, and a stepwise method for selecting
variables using a statistical test to determine the
order in which variables are included.
SPSS also supports the specification of "Blocks" of
variables for testing hierarchical models.
Since the problem calls for
a standard binary logistic
regression, we accept the
default Enter method for
including variables.
Slide 33
Adding outliers to the data set - 1
SPSS will calculate the values for
standardized residuals and save them to
the data set so that we can check for
outliers and remove the outliers easily if
we need to run a model excluding outliers.
Click on the Save… button
to request the statistics
that we want to save.
Slide 34
Adding outliers to the data set - 2
Second, click on
the Continue button
to complete the
specifications.
First, mark the checkbox
for Standardized residuals
in the Residuals panel.
Slide 35
Requesting the output
Click on the OK
button to request
the output.
While optional
statistical output is
available, we do not
need to request any
optional statistics.
Slide 36
Detecting the presence of outliers - 1
SPSS created a new variable, ZRE_1, which
contains the standardized residual. If SPSS
finds that the data set already contains a
ZRE_1 variable, it will create ZRE_2.
I find it easier to delete the ZRE_1 variable
after each analysis rather than have
multiple ZRE_ variables in the data set,
requiring that I remember which one goes
with which analysis.
Slide 37
Detecting the presence of outliers - 2
To detect outliers, we will sort the ZRE_1 column twice:
• first, in ascending order to identify outliers with a
standardized residual of +2.58 or greater.
• second, in descending order to identify outliers with
a standardized residual of -2.58 or less.
Click the right mouse button
on the column header and
select Sort Ascending from
the pop-up menu.
Slide 38
Detecting the presence of outliers - 3
After scrolling down past the
cases with missing data (. in
the ZRE_1 column), we see
that we have five outliers
that have standardized
residuals of -2.58 or less.
Slide 39
Detecting the presence of outliers - 4
To check for outliers with large
positive standardized residuals,
click the right mouse button
on the column header and
select Sort Ascending from the
pop-up menu.
Slide 40
Detecting the presence of outliers - 5
After scrolling up to the top
of the data set, we see that
there are no outliers that
have standardized residuals
of +2.58 or more.
Since we found outliers, we
will run the model excluding
them and compare accuracy
rates to determine which one
we will interpret.
Had there been no outliers, we
would move on to the issue of
sample size.
Slide 41
Running the model excluding outliers - 1
We will use a Select
Cases command to
exclude the outliers
from the analysis.
Slide 42
Running the model excluding outliers - 2
First, in the Select
Cases dialog box, mark
the option button If
condition is satisfied.
Second, click on
the If button to
specify the
condition.
Slide 43
Running the model excluding outliers - 3
To eliminate the outliers, we
request the cases that are not
outliers be selected into the
analysis.
The formula specifies that we
should include cases if the standard
score for the standardized residual
(ZRE_1) is less than or 2.58.
The abs() or absolute value
function tells SPSS to ignore the
sign of the value.
After typing in the formula,
click on the Continue button
to close the dialog box.
Slide 44
Running the model excluding outliers - 4
SPSS displays the
condition we entered
on the Select Cases
dialog box.
Click on the OK
button to close
the dialog box.
Slide 45
Running the model excluding outliers - 5
SPSS indicates which
cases are excluded by
drawing a slash across
the case number.
Scrolling down in the data,
we see that the outliers
and cases with missing
values are excluded.
Slide 46
Running the model excluding outliers - 6
To run the logistic
regression excluding
outliers, select Logistic
Regression from the
Dialog Recall menu.
Slide 47
Running the model excluding outliers - 7
The only change we will make
is to clear the check box for
saving standardized residuals.
Click on the Save
button to open
the dialog box.
Slide 48
Running the model excluding outliers - 8
Second, click
on the
Continue
button to close
the dialog box.
First, clear the check
box for Standardized
residuals.
Slide 49
Running the model excluding outliers - 9
Finally, click on
the OK button to
request the
output.
Slide 50
Accuracy rate of the baseline model including all cases
The accuracy rate
for the model with
all cases is 75.1%.
Navigate to the Classification
Table for the logistic regression
with all cases. To distinguish
the two models, I often refer to
the first one as the baseline
model.
Slide 51
Accuracy rate of the revised model excluding outliers
Navigate to the Classification
Table for the logistic regression
excluding outliers. To distinguish
the two models, I often refer to
the first one as the revised model.
The accuracy rate
for the model
excluding outliers is
78.0%.
Slide 52
Marking the statement for excluding outliers
In the initial logistic regression model, 5 cases had a
standardized residual of +2.58 or greater or -2.58 or lower:
- Case 20000032 had a standardized residual of -3.59
- Case 20000178 had a standardized residual of -5.83
- Case 20001092 had a standardized residual of -2.90
- Case 20001544 had a standardized residual of -4.16
- Case 20002344 had a standardized residual of -3.78
Since the classification accuracy of the model that excluded
outliers (78.0%) was greater by 2% or more than the
classification accuracy for the model that included all cases
(75.1%), we mark the check box for the statement.
All of the remaining statements will be evaluated based on
the output for the model that excludes outliers.
Slide 53
The statement about multicollinearity and other numerical problems
Multicollinearity in the logistic regression solution is detected
by examining the standard errors for the b coefficients. A
standard error larger than 2.0 indicates numerical problems,
such as multicollinearity among the independent variables,
cells with a zero count for a dummy-coded independent
variable because all of the subjects have the same value for
the variable, and 'complete separation' whereby the two
groups in the dependent event variable can be perfectly
separated by scores on one of the independent variables.
Analyses that indicate numerical problems should not be
interpreted.
Slide 54
Checking for multicollinearity
The standard errors for the variables included in
the analysis were: the standard error for
"highest year of school completed" [educ] was
.11, the standard error for survey respondents
who said that their health was poor was 1.44,
the standard error for survey respondents who
said that their health was fair was .62, the
standard error for survey respondents who said
that their health was good was .53, the
standard error for "socioeconomic index" [sei]
was .02 and the standard error for survey
respondents who were male was .45.
Slide 55
Marking the statement about multicollinearity and other numerical problems
Since none of the independent
variables in this analysis had a
standard error larger than 2.0,
we mark the check box to
indicate there was no evidence
of multicollinearity.
Slide 56
The statement about sample size
Hosmer and Lemeshow, who
wrote the widely used text on
logistic regression, suggest
that the sample size should be
10 cases for every
independent variable.
Slide 57
The output for sample size
We find the number of cases
included in the analysis in
the Case Processing
Summary.
The 164 cases available for the
analysis satisfied the recommended
sample size of 60 (10 cases per
independent variable) for logistic
regression recommended by Hosmer
and Lemeshow.
Slide 58
Marking the statement for sample size
Since we satisfy the
sample size requirement,
we mark the check box.
Slide 59
The overall relationship between the dependent and independent variables
The existence of a relationship
between the dependent variable
and combination of independent
variables is based on the statistical
significance of the model chisquare for the model that includes
all of the independent variables.
Slide 60
The output for the overall relationship
In this analysis, the test of the full model
versus a model with intercept only was
statistically significant, χ²(6, N = 164) =
88.44, p < .001. The null hypothesis that
there is no difference between the model
with only a constant and the model with
independent variables was rejected.
The existence of a relationship between the
independent variables and the dependent
variable was supported.
Slide 61
Marking the statement for overall relationship
Since the overall relationship
was statistically significant,
we mark the check box.
Slide 62
The statement about the relationship between education and computer use
Having satisfied the criteria for an overall
relationship, we examine the findings for
individual relationships with the dependent
variable. If the overall relationship were
not significant, we would not interpret the
individual relationships.
The first statement concerns
the relationship between
education and computer use.
Slide 63
Output for the relationship between education and computer use
The probability of the Wald statistic for the independent
variable "highest year of school completed" [educ]
(χ²(1, N = 164) = 11.49, p < .001) was less than or
equal to the level of significance of .05. The null
hypothesis that the b coefficient for "highest year of
school completed" [educ] was equal to zero was
rejected. The value of Exp(B) for the variable "highest
year of school completed" [educ] was 1.43 which
implies an increase in the odds of 43.2% (1.43 - 1.0 =
.43). The statement that 'For each unit increase in
"highest year of school completed", survey respondents
were 43.2% more likely to use a computer' is correct.
Slide 64
Marking the statement for the relationship between education and computer use
Since the relationship was
statistically significant, and the odds
ratio was correctly interpreted as an
increase of 43.2%, we mark the
check box for the statement.
Slide 65
The statement for the relationship between poor health and computer use
The next statement concerns
the relationship between the
dummy-coded variable for
poor health and computer use.
Slide 66
Output for the relationship between poor health and computer
The probability of the Wald statistic for the independent
variable survey respondents who said that their health
was poor (χ²(1, N = 164) = 8.20, p = .004) was less
than or equal to the level of significance of .05. The null
hypothesis that the b coefficient for survey respondents
who said that their health was poor was equal to zero
was rejected. The value of Exp(B) for the variable
survey respondents who said that their health was poor
was .016 which implies a decrease in the odds of 98.4%
(.016 - 1.000 = -.984). The statement that 'Survey
respondents who said that their health was poor were
98.4% less likely to use a computer compared to those
who said that their health was excellent' is correct.
Slide 67
Marking the statement for the relationship between poor health and computer use
Since the relationship was
statistically significant, and the odds
ratio was correctly interpreted as a
decrease of 98.4% compared to the
reference group in excellent health,
we mark the check box for the
statement.
Slide 68
The statement for the relationship between fair health and computer use
The next statement concerns
the relationship between the
dummy-coded variable for fair
health and computer use.
Slide 69
Output for the relationship between fair health and computer use
The probability of the Wald statistic for the independent
variable survey respondents who said that their health
was fair (χ²(1, N = 164) = 6.60, p = .010) was less
than or equal to the level of significance of .05. The null
hypothesis that the b coefficient for survey respondents
who said that their health was fair was equal to zero
was rejected. The value of Exp(B) for the variable
survey respondents who said that their health was fair
was .204 which implies a decrease in the odds of 79.6%
(.204 - 1.000 = -.796). The statement that 'Survey
respondents who said that their health was fair were
79.6% less likely to use a computer compared to those
who said that their health was excellent' is correct.
Slide 70
Marking the statement for the relationship between fair health and computer use
Since the relationship was
statistically significant, and the odds
ratio was correctly interpreted as a
decrease of 79.6% compared to the
reference group in excellent health,
we mark the check box for the
statement.
Slide 71
The statement for the relationship between good health and computer use
The next statement concerns
the relationship between the
dummy-coded variable for
good health and computer
use.
Slide 72
Output for the relationship between good health and computer use
The probability of the Wald statistic for the independent
variable survey respondents who said that their health
was good (χ²(1, N = 164) = 1.53, p = .216) was
greater than the level of significance of .05. The null
hypothesis that the b coefficient for survey respondents
who said that their health was good was equal to zero
was not rejected. Survey respondents who said that
their health was good does not have an impact on the
odds that survey respondents use a computer. The
analysis does not support the relationship that 'Survey
respondents who said that their health was good were
48.4% less likely to use a computer compared to those
who said that their health was excellent'
Slide 73
Marking the statement for the relationship between good health and computer use
Since the relationship was
not statistically significant,
the check box is not marked.
Slide 74
The statement for relationship between socioeconomic index and computer use
The next statements concern the
relationship between socioeconomic
index and computer use. We are
offered two alternative interpretations
of the direction of the relationship. If
the relationship is not statistically
significant, neither will be correct.
Slide 75
Output for the relationship between socioeconomic index and computer use
The probability of the Wald statistic for the independent
variable "socioeconomic index" [sei] (χ²(1, N = 164) =
16.93, p < .001) was less than or equal to the level of
significance of .05. The null hypothesis that the b
coefficient for "socioeconomic index" [sei] was equal to
zero was rejected. The value of Exp(B) for the variable
"socioeconomic index" [sei] was 1.070 which implies an
increase in the odds of 7.0% (1.070 - 1.000 = .070).
The statement that 'For each unit increase in
"socioeconomic index", survey respondents were 7.0%
more likely to use a computer' is correct.
Slide 76
Marking the relationship between socioeconomic index and computer use
Since the relationship was
statistically significant and the
odds ratio indicated an increase of
7.0%, the first statement is
marked and the second is left
blank.
Slide 77
The statement for the relationship between sex and computer use
The next statement concerns
the relationship between the
sex and computer use.
Slide 78
Output for the relationship between sex and computer use
The probability of the Wald statistic for the independent
variable survey respondents who were male (χ²(1, N =
164) = 2.10, p = .148) was greater than the level of
significance of .05. The null hypothesis that the b
coefficient for survey respondents who were male was
equal to zero was not rejected. Survey respondents who
were male does not have an impact on the odds that
survey respondents use a computer. The analysis does
not support the relationship that 'Survey respondents
who were male were 47.8% less likely to use a
computer compared to those who were female'
Slide 79
Marking the statement for the relationship between sex and computer use
Since the relationship was
not statistically significant,
the check box is not marked.
Slide 80
Statement about the usefulness of the model based on classification accuracy
The final statement concerns the usefulness of the
logistic regression model. The independent variables
could be characterized as useful predictors
distinguishing survey respondents who use a computer
from survey respondents who not use a computer if
the classification accuracy rate was substantially
higher than the accuracy attainable by chance alone.
Operationally, the classification accuracy rate should
be 25% or more higher than the proportional by
chance accuracy rate.
Slide 81
Computing proportional by-chance accuracy rate
At Block 0 with no
independent variables
in the model, all of the
cases are predicted to
be members of the
modal group, 1=Yes in
this example.
The proportion in the largest group is
60.4% or .604. The proportion in the
other group is 1.0 – 0.604 = .396.
The proportional by chance accuracy rate was
computed by calculating the proportion of cases for
each group based on the number of cases in each
group in the classification table at Step 0, and then
squaring and summing the proportion of cases in
each group (.396² + .604² = .521).
Slide 82
Output for the usefulness of the model based on classification accuracy
To be characterized as a useful model, the accuracy rate
should be 25% higher than the by chance accuracy
rate.
The by chance accuracy criteria is compute by
multiplying the by chance accurate rate of .521 times
1.25, or 1.25 x .521 = .652 (65.2%)..
The classification accuracy rate computed by SPSS
was 78.0% which was greater than or equal to
the proportional by chance accuracy criteria of
65.2% (1.25 x 52.1% = 65.2%).
The criteria for classification accuracy is satisfied.
Slide 83
Marking the statement for usefulness of the model
Since the criteria for classification
accuracy was satisfied, the check
box is marked.
Slide 84
Standard Binary Logistic Regression: Level of Measurement
Level of
measurement ok?
No
Do not mark check box
for level of measurement
Mark: Inappropriate
application of the statistic
Yes
Stop
Mark check box
for level of measurement
Ordinal level variable
treated as metric?
Yes
Consider limitation in
discussion of findings
No
Slide 85
Standard Binary Logistic Regression: Exclude Outliers
Run Baseline Binary Logistic Regression,
Including All Cases,
Requesting Standardized Residuals
Run Revised Binary Logistic Regression,
Excluding Outliers (standardized
Residuals >= 2.58)
Accuracy rate for revised
Model >= accuracy rate
for baseline model + 2%
Yes
Mark check box
for excluding outliers
Interpret revised model
No
Do not mark check box
for excluding outliers
Interpret baseline model
Slide 86
Standard Binary Logistic Regression: Multicollinearity and Sample Size
Multicollinearity/Numeric
al Problems (S. E. > 2.0)
Yes
Do not mark check box
for no multicollinearity
Stop
No
Mark check box
for no multicollinearity
Adequate Sample Size
(Number of IV’s x 10)
Yes
No
Do not mark check box
for sample size
Consider limitation in
discussion of findings
Mark check box
for sample size
Slide 87
Standard Binary Logistic Regression: Overall Relationship
Probability of Model
Chi-square ≤ α
No
Do not mark check box for
overall relationship
Stop
Yes
Mark check box
for overall relationship
Slide 88
Standard Binary Logistic Regression: Individual Relationships
Individual relationship
(Wald Sig ≤ α)?
No
Yes
Correct interpretation of
direction and strength of
relationship?
No
Do not mark check box for
individual relationship
Yes
Mark check box
for individual relationship
Yes
Additional individual
Relationships to
interpret?
No
Slide 89
Standard Binary Logistic Regression: Classification Accuracy
Classification accuracy >
1.25 x by chance
accuracy rate
No
Do not mark check box for
classification accuracy
Yes
Mark check box for
classification accuracy
Slide 90

Standard Binary Logistic Regression Slide 1 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous dependent variable and metric or.

Transcript Standard Binary Logistic Regression Slide 1 Logistic regression  Logistic regression is used to analyze relationships between a dichotomous dependent variable and metric or.

Directory