Frequency Distributions

Download Report

Transcript Frequency Distributions

SW388R7
Data Analysis &
Computers II
Slide 1
Multiple Regression – Assumptions and
Outliers
Multiple Regression and Assumptions
Multiple Regression and Outliers
Strategy for Solving Problems
Practice Problems
SW388R7
Data Analysis &
Computers II
Multiple Regression and Assumptions
Slide 2


Multiple regression is most effect at identifying
relationship between a dependent variable and a
combination of independent variables when its
underlying assumptions are satisfied: each of the
metric variables are normally distributed, the
relationships between metric variables are linear,
and the relationship between metric and
dichotomous variables is homoscedastic.
Failing to satisfy the assumptions does not mean that
our answer is wrong. It means that our solution may
under-report the strength of the relationships.
SW388R7
Data Analysis &
Computers II
Multiple Regression and Outliers
Slide 3


Outliers can distort the regression results. When an
outlier is included in the analysis, it pulls the
regression line towards itself. This can result in a
solution that is more accurate for the outlier, but
less accurate for all of the other cases in the data
set.
We will check for univariate outliers on the
dependent variable and multivariate outliers on the
independent variables.
SW388R7
Data Analysis &
Computers II
Relationship between assumptions and outliers
Slide 4



The problems of satisfying assumptions and detecting
outliers are intertwined. For example, if a case has
a value on the dependent variable that is an outlier,
it will affect the skew, and hence, the normality of
the distribution.
Removing an outlier may improve the distribution of
a variable.
Transforming a variable may reduce the likelihood
that the value for a case will be characterized as an
outlier.
SW388R7
Data Analysis &
Computers II
Order of analysis is important
Slide 5



The order in which we check assumptions and detect
outliers will affect our results because we may get a
different subset of cases in the final analysis.
In order to maximize the number of cases available
to the analysis, we will evaluate assumptions first.
We will substitute any transformations of variable
that enable us to satisfy the assumptions.
We will use any transformed variables that are
required in our analysis to detect outliers.
SW388R7
Data Analysis &
Computers II
Strategy for solving problems
Slide 6
Our strategy for solving problems about violations of
assumptions and outliers will include the following steps:
1.
2.
3.
4.
5.
6.
Run type of regression specified in problem statement on variables
using full data set.
Test the dependent variable for normality. If it does not satisfy the
criteria for normality unless transformed, substitute the transformed
variable in the remaining tests that call for the use of the dependent
variable.
Test for normality, linearity, homoscedasticity using scripts. Decide
which transformations should be used.
Substitute transformations and run regression entering all
independent variables, saving studentized residuals and Mahalanobis
distance scores. Compute probabilities for D².
Remove the outliers (studentized residual greater than 3 or
Mahalanobis D² with p <= 0.001), and run regression with the method
and variables specified in the problem.
Compare R² for analysis using transformed variables and omitting
outliers (step 5) to R² obtained for model using all data and original
variables (step 1).
SW388R7
Data Analysis &
Computers II
Transforming dependent variables
Slide 7
We will use the following logic to transform variables:


If dependent variable is not normally distributed:
 Try log, square root, and inverse transformation.
Use first transformed variable that satisfies
normality criteria.
 If no transformation satisfies normality criteria,
use untransformed variable and add caution for
violation of assumption.
If a transformation satisfies normality, use the
transformed variable in the tests of the independent
variables.
SW388R7
Data Analysis &
Computers II
Transforming independent variables - 1
Slide 8


If independent variable is normally distributed and
linearly related to dependent variable, use as is.
If independent variable is normally distributed but
not linearly related to dependent variable:
 Try log, square root, square, and inverse
transformation. Use first transformed variable
that satisfies linearity criteria and does not
violate normality criteria
 If no transformation satisfies linearity criteria and
does not violate normality criteria, use
untransformed variable and add caution for
violation of assumption
SW388R7
Data Analysis &
Computers II
Transforming independent variables - 2
Slide 9

If independent variable is linearly related to
dependent variable but not normally distributed:
 Try log, square root, and inverse transformation.
Use first transformed variable that satisfies
normality criteria and does not reduce correlation.
 Try log, square root, and inverse transformation.
Use first transformed variable that satisfies
normality criteria and has significant correlation.
 If no transformation satisfies normality criteria
with a significant correlation, use untransformed
variable and add caution for violation of
assumption
SW388R7
Data Analysis &
Computers II
Transforming independent variables - 3
Slide 10

If independent variable is not linearly related to
dependent variable and not normally distributed:
 Try log, square root, square, and inverse
transformation. Use first transformed variable
that satisfies normality criteria and has significant
correlation.
 If no transformation satisfies normality criteria
with a significant correlation, used untransformed
variable and add caution for violation of
assumption
Impact of transformations
and omitting outliers
SW388R7
Data Analysis &
Computers II
Slide 11



We evaluate the regression assumptions and detect
outliers with a view toward strengthening the
relationship.
This may not happen. The regression may be the
same, it may be weaker, and it may be stronger. We
cannot be certain of the impact until we run the
regression again.
In the end, we may opt not to exclude outliers and
not to employ transformations; the analysis informs
us of the consequences of doing either.
SW388R7
Data Analysis &
Computers II
Notes
Slide 12



Whenever you start a new problem, make sure you
have removed variables created for previous analysis
and have included all cases back into the data set.
I have added the square transformation to the
checkboxes for transformations in the normality
script. Since this is an option for linearity, we need
to be able to evaluate its impact on normality.
If you change the options for output in pivot tables
from labels to names, you will get an error message
when you use the linearity script. To solve the
problem, change the option for output in pivot tables
back to labels.
SW388R7
Data Analysis &
Computers II
Problem 1
Slide 13
In the dataset GSS2000.sav, is the following statement true, false, or an
incorrect application of a statistic? Assume that there is no problem with
missing data. Use a level of significance of 0.01 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors
of "total family income" [income98] from the list: "sex" [sex], "how many
in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions
and removing outliers, the total proportion of variance explained by the
regression analysis increased by 10.8%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 1
Slide 14
In the dataset GSS2000.sav, is the following statement true, false, or an
incorrect application of a statistic? Assume that there is no problem with
missing data. Use a level of significance of 0.01 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question
requires us to identify the best subset of predictors
The problem may give us different
of "total family income"
[income98]
from
the list: "sex" [sex], "how many
levels of significance
for the
analysis.
in family earned money" [earnrs], and "income" [rincom98].
In this problem, we are told to use
0.01 as alpha for the regression
as well as for
testing to
substitutinganalysis
transformed
variables
assumptions.
After
satisfy regression assumptions
and removing outliers, the total proportion of variance explained by the
regression analysis increased by 10.8%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 2
Slide 15
In the dataset GSS2000.sav, is the following statement true, false, or an
incorrect application of a statistic? Assume that there is no problem with
missing data. Use a level of significance of 0.01 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors
of "total family income" [income98] from the list: "sex" [sex], "how many
in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions
The method for selecting variables is
and removing outliers,
total
proportion
of variance explained by the
derivedthe
from
the research
question.
regression analysis increased by 10.8%.
1.
2.
3.
4.
In this problem we are asked to idnetify the
best subset of predicotrs, so we do a
stepwise multiple regression.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 3
Slide 16
In the dataset GSS2000.sav, is the following statement true, false, or an
incorrect application
ofofatesting
statistic?
Assume and
that
there
is no problem with
The purpose
for assumptions
outliers
is to
stronger
The mainof
question
missing data. identify
Use a alevel
of model.
significance
0.01 to
forbethe regression
in this problem is whether or not the use
analysis. Use answered
a level of
significance of 0.01 for evaluating assumptions.
transformed variables to satisfy assumptions and the
removal of outliers improves the overall relationship
between the independent variables and the dependent
research variable,
question
requiresbyusR².to identify the best subset
as measured
The
of predictors
of "total family income" [income98] from the list: "sex" [sex], "how many
in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions
and removing outliers, the total proportion of variance explained by the
regression analysis increased by 10.8%.
1.
2.
3.
4.
True
Specifically, the question asks whether or
True with caution not the R² for a regression analysis after
substituting transformed variables and
False
eliminating outliers is 10.8% higher than a
regression
using the original format
Inappropriate application
of aanalysis
statistic
for all variables and including all cases.
SW388R7
Data Analysis &
Computers II
R² before transformations or removing outliers
Slide 17
To start out, we run a
stepwise multiple regression
analysis with income98 as
the dependent variable and
sex, earnrs, and rincom98
as the independent
variables.
We select stepwise as
the method to select the
best subset of predictors.
SW388R7
Data Analysis &
Computers II
R² before transformations or removing outliers
Slide 18
Prior to any transformations of variables
to satisfy the assumptions of multiple
regression or removal of outliers, the
proportion of variance in the dependent
variable explained by the independent
variables (R²) was 51.1%. This is the
benchmark that we will use to evaluate
the utility of transformations and the
elimination of outliers.
SW388R7
Data Analysis &
Computers II
R² before transformations or removing outliers
Slide 19
For this particular question, we are not interested in the
statistical significance of the overall relationship prior to
transformations and removing outliers. In fact, it is
possible that the relationship is not statistically significant
due to variables that are not normal, relationships that
are not linear, and the inclusion of outliers.
SW388R7
Data Analysis &
Computers II
Slide 20
Normality of the dependent variable:
total family income
In evaluating assumptions, the first step is to
examine the normality of the dependent
variable. If it is not normally distributed, or
cannot be normalized with a transformation, it
can affect the relationships with all other
variables.
First, move the
dependent variable
INCOME98 to the list
box of variables to test.
To test the normality of the dependent
variable, run the script:
NormalityAssumptionAndTransformations.SBS
Second, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 21
Normality of the dependent variable:
total family income
Descriptives
TOTAL FAMILY INCOME Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
Statistic
15.67
14.98
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
The dependent variable "total family income"
[income98] satisfies the criteria for a normal
distribution. The skewness (-0.628) and kurtosis
(-0.248) were both between -1.0 and +1.0. No
transformation is necessary.
Std. Error
.349
16.36
15.95
17.00
27.951
5.287
1
23
22
8.00
-.628
-.248
.161
.320
SW388R7
Data Analysis &
Computers II
Slide 22
Linearity and independent variable:
how many in family earned money
First, move the dependent variable
INCOME98 to the text box for the
dependent variable.
To evaluate the linearity of the relationship
between number of earners and total family
income, run the script for the assumption of
linearity:
LinearityAssumptionAndTransformations.SBS
Second, move the
independent variable,
EARNRS, to the list
box for independent
variables.
Third, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 23
Linearity and independent variable:
how many in family earned money
Correlations
TOTAL FAMILY INCOME
HOW MANY IN FAMILY
EARNED MONEY
Logarithm of EARNRS
[LG10( 1+ EARNRS)]
Square of EARNRS
[(EARNRS)**2]
Square Root of EARNRS
[SQRT( 1+EARNRS)]
Inverse of EARNRS [-1/(
1+EARNRS)]
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
HOW MANY
Logarithm of
Square of
Square Root
TOTAL
IN FAMILY
EARNRS
EARNRS
of EARNRS
Inverse of
FAMILY
EARNED
[LG10(
[(EARNR
[SQRT(
EARNRS [-1/(
INCOME
MONEY The
1+EARNRS)]
S)**2]
1+EARNRS)]
independent variable
"how
many in 1+EARNRS)]
1
.505**
.536**money"
.376**
.527**
.526*
family earned
[earnrs] satisfies
.
.000 the criteria.000
.000
.000
for the assumption
of .000
linearity
with
the
dependent
variable
229
228
228
228
228
228
"total
family
income"
[income98],
but
.505**
1
.959**
.908**
.989**
.871*
the assumption
of .000
.000
. does not satisfy
.000
.000
.000
normality.
The
evidence
of
linearity
in
228
269
269
269
269
269
the relationship between the
.536**
.959**
1
.759**
.990**
.973*
independent variable "how many in
.000
.000
.
.000
.000
.000
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
**. Correlation is sig nificant at the 0.01 level (2-tailed).
228
.376**
.000
228
.527**
.000
228
.526**
.000
228
family earned money" [earnrs] and the
variable "total
269 dependent 269
269 family income"
269
[income98] was the statistical
.908**
.759**
1
.839**
significance
of the correlation
coefficient
.000 (r = 0.505).
.000The probability
.
for the.000
269 correlation 269
coefficient269
was <0.001, 269
less
than or equal
to the level
.989**
.990**
.839** of significance
1
of
0.01.
We
reject
the
null
hypothesis
.000
.000
.000
.
that
r
=
0
and
conclude
that
there
is
269
269
269
269 a
linear
relationship
between
the
.871**
.973**
.606**
.932**
variables.
.000
269
.000
269
.000
269
.000
269
269
.606*
.000
269
.932*
.000
269
1
.
269
SW388R7
Data Analysis &
Computers II
Slide 24
Normality of independent variable:
how many in family earned money
After evaluating the dependent variable, we
examine the normality of each metric
variable and linearity of its relationship with
the dependent variable.
To test the normality of number of earners in
family, run the script:
NormalityAssumptionAndTransformations.SB
S
First, move the
independent variable
EARNRS to the list box
of variables to test.
Second, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 25
Normality of independent variable:
how many in family earned money
Descriptives
HOW MANY IN FAMILY Mean
EARNED MONEY
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Rang e
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
1.43
1.31
Std. Error
.061
1.56
1.37
1.00
1.015
1.008
0
5
5
1.00
.742
1.324
The independent variable "how many in family earned money" [earnrs]
satisfies the criteria for the assumption of linearity with the dependent
variable "total family income" [income98], but does not satisfy the
assumption of normality.
In evaluating normality, the skewness (0.742) was between -1.0 and
+1.0, but the kurtosis (1.324) was outside the range from -1.0 to +1.0.
.149
.296
SW388R7
Data Analysis &
Computers II
Slide 26
Normality of independent variable:
how many in family earned money
The square root transformation also
has values of skewness and kurtosis in
the acceptable range.
However, by our order of preference
for which transformation to use, the
logarithm is preferred.
The logarithmic
transformation
improves the normality
of "how many in family
earned money" [earnrs]
without a reduction in
the strength of the
relationship to "total
family income"
[income98]. In
evaluating normality,
the skewness (-0.483)
and kurtosis (-0.309)
were both within the
range of acceptable
values from -1.0 to
+1.0. The correlation
coefficient for the
transformed variable is
0.536.
Transformation for how many in family
earned money
SW388R7
Data Analysis &
Computers II
Slide 27



The independent variable, how many in family
earned money, had a linear relationship to the
dependent variable, total family income.
The logarithmic transformation improves the
normality of "how many in family earned money"
[earnrs] without a reduction in the strength of the
relationship to "total family income" [income98].
We will substitute the logarithmic transformation of
how many in family earned money in the regression
analysis.
SW388R7
Data Analysis &
Computers II
Slide 28
Normality of independent variable:
respondent’s income
After evaluating the dependent variable, we
examine the normality of each metric
variable and linearity of its relationship with
the dependent variable.
To test the normality of respondent’s in
family, run the script:
NormalityAssumptionAndTransformations.SB
S
First, move the
independent variable
RINCOM89 to the list
box of variables to
test.
Second, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 29
Normality of independent variable:
respondent’s income
Descriptives
RESPONDENTS INCOME Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Rang e
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
13.35
12.52
Std. Error
.419
14.18
13.54
15.00
29.535
5.435
1
23
22
8.00
-.686
-.253
The independent variable "income" [rincom98] satisfies the criteria for
both the assumption of normality and the assumption of linearity with
the dependent variable "total family income" [income98].
In evaluating normality, the skewness (-0.686) and kurtosis (-0.253)
were both within the range of acceptable values from -1.0 to +1.0.
.187
.373
SW388R7
Data Analysis &
Computers II
Slide 30
Linearity and independent variable:
respondent’s income
First, move the dependent variable
INCOME98 to the text box for the
dependent variable.
To evaluate the linearity of the relationship
between respondent’s income and total
family income, run the script for the
assumption of linearity:
LinearityAssumptionAndTransformations.SBS
Second, move the
independent variable,
RINCOM89, to the list
box for independent
variables.
Third, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 31
Linearity and independent variable:
respondent’s income
Correlations
TOTAL FAMILY INCOME
Pearson Correlation
Sig . (2-tailed)
N
RESPONDENTS INCOME Pearson Correlation
Sig . (2-tailed)
N
Logarithm of RINCOM98
Pearson Correlation
[LG10( 24-RINCOM98)]
Sig . (2-tailed)
N
Square of RINCOM98
[(RINCOM98)**2]
Square Root of
RINCOM98 [SQRT(
24-RINCOM98)]
Inverse of RINCOM98 [-1/(
24-RINCOM98)]
Logarithm of
Square Root
Inverse of
RINCOM98
Square of
of RINCOM98 RINCOM9
TOTAL
[LG10(
RINCOM98
[SQRT(
8 [-1/(
FAMILY
RESPONDEN
24-RINCOM
[(RINCOM9
24-RINCOM9
24-RINC
INCOME
TS INCOME
98)]
8)**2]
8)]
OM98)]
1
.577**
-.595**
.613**
-.601**
-.434**
.
.000
.000of linearity
.000
.000
.000
The evidence
in the
229
163
163
163
163
163
relationship between the independent
variable
"income"
.577**
1
-.922** [rincom98]
.967** and the
-.985**
-.602**
dependent
variable "total
.000
.
.000
.000family income"
.000
.000
[income98]
was
the
statistical
163
168
168
168
168
168
significance
of
the
correlation
coefficient
-.595**
-.922**
1
-.976**
.974**
.848**
(r = 0.577). The probability for the
.000
.000
.
.000
.000
.000
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
**. Correlation is sig nificant at the 0.01 level (2-tailed).
163
.613**
.000
163
-.601**
.000
163
-.434**
.000
163
correlation coefficient was <0.001, less
than or equal
168
168to the level
168of significance
168
of 0.01. We reject the null hypothesis
.967**
1
-.993**
that r = 0 -.976**
and conclude that
there is
a
.000
.000
.
.000
linear relationship between the
168
168
168
variables. 168
-.985**
.000
168
-.602**
.000
168
.974**
.000
168
.848**
.000
168
-.993**
.000
168
-.718**
.000
168
1
.
168
.714**
.000
168
168
-.718**
.000
168
.714**
.000
168
1
.
168
SW388R7
Data Analysis &
Computers II
Homoscedasticity: sex
Slide 32
First, move the dependent variable
INCOME98 to the text box for the
dependent variable.
To evaluate the homoscedasticity of the
relationship between sex and total family
income, run the script for the assumption of
homogeneity of variance:
Second, move the
independent variable,
SEX, to the list box for
independent variables.
HomoscedasticityAssumptionAnd
Transformations.SBS
Third, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Homoscedasticity: sex
Slide 33
Based on the Levene Test, the
variance in "total family income"
[income98] is homogeneous for
the categories of "sex" [sex].
The probability associated with
the Levene Statistic (0.031) is
greater than the level of
significance, so we fail to reject
the null hypothesis and
conclude that the
homoscedasticity assumption is
satisfied.
SW388R7
Data Analysis &
Computers II
Adding a transformed variable
Slide 34
Even though we do not need a
transformation for any of the
variables in this analysis, we will
demonstrate how to use a script,
such as the normality script, to add a
transformed variable to the data set,
e.g. a logarithmic transformation for
highest year of school.
Second, mark the
checkbox for the
transformation we
want to add to the
data set, and clear
the other checkboxes.
Third, clear the
checkbox for Delete
transformed variables
from the data. This will
save the transformed
variable.
First, move the variable
that we want to transform
to the list box of variables
to test.
Fourth, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
The transformed variable in the data editor
Slide 35
If we scroll to the extreme
right in the data editor, we
see that the transformed
variable has been added to
the data set.
Whenever we add
transformed variables to
the data set, we should be
sure to delete them before
starting another analysis.
SW388R7
Data Analysis &
Computers II
The regression to identify outliers
Slide 36
We use the regression procedure
to identify both univariate and
multivariate outliers.
We start with the same dialog we
used for the last analysis, in which
income98 as the dependent
variable and sex, earnrs, and
rincom98 were the independent
variables.
First, we substitute the
logarithmic transformation of
earnrs, logearn, into the list
of independent variables.
Second, we change the
method of entry from
Stepwise to Enter so that all
variables will be included in
the detection of outliers.
Third, we want to save the
calculated values of the outlier
statistics to the data set.
Click on the Save… button to
specify what we want to save.
SW388R7
Data Analysis &
Computers II
Saving the measures of outliers
Slide 37
First, mark the checkbox for
Studentized residuals in the
Residuals panel. Studentized
residuals are z-scores computed
for a case based on the data for
all other cases in the data set.
Second, mark the checkbox for
Mahalanobis in the Distances
panel. This will compute
Mahalanobis distances for the
set of independent variables.
Third, click on
the OK button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
The variables for identifying outliers
Slide 38
The variables for identifying
univariate outliers for the
dependent variable are in a
column which SPSS has
names sre_1.
The variables for identifying
multivariate outliers for the
independent variables are in
a column which SPSS has
names mah_1.
SW388R7
Data Analysis &
Computers II
Computing the probability for Mahalanobis D²
Slide 39
To compute the probability
of D², we will use an SPSS
function in a Compute
command.
First, select the
Compute… command
from the Transform
menu.
SW388R7
Data Analysis &
Computers II
Formula for probability for Mahalanobis D²
Slide 40
First, in the target variable text box, type the
name "p_mah_1" as an acronym for the probability
of the mah_1, the Mahalanobis D² score.
Second, to complete the
specifications for the CDF.CHISQ
function, type the name of the
variable containing the D² scores,
mah_1, followed by a comma,
followed by the number of variables
used in the calculations, 3.
Third, click on the OK button
to signal completion of the
computer variable dialog.
Since the CDF function (cumulative
density function) computes the
cumulative probability from the left
end of the distribution up through a
given value, we subtract it from 1 to
obtain the probability in the upper tail
of the distribution.
SW388R7
Data Analysis &
Computers II
Multivariate outliers
Slide 41
Using the probabilities computed in p_mah_1
to identify outliers, scroll down through the list
of case to see if we can find cases with a
probability less than 0.001.
There are no outliers for the set of
independent variables.
SW388R7
Data Analysis &
Computers II
Univariate outliers
Slide 42
Similarly, we can scroll down the values of
sre_1, the studentized residual to see the
one outlier with a value larger than ± 3.0.
Based on these criteria, there are 4
outliers.There are 4 cases that have a score
on the dependent variable that is
sufficiently unusual to be considered outliers
(case 20000357: studentized
residual=3.08; case 20000416: studentized
residual=3.57; case 20001379: studentized
residual=3.27; case 20002702: studentized
residual=-3.23).
SW388R7
Data Analysis &
Computers II
Omitting the outliers
Slide 43
To omit the outliers from the
analysis, we select in the
cases that are not outliers.
First, select the
Select Cases…
command from the
Transform menu.
SW388R7
Data Analysis &
Computers II
Specifying the condition to omit outliers
Slide 44
First, mark the If
condition is satisfied
option button to
indicate that we will
enter a specific
condition for
including cases.
Second, click on the
If… button to specify
the criteria for inclusion
in the analysis.
SW388R7
Data Analysis &
Computers II
The formula for omitting outliers
Slide 45
To eliminate the outliers, we
request the cases that are not
outliers.
The formula specifies that we
should include cases if the
studentized residual (regardless of
sign) if less than 3 and the
probability for Mahalanobis D² is
higher than the level of
significance, 0.001.
After typing in the formula,
click on the Continue button
to close the dialog box,
SW388R7
Data Analysis &
Computers II
Completing the request for the selection
Slide 46
To complete the
request, we click on
the OK button.
SW388R7
Data Analysis &
Computers II
The omitted multivariate outlier
Slide 47
SPSS identifies the excluded cases by
drawing a slash mark through the case
number. Most of the slashes are for
cases with missing data, but we also see
that the case with the low probability for
Mahalanobis distance is included in
those that will be omitted.
SW388R7
Data Analysis &
Computers II
Running the regression without outliers
Slide 48
We run the regression again,
excluding the outliers.
Select the Regression |
Linear command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Opening the save options dialog
Slide 49
We specify the dependent
and independent variables,
substituting any transformed
variables required by
assumptions.
When we used regression to
detect outliers, we entered
all variables. Now we are
testing the relationship
specified in the problem, so
we change the method to
Stepwise.
On our last run, we
instructed SPSS to save
studentized residuals and
Mahalanobis distance. To
prevent these values from
being calculated again, click
on the Save… button.
SW388R7
Data Analysis &
Computers II
Clearing the request to save outlier data
Slide 50
First, clear the checkbox
for Studentized residuals.
Third, click on
the OK button to
complete the
specifications.
Second, clear the
checkbox form
Mahalanobis distance.
SW388R7
Data Analysis &
Computers II
Opening the statistics options dialog
Slide 51
Once we have removed outliers,
we need to check the sample
size requirement for regression.
Since we will need the
descriptive statistics for this,
click on the Statistics… button.
SW388R7
Data Analysis &
Computers II
Requesting descriptive statistics
Slide 52
First, mark the checkbox
for Descriptives.
Second, click on
the Continue
button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
Requesting the output
Slide 53
Having specified the
output needed for the
analysis, we click on
the OK button to obtain
the regression output.
SW388R7
Data Analysis &
Computers II
Sample size requirement
Slide 54
The minimum ratio of valid cases to independent
variables for stepwise multiple regression is 5 to
1. After removing 4 outliers, there are 159 valid
cases and 3 independent variables.
The ratio of cases to independent variables for this
analysis is 53.0 to 1, which satisfies the minimum
requirement. In addition, the ratio of 53.0 to 1
satisfies the preferred ratio of 50 to 1.
Descriptive Statistics
TOTAL FAMILY INCOME
RESPONDENTS SEX
RESPONDENTS INCOME
Logarithm of EARNRS
[LG10( 1+ EARNRS)]
Mean
17.09
1.55
13.76
Std. Deviation
4.073
.499
5.133
.424896
.1156559
N
159
159
159
159
SW388R7
Data Analysis &
Computers II
Significance of regression relationship
Slide 55
ANOVAd
Model
1
2
3
Reg ression
Residual
Total
Reg ression
Residual
Total
Reg ression
Residual
Total
Sum of
Squares
1122.398
1499.187
2621.585
1572.722
1048.863
2621.585
1623.976
997.609
2621.585
df
1
157
158
2
156
158
3
155
158
Mean Square
1122.398
9.549
F
117.541
Sig .
.000a
786.361
6.723
116.957
.000b
541.325
6.436
84.107
.000c
a. Predictors: (Constant), RESPONDENTS INCOME
b. Predictors: (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10(
The1+EARNRS)]
probability of the F statistic (84.107) for the regression
relationship
which includes
these variables is <0.001, less
c. Predictors: (Constant),
RESPONDENTS INCOME, Logarithm of EARNRS [LG10(
than
or
equal
to
the
level
of
significance of 0.01. We reject
1+EARNRS)], RESPONDENTS SEX
the null hypothesis that there is no relationship between
d.
Variable: TOTAL FAMILY INCOME
theDependent
best subset
of independent variables and the dependent
variable (R² = 0).
We support the research hypothesis that there is a
statistically significant relationship between the best subset
of independent variables and the dependent variable.
SW388R7
Data Analysis &
Computers II
Increase in proportion of variance
Slide 56
Model Summary
Model
1
2
3
R
.654a
.775b
.787c
R Sq uare
.428
.600
.619
Adjusted
R Sq uare
.424
.595
.612
Std. Error of
the Estimate
3.090
2.593
2.537
a. Predictors: (Constant), RESPONDENTS INCOME
b. Predictors: (Constant), RESPONDENTS INCOME,
Logarithm of EARNRS [LG10( 1+EARNRS)]
Prior toINCOME,
any transformations of variables to satisfy
c. Predictors: (Constant), RESPONDENTS
the assumptions of multiple regression or removal
Logarithm of EARNRS [LG10( 1+EARNRS)],
of outliers, the proportion of variance in the
RESPONDENTS SEX
dependent variable explained by the independent
variables (R²) was 51.1%.
After transformed variables were substituted to
satisfy assumptions and outliers were removed
from the sample, the proportion of variance
explained by the regression analysis was 61.9%, a
difference of 10.8%.
The answer to the question
is true with caution.
A caution is added because
of the inclusion of ordinal
level variables.
SW388R7
Data Analysis &
Computers II
Problem 2
Slide 57
In the dataset GSS2000.sav, is the following statement true, false, or
an incorrect application of a statistic? Assume that there is no problem
with missing data. Use a level of significance of 0.05 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age"
[age], "highest year of school completed" [educ], and "sex" [sex] to the
dependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regression
assumptions and removing outliers, the proportion of variance
explained by the regression analysis increased by 3.6%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 2 - 1
Slide 58
In the dataset GSS2000.sav, is the following statement true, false, or
an incorrect application of a statistic? Assume that there is no problem
with missing data. Use a level of significance of 0.05 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age"
The problem may give us different
[age], "highest year
of school completed" [educ], and "sex" [sex] to the
levels of significance for the analysis.
dependent variable "occupational prestige score" [prestg80].
In this problem, we are told to use
0.05 as alpha for the regression
and the more
conservative
After substitutinganalysis
transformed
variables
to satisfy regression
0.01 as the alpha in testing
assumptions and removing
outliers, the proportion of variance
assumptions.
explained by the regression analysis increased by 3.6%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 2 - 2
Slide 59
In the dataset GSS2000.sav, is the following statement true, false, or
an incorrect application of a statistic? Assume that there is no problem
with missing data. Use a level of significance of 0.05 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age"
[age], "highest year of school completed" [educ], and "sex" [sex] to the
dependent variable "occupational prestige score" [prestg80].
After substituting
transformed variables to satisfy regression
The method for selecting variables is
assumptions andderived
removing
outliers,
proportion of variance
from the
research the
question.
explained by the regression analysis increased by 3.6%.
1.
2.
3.
4.
If we are asked to examine a relationship
without any statement about control
variables or the best subset of variables, we
do a standard multiple regression.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 2 - 3
Slide 60
In the dataset GSS2000.sav, is the following statement true, false, or
an incorrect application of a statistic? Assume that there is no problem
The purpose of testing for assumptions and outliers is to
with missing data.
Use a level of significance of 0.05 for the regression
identify a stronger model. The main question to be
analysis. Use answered
a level of
significance
of 0.01
forthe
evaluating
assumptions.
in this
problem is whether
or not
use
transformed variables to satisfy assumptions and the
removal of outliers improves the overall relationship
the requires
independentus
variables
and the dependent
research between
question
to examine
the relationship
variable, as measured by R².
The
of "age"
[age], "highest year of school completed" [educ], and "sex" [sex] to the
dependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regression
assumptions and removing outliers, the proportion of variance
explained by the regression analysis increased by 3.6%.
1.
2.
3.
4.
True
True with caution Specifically, the question asks whether or
not the R² for a regression analysis after
False
substituting transformed variables and
eliminating outliers is 3.6% higher than a
Inappropriate application
a statistic
regression of
analysis
using the original format
for all variables and including all cases.
SW388R7
Data Analysis &
Computers II
R² before transformations or removing outliers
Slide 61
To start out, we run a
standard multiple
regression analysis with
prestg80 as the dependent
variable and age, educ, and
sex as the independent
variables.
SW388R7
Data Analysis &
Computers II
R² before transformations or removing outliers
Slide 62
Prior to any transformations of variables
to satisfy the assumptions of multiple
regression or removal of outliers, the
proportion of variance in the dependent
variable explained by the independent
variables (R²) was 27.1%. This is the
benchmark that we will use to evaluate
the utility of transformations and the
elimination of outliers.
For this particular question, we are not interested in the
statistical significance the overall relationship prior to
transformations and removing outliers. In fact, it is
possible that the relationship is not statistically significant
due to variables that are not normal, relationships that
are not linear, and the inclusion of outliers.
SW388R7
Data Analysis &
Computers II
Normality of the dependent variable
Slide 63
In evaluating assumptions, the first step is to
examine the normality of the dependent
variable. If it is not normally distributed, or
cannot be normalized with a transformation, it
can affect the relationships with all other
variables.
First, move the
dependent variable
PRESTG80 to the list
box of variables to test.
To test the normality of the dependent
variable, run the script:
NormalityAssumptionAndTransformations.SBS
Second, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Normality of the dependent variable
Slide 64
The dependent variable "occupational prestige
score" [prestg80] satisfies the criteria for a
normal distribution. The skewness (0.401) and
kurtosis (-0.630) were both between -1.0 and
+1.0. No transformation is necessary.
SW388R7
Data Analysis &
Computers II
Normality of independent variable: Age
Slide 65
After evaluating the dependent variable, we
examine the normality of each metric
variable and linearity of its relationship with
the dependent variable.
To test the normality of age, run the script:
NormalityAssumptionAndTransformations.SB
S
First, move the
independent variable
AGE to the list box of
variables to test.
Second, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Normality of independent variable: Age
Slide 66
Descriptives
AGE OF RESPONDENT
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Statistic
45.99
43.98
Std. Error
1.023
48.00
45.31
43.50
282.465
16.807
19
89
70
24.00
.595
-.351
The independent variable "age" [age] satisfies the criteria for the
assumption of normality, but does not satisfy the assumption of
linearity with the dependent variable "occupational prestige score"
[prestg80].
In evaluating normality, the skewness (0.595) and kurtosis (-0.351)
were both within the range of acceptable values from -1.0 to +1.0.
.148
.295
SW388R7
Data Analysis &
Computers II
Linearity and independent variable: Age
Slide 67
First, move the dependent variable
PRESTG80 to the text box for the
dependent variable.
To evaluate the linearity of the relationship
between age and occupational prestige, run
the script for the assumption of linearity:
LinearityAssumptionAndTransformations.SBS
Second, move the
independent variable,
AGE, to the list box for
independent variables.
Third, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Linearity and independent variable: Age
Slide 68
Correlations
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
AGE OF RESPONDENT
Logarithm of AGE
[LG10(AGE)]
Square of AGE [(AGE)**2]
Square Root of AGE
[SQRT(AGE)]
Inverse of AGE [-1/(AGE)]
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
RS
OCCUPA
TIONAL
PRESTIG
E SCORE
(1980)
1
.
255
.024
.706
255
.059
.348
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
**. Correlation is sig nificant at the 0.01 level (2-tailed).
255
-.004
.956
255
.041
.518
255
.096
.128
255
AGE OF
Logarithm of
Square of
Square Root
Inverse of
The evidence of nonlinearity in the
RESPON
AGE
AGE
of AGE
AGE
relationship between the independent
DENT
[LG10(AGE)]
[(AGE)**2]
[-1/(AGE)]
variable
"age" [age]
and the[SQRT(AGE)]
dependent
.024 variable "occupational
.059
-.004
.041
.096
prestige score"
.706 [prestg80]
.348
.956of statistical .518
.128
was the lack
coefficient
255 significance
255of the correlation
255
255
255
(r
=
0.024).
The
probability
for
the
1
.979**
.983**
.995**
.916**
correlation coefficient was 0.706, greater
.
.000
.000
.000
.000
than the level of significance of 0.01. We
270 cannot reject
270 the null 270
270
hypothesis that270
r = 0,
.979** and cannot 1conclude that
.926**there is a linear
.994**
.978**
.000 relationship .between the
.000 variables. .000
.000
270 Since none
270
270
of the transformations
to 270
improve linearity were successful, it is an
.983**
.926**
1
.960**
indication that the problem may be a weak
.000 relationship,
.000 rather than .a curvilinear.000
270 relationship
270correctable
270by using a 270
transformation.
A
weak
relationship is not
.995**
.994**
.960**
1 a
violation
of
the
assumption
of
linearity,
.000
.000
.000
.and
does not require a caution.
270
270
270
270
.916**
.978**
.832**
.951**
.000
.000
.000
.000
270
270
270
270
270
.832**
.000
270
.951**
.000
270
1
.
270
SW388R7
Data Analysis &
Computers II
Transformation for Age
Slide 69



The independent variable age satisfied the criteria
for normality.
The independent variable age did not have a linear
relationship to the dependent variable occupational
prestige. However, none of the transformations
linearized the relationship.
No transformation will be used - it would not help
linearity and is not needed for normality.
SW388R7
Data Analysis &
Computers II
Slide 70
Linearity and independent variable:
Highest year of school completed
First, move the dependent variable
PRESTG80 to the text box for the
dependent variable.
To evaluate the linearity of the relationship
between highest year of school and
occupational prestige, run the script for the
assumption of linearity:
Second, move the
independent variable,
EDUC, to the list box
for independent
variables.
LinearityAssumptionAndTransformations.SBS
Third, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 71
Linearity and independent variable:
Highest year of school completed
Correlations
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
HIGHEST YEAR OF
SCHOOL COMPLETED
Logarithm of EDUC
[LG10( 21-EDUC)]
Square of EDUC
[(EDUC)**2]
Square Root of EDUC
[SQRT( 21-EDUC)]
Inverse of EDUC [-1/(
21-EDUC)]
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
RS
OCCUPA
TIONAL
HIGHEST
Square Root
PRESTIG
YEAR OF TheLogarithm
of
Square of "highest
of EDUC
Inverse of
independent
variable
year
E SCORE
SCHOOL of school
EDUC [LG10(
EDUC
[SQRT(
EDUC [-1/(
completed"
[educ] satisfies
(1980)
COMPLETED the 21-EDUC)]
[(EDUC)**2]
21-EDUC)]
criteria for the
assumption21-EDUC)]
of
1
.495**
-.512**
.528** variable
-.518**
-.423
linearity with
the dependent
.
.000"occupational
.000 prestige score"
.000
.000
.000
255
254[prestg80],254
254
but does not254
satisfy the 254
assumption
of
normality.
The
evidence
.495**
1
-.920**
.980**
-.982**
-.699
of
linearity
in
the
relationship
between
.000
.
.000
.000
.000
.000
the
independent
variable
"highest
year
254
269
269
269
269
269
of
school
completed"
[educ]
and
the
-.512**
-.920**
1
-.969**
.977**
.915
dependent variable "occupational
.000
.000
.
.000
.000
.000
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
**. Correlation is sig nificant at the 0.01 level (2-tailed).
254
.528**
.000
254
-.518**
.000
254
-.423**
.000
254
prestige score" [prestg80] was the
of269
the correlation
269statistical significance
269
269
coefficient (r = 0.495). The probability
.980**
-.969**
1
for the correlation
coefficient
was -.997**
.000<0.001, less
.000than or equal. to the level
.000
269of significance
269 of 0.01. We
269 reject the269
-.982**
.977** that r =-.997**
1
null hypothesis
0 and conclude
is a linear relationship
.000that there .000
.000
.
between
the
variables.
269
269
269
269
-.699**
.000
269
.915**
.000
269
-.789**
.000
269
.812**
.000
269
269
-.789
.000
269
.812
.000
269
1
269
SW388R7
Data Analysis &
Computers II
Slide 72
Normality of independent variable:
Highest year of school completed
To test the normality of EDUC, Highest year
of school completed, run the script:
NormalityAssumptionAndTransformations.SB
S
First, move the
dependent variable
EDUC to the list box of
variables to test.
Second, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 73
Normality of independent variable:
Highest year of school completed
Descriptives
HIGHEST YEAR OF
SCHOOL COMPLETED
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Rang e
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
13.12
12.77
Std. Error
.179
13.47
13.14
13.00
8.583
2.930
2
20
18
3.00
-.137
1.246
In evaluating normality, the skewness (-0.137) was between -1.0
and +1.0, but the kurtosis (1.246) was outside the range from -1.0
to +1.0. None of the transformations for normalizing the distribution
of "highest year of school completed" [educ] were effective.
.149
.296
SW388R7
Data Analysis &
Computers II
Transformation for highest year of school
Slide 74



The independent variable, highest year of school,
had a linear relationship to the dependent variable,
occupational prestige.
The independent variable, highest year of school, did
not satisfy the criteria for normality. None of the
transformations for normalizing the distribution of
"highest year of school completed" [educ] were
effective.
No transformation will be used - it would not help
normality and is not needed for linearity. A caution
should be added to any findings.
SW388R7
Data Analysis &
Computers II
Homoscedasticity: sex
Slide 75
First, move the dependent variable
PRESTG80 to the text box for the
dependent variable.
To evaluate the homoscedasticity of the
relationship between sex and occupational
prestige, run the script for the assumption of
homogeneity of variance:
Second, move the
independent variable,
SEX, to the list box for
independent variables.
HomoscedasticityAssumptionAnd
Transformations.SBS
Third, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Homoscedasticity: sex
Slide 76
Based on the Levene Test, the
variance in "occupational
prestige score" [prestg80] is
homogeneous for the categories
of "sex" [sex]. The probability
associated with the Levene
Statistic (0.808) is greater than
the level of significance, so we
fail to reject the null hypothesis
and conclude that the
homoscedasticity assumption is
satisfied.
Even if we violate the
assumption, we would not do a
transformation since it could
impact the relationships of the
other independent variables
with the dependent variable.
SW388R7
Data Analysis &
Computers II
Adding a transformed variable
Slide 77
Even though we do not need a
transformation for any of the
variables in this analysis, we will
demonstrate how to use a script,
such as the normality script, to add a
transformed variable to the data set,
e.g. a logarithmic transformation for
highest year of school.
Second, mark the
checkbox for the
transformation we
want to add to the
data set, and clear
the other checkboxes.
Third, clear the
checkbox for Delete
transformed variables
from the data. This will
save the transformed
variable.
First, move the variable
that we want to transform
to the list box of variables
to test.
Fourth, click on the
OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
The transformed variable in the data editor
Slide 78
If we scroll to the extreme
right in the data editor, we
see that the transformed
variable has been added to
the data set.
Whenever we add
transformed variables to
the data set, we should be
sure to delete them before
starting another analysis.
SW388R7
Data Analysis &
Computers II
The regression to identify outliers
Slide 79
We can use the regression
procedure to identify both
univariate and multivariate
outliers.
We start with the same dialog we
used for the last analysis, in which
prestg90 as the dependent
variable and age, educ, and sex
were the independent variables.
If we need to use any transformed
variables, we would substitute
them now.
We will save the calculated
values of the outlier
statistics to the data set.
Click on the Save… button to
specify what we want to
save.
SW388R7
Data Analysis &
Computers II
Saving the measures of outliers
Slide 80
First, mark the checkbox for
Studentized residuals in the
Residuals panel. Studentized
residuals are z-scores computed
for a case based on the data for
all other cases in the data set.
Second, mark the checkbox for
Mahalanobis in the Distances
panel. This will compute
Mahalanobis distances for the
set of independent variables.
Third, click on
the OK button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
The variables for identifying outliers
Slide 81
The variables for identifying
univariate outliers for the
dependent variable are in a
column which SPSS has
names sre_1.
The variables for identifying
multivariate outliers for the
independent variables are in
a column which SPSS has
names mah_1.
SW388R7
Data Analysis &
Computers II
Computing the probability for Mahalanobis D²
Slide 82
To compute the probability
of D², we will use an SPSS
function in a Compute
command.
First, select the
Compute… command
from the Transform
menu.
SW388R7
Data Analysis &
Computers II
Formula for probability for Mahalanobis D²
Slide 83
First, in the target variable text box, type the
name "p_mah_1" as an acronym for the probability
of the mah_1, the Mahalanobis D² score.
Second, to complete the
specifications for the CDF.CHISQ
function, type the name of the
variable containing the D² scores,
mah_1, followed by a comma,
followed by the number of variables
used in the calculations, 3.
Third, click on the OK button
to signal completion of the
computer variable dialog.
Since the CDF function (cumulative
density function) computes the
cumulative probability from the left
end of the distribution up through a
given value, we subtract it from 1 to
obtain the probability in the upper tail
of the distribution.
SW388R7
Data Analysis &
Computers II
The multivariate outlier
Slide 84
Using the probabilities computed in p_mah_1
to identify outliers, scroll down through the list
of case to see the one case with a probability
less than 0.001.
There is 1 case that has a combination of
scores on the independent variables that is
sufficiently unusual to be considered an outlier
(case 20001984: Mahalanobis D²=16.97,
p=0.0007).
SW388R7
Data Analysis &
Computers II
The univariate outlier
Slide 85
Similarly, we can scroll down the values of
sre_1, the studentized residual to see the
one outlier with a value larger than 3.0.
There is 1 case that has a score on the
dependent variable that is sufficiently
unusual to be considered an outlier (case
20000391: studentized residual=4.14).
SW388R7
Data Analysis &
Computers II
Omitting the outliers
Slide 86
To omit the outliers from the
analysis, we select in the
cases that are not outliers.
First, select the
Select Cases…
command from the
Transform menu.
SW388R7
Data Analysis &
Computers II
Specifying the condition to omit outliers
Slide 87
First, mark the If
condition is satisfied
option button to
indicate that we will
enter a specific
condition for
including cases.
Second, click on the
If… button to specify
the criteria for inclusion
in the analysis.
SW388R7
Data Analysis &
Computers II
The formula for omitting outliers
Slide 88
To eliminate the outliers, we
request the cases that are not
outliers.
The formula specifies that we
should include cases if the
studentized residual (regardless of
sign) if less than 3 and the
probability for Mahalanobis D² is
higher than the level of
significance, 0.001.
After typing in the formula,
click on the Continue button
to close the dialog box,
SW388R7
Data Analysis &
Computers II
Completing the request for the selection
Slide 89
To complete the
request, we click on
the OK button.
SW388R7
Data Analysis &
Computers II
The omitted multivariate outlier
Slide 90
SPSS identifies the excluded cases by
drawing a slash mark through the case
number. Most of the slashes are for
cases with missing data, but we also see
that the case with the low probability for
Mahalanobis distance is included in
those that will be omitted.
SW388R7
Data Analysis &
Computers II
Running the regression without outliers
Slide 91
We run the regression again,
excluding the outliers.
Select the Regression |
Linear command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Opening the save options dialog
Slide 92
If specify the dependent an
independent variables. If
we wanted to use any
transformed variables we
would substitute them now.
On our last run, we
instructed SPSS to save
studentized residuals and
Mahalanobis distance. To
prevent these values from
being calculated again, click
on the Save… button.
SW388R7
Data Analysis &
Computers II
Clearing the request to save outlier data
Slide 93
First, clear the checkbox
for Studentized residuals.
Third, click on
the OK button to
complete the
specifications.
Second, clear the
checkbox form
Mahalanobis distance.
SW388R7
Data Analysis &
Computers II
Opening the statistics options dialog
Slide 94
Once we have removed outliers,
we need to check the sample
size requirement for regression.
Since we will need the
descriptive statistics for this,
click on the Statistics… button.
SW388R7
Data Analysis &
Computers II
Requesting descriptive statistics
Slide 95
First, mark the checkbox
for Descriptives.
Second, click on
the Continue
button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
Requesting the output
Slide 96
Having specified the
output needed for the
analysis, we click on
the OK button to obtain
the regression output.
SW388R7
Data Analysis &
Computers II
Sample size requirement
Slide 97
The minimum ratio of valid cases to independent
variables for multiple regression is 5 to 1. After
removing 2 outliers, there are 252 valid cases and
3 independent variables.
The ratio of cases to independent variables for this
analysis is 84.0 to 1, which satisfies the minimum
requirement. In addition, the ratio of 84.0 to 1
satisfies the preferred ratio of 15 to 1.
SW388R7
Data Analysis &
Computers II
Significance of regression relationship
Slide 98
The probability of the F statistic (36.639) for the
overall regression relationship is <0.001, less than
or equal to the level of significance of 0.05. We
reject the null hypothesis that there is no
relationship between the set of independent
variables and the dependent variable (R² = 0).
We support the research hypothesis that there is a
statistically significant relationship between the
set of independent variables and the dependent
variable.
SW388R7
Data Analysis &
Computers II
Increase in proportion of variance
Slide 99
Prior to any transformations of variables to satisfy
the assumptions of multiple regression or removal
of outliers, the proportion of variance in the
dependent variable explained by the independent
variables (R²) was 27.1%. No transformed
variables were substituted to satisfy assumptions,
but outliers were removed from the sample.
The proportion of variance explained by the
regression analysis after removing outliers was
30.7%, a difference of 3.6%.
The answer to the question
is true with caution.
A caution is added because
of a violation of regression
assumptions.
SW388R7
Data Analysis &
Computers II
Impact of assumptions and outliers - 1
Slide 100
The following is a guide to the decision process for answering
problems about the impact of assumptions and outliers on analysis:
Dependent variable
metric?
Independent variables
metric or dichotomous?
No
Inappropriate
application of
a statistic
Yes
Ratio of cases to
independent variables at
least 5 to 1?
Yes
Run baseline regression and
record R² for future
reference, using method for
including variables identified
in the research question.
No
Inappropriate
application of
a statistic
SW388R7
Data Analysis &
Computers II
Impact of assumptions and outliers - 2
Slide 101
Is the dependent variable
normally distributed?
No
Try:
1. Logarithmic transformation
2. Square root transformation
3. Inverse transformation
If unsuccessful, add caution
Yes
Metric IV’s normally
distributed and linearly
related to DV
No
Try:
1. Logarithmic transformation
2. Square root transformation
(3. Square transformation)
4. Inverse transformation
If unsuccessful, add caution
Yes
DV is homoscedastic for
categories of
dichotomous IV’s?
Yes
No
Add caution
SW388R7
Data Analysis &
Computers II
Impact of assumptions and outliers - 3
Slide 102
Substituting any transformed variables, run
regression using direct entry to include all
variables to request statistics for detecting
outliers
Are there univariate
outliers (DV) or
multivariate outliers
(IVs)?
Yes
Remove outliers from data
No
Ratio of cases to
independent variables at
least 5 to 1?
Yes
Run regression again using
transformed variables and
eliminating outliers
No
Inappropriate
application of
a statistic
SW388R7
Data Analysis &
Computers II
Impact of assumptions and outliers - 4
Slide 103
Yes
Probability of ANOVA test of
regression less than/equal to
level of significance?
No
False
Yes
Increase in R² correct?
No
False
Yes
Satisfies ratio for preferred
sample size: 15 to 1
(stepwise: 50 to 1)
Yes
No
True with caution
SW388R7
Data Analysis &
Computers II
Impact of assumptions and outliers - 5
Slide 104
Yes
Other cautions added for
ordinal variables or violation
of assumptions?
Yes
True with caution
No
True