Lecture 6: Multiple Regression - School of Psychology

Download Report

Transcript Lecture 6: Multiple Regression - School of Psychology

Lecture 6:
Multiple Regression
Laura McAvinue
School of Psychology
Trinity College Dublin
Previous Lectures
• Relationship between two variables
• Correlation
– Measure of strength of association between two variables
• Simple linear regression
– Measure of the ability of one variable (X) to predict the other
variable (Y)
– Computes a regression equation that describes the relationship
between the response variable (Y) and the predictor variable (X)
by expressing Y as a function of X
Multiple Regression
• Used when there is more than one predictor
variable
• Two purposes
– To predict Y, given a combination of predictor
variables
– To assess the relative importance of each predictor
variable in explaining the response variable Y
Regression Equations
Simple Linear Regression
Multiple Regression
Yˆ  bX  a
Y = a + b1X1 + b2X2 +… + bkXk
b1 = Regression coefficient for first predictor variable, X1
b2 = Regression coefficient for second predictor variable, X2
a = Intercept, value of Y when all predictor variables are 0
Statistical Models
• Running a regression analysis is not a simple matter of
inputting data, clicking a button and obtaining a ‘fixed’
model of the data
• You create the model of your data
– Subjective process in many respects
– You shape the model you create
– Your job is to create the model that best describes the data
Multiple Regression
• Assessing the relative contribution of each predictor
variable to the response variable
– Which variable contributes most?
– Which is the second biggest predictor?
– Which variables don’t seem to contribute to prediction?
• Problem
– The order with which you input the variables into the analysis
influences the model
– Variable entered first is attributed more variance
– By the time the last variable is entered, there might be very little
variance left to explain
Variance in Y
related to X1
Variance in Y
related to X2
Multiple Correlation
The predictor variables are
correlated with each other
and with the response
variable
Which predictor variable
gets credit for this shared
variance?
Variance in Y related to shared
variance between X1 & X2
Which variable gets credit, X1 or X2?
Different Methods of Multiple
Regression
• Hierarchical Regression
• Entry / Standard Regression
• Sequential Methods
– Forward Addition
– Backward Selection
– Stepwise
• Combinatorial Approach
Hierarchical Regression
• You decide the order in which the variables are
entered
• Based on theory / prior research
• Allows you to assess whether each predictor
adds anything to the model, given the predictors
that are already in the model
Entry / Standard Regression
• Computer package enters all predictor variables into the
model simultaneously
– Creates a regression equation including all predictor variables
– Allows us to assess the unique contribution of each predictor
variable when all other variables are held constant
• Advantages & Disadvantages
– Easy to see which variables significantly predict the response
variable
– May not create the best model for predicting Y as it will include
variables that don’t significantly predict Y
Sequential Models
• Aim to create the ‘best model’
– The combination of variables that best predicts the response
variable
• Build several models in a series of steps, adding or
deleting variables at each step, depending on their
contribution to predicting the response variable
• Final model includes only variables which significantly
and uniquely predict the response variable
Sequential Methods
• Forward Addition
• Begins with only one variable in the model
– The variable that makes the biggest contribution to the response
variable (highest r)
• Adds the variable with the next highest contribution
• Continues to add variables until there are no more
variables that make a significant contribution to the
response variable over and above the variables that are
already in the equation
Sequential Methods
• Backward Selection
• Begins with all predictor variables in the model and
successively deletes variables until only significant ones
remain
• Stepwise Regression
• Similar to previous two but more versatile
• Generally moves forward, adding significant variables,
but can move backward to eliminate a variable if it no
longer significantly predicts when another variable is
added
Sequential Methods
• Drawbacks
– Inclusion in the model depends on mathematical
criterion rather than psychological theory or research
– Variable selection could depend upon tiny differences
in correlation between each predictor variable and the
response variable
• Slight numerical differences could therefore lead to major
differences in theoretical interpretation
– Difficult to replicate results
Combinatorial Methods
• Best Subsets Method
• Computes models with all possible combinations
of the predictor variables and chooses the model
that explains most variance in the response
variable
Critical Considerations for MR
• Sample size
• Distribution requirements: Residuals
– Data must be normally distributed
• Outliers
• Multi-collinearity
Sample Size
• Ratio of cases to predictors should be substantial
• Stevens (1996) advised about 15 participants per
predictor variable
– Size matters: The more people in your sample the better the
chance of the results being replicated
• However, an even bigger ratio is needed when
– Response variable has skewed data distribution
– Poor reliability in measures - substantial measurement error
reduces size of true relationships of variables
– Stepwise methods (45-50 participants per predictor)
Residuals
• Recall the Method of Least Squares
– Fits the regression line by minimising the prediction error of the
line
– Minimises the sum of squares of the residuals (Y-Y’)2
• Fits a line of the form
Y = a + b1X1 + b2X2 +… + bkXk + e
Assumes:
Y =
Fit
+
noise
Residuals
• Method of Least Squares models the noise (e) in the
data using the normal distribution
– Assumes the noise is normally distributed with mean of 0 and
variance σ2
• If this assumption is violated, the results of your
regression analysis may not be valid
– You need to check this by plotting the residuals
– Standardised Residual Plots
• Histogram
• Normal Probability Plot
Histogram
70
60
Frequency
50
40
30
20
10
Mean = 72.5765
Std. Dev. = 123.52254
N = 85
0
0.00
200.00
400.00
600.00
variable


800.00
1000.00
Normal Probability Plot
Plots the residual value
that was obtained for each
data point (observed)
against the value you would
expect if the residuals were
normally distributed
(expected)
Should be a straight diagonal
line
Outliers
• Data points that lie far from the rest of the data and have
large residuals
• Big influence on regression analysis
• You can check for outliers
– Scatterplots examining relationship between response variable
and predictor variables separately
– Casewise diagnostics in SPSS
– Plots of the standardised residuals
Plot of Standardised Residuals
Plot of Standardized Predicted Values
X
Studentised Deleted Residuals
(Residual scores divided by their
standard deviation, which is calculated
leaving out any suspiciously outlying
data points)
4
3
2
1
0
-1
-2
-3
-3
-2
-1
0
1
2
3
Based on the assumption of normality:
99.9% of residuals should lie within
+3 & - 3 standard deviations
Any point outside this range is an outlier
Multi-Collinearity
• Occurs when predictor variables are highly
correlated with one another
– High bivariate correlations (.7 / .8 or above)
– High multivariate correlation
• Not a desired feature of the dataset
– Some predictor variables are redundant
– Statistically, leads to unstable results
Multi-Collinearity
• To assess whether multi-collinearity is present
– Examine the bivariate correlations between predictor variables
– Tolerance Statistic
• 1 – Multiple correlation (correlation between each predictor variable
and all others)
• If low, then multiple correlation must be high and multi-collinearity is
a problem
• Solution
– Leave out one of the predictor variables
– Combine two highly correlated predictor variables
Let’s take an example
• Interested in a theory which suggests that a person’s
level of optimism (X1) and the social support (X2) that
he/she has in his/her life predicts how long he/she will
survive (Y) after being diagnosed with cancer.
• Three steps to Regression Analysis:
– A. Examine the relationship between the predictor and response
variables separately
– B. Perform and interpret the multiple regression
– C. Assess the appropriateness of the regression analysis
Let’s take an example
• Open the following dataset
• Software / Kevin Thomas / Multiple Regression Dataset
• Run Correlations between…
– Survival & Optimism
– Survival & Social Support
Correlations
survival in weeks
Optimism
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Correlations
survival in
weeks
1
LOT score
.599**
.000
200
200
.599**
1
.000
200
200
**. Correlation is sig nificant at the 0.01 level (2-tailed).
survival in weeks
Social Support
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
survival in
weeks
1
SS score
.326**
.000
200
200
.326**
1
.000
200
200
**. Correlation is sig nificant at the 0.01 level (2-tailed).
Create Scatterplots & fit regression line
Graphs / Scatter / Simple Scatter / y = Survival, X = Predictor Variable
Fit regression line: Double click on chart, then Elements / Fit line at total
Step 2: The Multiple Regression
• Analyse, Regression, Linear
– Dependent variable: Survival
– Independent variable: Social, optimism
• Method: Enter (gives a standard multiple regression)
• Statistics
– Regression Coefficients
• Estimates 
• Model fit 
• Descriptives
Answer the questions on your
worksheet
1.
Does this model (i.e. combination of social support and
optimism) significantly predict the response variable
(survival in months)?
ANOVAb
Model
1
Reg ression
Residual
Total
Sum of
Squares
528045.0
767907.0
1295952
df
2
197
199
Mean Square
264022.487
3898.005
F
67.733
a. Predictors: (Constant), SS score, LOT score
b. Dependent Variable: survival in weeks
Yes, F (2, 199) = 67.73, p < .001
Sig .
.000a
Answer the questions on your
worksheet
2.
What percentage of variance in the response variable,
survival in months, is explained by this model?
Model Summary
Model
1
R
a
.638
R Sq uare
.407
Adjusted
R Sq uare
.401
Std. Error of
the Estimate
62.43401
a. Predictors: (Constant), SS score, LOT score
R Square adjusted = Estimate of the population proportion of
variation in survival due to optimism & support
Penalises for number of variables in the model
40.1%
Answer the questions on your
worksheet
3.
Write the regression equation
Coefficientsa
Model
1
(Constant)
LOT score
SS score
Unstandardized
Coefficients
B
Std. Error
4.340
22.760
3.670
.367
12.987
3.226
Standardized
Coefficients
Beta
.558
.225
t
.191
10.005
4.026
Sig .
.849
.000
.000
a. Dependent Variable: survival in weeks
Survival in months = 3.67(optimism) + 12.99(social support) + 4.34
Answer the questions on your
worksheet
4.
What does this equation tell us about the relationship
between months of survival and social support?
Coefficientsa
Model
1
(Constant)
LOT score
SS score
Unstandardized
Coefficients
B
Std. Error
4.340
22.760
3.670
.367
12.987
3.226
Standardized
Coefficients
Beta
.558
.225
t
.191
10.005
4.026
Sig .
.849
.000
.000
a. Dependent Variable: survival in weeks
As social support increases by one unit, survival in months increases
by almost 13 months
Answer the questions on your
worksheet
5.
Do both variables significantly predict survival in months?
Coefficientsa
Model
1
(Constant)
LOT score
SS score
Unstandardized
Coefficients
B
Std. Error
4.340
22.760
3.670
.367
12.987
3.226
Standardized
Coefficients
Beta
.558
.225
t
.191
10.005
4.026
Sig .
.849
.000
.000
a. Dependent Variable: survival in weeks
Yes, for optimism, t = 10, p < .001 & for social support, t = 4.026, p < .001
Answer the questions on your
worksheet
6.
Which of the predictor variables contributes most to the
response variable?
Coefficientsa
Model
1
(Constant)
LOT score
SS score
Unstandardized
Coefficients
B
Std. Error
4.340
22.760
3.670
.367
12.987
3.226
Standardized
Coefficients
Beta
.558
.225
t
.191
10.005
4.026
Sig .
.849
.000
.000
a. Dependent Variable: survival in weeks
Beta = Standardized Regression Coefficient (B / Std. Error)
Can be used to compare strength of contribution of predictor variables
Optimism has a Beta value of .558 and so, contributes more than social
support, which has a Beta value of .225
Answer the questions on your
worksheet
7.
Use the regression equation to make the following
prediction: If a person has an optimism score of 10 and a
social support score of 2, how long would you expect them
to survive?
Survival in months = 3.67(optimism) + 12.99(social support) + 4.34
Survival in months = 3.67(10) + 12.99(2) + 4.34
Survival in months = 36.7 + 25.98 + 4.34
Survival in months = 67.02
67 months!
Answer the questions on your
worksheet
8.
What is the standard error of this prediction?
Model Summary
Model
1
R
a
.638
R Sq uare
.407
Adjusted
R Sq uare
.401
a. Predictors: (Constant), SS score, LOT score
62.43
months
Std. Error of
the Estimate
62.43401
Step 2: Assess the appropriateness of
the Analysis
• Distribution of Residuals
• Outliers
• Multi-collinearity
• Re-run regression but this time…
– Statistics
• Collinearity Diagnostics
• Residuals, casewise diagnostics
– Outliers outside 3 standard deviations
– Plots
• Histogram
• Normal Probability Plot
• Plot of Standardized Predicted Values (Y: ZPRED) by
Studentized Deleted Residuals (X: SDRESID)
Distribution of Residuals
Outliers
Residuals Statistics
Predicted Value
Residual
Std. Predicted Value
Std. Residual
a
Minimum
59.9595
-182.7696
-2.587
Maximum
326.4863
162.2596
2.587
Mean
193.2000
5.400E-15
.000
Std. Deviation
51.5121
62.1195
1.000
-2.927
2.599
.000
.995
a. Dependent Variable: length of survival (months)
All residuals lie within -3 and 3 standard deviations
Note that you expect 1% of cases to lie outside this area so in a
large sample, if you have one or two, that could be ok
N
200
200
200
200
Outliers
All residuals lie within -3 and 3 standard deviations
Multi-Collinearity
Correlations
SS score
LOT score
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
SS score
1
LOT score
.182**
.010
200
200
.182**
1
.010
200
200
**. Correlation is significant at the 0.01 level (2-tailed).
Coefficientsa
Model
1
(Constant)
LOT score
SS score
Unstandardized
Coefficients
B
Std. Error
4.340
22.760
3.670
.367
12.987
3.226
Standardized
Coefficients
Beta
.558
.225
t
.191
10.005
4.026
Sig .
.849
.000
.000
Collinearity Statistics
Tolerance
VIF
.967
.967
1.034
1.034
a. Dependent Variable: survival in weeks
Bivariate correlations seem to be low (r = .182) even though significant (p =
.01)
Tolerance is high, meaning that the multiple correlation is small, meaning that
multi-collinearity is not a feature of this dataset
Summary
• Multiple Regression
– To predict Y given a combination of predictor variables
– To assess the relative importance of each predictor variable in
explaining the response variable
• Statistical modelling
– Different Methods
• Three steps
– Examine the relationship between the predictor and response
variables separately
– Perform and interpret the multiple regression
– Assess the appropriateness of the regression analysis
• There are a number of critical considerations