Transcript Slide 1
Linear Regression
5.2 Introduction
• Correlations tell us nothing about the predictive power of
variables.
• In regression we fit a predictive model to our data and use the
model to predict the values of the dependent variable from one
or more independent variable.
• Outcomei = (Modeli) + errori
• The word model in the equation get replaced by some thing that
defines the line that we fit to the data.
• With any data set there can be many lines that could be used to
summarise the general trend.
• We need to decide which of many possible lines to choose.
• For drawing accurate conclusions we want to fit a model that
best describes the data.
• There are several ways to fit the line i.e use your eye, or a
mathematical technique `‘method of least squares´`
5.2.1 Describing a Straight Line
Yi = b0 + b1X i + e i
• bi
– Regression coefficient for the predictor
– Gradient (slope) of the regression line
– Direction/Strength of Relationship
• b0
– Intercept (value of Y when X = 0)
– Point at which the regression line crosses the Y-axis
(ordinate)
Same Intercept, Different Gradient
70
Verbal Coherence
60
50
40
30
20
10
0
0
2
4
6
Number of Pints
8
10
Same Gradient, Different Intercept
80
Verbal Coherence
70
60
50
40
30
20
10
0
0
2
4
6
Number of Pints
8
10
5.2.2. The Method of Least Squares
• Insert Figure 5.2
5.2.3 Assessing the goodness of fit: sums of
squares, R and R2 i.e How Good is the Model?
• The regression line is only a model based
on the data.
• This model might not reflect reality.
– We need some way of testing how well the
model fits the observed data.
– How?
Once we have found the line of best fit it is important that we assess how well this line fits
the actual data.
If we want to assess the line of best fit, we need to compare it against some thing. The
thing we choose is the most basic model we can find.
We use the following equation,
–
–
to calculate the fit of the most basic model
And then the fit of the best model. If the best model is any good then it should fit the data
significantly better than our basic model .
Deviation = ∑ (observed – model)2
We choose the mean as the basic model.
And then calculate the difference between the observed values and the values predicted
by the mean.This sum of squaed differenes is called the total sum of squares (SST) .
This represents how good the mean is as model of the observed data.
Now if we fit our best model (least squares), we can again find the differences between the
obderved data and the new model. These differences are squared and added. This is
called sum of square residuals (errors). This represent the degree of inaccuracy when
the best model is fitted, (SSR)
We can use these two values to calculate how much better the regression line is than just
using the mean as a model .
SSM = SST – SSR
This difference shows the reduction in inaccuracy of the model resulting from fitting the
regression model to the data.
R2 = SSM / SST (Variation explained by the model / Total variation in the model)
Sums of Squares
• Insert Figure 5.3
Summary
• SST
– Total variability (variability between scores and the
mean).
• SSR
– Residual/Error variability (variability between the
regression model and the actual data).
• SSM
– Model variability (difference in variability between the
model and the mean).
Testing the Model: ANOVA
SST
Total Variance In The Data
SSM
Improvement Due to the Model
SSR
Error in Model
• If the model results in better prediction than
using the mean, then we expect SSM to be much
greater than SSR
Testing the Model: ANOVA
• Mean Squared Error
– Sums of Squares are total values.
– They can be expressed as averages.
– These are called Mean Squares, MS
F =
MSM
MSR
Variation explained by
the model
Variation not explained
by the model
Testing the Model: R2
• R2
– The proportion of variance accounted for by
the regression model.
– The Pearson Correlation Coefficient Squared
2
R =
SSM
SST
5.2.4 Assessing individual predictors
Yi = b0 + b1X i + e i
The value b represents the change in the outcome resulting from a unit change in the
predictor. If the model is bad the regression coefficent will be 0. This means a unit change
in the predictior variable results in no change in the predicted value. This hypotheses is
tested using a t test.
Null Hypothese: Ho
Ha
b1= 0
b1 not equal to 0
If it is significant (less than 0.05) we accept the hypotheses that the b
value is significantly different from zero and that the predictor variable
contributes significantly to the predictor variable.
t=
b observed – b expected
SEb
t=
b observed
SEb
t test
•Let us assume we take lots of samples of the data regarding adverts
and sales.
•Calculate the b value for each sample.
•We could plot a frequency distribution for these samples to see
whether the b values from all samples are relatively similiar or different.
•We can use the standard deviation of this distribution (called standard
error SE ) as a measure of the spread of the b values. If the SE is
small then it means that most samples have a b value similiar to the
one in the sample selected (because there is little variation across
samples)
•The t test tells whether the b value is different from 0, relative to the
variation in the b values for similiar samples.
t=
b observed – b expected
SEb
t=
b observed
SEb
The bexpected is the value of b that
we would expect to obtain if the null
Hyp is true i.e bexpected
=0
The simplified equation is on the
bottom
t test
•
•
•
•
•
•
The values of t have a special distribution that differs according to the degree of
freedom.
In regression the d.o.f are N – P – 1 N= Sample size, P = No of predictors
Using dof, establish which t distributiion is to be used,
Compare the observed value of t with the values that we would expect to find by
chance.
If t is very large as compared to the given values in the table, then it is very
unlikely, that it has occured by chance. It is a genuine effect.
SPSS provides the exact probability of the t test value for ``b`` parameter having
a value of zero.
5.3 Regression: An Example
Open: Record1.sav
Regression Using SPSS
5.4.1 Interpreting a simple regression
Model summary tells whether it is successful in predicting
sales
SPSS Output: ANOVA
ANOVA tests whether the model is significantly better at
predicting the outcome than using the mean as the best
guess or the model
5.4.2 SPSS Output: Model Parameters
How do I
interpret b
values?
5.4.3 Using The Model
Record Salesi = b0 + b1Advertising Budgeti
= 134.14 + 0.09612 Advertising Budgeti
Record Sales i = 134.14 + 0.09612 Advertising Budget i
= 134.14 + 0.09612 100
= 143.75
5.5. Multiple Regression: Basics
• Outcome = (Modeli) + errori
• Yi = (bo + b1 X1 + b2 X2 +
) + error
5.5.1 Example of Regression: Record Sales Company
Advertising accounts for 33% of variation in sales.
67 % variation remains unexplained. Therefore a new predictor is
introduced to explain some of the unexplained variation in sales.
i.e. the number of times the record is played on the radio.
Record Sales i = (bo + b1 Advertising Budgeti + b2 Air Playi + error
There is a b value for both predictors.
3 D Graphic model in Fig 5.6 , Page 158
5.5.2 Sum of Squares, R and R2
SST: Represents the difference between the observed values
and the mean value of the outcome variable.
SSR: Represents the difference between the values of Y
predicted by the model and the observed values.
SSM: Represents the difference between the values o Y
predicted by the model and the mean value.
Multiple R is a measure of how well the model predicts the observed data.
It is used when there are multiple predictiors.
Multiple is a correlation between the observed values of Y and the values of
Y prdicted by the regression model
Large R Value : large correlation between predicted and observed values of
the outcome.
M. R = 1 , Model perfectly predicts the observed data
R2 : Amount of variation in the out come variable that is accounted for by the model.
5.5.3 Methods of Regression
• Hierarchical (Blockwise Entry)
– Predictors selected based on past work
– The experimenter decides the order of entry. Entering the important ones first.
• Forced Entry
–
All the predictors entred simultaneously (You shall have good theoretical reasons to
include the chosen predictors)
• Stepwise Methods
The order in which the predictors are added is a mathematical criteria.
Forward Method:
– The initial model contains only the constant bo.
– The computer searches for the model that best predicts the outcome variable (By selecting the
variable having the highest correlation with the outcome.
– If the predictior sigificantly improves the ability of the model to predict the outcome, then the
predictor is retained and the computer searches for another predictor.
– The criteria for the 2nd predictior is that it is the variable that has the largest semi partial
correlation with the outcome. The predictor that accounts for the most new variance is added to
the model and if it makes a significant contribution to the predictive powerof the model, it is
retained and another predictor is considered.
Stepwise Method
– Similiar to forward method, except that each time a predictor is added to the equation, a removal
test is made of the least useful predictor.The regression equation is constantly being assessed for
any redundant predictors to be removed.
Backward Method:
– The computers places all the predictors in the model and then calculates the contribution of each
by looking at the significance value of the t test for each predictor.
Assessing: Does the model fit the
observed data well?
5.6.1 How Accurate is the my regression model
5.6.1.1 Outliers and residuals
•
•
Generalisation: Can my model
generalise to other samples?
Outlier: A case that differs substantially from the main trend
How to find it? Use residuals. (The difference between values of the outcome predicted by the model and
the values of the outcome observed in the model.
–
–
–
Unstandardised Residuals
Standardised Residuals (Find Z Scores )
Studentised Residuals
95% Z Scores lie within +- 1.96
99% Z Scores lie within +- 2.58
99.9% Z Scores lie within +- 3.29
5.6.1.2 Influential Cases (Do certain cases exert undue influence over the parameters of the model i.e. If
we delete a case do we obtain a different regression coefficient. There are several residual statistics to
assess the influence of a particular case.
•
•
•
•
•
•
Adjusted predicted value (When a case is excluded from the analysis)
DFFit = Adjusted predicted value – Original Predicted Value
Standardised DFFit
Deleted Residual = Adjusted predicted value – Original Observed Value
Studentised deleted residual.
How the case influence the model as a whole
–
–
–
•
•
•
Cooks Distance: A statistics that considers the effect of a single case on the model. Values greater than 1 are cause
of concern.
Leverage (Values can lie between 0=case has no influence and 1=case has complete influence)
Mahalanobis distances: Values above 25 are a cause of concern.
DFBeta: The difference between a parameter estimated using all cases and then estimated when one
case is excluded
Standardised DF Beta (Universal cut off points can be applied) Values greater than 1 indicate cases that
substantially influence the model parameters,
Covariance ratio: This is a measure whether a case influences the variance of the Regression
parameters.
In social science we are interested in generalising our findings outside the sample.
5.6.2 Assessing the generalisation of the model
5.6.2.1 Several Assumptions must be true
5.6.2.2.Cross validation of the model
5.6.2.3. Sample size in Regression
5.6.2.4. Multicollinearity
5.6.2.1 Several assumptions must be true
•
Variables types:
– Predictor variables must be quantitiave or categorical
– Outcome variable must be quantititvie, continious and unbounded
•
•
Non Zero variance: The predictors should have some variance
No Perfect multicollinearity: No perfect linear relationship between two or
more predictors.
Predictors are uncorrelated with external variables.
Homosedasticity: At each level of the predictor variable, the variance of the
residual terms should be constant.
Independent errors (Durbin Watson Test)
Normally distributed errors
Independence
Linearity
•
•
•
•
•
•
5.6.2.2. Cross validation of the model
Asessing the accuracy of the model across different samples is known as
cross validation. There are two main methods of cross validation.
1.
Adjusted R2: This indicates the loss of predictive power or shrinkage. R2
tells us how much of the variance in Y is accounted for by the regression
model from our sample, the adjusted value tells us how much variance
in Y would be accounted for if the model had been derived from the
population from which the sample had been taken
2.
Data Splitting: Randomly splitting your data in half, computing a
regression equation on both halves of data and comparing the resulting
models.
5.6.2.3 Sample size in regression
Rules of thumb
5.7 Multiple Regression Using SPSS
•
•
•
•
•
5.7.1 Main Options
5.7.2 Statistics
Regression Plots
Saving Regression diagnostics
Further Options
5.8 Interpreting Multiple Regression
•
•
•
•
•
5.8.1 Descriptives
5.8.2 Summary of Model
5.8.3 Model Parameters
5.8.4
5.8.5 Assessing the assumptions of
multicollinearity
• 5.8.6 Case wise diagnostics
• Checking assumptions
5.10 Categorical predictors and
regression
• 5.10.1 Dummy Coding
Is a way of representing variables using 0s and
1s.
The number of variables we need is one less than
the number of groups we are coding
Example: Glastonbury festival
•
•
•
•
•
The biologist categorised people according to their musical affiliation.
People liking alternative music – indie kid
People kiknig heavy metal – metaller
People liking hippy/folky – crusty
People not liking music – no affiliation
Dummy
variable 1
Dummy
variable 2
Dummy
variable 3
No Affiliation
0
0
0
Indie Kid
0
0
1
Metaller
0
1
0
Crusty
1
0
0