What are linear statistical models?

Download Report

Transcript What are linear statistical models?

Assumptions in linear regression models
Yi = β0 + β1 x1i + … + βk xki + εi, i = 1, … , n
Assumptions
x1i , … , xki are deterministic (not random variables)
ε1 … εn are independent random variables with null
mean, i.e. E(εi) = 0 and common variance, i.e. V(εi) = σ2.
Consequences
E(Yi) = β0 + β1 x1i + … + βk xki and V(Yi) = σ2. i = 1, … , n
The OLS (ordinary least squares) estimators of β0, … βk
indicated with b1, …, bk are BLUE (Best Linear Unbiased
Estimators) – Gauss Markov theorem.
Normality assumption
If, in addition, we assume that the errors are Normal r.v.
ε1 … εn are independent NORMAL r.v. with null mean and
common variance σ2, i.e. εi ~N(0, σ2), i = 1, … , n
Consequences
Yi ~ N( β0 + β1 x1i + … + βk xki , σ2), i = 1, … , n
bi ~ N( βi V(bi)), i = 0, … , k
The normality assumption is needed to make reliable
inference (confidence intervals and tests of hypotheses).
I.e. probability statements are exact
If the normality assumption does not hold, under some
conditions, a large n (observations), via a Central Limit
theorem allows reliable asymptotic inference on the
estimated betas.
Checking assumptions
The error term ε is unobservable. Instead we can provide
an estimate by using the parameter estimates.
The regression residual is defined as

ei = yi – yi , i= 1, 2, ... n
Plots of the regression residuals are fundamental in
revealing model inadequacies such as
a) non-normality
b) unequal variances
c) presence of outliers
d) correlation (in time) of error terms
Detecting model lack of fit with residuals
• Plot the residuals ei on the vertical axis against each of
the independend variables x1, ..., xk on the horizontal
axis.
• Plot the residuals ei on the vertical axis against the
predicted value y on the horizontal axis.
• In each plot look for trends, dramatic changes in
variability, and /or more than 5% residuals lie outside
2s of 0. Any of these patterns indicates a problem with
model fit.
Use the Scatter/Dot graph command in SPSS to construct
any of the plots above.
Examples: residuals vs. predicted
Residual Plot
Residual Plot
3
250
2
200
150
0
0
10
20
30
40
50
60
70
-1
100
Residuals
Residuals
1
50
0
-400
-2
-200
0
200
400
600
800
1000
1200
-50
-3
-100
fine
-4
-150
Predicted value
Residual Plot
300
10
250
8
200
6
150
4
100
2
Residuals
Residuals
Residual Plot
50
0
-10
-50
0
10
20
30
40
50
60
70
0
-2
0
10
20
30
40
-4
-100
-6
-150
-8
-10
-200
Predicted value
Predicted value
unequal variances
nonlinearity
Predicted value
outliers
50
60
70
Examples: residuals vs. predicted
auto-correlation
Residual Plot
4
3
Residuals
2
1
0
0
10
20
30
40
50
60
70
-1
-2
-3
-4
Residual Plot
Predicted value
8
6
Residuals
4
2
0
-20
0
20
40
60
-2
-4
nonlinearity and auto-correlation
-6
Predicted value
80
100
120
Partial residuals plot
An alternative method to detect lack of fit in models with more than
one independent variable uses the partial residuals; for a selected
j-th independent var xj,
e* = y – (b0+ b1x1+...+ bj-1xj-1 + bj+1xj+1 + ... + bkxk )
= e + bjxj
Partial residuals measure the influence of xj after the effects of all
other independent vars have been removed.
A plot of the partial residuals for xj against xj often reveals more
information about the relationship between y and xj than the usual
residual plot.
If everything is fine they should show a straight line with slope bj.
Partial residual plots can be calculated in SPSS by selecting “Produce all
partial plots” in the “Plots” options in the “Regression” dialog box.
Example
A supermarket chain wants to investigate the effect of
price p on the weekly demand of a house brand of
coffee.
Eleven prices were randomly assigned to the stores and
were advertised using the same procedure.
A few weeks later the chain conducted the same
experiment using no advertising
– Y : weekly demand in pounds
– X1: price, dollars/pound
– X2: advertisement: 1 = Yes, 0 =No.
Model 1: E(Y) = β0 + β1x1 + β2x2
Data: Coffee2.sav
Computer Output
Model Summary
Model
1
R
.988a
R Square
.975
Adjusted
R Square
.973
Std. Error of
the Estimate
49.876
a. Predictors: (Constant), Advertisment, Price per pound
ANOVAb
Model
1
Regression
Residual
Total
Sum of
Squares
1859299
47264.868
1906564
df
2
19
21
Mean Square
929649.475
2487.625
F
373.710
Sig.
.000a
a. Predictors: (Constant), Advertisment, Price per pound
b. Dependent Variable: Weekly demand
Model
1
(Constant)
Price per pound
Advertisment
Coefficientsa
Unstandardized
Coefficients
B
Std. Error
2400.182
68.914
-456.295
16.813
70.182
21.267
a. Dependent Variable: Weekly demand
Standardized
Coefficients
Beta
-.980
.119
t
34.829
-27.139
3.300
Sig.
.000
.000
.004
Residual and partial residual (price) plots
Residuals vs. price.
Shows non-linearity
Partial residuals for price vs.
price.
Shows nature of non-linearity.
Try using 1/x instead of x
E(Y) = β0 + β1(1/x1) + β2x2
Model Summaryb
Model
1
R
a
.999
R Square
.999
Adjusted
R Square
.999
Std. Error of
the Estimate
11.097
a. Predictors: (Constant), RecPrice, Advertisment
ANOVAb
b. Dependent Variable: Weekly demand
Sum of
Model
Squares
df
Mean Square
1
Regression
1904224
2
952111.958
Residual
2339.903
19
123.153
Total
1906564
21
F
7731.145
Sig.
.000a
a. Predictors: (Constant), RecPrice, Advertisment
b. Dependent Variable: Weekly demand
Model
1
(Constant)
Advertisment
RecPrice
Unstandardized
Coefficients
B
Std. Error
-1217.343
14.898
70.182
4.732
6986.507
56.589
a. Dependent Variable: Weekly demand
RecPrice = 1/Price
Coefficientsa
Standardized
Coefficients
Beta
.119
.992
t
-81.711
14.831
123.460
Sig.
.000
.000
.000
Residual and partial residual (1/price) plots
After fitting the independent variable “x1 = 1/price” the
Residual plot does not show any pattern and the Partial
residual plot for (1/price) does not show any non
linearity.
An example with simulated data
The true model, supposedly unknown, is
Y = 1 + x1 + 2∙x2 + 1.5∙x1∙x2 + ε, with ε~N(0,1)
Data: Interaz.sav
Fit a model based on
data
X2
Cor(X1,X2)=0.131
Y
x1
x2
Model 1: E(Y) = β0 + β1x1 + β2x2
Regressione
Residuo
Totale
Anovab
Sig.
SS
df
MS
F
8447,42
2 4233,711 768,494 ,000a
533,12 97
5,496
2=0.939
Adj.
R
8980,54 99
Coefficientia
1
(Costante)
X1
X2
B
-6,092
3,625
6,145
DS
,630
,207
,189
t
-9,668
17,528
32,465
Sig.
,000
,000
,000
VIF
1,018
1,018
Model 1: standardized residual plot
Nonlinearity is
present
To what is due?
Since the scatterplots do not show
any non-linearity
it could be due to
an interaction
Y
Model 1: partial regression plots
Show that linear
effects are roughly
fine. But some nonlinearity shows up
X1
X2
Model 2: E(Y) = β0 + β1x1 + β2x2 + β3x1x2
Regressione
Residuo
Totale
Anovab
SS
df
F
Sig.
MS
8885,372
3 2961,791 2987,64 000a
95,169 96
,991
2=0.989
Adj.
R
8980,541 99
Coefficientia
1
(Costante)
X1
X2
IntX1X2
B
,305
1,288
2,098
1,411
DS
,405
,142
,209
,067
t
,753
9,087
10,051
21,018
Sig.
,453
,000
,000
,000
VIF
2,648
6,857
9,280
Model 2: standardized residual plot
Looks fine
Model 2: partial regression plots
Maybe an outlier
is present
X1
X2
All plots show linearity of
the corresponding terms
X1X2
Model 3: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 + β4x22
Suppose I wanto to try fitting a quadratic term
Anovab
Sig.
SS
df
MS
F
Regressione 8890,686 4 2222,67 2349,92 ,000a
Residuo
89,856 95
,946
Adj. R2=0.990
Totale
8980,541 99
Coefficientia
1
(Costante)
X1
X2
IntX1X2
X2Square
B
DS
,023
,413
1,258
,139
2,615
,299
1,436
,066
-,137
,058
x22 seems fine
t
,055
9,051
8,757
21,619
-2,370
Sig.
VIF
,956
,000
2,670
,000
14,713
,000
9,528
,020
11,307
Higher MC
Model 3: standardized residual plot
Looks fine
Model 3: partial regression plots
X1
X1X2
X2
Doesn’t show
“linearity”
X22
Checking the normality assumption
The inference procedures on the estimates (tests and
confidence intervals) are based on the Normality
assumption on the error term ε. If this assumption is
not satisfied the conclusions drawn may be wrong.
Again, the residuals ei are used for checking this assumption
Two widely used graphical tools are
1) the P-P plot for Normality of the residuals
2) the histogram of the residuals compared with the
Normal density function.
The P-P plot for Normality and histogram of the residuals can be
calculated in SPSS by selecting the appropriate boxes in the “Plots”
options in the “Regression” dialog box.
Social Workers example: E(ln(Y)) = β0 + β1x
Histogram should match the
continuous line
Points should be as close as
possible to the straight line
Both graphs do not show strong departures from the
Normality assumption.