Transcript Topic_9

Topic 9: Remedies
Outline
• Review diagnostics for residuals
• Discuss remedies
– Nonlinear relationship
– Nonconstant variance
– Non-Normal distribution
– Outliers
Diagnostics for residuals
• Look at residuals to find serious
violations of the model assumptions
– nonlinear relationship
– nonconstant variance
– non-Normal errors
• presence of outliers
• a strongly skewed distribution
Recommendations for
checking assumptions
• Plot Y vs X (is it a linear relationship?)
• Look at distribution of residuals
• Plot residuals vs X, time, or any other
potential explanatory variable
• Use the i=sm## in symbol statement
to get smoothed curves
Plots of Residuals
• Plot residuals vs
– Time (order)
– X or predicted value (b0+b1X)
• Look for
–nonrandom patterns
–outliers (unusual observations)
Residuals vs Order
• Pattern in plot suggests
dependent errors / lack of indep
• Pattern usually a linear or
quadratic trend and/or cyclical
• If you are interested read KNNL
pgs 108-110
Residuals vs X
• Can look for
–nonconstant variance
–nonlinear relationship
–outliers
–somewhat address Normality
of residuals
Tests for Normality
• H0: data are an i.i.d. sample from a
Normal population
• Ha: data are not an i.i.d. sample
from a Normal population
• KNNL (p 115) suggest a correlation
test that requires a table look-up
Tests for Normality
• We have several choices for a
significance testing procedure
• Proc univariate with the normal
option provides four
proc univariate normal;
• Shapiro-Wilk is a common choice
Tests for Normality
Test
Shapiro-Wilk
Statistic
p Value
0.978904 Pr < W
0.8626
W
Kolmogorov-Smirnov
D
Cramer-von Mises
W-Sq
0.033263 Pr > W-Sq
>0.2500
Anderson-Darling
A-Sq
0.207142 Pr > A-Sq
>0.2500
0.09572 Pr > D
>0.1500
All P-values > 0.05…Do not reject H0
Other tests for model
assumptions
• Durbin-Watson test for serially
correlated errors (KNNL p 114)
• Modified Levene test for homogeneity
of variance (KNNL p 116-118)
• Breusch-Pagan test for homogeneity
of variance (KNNL p 118)
• For SAS commands see topic9.sas
Plots vs significance test
• Plots are more likely to suggest
a remedy
• Significance tests results are
very dependent on the sample
size; with sufficiently large
samples we can reject most null
hypotheses
Default graphics with
SAS 9.3
proc reg data=toluca;
model hours=lotsize;
id lotsize;
run;
Will discuss
these
diagnostics
more in
multiple
regression
Provides
rule of
thumb limits
Questionable
observation
(30,273)
Additional summaries
• Rstudent: Studentized residual…almost
all should be between ± 2
• Leverage: “Distance” of X from
center…helps determine outlying X values
in multivariable setting…outlying X values
may be influential
• Cooks’D: Influence of ith case on all
predicted values
Lack of fit
• When we have repeat observations at
different values of X, we can do a
significance test for nonlinearity
• Browse through KNNL Section 3.7
• Details of approach discussed when
we get to KNNL 17.9, p 762
• Basic idea is to compare two models
• Gplot with a smooth is a better (i.e.,
simpler) approach
SAS code and output
proc reg data=toluca;
model hours=lotsize / lackfit;
run;
Analysis of Variance
Sum of
Squares
252378
Mean
Square
252378
Source
Model
DF
1
Error
23
54825 2383.71562
Lack of Fit
9
17245 1916.06954
Pure Error
14
37581 2684.34524
Corrected Total
24
307203
F Value
105.88
Pr > F
<.0001
0.71 0.6893
Nonlinear relationships
• We can model many nonlinear
relationships with linear models,
some have several explanatory
variables (i.e., multiple linear
regression)
–Y = β0 + β1X + β2X2 + e (quadratic)
–Y = β0 + β1log(X) + e
Nonlinear Relationships
• Sometimes can transform a
nonlinear equation into a linear
equation
• Consider Y = β0exp(β1X) + e
• Can form linear model using log
log(Y) = log(β0) + β1X + log(e)
• Note that we have changed our
assumption about the error
Nonlinear Relationship
• We can perform a nonlinear
regression analysis
• KNNL Chapter 13
• SAS PROC NLIN
Nonconstant variance
• Sometimes we model the way in
which the error variance changes
–may be linearly related to X
• We can then use a weighted analysis
• KNNL 11.1
• Use a weight statement in PROC REG
Non-Normal errors
• Transformations often help
• Use a procedure that allows
different distributions for the
error term
–SAS PROC GENMOD
Generalized Linear Model
• Possible distributions of Y:
– Binomial (Y/N or percentage data)
– Poisson (Count data)
– Gamma (exponential)
– Inverse gaussian
– Negative binomial
– Multinomial
• Specify a link function for E(Y)
Ladder of Reexpression
(transformations)
p
1.5
1.0
0.5
0.0
-0.5
-1.0
Transformation
is xp
Circle of Transformations
X down, Y up
Y
X up, Y up
X
X down, Y down
X up, Y down
Box-Cox Transformations
• Also called power transformations
• These transformations adjust for
non-Normality and nonconstant
variance
• Y´ = Y or Y´ = (Y - 1)/
• In the second form, the limit as 
approaches zero is the (natural) log
Important Special Cases
•
•
•
•
•
 = 1, Y´ = Y1, no transformation
 = .5, Y´ = Y1/2, square root
 = -.5, Y´ = Y-1/2, one over square root
 = -1, Y´ = Y-1 = 1/Y, inverse
 = 0, Y´ = (natural) log of Y
Box-Cox Details
• We can estimate  by including it as
a parameter in a non-linear model
• Y = β0 + β1X + e
and using the method of maximum
likelihood
• Details are in KNNL p 134-137
• SAS code is in boxcox.sas
Box-Cox Solution
• Standardized transformed Y is
–K1(Y - 1) if  ≠ 0
–K2log(Y) if  = 0
where K2 = ( Yi)1/n (the
geometric mean)
and K1 = 1/ ( K2 -1)
• Run regressions with X as
explanatory variable
• estimated  minimizes SSE
Example
data a1; input age plasma @@;
cards;
0 13.44 0 12.84 0 11.91 0 20.09
0 15.60 1 10.11 1 11.38 1 10.28
1 8.96 1 8.59 2 9.83 2 9.00
2 8.65 2 7.85 2 8.88 3 7.94
3 6.01 3 5.14 3 6.90 3 6.77
4 4.86 4 5.10 4 5.67 4 5.75
4 6.23
;
Box Cox Procedure
*Procedure that will
automatically find the Box-Cox
transformation;
proc transreg data=a1;
model boxcox(plasma)=identity(age);
run;
Transformation Information for BoxCox(plasma)
Lambda
R-Square
Log Like
-2.50
0.76
-17.0444
-2.00
0.80
-12.3665
-1.50
0.83
-8.1127
-1.00
0.86
-4.8523 *
-0.50
0.87
-3.5523 <
0.00 +
0.85
-5.0754 *
0.50
0.82
-9.2925
1.00
0.75
-15.2625
1.50
0.67
-22.1378
2.00
0.59
-29.4720
2.50
0.50
-37.0844
< - Best Lambda
* - Confidence Interval
+ - Convenient Lambda
*The first part of the program
gets the geometric mean;
data a2; set a1;
lplasma=log(plasma);
proc univariate data=a2 noprint;
var lplasma;
output out=a3 mean=meanl;
data a4; set a2;
if _n_ eq 1 then set a3;
keep age yl l;
k2=exp(meanl);
do l = -1.0 to 1.0 by .1;
k1=1/(l*k2**(l-1));
yl=k1*(plasma**l -1);
if abs(l) < 1E-8 then
yl=k2*log(plasma);
output;
end;
proc sort data=a4 out=a4;
by l;
proc reg data=a4 noprint outest=a5;
model yl=age;
by l;
data a5; set a5;
n=25; p=2; sse=(n-p)*(_rmse_)**2;
proc print data=a5;
var l sse;
Obs
1
2
3
4
5
6
7
8
9
10
l
-1.0
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
sse
33.9089
32.7044
31.7645
31.0907
30.6868
30.5596
30.7186
31.1763
31.9487
33.0552
symbol1 v=none i=join;
proc gplot data=a5;
plot sse*l;
run;
data a1; set a1;
tplasma = plasma**(-.5);
tage = (age+.5)**(-.5);
symbol1 v=circle i=sm50;
proc gplot;
plot tplasma*age;
proc sort; by tage;
proc gplot;
run;
plot tplasma*tage;
Background Reading
• Sections 3.4 - 3.7 describe significance
tests for assumptions (read it if you are
interested).
• Box-Cox transformation is in
nknw132.sas
• Read sections 4.1, 4.2, 4.4, 4.5, and 4.6