Lack of Fit and Data Transformation

Download Report

Transcript Lack of Fit and Data Transformation

Transformations in
Statistical Analysis
Outline
• Assumptions of linear statistical models.
• Types of Transformations
• Alternatives to Transformations
Model Assumptions
•
•
•
•
Effect addivitity
Normality
Homoscedasticity
Independence
14-1
Order of Importance
•
•
•
•
Homoscedasticity
Normality
Additivity
Independence
Observational
Analysis
Models
(Regression)
Experimental
Analysis
Models
(ANOVA)
•
•
•
•
Additivity
Homoscedasticity
Normality
Independence
All four are so interrelated that which is
“most” important may be immaterial!
14-2
Independence
When is this
important?
Measurements over time on the same individual.
• Time series data (rainfall, temperature, etc).
• Repeated measures - split plots in time.
• Growth curves.
Measurements near each other in space.
• Split plot designs.
• Spatial data.
How do I know it’s a problem?
By design - how the data were collected.
Temporal/spatial autocorrelation analysis.
Rectifying a dependence problem.
Modify the type of model to be
fitted to the data.
14-3
Homoscedasticity
How do I know I have a problem?
yˆ i vs. ˆi
Plot predicted (fitted) values
versus residuals.
What is the pattern of the spread in the
residuals as the predicted values increase?
x
•Spread constant.
•Spread increases.
•Spread decreases then increases.
Problems
x
x
x
x
x
Y
x
x
Y
x
Problems
x
x
x
x
x
x
x
x
Acceptable
xx
x
x x x
x
x
x
Y
x
14-4
Lack of Homogeneity in Regression
What to do?
• Attempt a transformation.
• Weighted regression.
• Incorporate additional covariates.
• Non-linear regression.
What to do if the spread of the residuals plotted versus X looks like this?

X
or this?
E( y)  0  1x  2 x 2
Need another x
variable.

x
14-5
Transforming the Response to achieve Linearity
If a scatterplot of y versus x
curves upward, proceed down
on the scale to choose a
transformation.
14-6
14-7
Handling Heterogeneity
Regression?
yes
Fit linear
model
Plot
residuals
OK
no
ANOVA
Group means
Test for
Homoscedasticity
accept
OK
reject
Type of
Transformation
Box/Cox Family
Traditional
Power Family
Transform Observations
14-8
Transformations to Achieve Normality
Regression?
no
ANOVA
yes
Fit linear
model
Estimate
group means
Q-Q plot
Formal Tests
yes
Residuals Normal?
OK
no
Transform
Different Model
14-9
Transformations to Achieve Normality
How can we determine if observations are normally distributed?
• Graphical examination:
• Normal quantile-quantile plot (QQ-plot).
• Histogram or boxplot.
• Goodness of fit tests:
• Kolmogorov-Smirnov test.
• Shapiro-Wilks test.
• D’Agostino’s test.
14-10
Non-normal! So what?
Only very skewed distributions will have a marked effect on
the significance level of the F-test for overall model or
model effects.
Often the same transformations which are used to
achieve homoscedasticity will produce more normallooking observations (residuals).
Transformations to Achieve Model Simplicity
GOAL: To provide as simple as possible a mathematical form for the
relationship among response and explanatory variables.
May require transforming both response and explanatory variables.
14-11
Alternative Models
low
c
o
m
p
l
e
x
i
t
y
Regular Least Squares
Weighted Least Squares
Non-Parametric Methods
Generalized Linear Models
Non-Linear Regression
high
14-12
Example: Predicting brain weight from body weight
in mammals via SLR
Data are average brain (Y, g) and body (X, kg) weights for 62 species of
mammals (2 omitted). Source: Allison & Chicchetti (1976), Science.
Species (common name)
Arctic fox
Owl monkey
Horse
Kangaroo
Human
African elephant
Asian elephant
…
Chimpanzee
Tree shrew
Red fox
body weight
3.385
0.480
521.000
35.000
62.000
6654.000
2547.000
brain weight
44.500
15.499
655.000
56.000
1320.000
5712.000
4603.000
52.160
0.104
4.235
440.000
2.500
50.400
Omit
14-13
Scatterplot of data
is non-informative.
Most species have
small weights
compared to the
elephants.
Viewing only those
mammals with body
weight below 300kgs
suggests transforming
to a log scale to
linearize the
relationship .
14-14
Scatterplot looks linear. Fitted regression equation is:
log( y)  2.111 0.755log(x)
Body weight is a very significant predictor of brain weight
(p-value<0.0001). Also, R2=0.922.
14-15
human
opossum
Residual plot shows no obvious violations of the zero mean and
constant variance assumption.
QQ-Plot demonstrates that the normality assumption for the residuals is
plausible.
14-16
Checking for influential observations (R)
In MTB: Stat >
> fm_lm(log(y)~log(x))
> influence.measures(fm)
Influence measures of
lm(formula = log(y) ~ log(x)) :
> Regression > Regression
> Regression Storage
1
2
3
…
14
…
19
…
32
33
34
35
…
40
…
60
dfb.1. dfb.lg.. dffit
0.13501 -8.18e-03 0.14452
0.27274 -1.56e-01 0.27714
-0.04860 1.62e-02 -0.04876
-0.02853
0.00538
cov.r
1.009
0.956
1.051
cook.d
1.04e-02
3.71e-02
1.21e-03
hat inf
0.0167
0.0245
0.0187
(Owl Monk.)
3.42e-02 -0.03775 1.142 7.25e-04 0.0937
* (Shrew)
1.69e-01
* (Asian El.)
0.18810 1.121 1.79e-02 0.0881
0.22151 3.51e-01 0.53207 0.788
0.00130 -5.11e-02 -0.05538 1.164
-0.31147 1.54e-02 -0.33480 0.846
0.27033 5.36e-02 0.32472 0.861
1.24e-01
1.56e-03
5.11e-02
4.85e-02
0.0295
0.1110
0.0167
0.0171
-0.00740
8.39e-03 -0.00945 1.124 4.55e-05 0.0786
-0.00799
2.27e-03 -0.00806 1.054 3.31e-05 0.0181
*
*
*
*
(Human)
(African El.)
(Opossum)
(Rhesus Monk.)
* (Brown Bat)
14-17
Decision: Leave out man (he doesn’t really fit in with the rest of the
mammals) and re-run the analysis.
Feature
Full Model
Omit Human
2.111
2.090
0.755
0.745
SE ( ˆ1 )
0.029
0.027
R2
0.922
0.929
Slope p-value
< 0.0001
< 0.0001
ˆ0
ˆ1
Even though results don’t change much, we will go with this last model:
log( y)  2.090 0.745log(x)
 ye
2.090
x
0.745
 8.08x
0.745
14-18
Predicting the brain weights of the omitted mammals (R)
>
>
>
>
1
2
>
1
2
xh <- x[-32]; yh <- y[-32]
fmh <- lm(log(yh)~log(xh))
new <- data.frame(xh=c(.104,4.235))
predict(fmh, newdata=new, interval="prediction")
fit
lwr
upr
0.4038624
-0.9269029
1.734628
3.1660753
1.8499283
4.482222
exp(predict(fmh, newdata=new, interval="prediction"))
fit
lwr
upr
Exponentiate
1.497598
0.3957776
5.666817
23.714231
6.3593633
88.430985
final results!
Mammal
Predicted Brain Wt Prediction Interval
Actual Brain Wt
Tree Shrew
1.498
(0.396, 5.667)
2.500
Red Fox
23.714
(6.359, 88.431)
50.400
This illustrates the idea of cross-validation in regression. It is often
recommended that the data be split into two (equal?) portions; use one
for model fitting; the other for model checking/verification.
14-19
Predicting the brain weights of the omitted mammals (MTB)
Influence measures
can be selected
here.
14-20
The regression equation is
lbrain = 2.11 + 0.755 lbody
Predictor
Constant
lbody
Coef
2.11091
0.75467
S = 0.696924
SE Coef
0.09860
0.02889
R-Sq = 92.2%
Analysis of Variance
Source
DF
SS
Regression
1 331.35
Residual Error 58
28.17
Total
59 359.52
Unusual Observations
Obs lbody lbrain
Fit
32
4.13 7.1854 5.2255
33
8.80 8.6503 8.7542
34
1.25 1.3610 3.0563
35
1.92 5.1874 3.5575
MTB output
(with man)
T
21.41
26.12
P
0.000
0.000
R-Sq(adj) = 92.0%
MS
331.35
0.49
F
682.21
P
0.000
SE Fit
0.1197
0.2322
0.0901
0.0912
Residual
1.9599
-0.1039
-1.6954
1.6298
Only available
influence
measures are:
standard/student
residuals; hat
matrix; Cook’s dist;
and dffits.
St Resid
2.85R
-0.16 X
-2.45R
2.36R
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large influence.
Predicted Values for New Observations
New
Obs
Fit SE Fit
95% CI
1 0.4028 0.1388 (0.1249, 0.6807)
2 3.2002 0.0900 (3.0201, 3.3803)
95% PI
(-1.0196, 1.8253)
( 1.7936, 4.6068)
14-21