003_arizona_LM3_assumptions

Download Report

Transcript 003_arizona_LM3_assumptions

Assumptions
“Essentially, all models are wrong,
but some are useful”
Your model has to be
wrong…
… but that’s o.k.
if it’s illuminating!
George E.P. Box
Linear Model
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Linear Model
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Absence of Collinearity
Baayen
(2008: 182)
Absence of Collinearity
Baayen
(2008: 182)
Where does collinearity
come from?
…most often, correlated
predictor variables
Demo
What to do?
Linear Model
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Leverage
Baayen
(2008: 189-190)
0
-5
y
5
p = 0.37
-5
0
x
5
0
-5
y
5
p = 0.000529
-5
0
x
5
-1000
-500
0
500
1000
p = 1.18618272971752e-26
-1000
-500
0
500
1000
p = 0.0699135015186808
Leave-one-out
Influence Diagnostics
DFbeta
(…and much more)
Winter & Matlock (2013)
Linear Model
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Normality of Error
The error (not the data!) is assumed to be
normally distributed
So, the residuals should be normally distributed
Histogram of residuals(xmdl)
50
100
✔
0
Frequency
150
200
xmdl = lm(y ~ x)
hist(residuals(xmdl))
-4
-2
0
residuals(xmdl)
2
qqnorm(residuals(xmdl))
qqline(residuals(xmdl))
1
-1
0
✔
-2
Sample Quantiles
2
Normal Q-Q Plot
-2
-1
0
Theoretical Quantiles
1
2
qqnorm(residuals(xmdl))
qqline(residuals(xmdl))
✗
Linear Model
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Homoskedasticity of Error
The error (not the data!) is assumed to have
equal variance across the predicted values
So, the residuals should have equal variance
across the predicted values
Noise
0
200
400
600
Reaction time
800
1000
-3
-2
-1
0
1
residuals(xmdl)
✔
-10
-5
0
fitted(xmdl)
5
2
3
4
-5
0
fitted(xmdl)
5
-0.5
0.0
0.5
1.0
residuals(xmdl)
✗
1.5
2.0
2.5
150000
50000
residuals(xmdl)
0
-100000
-200000
1e+06
2e+06
3e+06
fitted(xmdl)
4e+06
5e+06
✗
6e+06
WHAT TO IF
NORMALITY/HOMOSKEDASTI
CITY IS VIOLATED?
 Either: nothing + report the
violation
 Or: report the violation +
transformations
Two types of transformations
Linear
Transformations
Nonlinear
Transformations
Leave shape of
the distribution
intact (centering,
scaling)
Do change the
shape of the
distribution
200
100
0
Frequency
300
400
Histogram of xdata$Pic.RT
0
1000
2000
xdata$Pic.RT
3000
4000
100
50
0
Frequency
150
Histogram of log(xdata$Pic.RT)
5.5
6.0
6.5
7.0
log(xdata$Pic.RT)
7.5
8.0
8.5
1000
500
0
-500
-1000
residuals(xmdl)
1500
2000
Before transformation
500
1000
fitted(xmdl)
1500
2000
0.5
0.0
-0.5
residuals(xmdl.log)
1.0
After transformation
6.0
6.5
fitted(xmdl.log)
7.0
Still bad….
7.5
…. but better!!
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Assumptions
Normality of Errors
Homoskedasticity
of Errors
(Histogram of
Residuals)
Residual Plot
Q-Q plot of Residuals
Assumptions
Normality of Errors
Absence of
Collinearity
Independence
Homoskedasticity
Errors
Noofinfluential
data points
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
What is
independence?
Common experimental data
Subject
Item #1
Rep 1
Rep 3
Rep 2
Item
...
Item
...
Common experimental data
Subject
Item #1
Rep 1
Rep 3
Rep 2
Item
...
Pseudoreplication
= Disregarding
Dependencies
Item
...
Subject1
Subject1
Subject1
…
Item1
Item2
Item3
…
“pooling fallacy”
Machlis et al. (1985)
Subject2
Subject2
Subject3
….
Item1
Item2
Item3
…
“pseudoreplication”
Hurlbert (1984)
Hierarchical data is everywhere
• Typological data
(e.g., Bell 1978, Dryer 1989, Perkins 1989; Jaeger et al., 2011)
• Organizational data
• Classroom data
Finnish
Norwegian
Swedish
English
French
Spanish
Germa
n
Hungarian
Romanian
Italian
Turkish
Finnish
Norwegian
Swedish
English
French
Spanish
Germa
n
Hungarian
Romanian
Italian
Turkish
Hierarchical data is everywhere
Class 1
Class 2
Hierarchical data is everywhere
Class 1
Class 2
Hierarchical data is everywhere
Class 1
Class 2
Hierarchical data is everywhere
Hierarchical data is everywhere
Intraclass
Correlation (ICC)
Simulation
for 16
subjects
pseudoreplication
Type I
error
rate
items analysis
Interpretational Problem:
What’s the population
for inference?
Violating the
independence
assumption makes
the p-value…
…meaningless
S1
S2
S1
S2
That’s it
(for now)