Lecture 13 Chapter 4, Regression Diagnostics Detection of Model Violation

Download Report

Transcript Lecture 13 Chapter 4, Regression Diagnostics Detection of Model Violation

Lecture 13
Chapter 4, Regression Diagnostics
Detection of Model Violation
In Chapter 2, we study SLR models
In Chapter 3, we study MLR models
We got many useful results about
estimation and statistical inferences
However, all these results are VALID only when
Some Required Assumptions are Satisfied.
Questions: 1. What are these required assumptions?
2. How to detect whether these assumptions are violated or not?
3. What would happen if these assumptions are violated?
5/27/2016
ST3131, Lecture 13
1
Answer to Question 3:
when some required assumptions are
violated,
1). Theories are NO LONGER
.
2). Applications will lead to
results.
We will see some examples later. Thus we need to answer Questions 1 and 2.
AIMs of Chapter 4:
1).
State the
2).
Study the
5/27/2016
.
to detect model violations.
ST3131, Lecture 13
2
Standard Regression Assumptions:
Y   0   1 X 1  ...   p X p  
1. The Linearity Assumption (about the form of the model )
2. The Measurement Error Assumption (about the measurement errors)
3. The Predictor Assumption (about the predictor variables)
4. The Observation Assumption (about the observation)
1.
The Form Assumption: the i-th observation can be written as
Detection Method: For SLR (p=1), use
of Y against X to
detect the linearity. A linear scatter plot ensures linearity. For MLR
(p>1), it is a difficult task, we may be able to use the
of Y
against X1, X2, …, Xp.
5/27/2016
ST3131, Lecture 13
3
2. The Measurement Error Assumption
 1 ,  2 ,...,  n iid from N(0,  2 ),  2 unknown
iid means Independently Identically Distributed.
This assumption implies 4 sub-Assumptions:
1.
Assumption:
Detection Method:
2.
Assumption:
Detection Method:
3.
.
Assumption:
are normally distributed.
plot of residuals.
have mean 0.
have the same but
variance.
When this assumption is violated, the problem is called heterogeneity or
heteroscedasticity problem.
Detection Method: see Chapter 7.
4.
Assumption:
are independent of each other.
When this assumption is violated, the problem is called autocorrelation problem
Detection Method: see Chapter 8.
5/27/2016
ST3131, Lecture 13
4
3. The Predictor Assumption contains 3 sub-assumptions:
a). The Non-random Assumption: X1, X2, …, Xp are non-random.
are assumed to be nonrandom or selected in advance.
Design Data :
.
Non-design Data or observational data:
.
When this assumption is violated, all inferences are
valid, conditional to
the observed data. In this course, we assume this condition is always satisfied.
Detection Method: beyond our consideration in this course.
b). The Without Measurement Error Assumption: X1, X2, …, Xp can be accurately
observed, or
can be measured without errors.
Detection Method: beyond our consideration in this course. In this course, we
assume this condition is always satisfied.
This assumption is hardly satisfied. If violated, will affect the residual variances,
coefficient estimation and fitted values.
5/27/2016
ST3131, Lecture 13
5
c)
The Linearly Independence Assumption: X1, X2, …, Xp are assumed to be linearly
independent of each other.
This assumption guarantees the
of the Least Squares estimates of the regression
coefficients. When this assumption is violated, there are multiple solutions for the
least squares estimates of the regression coefficients.
Detection Method: check if the design matrix is of full rank.
4. The Observation Assumption: all observations are equally reliable, play approximately
equal role in determining the regression results and influencing conclusions.
5/27/2016
ST3131, Lecture 13
6
Consequences of the Violations of the Assumptions:
In general,
violations of the assumptions
the inference or conclusions too much. However,
will
conclusions. Thus, we should study how to detect these violations.
invalidate
distort the
Let us see some examples below: the Anscombe’s Quartet Data
Y1
X1
Y2
X2
Y3
X3
Y4
X4
8.04
10
9.14
10
7.46
10
6.58
8
6.95
8
8.14
8
6.77
8
5.76
8
7.58
13
8.74
13
12.74
13
7.71
8
8.81
9
8.77
9
7.11
9
8.84
8
8.33
11
9.26
11
7.81
11
8.47
8
9.96
14
8.10
14
8.84
14
7.04
8
7.24
6
6.13
6
6.08
6
5.25
8
4.26
4
3.10
4
5.39
4
12.50
19
10.84
12
9.13
12
8.15
12
5.56
8
4.82
7
7.26
7
6.42
7
7.91
8
5.68
5
4.74
5
5.73
5
6.89
8
5/27/2016
ST3131, Lecture 13
7
11
9
10
8
9
7
Y2
Y1
8
7
6
5
6
4
5
3
4
4
4
9
9
14
X2
14
13
13
12
12
11
11
10
10
Y4
Y3
X1
9
9
8
8
7
7
6
6
5
5
4
9
14
X3
5/27/2016
10
15
20
X4
ST3131, Lecture 13
8
Regression Plot
Regression Plot
Y1 = 3.00009 + 0.500091 X1
S = 1.23660
R-Sq = 66.7 %
Y2 = 3.00091 + 0.5 X2
R-Sq(adj) = 62.9 %
S = 1.23721
R-Sq = 66.6 %
R-Sq(adj) = 62.9 %
11
10
10
9
9
8
Y1
8
Y2
7
7
6
6
5
5
4
4
3
4
5
6
7
8
9
10
11
12
13
14
X1
4
5
6
7
8
9
10
11
12
13
14
X2
Regression Plot
Regression Plot
Y4 = 3.00173 + 0.499909 X4
S = 1.23570
Y3 = 3.00245 + 0.499727 X3
S = 1.23631
R-Sq = 66.6 %
R-Sq(adj) = 63.0 %
13
R-Sq(adj) = 62.9 %
13
12
12
11
11
10
Y4
10
Y3
R-Sq = 66.7 %
9
9
8
8
7
7
6
6
5
5
10
4
5
6
7
8
9
10
11
12
13
15
20
X4
14
X3
5/27/2016
ST3131, Lecture 13
9
The Residual Plot of Y1 vs X1
Normal Plot of Residuals
I Chart of Residuals
2
Residual
Residual
1
0
-1
-2
-2
-1
0
1
5
4
3
2
1
0
-1
-2
-3
-4
-5
UCL=4.270
Mean=-0.01305
LCL=-4.296
0
2
Normal Score
Residuals vs. Fits
4
1
Residual
5
2
3
2
0
-1
1
0
-2
-2.0 -1.5 -1.0 -0.5 -0.0 0.5 1.0 1.5
5
6
Residual
5/27/2016
10
Observation Number
Histogram of Residuals
Frequency
5
7
8
9
10
Fit
ST3131, Lecture 13
10
The Residual Plot of Y2 vs X2
Normal Plot of Residuals
I Chart of Residuals
Residual
Residual
1
0
-1
-2
-1
0
UCL=3.730
Mean=-0.05175
LCL=-3.834
0
1
10
Observation Number
Histogram of Residuals
Residuals vs. Fits
1
Residual
2
1
0
0
-1
-2
-2.0 -1.5 -1.0 -0.5 -0.0 0.5
1.0
5
Residual
5/27/2016
5
Normal Score
3
Frequency
4
3
2
1
0
-1
-2
-3
-4
-5
6
7
8
9
10
Fit
ST3131, Lecture 13
11
The Residual Plot of Y3 vs X3
Normal Plot of Residuals
I Chart of Residuals
3
Residual
Residual
1
0
1
0
Mean=0.01080
-1
-2
-3
-1
-2
-1
0
1
LCL=-2.972
0
2
5
10
Normal Score
Observation Number
Histogram of Residuals
Residuals vs. Fits
3
3
2
Residual
4
2
1
1
0
-1
0
-1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
5
Residual
5/27/2016
UCL=2.993
2
2
Frequency
1
3
6
7
8
9
10
Fit
ST3131, Lecture 13
12
The Residual Plot of Y4 vs X4
Normal Plot of Residuals
I Chart of Residuals
Residual
Residual
1
0
-1
-1
0
4
3
2
1
0
-1
-2
-3
-4
UCL=3.080
Mean=-1.8E-15
LCL=-3.080
0
1
5
10
Normal Score
Observation Number
Histogram of Residuals
Residuals vs. Fits
2
Residual
Frequency
1
1
0
-1
0
-1.5 -1.0 -0.5 0.0
0.5
1.0
1.5
Residual
5/27/2016
ST3131, Lecture 13
7
8
9
10
11
12
13
Fit
13
Methods of Detecting Violations
a) Using
.
b) Using some statistical measures (we will learn some of them soon)
c) Combining a) and b).
Most of the above methods for detecting assumption violations are ResidualBased methods or use Residual Plots. The latter can reveal many features
about the data that might be missed or overlooked using just summary
statistics, e.g., many widely-used statistics, such as, the correlation coefficients,
the regression coefficients, etc. based on all the 4 data sets of the Anscombe
Data are the same.
Thus, we need study the residuals. Various Types of Residuals are
a). Ordinary Residuals
b). Standardized Residuals
c). Studentized Residuals (Interval and External)
5/27/2016
ST3131, Lecture 13
14
1. Ordinary Residuals
For a SLR, we have
ŷ i  ˆ 0  ˆ1 xi  ( y  ˆ1 x )  ˆ1 xi
 y  ˆ ( x  x )
1
i
n
n
1 n
  y j  ( xi  x ) ( x j  x ) y j /  ( x k  x ) 2
n j 1
j 1
k 1
 pi1 y1  pi 2 y 2  ... pin y n
where
n
1
p ij   ( xi  x )( x j  x ) /  ( x k  x ) 2 ,
n
k 1
n
1
2
pii   ( xi  x ) /  ( x k  x ) 2
n
k 1
Let ŷ  (ŷ1 ,..., ŷ n ) T and P  (p ij ). Then
ŷ  Py
Thus, ê  ŷ - y  (I - P)y. This is the vector of the ordinary residuals.
5/27/2016
ST3131, Lecture 13
15
In general, for p  1, we can write
ŷ  Py
where P is called the projection matrix.
Thus, ê  ŷ - y  (I - P)y.
We can show that
Var( ê)  (I - P)Var(y)(I - P) T   2 ( I  P).
Var (ê i )   2 ( I  P) ii   2 (1  pii )
2. The Standardized Residuals
z i  eˆi /( 1  pii )
But  is unknown and should be replaced by its estimates.
5/27/2016
ST3131, Lecture 13
16
a). Internally Studentized Residuals
In this case,  is replaced by the usual noise variance estimator
ˆ  SSE/(n - p - 1)  MSE , so that ri  eˆi /(ˆ 1  pii )
Drawbacks : 1). ê i and ˆ are correlated
2). Do not sum to
3). Do not be
of each other
b). Externally Studentized Residuals
In this case,  is replaced by the usual noise variance estimator
ˆ (i )  SSE (i) /(n - p - 2)  MSE (i )
ri*  eˆi /(ˆ ( i ) 1  pii )
Advantage : ri*
has a t - distributi on with n - p - 2 degrees of freedom
Drawbacks : 1). Do not sum to
2). Do not be
ri*  ri
5/27/2016
of each other
n p2
, this is a monotone transform ation.
n  p  1  ri 2
ST3131, Lecture 13
17
I.
Summary
Standard Regression Assumptions:
a). about the form of the model
b). about the measurement errors
c). about the predictor variables
d). about the observations
II
Examples of the Anscombe’s Quartet Data show that
a). Gross Violations of assumptions will lead to serious problems
b). Summary statistics may miss or overlook the features of the data.
III
Type of Residuals
a). Ordinary
5/27/2016
b). Standardized
c). Studentized
ST3131, Lecture 13
18