Transcript 迴歸診斷
中級社會統計
第十四講
迴歸診斷
©Ming-chi Chen
社會統計
Page.1
迴歸分析會在哪裡出錯?
• 有了統計軟體之後,因為操作的簡便,複迴
歸分析常會被誤用濫用
• 問題往往在於不了解複迴歸分析背後的假設
和可能的問題
• 問題多數是在作因果推論
• 純粹用來作預測用的複迴歸模型的問題比較
不嚴重
• 以下根據Paul Allison (1999)
©Ming-chi Chen
社會統計
Page.2
迴歸分析會在哪裡出錯?
• Model specification errors
– Are important independent variables left out of
the model?
– Are irrelevant independent variables included
• Does the dependent variable affect any of
the independent variables?
• How well are the independent variables
measured?
©Ming-chi Chen
社會統計
Page.3
迴歸分析會在哪裡出錯?
• Is the sample large enough to detect
important effects?
• Is the sample so large that trivial effects are
statistically significant?
• Do some variable mediate the effects of
other variables?
• Are some independent variables too highly
correlated?
• Is the sample biased
©Ming-chi Chen
社會統計
Page.4
14.1.1
模型設定錯誤-遺漏
Model Specification Errors
自變數設定錯誤的問題
• 為何複迴歸方程式要放入某一個自變數?
– 想要了解這個IV對於DV的影響效果
– 想要控制這個IV
• 研究者往往忘了要放入重要的控制變數
• 何謂重要的控制變數?
– 對DV有沒有因果效果?
– 和我們主要關心的主要變數有沒有相關性?
• 相關性?
– 如果和其他IV沒有關係,那也不需要控制了。
– 控制是為了分離出淨關係
©Ming-chi Chen
社會統計
Page.6
自變數設定錯誤的問題:遺漏
• 遺漏重要IV的後果
• 迴歸係數會有偏誤,不然就是太高,要不然
就是太低。
• 沒有控制,IV和DV之間的關係可能是虛假的
spurious
©Ming-chi Chen
社會統計
Page.7
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Consequences of Variable Misspecification
True Model
Fitted Model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Yˆ b1 b2 X 2
Yˆ b1 b2 X 2
b3 X 3
In this sequence and the next we will investigate the consequences of misspecifying the
regression model in terms of explanatory variables.
1
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Consequences of Variable Misspecification
True Model
Fitted Model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Yˆ b1 b2 X 2
Yˆ b1 b2 X 2
b3 X 3
To keep the analysis simple, we will assume that there are only two possibilities. Either Y
depends only on X2, or it depends on both X2 and X3.
2
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Consequences of Variable Misspecification
True Model
Fitted Model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Correct specification,
ˆ
Y b1 b2 X 2
no problems
Yˆ b1 b2 X 2
b3 X 3
If Y depends only on X2, and we fit a simple regression model, we will not encounter any
problems, assuming of course that the regression model assumptions are valid.
3
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Consequences of Variable Misspecification
True Model
Fitted Model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Correct specification,
ˆ
Y b1 b2 X 2
no problems
Yˆ b1 b2 X 2
b3 X 3
Correct specification,
no problems
Likewise we will not encounter any problems if Y depends on both X2 and X3 and we fit the
multiple regression.
4
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Consequences of Variable Misspecification
True Model
Fitted Model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Correct specification,
ˆ
Y b1 b2 X 2
no problems
Yˆ b1 b2 X 2
b3 X 3
Correct specification,
no problems
In this sequence we will examine the consequences of fitting a simple regression when the
true model is multiple.
5
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Consequences of Variable Misspecification
True Model
Fitted Model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Correct specification,
ˆ
Y b1 b2 X 2
no problems
Yˆ b1 b2 X 2
b3 X 3
Correct specification,
no problems
In the next one we will do the opposite and examine the consequences of fitting a multiple
regression when the true model is simple.
6
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Consequences of Variable Misspecification
True Model
Fitted Model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Yˆ b1 b2 X 2
Yˆ b1 b2 X 2
b3 X 3
Correct specification,
no problems
Coefficients are biased (in
general). Standard
errors are invalid.
Correct specification,
no problems
The omission of a relevant explanatory variable causes the regression coefficients to be
biased and the standard errors to be invalid.
7
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Y 1 2 X 2 3 X 3 u
Yˆ b1 b2 X 2
Yi Y 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u
2 X 2 i X 2 3 X 3 i X 3 ui u
We will now derive the expression for the bias mathematically. It is convenient to start by
deriving an expression for the deviation of Yi about its sample mean. It can be expressed in
terms of the deviations of X2, X3, and u about their sample means.
12
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Yˆ b1 b2 X 2
Y 1 2 X 2 3 X 3 u
Yi Y 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u
2 X 2 i X 2 3 X 3 i X 3 ui u
b2
X X Y Y
X X
X X X X X X X X u
X X
X X X X X X u u
X X
X X
2i
2
i
2
2i
2
2
2
2i
2
3
2i
2
3i
3
2i
2
i
u
2
2i
2i
2
2
3i
3
3
2
2i
2
i
2
2i
2
2
2i
2
Although Y really depends on X3 as well as X2, we make a mistake and regress Y on X2 only.
The slope coefficient is therefore as shown.
13
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Yˆ b1 b2 X 2
Y 1 2 X 2 3 X 3 u
Yi Y 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u
2 X 2 i X 2 3 X 3 i X 3 ui u
b2
X X Y Y
X X
X X X X X X X X u
X X
X X X X X X u u
X X
X X
2i
2
i
2
2i
2
2
2
2i
2
3
2i
2
3i
3
2i
2
i
u
2
2i
2i
2
2
3i
3
3
2
2i
2
i
2
2i
2
2
2i
2
We substitute for the Y deviations and simplify.
14
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Yˆ b1 b2 X 2
Y 1 2 X 2 3 X 3 u
Yi Y 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u
2 X 2 i X 2 3 X 3 i X 3 ui u
b2
X X Y Y
X X
X X X X X X X X u
X X
X X X X X X u u
X X
X X
2i
2
i
2
2i
2
2
2
2i
2
3
2i
2
3i
3
2i
2
i
u
2
2i
2i
2
2
3i
3
3
2
2i
2
i
2
2i
2
2
2i
2
Hence we have demonstrated that b2 has three components.
15
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Yˆ b1 b2 X 2
Y 1 2 X 2 3 X 3 u
b2 2 3
X X X X X X u u
X X
X X
2i
2
3
2i
2
i
2
2i
E b2 2 3
3i
2
2
2i
2
X X X X E X X u u
X
X
X X
2i
2
3i
3
2i
2
i
2
2i
2
2
2i
2
To investigate biasedness or unbiasedness, we take the expected value of b2. The first two
terms are unaffected because they contain no random components. Thus we focus on the
expectation of the error term.
16
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
E b2 2 3
X X X X E X X u u
X
X
X X
2i
2
3i
3
2i
2
i
2
2i
2
2
2i
2
X 2 i X 2 ui u
1
E
E X 2 i X 2 ui u
2
2
X 2i X 2 X 2i X 2
1
E X 2 i X 2 ui u
2
X 2i X 2
1
X 2 i X 2 E ui u
2
X 2i X 2
0
X2 is nonstochastic, so the denominator of the error term is nonstochastic and may be
taken outside the expression for the expectation.
17
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
E b2 2 3
X X X X E X X u u
X
X
X X
2i
2
3i
3
2i
2
i
2
2i
2
2
2i
2
X 2 i X 2 ui u
1
E
E X 2 i X 2 ui u
2
2
X 2i X 2 X 2i X 2
1
E X 2 i X 2 ui u
2
X 2i X 2
1
X 2 i X 2 E ui u
2
X 2i X 2
0
In the numerator the expectation of a sum is equal to the sum of the expectations (first
expected value rule).
18
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
E b2 2 3
X X X X E X X u u
X
X
X X
2i
2
3i
3
2i
2
i
2
2i
2
2
2i
2
X 2 i X 2 ui u
1
E
E X 2 i X 2 ui u
2
2
X 2i X 2 X 2i X 2
1
E X 2 i X 2 ui u
2
X 2i X 2
1
X 2 i X 2 E ui u
2
X 2i X 2
0
In each product, the factor involving X2 may be taken out of the expectation because X2 is
nonstochastic.
19
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
E b2 2 3
X X X X E X X u u
X
X
X X
2i
2
3i
3
2i
2
i
2
2i
2
2
2i
2
X 2 i X 2 ui u
1
E
E X 2 i X 2 ui u
2
2
X 2i X 2 X 2i X 2
1
E X 2 i X 2 ui u
2
X 2i X 2
1
X 2 i X 2 E ui u
2
X 2i X 2
0
By Assumption A.3, the expected value of u is 0. It follows that the expected value of the
sample mean of u is also 0. Hence the expected value of the error term is 0.
20
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Yˆ b1 b2 X 2
Y 1 2 X 2 3 X 3 u
E b2 2 3
X X X X E X X u u
X
X
X X
2i
2
3i
3
2i
2
i
2
2i
E b2 2 3
2
2
2i
2
X X X X
X X
2i
2
3i
3
2
2i
2
Thus we have shown that the expected value of b2 is equal to the true value plus a bias
term. Note: the definition of a bias is the difference between the expected value of an
estimator and the true value of the parameter being estimated.
21
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
Yˆ b1 b2 X 2
Y 1 2 X 2 3 X 3 u
E b2 2 3
X X X X E X X u u
X
X
X X
2i
2
3i
3
2i
2
i
2
2i
E b2 2 3
2
2
2i
2
X X X X
X X
2i
2
3i
3
2
2i
2
As a consequence of the misspecification, the standard errors, t tests and F test are invalid.
22
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
147.36
0.0000
0.3543
0.3519
1.963
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
------------------------------------------------------------------------------
We will illustrate the bias using an educational attainment model. To keep the analysis
simple, we will assume that in the true model S教育年數 depends only on ASVABC測驗分數
and SM母親教育年數. The output above shows the corresponding regression using EAEF
Data Set 21.
23
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
147.36
0.0000
0.3543
0.3519
1.963
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
------------------------------------------------------------------------------
E (b2 ) 2 3
ASVABC ASVABC SM SM
ASVABC ASVABC
i
i
2
i
We will run the regression a second time, omitting SM. Before we do this, we will try to
predict the direction of the bias in the coefficient of ASVABC.
24
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
147.36
0.0000
0.3543
0.3519
1.963
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
------------------------------------------------------------------------------
E (b2 ) 2 3
ASVABC ASVABC SM SM
ASVABC ASVABC
i
i
2
i
It is reasonable to suppose, as a matter of common sense, that 3 is positive. This
assumption is strongly supported by the fact that its estimate in the multiple regression is
positive and highly significant.
25
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574
. cor SM ASVABC
Number of obs =
540
(obs=540)
F( 2,
537) = 147.36
Prob
> F SM
= 0.0000
|
ASVABC
R-squared
= 0.3543
--------+-----------------Adj R-squared
= 0.3519
SM|
1.0000
Root 0.4202
MSE
=
1.963
ASVABC|
1.0000
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
------------------------------------------------------------------------------
E (b2 ) 2 3
ASVABC ASVABC SM SM
ASVABC ASVABC
i
i
2
i
The correlation between ASVABC and SM is positive, so the numerator of the bias term
must be positive. The denominator is automatically positive since it is a sum of squares
and there is some variation in ASVABC. Hence the bias should be positive.
26
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC
Source |
SS
df
MS
-------------+-----------------------------Model | 1081.97059
1 1081.97059
Residual | 2123.01275
538 3.94612035
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
274.19
0.0000
0.3376
0.3364
1.9865
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.148084
.0089431
16.56
0.000
.1305165
.1656516
_cons |
6.066225
.4672261
12.98
0.000
5.148413
6.984036
------------------------------------------------------------------------------
Here is the regression omitting SM.
27
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
-----------------------------------------------------------------------------. reg S ASVABC
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.148084
.0089431
16.56
0.000
.1305165
.1656516
_cons |
6.066225
.4672261
12.98
0.000
5.148413
6.984036
------------------------------------------------------------------------------
As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the
difference may be due to pure chance, but part is attributable to the bias.
28
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S SM
Source |
SS
df
MS
-------------+-----------------------------Model | 419.086251
1 419.086251
Residual | 2785.89708
538 5.17824736
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
80.93
0.0000
0.1308
0.1291
2.2756
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------SM |
.3130793
.0348012
9.00
0.000
.2447165
.3814422
_cons |
10.04688
.4147121
24.23
0.000
9.232226
10.86153
------------------------------------------------------------------------------
E (b3 ) 3 2
ASVABC ASVABC SM
SM SM
i
i
SM
2
i
Here is the regression omitting ASVABC instead of SM. We would expect b3 to be upwards
biased. We anticipate that 2 is positive and we know that both the numerator and the
denominator of the other factor in the bias expression are positive.
29
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
-----------------------------------------------------------------------------. reg S SM
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------SM |
.3130793
.0348012
9.00
0.000
.2447165
.3814422
_cons |
10.04688
.4147121
24.23
0.000
9.232226
10.86153
------------------------------------------------------------------------------
In this case the bias is quite dramatic. The coefficient of SM has more than doubled. The
reason for the bigger effect is that the variation in SM is much smaller than that in ASVABC,
while 2 and 3 are similar in size, judging by their estimates.
30
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
147.36
0.0000
0.3543
0.3519
1.963
. reg S ASVABC
Source |
SS
df
MS
-------------+-----------------------------Model | 1081.97059
1 1081.97059
Residual | 2123.01275
538 3.94612035
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
274.19
0.0000
0.3376
0.3364
1.9865
. reg S SM
Source |
SS
df
MS
-------------+-----------------------------Model | 419.086251
1 419.086251
Residual | 2785.89708
538 5.17824736
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
80.93
0.0000
0.1308
0.1291
2.2756
Finally, we will investigate how R2 behaves when a variable is omitted. In the simple
regression of S on ASVABC, R2 is 0.34, and in the simple regression of S on SM it is 0.13.
31
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
147.36
0.0000
0.3543
0.3519
1.963
. reg S ASVABC
Source |
SS
df
MS
-------------+-----------------------------Model | 1081.97059
1 1081.97059
Residual | 2123.01275
538 3.94612035
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
274.19
0.0000
0.3376
0.3364
1.9865
. reg S SM
Source |
SS
df
MS
-------------+-----------------------------Model | 419.086251
1 419.086251
Residual | 2785.89708
538 5.17824736
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
80.93
0.0000
0.1308
0.1291
2.2756
Does this imply that ASVABC explains 34% of the variance in S and SM 13%? No, because
the multiple regression reveals that their joint explanatory power is 0.35, not 0.47.
32
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
S 1 2 ASVABC 3 SM u
. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
147.36
0.0000
0.3543
0.3519
1.963
. reg S ASVABC
Source |
SS
df
MS
-------------+-----------------------------Model | 1081.97059
1 1081.97059
Residual | 2123.01275
538 3.94612035
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
274.19
0.0000
0.3376
0.3364
1.9865
. reg S SM
Source |
SS
df
MS
-------------+-----------------------------Model | 419.086251
1 419.086251
Residual | 2785.89708
538 5.17824736
-------------+-----------------------------Total | 3204.98333
539 5.94616574
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
80.93
0.0000
0.1308
0.1291
2.2756
In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its
apparent explanatory power. Similarly, in the third regression, SM is partly acting as a
proxy for ASVABC, again inflating its apparent explanatory power.
33
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
LGEARN 1 2 S 3 EXP u
. reg LGEARN S EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 50.9842581
2
25.492129
Residual | 135.723385
537 .252743734
-------------+-----------------------------Total | 186.707643
539
.34639637
Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
100.86
0.0000
0.2731
0.2704
.50274
-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.1235911
.0090989
13.58
0.000
.1057173
.141465
EXP |
.0350826
.0050046
7.01
0.000
.0252515
.0449137
_cons |
.5093196
.1663823
3.06
0.002
.1824796
.8361596
------------------------------------------------------------------------------
However, it is also possible for omitted variable bias to lead to a reduction in the apparent
explanatory power of a variable. This will be demonstrated using a simple earnings
function model, supposing the logarithm of hourly earnings to depend on S and EXP.
34
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
LGEARN 1 2 S 3 EXP u
. reg LGEARN S EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 50.9842581
2
25.492129
Residual | 135.723385
537 .252743734
-------------+-----------------------------Total | 186.707643
539
.34639637
. cor S EXP
540
(obs=540)Number of obs =
F( 2,
537) = 100.86
=EXP0.0000
|Prob > F S
R-squared
= 0.2731
--------+-----------------R-squared = 0.2704
S|Adj 1.0000
MSE
= .50274
EXP|Root
-0.2179
1.0000
-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.1235911
.0090989
13.58
0.000
.1057173
.141465
EXP |
.0350826
.0050046
7.01
0.000
.0252515
.0449137
_cons |
.5093196
.1663823
3.06
0.002
.1824796
.8361596
------------------------------------------------------------------------------
E (b2 ) 2 3
S
i
S EXPi EXP
2
Si S
If we omit EXP from the regression, the coefficient of S should be subject to a downward
bias. 3 is likely to be positive. The numerator of the other factor in the bias term is
negative since S and EXP are negatively correlated. The denominator is positive.
35
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
LGEARN 1 2 S 3 EXP u
. reg LGEARN S EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 50.9842581
2
25.492129
Residual | 135.723385
537 .252743734
-------------+-----------------------------Total | 186.707643
539
.34639637
. cor S EXP
540
(obs=540)Number of obs =
F( 2,
537) = 100.86
=EXP0.0000
|Prob > F S
R-squared
= 0.2731
--------+-----------------R-squared = 0.2704
S|Adj 1.0000
MSE
= .50274
EXP|Root
-0.2179
1.0000
-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.1235911
.0090989
13.58
0.000
.1057173
.141465
EXP |
.0350826
.0050046
7.01
0.000
.0252515
.0449137
_cons |
.5093196
.1663823
3.06
0.002
.1824796
.8361596
------------------------------------------------------------------------------
E (b3 ) 3 2
EXP EXP S S
EXP EXP
i
i
2
i
For the same reasons, the coefficient of EXP in a simple regression of LGEARN on EXP
should be downwards biased.
36
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg LGEARN S EXP
-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.1235911
.0090989
13.58
0.000
.1057173
.141465
EXP |
.0350826
.0050046
7.01
0.000
.0252515
.0449137
_cons |
.5093196
.1663823
3.06
0.002
.1824796
.8361596
. reg LGEARN S
-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.1096934
.0092691
11.83
0.000
.0914853
.1279014
_cons |
1.292241
.1287252
10.04
0.000
1.039376
1.545107
. reg LGEARN EXP
-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------EXP |
.0202708
.0056564
3.58
0.000
.0091595
.031382
_cons |
2.44941
.0988233
24.79
0.000
2.255284
2.643537
As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions.
37
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg LGEARN S EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 50.9842581
2
25.492129
Residual | 135.723385
537 .252743734
-------------+-----------------------------Total | 186.707643
539
.34639637
Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
100.86
0.0000
0.2731
0.2704
.50274
. reg LGEARN S
Source |
SS
df
MS
-------------+-----------------------------Model | 38.5643833
1 38.5643833
Residual |
148.14326
538 .275359219
-------------+-----------------------------Total | 186.707643
539
.34639637
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
140.05
0.0000
0.2065
0.2051
.52475
. reg LGEARN EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 4.35309315
1 4.35309315
Residual |
182.35455
538 .338948978
-------------+-----------------------------Total | 186.707643
539
.34639637
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
12.84
0.0004
0.0233
0.0215
.58219
A comparison of R2 for the three regressions shows that the sum of R2 in the simple
regressions is actually less than R2 in the multiple regression.
38
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg LGEARN S EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 50.9842581
2
25.492129
Residual | 135.723385
537 .252743734
-------------+-----------------------------Total | 186.707643
539
.34639637
Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
100.86
0.0000
0.2731
0.2704
.50274
. reg LGEARN S
Source |
SS
df
MS
-------------+-----------------------------Model | 38.5643833
1 38.5643833
Residual |
148.14326
538 .275359219
-------------+-----------------------------Total | 186.707643
539
.34639637
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
140.05
0.0000
0.2065
0.2051
.52475
. reg LGEARN EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 4.35309315
1 4.35309315
Residual |
182.35455
538 .338948978
-------------+-----------------------------Total | 186.707643
539
.34639637
Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
540
12.84
0.0004
0.0233
0.0215
.58219
This is because the apparent explanatory power of S in the second regression has been
undermined by the downwards bias in its coefficient. The same is true for the apparent
explanatory power of EXP in the third equation.
39
自變數設定錯誤的問題:遺漏
•
•
•
•
正確的迴歸模型為:
Y=α+βX+γZ+ε
漏掉重要解釋變數Z
Y=α’+β’X+ε’
ˆ '為前式的迴歸係數,ˆ ' 不再是的不偏估計式,E ( ˆ ' )
xz
ˆ
E ( ' )
2
x
如果(Y和Z的關係)與 xz
(X和Z的關係)同號,則ˆ ' 有正偏誤
若不同號,則有負偏誤。
©Ming-chi Chen
社會統計
Page.43
自變數設定錯誤的問題:遺漏
xz ˆ
ˆ
ˆ
' ˆ
ˆ bZX
2
x
ˆ ' 是X對Y在控制了Z這個IV之後的迴歸係數ˆ加上
Z對Y的影響(ˆ)乘以X對Z之影響bzx
ˆ ' 是總效果
ˆ是X對Y的直接影響效果
ˆ bZX 是X影響Z再影響Y的間接效果
©Ming-chi Chen
社會統計
Page.44
自變數設定錯誤的問題:遺漏
X
直接效果
︿
β
︿
γ
b ZX
Z
間接效果
︿
bZX*γ
©Ming-chi Chen
社會統計
Page.45
自變數設定錯誤的問題:遺漏
如果Z與Y無關(即 0)
或Z與X無關( xz 0)時,
則ˆ ' ˆ且E ( ˆ ' )
如果Z會影響Y與X(即 0, xz 0),那如果迴歸方程
式中沒有把Z放進來,則所估計的X對Y的迴歸係數將包含
了經由X對Z,Z對Y的影響力,沒辦法估計出X對Y的真正
直接影響力。
©Ming-chi Chen
社會統計
Page.46
自變數設定錯誤的問題:遺漏
遺漏了Z變數的迴歸模型所估計出來的變異數
ˆ ' X )2
ˆ
(
Y
'
2
SY | X
不再是母體變異數 2的
n2
不偏估計式,我們可證明
E (SY2 | X ) 2
2 z2
2
n2
SY2 | X 是 2的正偏估計式,除非 0,否則推論統計會有錯誤。
當 xz 0時,ˆ ' 是一個不偏估計式,但由上式可知SY2 | X 仍有
偏誤(除非 0),使統計推論發生錯誤。
©Ming-chi Chen
社會統計
Page.47
自變數設定錯誤的問題:遺漏
2
ˆ
V ( ' )
x2
z
x z ( xz)
z
x
z
(
1
r
)
x
V ( ˆ )
2
2
2
2
2
2
2
2
XZ
2
2
1
2
2
2
(1 rXZ )
2
通常未知 ,必須用
S2ˆ '和S2ˆ來進行統計推論,兩者大小無法確定,
1
2
S
S
2 Y|X
x
2
ˆ '
©Ming-chi Chen
1
2
S
S
Y | XZ
2
2
x
(
1
r
)
XZ
2
ˆ
社會統計
Page.48
自變數設定錯誤的問題:遺漏
雖然
1
2
2
x (1 rXZ )
1
2
2
2
2
,但是
S
可能小於
S
(根據
E(
S
)
E(
S
Y | XZ
Y|X
Y | XZ
Y | X ))
2
x
ˆ '
ˆ
ˆ
ˆ
因此其大小不能確知。此外, ' 與值也不相同,因此 與 兩者之t值大小
Sˆ ' Sˆ
ˆ '
不相等,然而 為錯誤的t值會導致錯誤的統計推論。
Sˆ '
©Ming-chi Chen
社會統計
Page.49
14.1.2
模型設定錯誤-加入不相關變數
Model Specification Errors-including
irrelevant IV
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE
Consequences of Variable Misspecification
True Model
Fitted Model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Yˆ b1 b2 X 2
Yˆ b1 b2 X 2
b3 X 3
Correct specification,
no problems
Coefficients are biased (in
general). Standard
errors are invalid.
Correct specification,
no problems
In this sequence we will investigate the consequences of including an irrelevant variable in
a regression model.
1
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE
Consequences of Variable Misspecification
True Model
Fitted Model
Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u
Yˆ b1 b2 X 2
Yˆ b1 b2 X 2
b3 X 3
Correct specification,
no problems
Coefficients are
unbiased (in general),
but inefficient.
Standard errors are
valid (in general)
Coefficients are biased (in
general). Standard
errors are invalid.
Correct specification,
no problems
The effects are different from those of omitted variable misspecification. In this case the
coefficients in general remain unbiased, but they are inefficient. The standard errors remain
valid, but are needlessly large.
2
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE
Y 1 2 X 2 u
Yˆ b1 b2 X 2 b3 X 3
These results can be demonstrated quickly.
3
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE
Y 1 2 X 2 u
Yˆ b1 b2 X 2 b3 X 3
Y 1 2 X 2 0 X 3 u
Rewrite the true model adding X3 as an explanatory variable, with a coefficient of 0. Now the
true model and the fitted model coincide. Hence b2 will be an unbiased estimator of 2 and
b3 will be an unbiased estimator of 0.
4
加入不相關的自變數
•正確的迴歸模型為:
•Y=α+βX+ε
•放入不相關的解釋變數Z
•Y=α*+β*X+γ*Z+ε*
若以錯誤模型估計得ˆ,則
'
ˆ ' 仍為之不偏估計式。
證明ˆ '為之不偏估計式
若以錯誤模型估計得ˆ,則
'
ˆ ' 仍為之不偏估計式。
ˆ '
z xy xz zy
x z ( xz)
2
2
2
2
y Y - Y X - ( X ) ( X X ) ( ) x '
前式分子部分
z xy xz zy z x(x ' ) xz z (x ' )
z ( xx x ' ) xz ( zx z ' ) z ( xx x ') xz( zx z ')
z ( x x ') xz( zx z ')
z x ( xz) z x ' xz z '
[ z x ( xz) ] [ z x ' xz z ']
2
2
2
2
2
2
2
2
2
2
2
2
2
2
把分母帶回來
z
xy xz zy [ z x ( xz) ] [ z x ' xz z ']
ˆ '
x
z
(
xz
)
x z ( xz)
z x ' xz z '
x z ( xz)
x ' x( ) x x x ( X X ) x 0
2
2
2
2
2
2
E ( ˆ ' ) ,故為不偏
2
2
2
2
2
2
2
2
2
自變數設定錯誤的問題:放入不相關
的變數
•
•
•
•
正確的迴歸模型為:
Y=α+βX+ε
放入不相關的解釋變數Z
Y=α*+β*X+γ*Z+ε*
ˆ *為的不偏估計式
考慮了不相關的變數Z,會使
z
x z (1 r
2
S
2
ˆ *
2
ˆ *
S
2
ˆ *
2
2
XZ
)
SY2| XZ S2ˆ
1
1
1
2
S
(
)
2 Y|X
2
2
2
x
x
(
1
r
)
x
XZ
的t值會被低估,而誤以為X對Y沒有影響,造成統計檢定錯誤,影響結論。
©Ming-chi Chen
社會統計
Page.57
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE
. reg LGFDHO LGEXP LGSIZE
Source |
SS
df
MS
---------+-----------------------------Model | 138.776549
2 69.3882747
Residual | 130.219231
865 .150542464
---------+-----------------------------Total | 268.995781
867 .310260416
Number of obs
F( 2,
865)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
868
460.92
0.0000
0.5159
0.5148
.388
-----------------------------------------------------------------------------LGFDHO |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------LGEXP |
.2866813
.0226824
12.639
0.000
.2421622
.3312003
LGSIZE |
.4854698
.0255476
19.003
0.000
.4353272
.5356124
_cons |
4.720269
.2209996
21.359
0.000
4.286511
5.154027
------------------------------------------------------------------------------
The analysis will be illustrated using a regression of LGFDHO, the logarithm of annual
household expenditure on food eaten at home, on LGEXP, the logarithm of total annual household
expenditure, and LGSIZE, the logarithm of the number of persons in the household.
10
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE
. reg LGFDHO LGEXP LGSIZE
Source |
SS
df
MS
---------+-----------------------------Model | 138.776549
2 69.3882747
Residual | 130.219231
865 .150542464
---------+-----------------------------Total | 268.995781
867 .310260416
Number of obs
F( 2,
865)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
868
460.92
0.0000
0.5159
0.5148
.388
-----------------------------------------------------------------------------LGFDHO |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------LGEXP |
.2866813
.0226824
12.639
0.000
.2421622
.3312003
LGSIZE |
.4854698
.0255476
19.003
0.000
.4353272
.5356124
_cons |
4.720269
.2209996
21.359
0.000
4.286511
5.154027
------------------------------------------------------------------------------
The source of the data was the 1995 US Consumer Expenditure Survey. The sample size was 868.
11
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE
. reg LGFDHO LGEXP LGSIZE LGHOUS
Source |
SS
df
MS
---------+-----------------------------Model | 138.841976
3 46.2806586
Residual | 130.153805
864 .150640978
---------+-----------------------------Total | 268.995781
867 .310260416
Number of obs
F( 3,
864)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
868
307.22
0.0000
0.5161
0.5145
.38812
-----------------------------------------------------------------------------LGFDHO |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------LGEXP |
.2673552
.0370782
7.211
0.000
.1945813
.340129
LGSIZE |
.4868228
.0256383
18.988
0.000
.4365021
.5371434
LGHOUS |
.0229611
.0348408
0.659
0.510
-.0454214
.0913436
_cons |
4.708772
.2217592
21.234
0.000
4.273522
5.144022
------------------------------------------------------------------------------
Now add LGHOUS, the logarithm of annual expenditure on housing services. It is safe to
assume that LGHOUS is an irrelevant variable and, not surprisingly, its coefficient is not
significantly different from zero.
12
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE
. reg LGFDHO LGEXP LGSIZE LGHOUS
. cor LGHOUS LGEXP LGSIZE
Source |
SS
df
MS
Number of obs =
868
(obs=869)
---------+-----------------------------F( 3,
864) = 307.22
Model | 138.841976
3 46.2806586
Prob > F LGEXP = LGSIZE
0.0000
|
LGHOUS
Residual | 130.153805
864 .150640978 --------+--------------------------R-squared
= 0.5161
---------+-----------------------------Adj R-squared = 0.5145
lGHOUS|
1.0000
Total | 268.995781
867 .310260416
Root MSE1.0000 = .38812
LGEXP|
0.8137
LGSIZE|
0.3256
0.4491
1.0000
-----------------------------------------------------------------------------LGFDHO |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------LGEXP |
.2673552
.0370782
7.211
0.000
.1945813
.340129
LGSIZE |
.4868228
.0256383
18.988
0.000
.4365021
.5371434
LGHOUS |
.0229611
.0348408
0.659
0.510
-.0454214
.0913436
_cons |
4.708772
.2217592
21.234
0.000
4.273522
5.144022
------------------------------------------------------------------------------
It is however highly correlated with LGEXP (correlation coefficient 0.81), and also, to a
lesser extent, with LGSIZE (correlation coefficient 0.33).
13
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE
. reg LGFDHO LGEXP LGSIZE
-----------------------------------------------------------------------------LGFDHO |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------LGEXP |
.2866813
.0226824
12.639
0.000
.2421622
.3312003
LGSIZE |
.4854698
.0255476
19.003
0.000
.4353272
.5356124
_cons |
4.720269
.2209996
21.359
0.000
4.286511
5.154027
-----------------------------------------------------------------------------. reg LGFDHO LGEXP LGSIZE LGHOUS
-----------------------------------------------------------------------------LGFDHO |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------LGEXP |
.2673552
.0370782
7.211
0.000
.1945813
.340129
LGSIZE |
.4868228
.0256383
18.988
0.000
.4365021
.5371434
LGHOUS |
.0229611
.0348408
0.659
0.510
-.0454214
.0913436
_cons |
4.708772
.2217592
21.234
0.000
4.273522
5.144022
------------------------------------------------------------------------------
Its inclusion does not cause the coefficients of those variables to be biased.
14
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE
. reg LGFDHO LGEXP LGSIZE
-----------------------------------------------------------------------------LGFDHO |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------LGEXP |
.2866813
.0226824
12.639
0.000
.2421622
.3312003
LGSIZE |
.4854698
.0255476
19.003
0.000
.4353272
.5356124
_cons |
4.720269
.2209996
21.359
0.000
4.286511
5.154027
-----------------------------------------------------------------------------. reg LGFDHO LGEXP LGSIZE LGHOUS
-----------------------------------------------------------------------------LGFDHO |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------LGEXP |
.2673552
.0370782
7.211
0.000
.1945813
.340129
LGSIZE |
.4868228
.0256383
18.988
0.000
.4365021
.5371434
LGHOUS |
.0229611
.0348408
0.659
0.510
-.0454214
.0913436
_cons |
4.708772
.2217592
21.234
0.000
4.273522
5.144022
------------------------------------------------------------------------------
But it does increase their standard errors, particularly that of LGEXP, as you would expect,
reflecting the loss of efficiency.
15
14.2
線性重合與其他問題
Does the dependent variable affect
any of the independent variables?
• 因果方向
• Reverse causation逆反的因果
• 後果
– Every coefficient in the regression model may be biased.
– It’s hard to design a study that will adequately solve this
problem.
• 時間順序有助於我們澄清因果方向。
• 一般常識
• 還留下很多不確定性
©Ming-chi Chen
社會統計
Page.65
線性重合multicollinearity
• 兩個或多個自變數之間具有高度線性相關的現象
• 比如研究家務時數和個人所得和教育之間的關係,
個人所得和教育程度之間可能有線性關係。
•很難區分各個IV對DV的影響,因為IV的變動是高度
相關,當一個IV變動,另一個也跟著變動。
•完全線性重合perfect multicollinearity
•近似線性重合near multicollinearity
©Ming-chi Chen
社會統計
Page.66
完全線性重合perfect multicollinearity
• 任一IV可被寫成為其他IV的線性合數
• 比如兩個自變數X1與X2, X1=a+b X2
©Ming-chi Chen
社會統計
Page.67
完全線性重合perfect multicollinearity
• OLSE無法求出
Y X Z
˙最小平方法
ˆ
2
z
xy xz zy
x z ( xz)
z xy xz zy
x z (1 r )
2
2
2
2
2
©Ming-chi Chen
2
社會統計
2 2
XZ
Page.68
完全線性重合perfect multicollinearity
• 如果X、Z完全線性重合,則r2xz=1,上式分母
為0,因此無解
• 解決:可將Z寫成X的線性函數,代入原方程
式。
©Ming-chi Chen
社會統計
Page.69
近似線性重合near multicollinearity
• 最小平方估計式會膨脹,容易發生偏誤
• 估計式的變異數變大,可靠性低,估計的迴
歸係數雖具不偏性,但有效性低
• 統計推論不正確,非常不容易拒絕虛無假設
(所以可以看成是一個保守的估計)
• 迴歸係數對樣本數很敏感,樣本數稍有改變,
係數就會有很大的改變
©Ming-chi Chen
社會統計
Page.70
近似線性重合near multicollinearity
• OLSE無法準確求出個別係數
Y X Z
˙最小平方法
ˆ
2
z
xy xz zy
x z ( xz)
z xy xz zy
x z (1 r )
2
2
2
2
2
©Ming-chi Chen
2
社會統計
2 2
XZ
Page.71
近似線性重合near multicollinearity
當X與Z發生近似線性重合時,
1. r 2 很大且接近1,則ˆ會高估(膨脹)
xz
2. V( ˆ )
2 2
z
2
2
2
x
z
(
1
r
xz )
也會變大,表示ˆ值非常不穩定
3.會造成統計推論時t值不顯著因而無法拒絕虛無假設,不易
驗證假設。
©Ming-chi Chen
社會統計
Page.72
診斷近似線性重合
F值顯著,但是個別的t值卻不顯著,亦即整個迴歸模型有
聯合解釋能力,但是卻無法分別顯現出個別IV對DV的影響程度。
計算IV間的相關係數
2
如果rx2i x j R(
R 2是原迴歸方程式的判定係數),則一般可判定
無線性重合的現象。
利用變異數膨脹因素(varianceinflat ionfact or,VIF)。
1
2
VIFi
,
R
i 是第i個解釋變數對其他IV作迴歸的判定係數,又
2
1 Ri
稱auxiliary R 2
VIFi 10
(P aul Allison主張 2.5),則判定有線性重合的問題。
©Ming-chi Chen
社會統計
Page.73
Stata & Multicollinearity
• Stata可以計算VIF,但要先完成迴歸分析。
• DV還是85年社會變遷調查的家務時數,IV則
有收入和教育年數。
©Ming-chi Chen
社會統計
Page.74
Stata & Multicollinearity
所以線性重合的問題不嚴重
©Ming-chi Chen
社會統計
Page.75
資料與線性重合
• 以下這些類型的資料比較容易有線性重合的
問題
• 時間序列資料time-series data
• 重複樣本研究panel study
• 集體層次資料aggregate-level data
– 個體層次的差異在集體層次彼此取消
©Ming-chi Chen
社會統計
Page.76
如何處理線性重合?
• 沒有簡單的解決之道
• 利用事前有關IV或迴歸係數間關係的訊息,代
入方程式中
• 去掉較不重要的IV,但要注意specification
error
• 擴大樣本數
©Ming-chi Chen
社會統計
Page.77
變異數齊一性假設homoscedasticity
• 我們一般假定迴歸方程式的殘差項的變異數
是固定的常數,即V(εi)=σi2= σ2
• 這個條件很多時候並不成立
• 高收入者的消費支出的變異比低所得者來的
大
• 這就是變異數不齊一性heteroscedasticity
©Ming-chi Chen
社會統計
Page.78
變異數不齊一性heteroscedasticity
• 迴歸方程式中殘差項的變異數不是一個固定
的常數
• V(εi)=σi2≠σ2
• 常發生在橫斷面資料cross-sectional data
• 最小平方估計式OLSE仍為一不偏估計式
• 但不是最佳線性不偏估計式BLUE
©Ming-chi Chen
社會統計
Page.79
檢查變異數不齊一性White test
利用OLS求出估計之迴歸方程式
Yˆ ˆ ˆX ˆZ
計算估計之殘差值
e Y Yˆ
i
i
i
2
計算nR(n為樣本數)。本統計量為自由度P 1的卡方分配
P為上述e 2迴歸是中IV的個數,卡方分配,右尾檢定。
nR2 P2 1, 則拒絕虛無假設H 0 : 變異數具齊一性。
nR2 P2 1,則接受虛無假設H 0 : 變異數具齊一性。
估計下列迴歸方程式
e 2 a0 a1 X a2 X 2 a3 Z a4 Z 2 a5 XZ
及其判定係數R 2
©Ming-chi Chen
社會統計
Page.80
收入與消費p.389
在data
editor
裡自行
輸入資
料
©Ming-chi Chen
社會統計
Page.81
收入與消費p.389
Stata沒有內建White test的功能,有的是另
一個檢定Cook-Weisberg。
一樣是要先run過迴歸分析。
©Ming-chi Chen
社會統計
Page.82
Stata and Heteroscedasticity
所以拒絕虛無假設,也等於說變異數不齊一。
©Ming-chi Chen
社會統計
Page.83
White Test
• Help>Search> Search All>空格填入white
test
點進去
©Ming-chi Chen
社會統計
Page.84
White Test
©Ming-chi Chen
社會統計
Page.85
White Test
• Help whitetst
©Ming-chi Chen
社會統計
Page.86
White Test
所以拒絕虛無假設,也等於說變異數不齊一。
©Ming-chi Chen
社會統計
Page.87
看圖形判斷
©Ming-chi Chen
社會統計
Page.88
看圖形判斷
©Ming-chi Chen
社會統計
Page.89
看圖形判斷
X越大殘
差值越大
©Ming-chi Chen
社會統計
Page.90
Heteroscedasticity的後果
• Inefficiency: LSE (least square estimate) no longer
have minimum standard errors. That means you
can do better using alternative methods. The
reason OLS is not optimal where there is
heteroscedasticity is that it gives equal weight to all
obs. When, in fact, obs with larger disturbance
contain less information than obs with smaller
one.weighted least squares, which gives greater
weight on the obs with smaller variance.
©Ming-chi Chen
社會統計
Page.91
Heteroscedasticity的後果
• Biased standard errors: The standard errors reported by regression
programs are only estimates of the true standard errors, which
cannot be observed directly. If there is hestroscedasticity, these
standard error estimates can be seriously biased. That in turn leads
to bias in test statistics and confidence intervals.
• more serious, leads to incorrect conclusions.
• it is easier to use robust standard errors here.
• This doesn’t change the coefficient estimates and, therefore,
doesn’t solve the inefficiency problem.
• But at least the test statistics will give us reasonably accurate p
values
• Require fewer assumptions
©Ming-chi Chen
社會統計
Page.92
處理變異數不齊一性
• 除了加權最小平方法(weighted least squares)和
強韌迴歸法(robust regression)之外,另一個常用
來矯正heteroscedasiticity的方法是針對DV來作轉化,
如取對數值或取平方根,這個轉化被稱為vairance
stabilizing transformations,但是這個轉化也同時根
本地改變了IV和DV之間的關係,讓迴歸係數難以解
釋。
• 一個更好的方法是改用概化線性模型(generalized
linear models, GLM)的分析方法。比如在前述收入
和消費的例子裡,不假設殘差呈常態分配,而用
gamma分配來評估,得到gamma GLM
©Ming-chi Chen
社會統計
Page.93
如何處理Heteroscedasticity?
• It has to be pretty severe before it leads to
serious bias in the standard errors.
©Ming-chi Chen
社會統計
Page.94
Outlier的影響
• OLS很容易受到離群值(outlier)的影響,尤
其當樣本不大的時候。
• 有很多的統計技術可用來檢驗每個obs對於迴
歸模型的影響。
• 主要是討論如果我們刪除了某個觀察個案,
對於模型的參數會有何影響。
• 個案的影響力取決於下列兩個條件:
– 個案觀察到的Y值跟預測值有多大的落差
– 個案在IV上有多極端(離均值)
©Ming-chi Chen
社會統計
Page.95
Studentized residual
• 第一個方法是先求殘差
• 殘差越大,就表示該觀察值離整體趨勢越遠
多遠?可以用標準化轉換。
• 這稱為studentized residual
• 如果絕對值>2.5就要小心
©Ming-chi Chen
社會統計
Page.96
Stata & Studentized residual
• 先跑迴歸,用前面income和consumption的例子。
• Predict 新變數名稱, rstudent
自行定名
©Ming-chi Chen
社會統計
Page.97
Hat value or leverage
• 個案在某IV的值離這個IV的均數有多遠
• hat value越大,在計算預測值Y-hat時的權重
就越大,它的槓桿也越大。
• hat value的平均是p/n,p是模型裡的參數數
量。
• hat value隨著樣本變大而變小。
• >3p/n表示有大的槓桿
©Ming-chi Chen
社會統計
Page.98
hat value (leverage) & Stata
• Predict 新變數名稱, hat
©Ming-chi Chen
社會統計
>(3*2)/20Page.99
DFFIT
• 有兩個常用的診斷統計量:DFFITS和
DFBETAS
• DFFITS:去除了個案對model fit的影響,也
就是how much the predicted value for each
obs would change.
©Ming-chi Chen
社會統計
Page.100
DFFIT & Stata
• Predict 新變數名稱, dfits
©Ming-chi Chen
社會統計
Page.101
DFBETAS
• 移除個案後迴歸係數的改變,除以調整過後
的資料組的估計式標準誤。
• >1表示個案有重大影響
©Ming-chi Chen
社會統計
Page.102
DFBETAS & Stata
• predict 新變數名稱, dfbeta(所選的自變數)
分別選擇自變項
>1
©Ming-chi Chen
社會統計
Page.103
DFBETAS & Stata
• Dfbeta不用指定自變項
©Ming-chi Chen
社會統計
只有
一個
自變
數,
所以
只有
一個
DF值,
注意
Stata
變數
命名
Page.104
移除離群值重跑迴歸
•
•
•
•
•
•
reg consum income if abs(DFincome) < 1
新迴歸方程式:
consum=23527.91+0.62income
舊(未刪除離群值)迴歸方程式
Consum=30948.47+0.54income
R-square也變大了(0.9589->0.9785)
©Ming-chi Chen
社會統計
Page.105
圖示
收入和殘差之間似乎
有一種曲線的關係
©Ming-chi Chen
社會統計
Page.106
中級社會統計
14.3
非線性關係:變數轉換
©Ming-chi Chen
社會統計
Page.107
非線性關係
• 嚴格說來,複迴歸方程式
y=A+B1x1+B2x2+…+Bkxk+U,線性指的是迴歸
係數B的部分
• 複迴歸方程式裡的係數可以乘上某數字,然
後再加總起來;而自變數x卻不一定要是線性
的,我們可以基於數學運算原則,對自變數
作一些轉換(取對數、平方根、多次方項),
除了解釋可能遇到的困難外,不會有其他嚴
重的問題。
©Ming-chi Chen
社會統計
Page.108
非線性關係的理由
• 我們的理論預設自變數和依變數之間存在著
非線性關係,例如經濟發展程度和政治動亂
– Hibbs (1973)認為政治動亂會隨著經濟發展程度
由低到中等而增加,但是隨著經濟程度由中等到
高度發展,政治動亂則會下降。
• 從散佈圖看自變數與依變數之間的關係,發
現兩者之間並非線性關係,而是曲線關係。
©Ming-chi Chen
社會統計
Page.109
非直線關係:二次方迴歸模型
• 先在迴歸模型中放入二次方項
• 一定要放一次方項main effect,不可單獨只放
二次方項。
• 在多次方迴歸方程式裡,要把所有低次方項
放在模型裡。
©Ming-chi Chen
社會統計
Page.110
非直線關係:二次方迴歸模型
一次方項和二次方項皆顯著,顯然有二次曲線關係,迴歸方程式為:
Consum=14364.27 + 0.8377*income - 0.00000108*income2
©Ming-chi Chen
社會統計
Page.111
解讀二次方迴歸模型
• 這是一個沒有刪除任何觀察值的資料
• 一開始的時候,收入增加,消費也隨之增加
• 但是增加的速度越來越慢(負的二次方項),過了二次
曲線的反折點(最高點/最低點)以後,收入增加反而壓
抑消費。
• 切線的斜率依其位置不同而改變,公式為:
slope=β1+2β2X
• 一直到收入為-β1/2β2=-0.8377/2*(-0.00000108)=387827
時到達最高點,之後收入增加,消費反而下降。
• R2=0.9839>一次方迴歸方程式的R2=0.9589
©Ming-chi Chen
社會統計
Page.112
非線性迴歸模型:自變數的其他轉換
• 有時候我們可以對自變數做其他數學轉換
• 比如說收入跟消費,或教育年數跟收入的關係。
• 教育年數增加,收入也隨之增加,但是增加的幅
度會越來越小,不過沒有反折點,亦即兩者關係
不會反轉。
• 這樣的情況下,可以對教育年數取對數。不過這
樣的轉換詮釋不易。
• 既使在這樣的情況下,用二次方迴歸方程式來逼
近也勝過線性迴歸方程式。
©Ming-chi Chen
社會統計
Page.113
非線性迴歸模型:自變數的其他轉換
• 我們可以在excel裡產生相應的數列來模擬前
述的現象。
• 先用excel裡產生一個0.01到5.0,以0.01的增
幅構成的500個數值。
©Ming-chi Chen
社會統計
Page.114
非線性迴歸模型:自變數的其他轉換
• 在第一格填入0.01
非線性迴歸模型:自變數的其他轉換
在B1格中輸入
=ln(a1),enter,
然後一直複製到
B500格
非線性迴歸模型:自變數的其他轉換
• 把這個資料存成以TAB字元相隔的文字檔。
• 在Stata中匯入這個資料
非線性迴歸模型:自變數的其他轉換
對數關係與線性關係
• 用線性關係來分析這筆資料
• R2=0.77,整個模型顯著,迴歸係數為0.59,
亦為顯著
對數關係與線性關係
對數關係與二次方關係
• 用二次方迴歸模型來分析
• R2=0.91,整個模型顯著;迴歸係數皆顯著
• v2=-1.68+1.56*v1-0.19*v12
對數關係與二次方關係
二次方關係
• 多次方轉換容許等差尺度的變數
• 大部分其他函數轉換則需要等比尺度的變數
非直線關係:轉換依變數
• 有時我們也會用其他變數轉換法來分析自變
數和依變數之間的關係
• 比如說像指數迴歸(exponential regression)
也等同於把變數作對數轉化(log
transformation),一般常用的是自然對數,
也就是底數為e(≒2.71828)的對數函數。
• μ=E(Y)=αβx
• 也就是ln[μ]=lnα+(lnβ)X=α’+β’X
©Ming-chi Chen
社會統計
Page.124
指數迴歸
β<1
β>1
©Ming-chi Chen
社會統計
Page.125
非直線關係:轉換依變數
• 轉換依變數會改變自變數與依變數的關係,
轉換後兩者的關係不再適合用線性關係來分
析
• 這必須用概化線性模型(generalized linear
model, GLM)來分析,而OLS可被視為是
GLM家族中的一員