Regression Assumptions - Division of Social Sciences

Download Report

Transcript Regression Assumptions - Division of Social Sciences

If the following assumptions are met:

The Model is




Complete
Linear
Additive
Variables are



measured at an interval or ratio scale
without error
The regression error term is







unrelated to predictors
normally distributed
has an expected value of 0
errors are independent
homoscedasticity
In a system of interrelated equations the errors are unrelated to each other
Characteristics of OLS if sample is probability sample




Unbiased
Efficient
Consistent

Unbiased:

E(b)=β b is the sample β is the true, population coefficient


Efficient


On the average we are on target
Standard error will be minimum
Consistent

As N increases the standard error decreases and closes in on the population
value
. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB if AVG_ED>0 & AVG_ED<6, beta
Source |
SS
df
MS
--------------+------------------------------ -------------------------------------Model | 65503313.6
6
10917218.9
Residual | 37321960.3
10075
3704.41293
-------------+---------------------------------------------------------------------Total | 102825274
10081
10199.9081
Number of obs
F( 6, 10075)
Prob > F
R-squared
Adj R-squared
Root MSE
= 10082
= 2947.08
= 0.0000
= 0.6370
= 0.6368
= 60.864
-----------------------------------------------------------------------------------------------------------API13 |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------------------------------------MEALS | .1843877
.0394747 4.67 0.000
.0508435
AVG_ED | 92.81476
1.575453 58.91 0.000
.6976283
P_EL | .6984374
.0469403 14.88 0.000
.1225343
P_GATE | .8179836
.0666113 12.28 0.000
.0769699
EMER | -1.095043
.1424199 -7.69 0.000
-.046344
DMOB | 4.715438
.0817277 57.70 0.000
.3746754
_cons | 52.79082
8.491632 6.22 0.000
.
------------------------------------------------------------------------------------------------------------
Meals
. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ED>0 & AVG_ED<6, beta
Source |
SS
df
MS
----------------+-------------------------------------------------------------------Model | 67627352
13
5202104
Residual | 35197921.9
10068
3496.01926
-------------+---------------------------------------------------------------------Total | 102825274
10081
10199.9081
Number of obs
F( 13, 10068)
Prob > F
R-squared
Adj R-squared
Root MSE
-------------------------------------------------------------------------------------------------------------API13 |
Coef.
Std. Err.
t
P>|t|
Beta
--------------+----------------------------------------------------------------------------------------------MEALS | .370891
.0395857 9.37 0.000
.1022703
AVG_ED | 89.51041
1.851184 48.35 0.000
.6727917
P_EL | .2773577
.0526058 5.27 0.000
.0486598
P_GATE | .7084009
.0664352 10.66 0.000
.0666584
EMER | -.7563048
.1396315 -5.42 0.000
-.032008
DMOB | 4.398746
.0817144 53.83 0.000
.349512
PCT_AA | -1.096513
.0651923 -16.82 0.000
-.1112841
PCT_AI | -1.731408
.1560803 -11.09 0.000
-.0718944
PCT_AS | .5951273
.0585275 10.17 0.000
.0715228
PCT_FI | .2598189
.1650952 1.57 0.116
.0099543
PCT_HI | .0231088
.0445723 0.52 0.604
.0066676
PCT_PI | -2.745531
.6295791 -4.36 0.000
-.0274142
PCT_MR | -.8061266
.1838885 -4.38 0.000
-.0295927
_cons | 96.52733
9.305661 10.37 0.000
.
-----------------------------------------------------------------------------------------------------------
= 10082
= 1488.01
= 0.0000
= 0.6577
= 0.6572
= 59.127
Parents’
education

Diagnosis


Theoretical
Remedy

Including new variables
160000
Violation of linearity
An almost perfect
relationship will appear as a
weak one
 Almost all linear relations
stop being linear at a certain
120000
100000

80000
60000
40000
20000
Y
0
-20000
Rsq = 0.1174
200
point
400
600
800
1000
1200
400
600
800
1000
1200
X
160000
1.000
140000
.998
120000
.996
100000
.994
80000
.992
60000
.990
40000
.988
20000
.986
Rsq = 0.6211
200
400
600
800
1000
0
1200
Y
Z

140000
X
-20000
Rsq = 0.9313
200
X
Diagnosis:



Visual scatter plots
Comparing regression with continuous and dummied independent variable
Remedy:


Use dummies
Y=a+bX+e becomes
Y=a+b1D1+ …+bk-1Dk-1+e where X is broken up into k dummies (Di) and k-1 is included. If the Rsquare of this equation is significantly higher than the R-square of the original that is a sign of nonlinearity. The pattern of the slopes (bi) will indicate the shape of the non-linearity.



Transform the variables through a non-linear transformation, therefore







Y=a+bX+e becomes
Quadratic:
Cubic:
Kth degree polynomial:
Y=a+b1X+b2X2+e
Y=a+b1X+b2X2+b3X3+e
Y=a+b1X+…+bkXk+e
Logarithmic:
Exponential:
Inverse:
Y=a+b*log(X)+e or
log(Y)=a+bX+e or Y=ea+bx+e
Y=a+b/X+e etc.
. regress API13 MEALS MEALS2, beta
Source
SS
df
MS
Model
Residual
25677500.1
2
81444841.5 10239
12838750
7954.3746
Total
107122342 10241
10460.1447
API13
Coef.
MEALS
MEALS2
_cons
-3.666183
.0181756
922.5229
Std. Err.
.1316338
.0011998
3.183358
t
-27.85
15.15
289.80
Inflection point: -b1/2*b2
-(-3.666183)/2*.0181756=100.85425
As you approach 100% the negative effect disappears
Number of obs
F( 2, 10239)
Prob > F
R-squared
Adj R-squared
Root MSE
=
10242
= 1614.05
= 0.0000
= 0.2397
= 0.2396
= 89.187
P>|t|
Beta
0.000
0.000
0.000
-.9997207
.5437501
.
Meaningless!


The assumption is that both X1 and X2 each, separately add to Y
regardless of the value of the other.


Y=a+b1X1+b2X2+e
E.g. Inc=a+b1Education+b2Citizenship+e
Imagine, that the effect of X1 depends on X2.


If Citizen Inc=a+b*1Education+e*
If Not Citizen Inc=a+b**1Education+e**
 where b*1 >b**1


You cannot simply add the two. If Citizenship is takes only two values,
their effect is multiplicative:

Inc=a+b1Education*b2Citizenship+e
There are many examples of the violation of additivity:
E.g., the effect of previous knowledge (X1) and effort (X2) on grades (Y)
 The effect of race and skills on income (discrimination)
 The effect of paternal and maternal education on academic achievement


Diagnosis:


Try other functional forms and compare R-squares
Remedy:

Introducing the multiplicative term as a new variable so
Yi=a+b1X1+b2X2+e becomes

Yi=a+b1X1+b2X2+b3Z+ e where Z=X1*X2
Or transforming the equation into additive form

If Y=a*X1b1*X2b2*e then

log Y=log(a)+b1log(X1)+b2log(X2)+e so


Model Summary
Model
R
R Square
1
.720(a)
.519
a Predictors: (Constant), ESCHOOL, AVG_ED
Does parents’ education
matter more in elementary
school or later?
Adjusted R Square Std. Error of the Estimate
.519
70.918
Coefficients(a)
Model
1(Constant)
AVG_ED
ESCHOOL
a Dependent Variable: API13
Unstandardized Coefficients
B
Std. Error
510.030
2.738
87.476
.930
54.352
1.424
Standardized Coefficients
Beta
.649
.264
t
Sig.
186.250
94.085
38.179
.000
.000
.000
Model Summary
Model
R
R Square
Adjusted R Square
1
.730(a)
.533
.533
a Predictors: (Constant), INTESXED, AVG_ED, ESCHOOL
Std. Error of the Estimate
69.867
Coefficients(a)
Model
1(Constant)
AVG_ED
ESCHOOL
AVG_ED*ESCHOOL(interaction)
a Dependent Variable: API13
Unstandardized Coefficients
B
Std. Error
454.542
4.151
107.938
1.481
145.801
5.386
-33.145
1.885
Standardized Coefficients
Beta
.801
.707
-.495
t
Sig.
109.497
72.896
27.073
-17.587
.000
.000
.000
.000

Pred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*ESCHOOLi+(-33.145)*AVG_EDi*ESCHOOLi

IF ESCHOOL=1 i.e. school is an elementary school
Pred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*1+(-33.145)*AVG_EDi*1 =
454.542+ 107.938*AVG_EDi+ 145.801+(-33.145)*AVG_EDi =
(454.542 + 145.801)+ (107.938 -33.145)*AVG_EDi =

600.343+74.793*AVG_EDi




IF ESCHOOL=0 i.e. school is not an elementary but a middle or high school
Pred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*0+(-33.145)*AVG_EDi*0 =

454.542+ 107.938*AVG_EDi



The effect of parental education is larger after elementary school!
Is this difference statistically significant?
Coefficients(a)
Model
1(Constant)
AVG_ED
ESCHOOL
AVG_ED*ESCHOOL(interaction)
a Dependent Variable: API13
Unstandardized Coefficients
B
Std. Error
454.542
4.151
107.938
1.481
145.801
5.386
-33.145
1.885
Standardized Coefficients
Beta
.801
.707
-.495
t
Sig.
109.497
72.896
27.073
-17.587
.000
.000
.000
.000
. regress API13 P_EL AVG_ED, beta
Source
SS
. regress API13 P_EL AVG_ED
df
MS
Model
Residual
51193410.2
2
55071454 10170
25596705.1
5415.08889
Total
106264864 10172
10446.8014
API13
Coef.
P_EL
AVG_ED
_cons
1.364798
111.0936
446.5234
Std. Err.
.0544075
1.268689
4.420695
t
25.08
87.57
101.01
Number of obs
F( 2, 10170)
Prob > F
R-squared
Adj R-squared
Root MSE
Source
SS
INTELXED, beta
=
10173
= 4726.92
= 0.0000
= 0.4818
= 0.4817
= 73.587
df
MS
Model
Residual
51256452.7
3
55008411.5 10169
17085484.2
5409.42192
Total
106264864 10172
10446.8014
P>|t|
Beta
API13
Coef.
0.000
0.000
0.000
.2362464
.8246885
.
P_EL
AVG_ED
INTELXED
_cons
1.886735
113.9676
-.2345271
439.2113
Std. Err.
.1622719
1.522041
.0686992
4.910184
t
11.63
74.88
-3.41
89.45
Number of obs
F( 3, 10169)
Prob > F
R-squared
Adj R-squared
Root MSE
=
10173
= 3158.47
= 0.0000
= 0.4823
= 0.4822
= 73.549
P>|t|
Beta
0.000
0.000
0.001
0.000
.3265936
.8460229
-.0818319
.
Dependent
Nominal
Independent
Dichotomous
Polytomous
Dichotomous
2x2 table
Kx2 table
Dummy
variables with
logit/probit
Polytomous
2xK table
Dummy
variables with
logit/probit
Ordinal
2xN table
Dummy
variables with
logit/probit or
just
logit/probit
Interval/Ratio
Logit/probit
Ordinal
Interval/Ratio
Nx2 table
Difference of
means test
Dummy
variables with
multinomial
logit/probit
KxK table
Dummy
variables with
ordered
logit/probit
NxK table
Dummy
variables with
multinomial
logit/probit
NxK table
Dummy
variables with
ordered
logit/probit
KxK table
Dummy
variables with
multinomial
logit/probit or
just
multinomial
logit/probit
Multinomial
logit/pobit
Dummy
variables with
ordered
logit/probit or
just
ordered
logit/probit
Ordered
logit/probit
Regression with
dummy
variables
ANOVA
Regression with
dummy
variables
Regression with
dummy or
just
Regression
Regression
Take Y=a+bX+e
where X is the real value and e is a random measurement error


Suppose X*=X+e



Then Y=a+b’X*+e’  Y=a+b’(X+e)+e’=a+b’X+b’e+e’ 
Y=a+b’X+E where E=b’e+e’ and b’=b
The slope (b) will not change but the error will increase as a result



Our R-square will be smaller
Our standard errors will be larger  t-values smaller  significance smaller
Suppose X#=X+cW+e


where W is a systematic measurement error c is a weight
Then Y=a+b’X#+e’  Y=a+b’(X+cW+e)+e’=a+b’X+b’cW+E
b’=b iff rwx=0 or rwy=0 otherwise b’≠b which means that the slope will change together with the
increase in the error. Apart from the problems stated above, that means that

Our slope will be wrong

Diagnosis:


Look at the correlation of the measure with other
measures of the same variable
Remedy:



Use multiple indicators and structural equation models
(AMOS)
Confirmatory factor analysis
Better measures
Our calculations of statistical significance depends on this
assumption
 Statistical inference can be robust even when error is nonnormal
 Diagnosis:




You can look at the distribution of the error. Because of the
homoscedasticity assumption (see later) the error when summed up
for each prediction should be also normal. (In principle, we have
multiple observations for each prediction.)
Remember! Our measured variables (Y and X) do not have to have a
normal distribution! Only the error for each prediction.
Remedy:

Any non-linear transformation will change the shape of the
distribution of the error
N
childs
NUMBER OF
CHILDREN
1751
Minimum
Maximum
0
8
Mean
1.89
DEPENDENT VARIABLE
Underdispersion : Mean/Std.Dev.>1
Overdispersion : Mean/Std.Dev.<1
As Mean >Std. Deviation we have a
case of a (small) underdispersion
Std.
Deviation
1.665
. regress childs educ income06 sibs age, beta
Source
SS
df
MS
Model
Residual
1010.57637
3841.48416
4
1746
252.644093
2.20016275
Total
4852.06054
1750
2.77260602
childs
Coef.
educ
income06
sibs
age
_cons
-.1196158
.0161114
.0860119
.0321178
1.405787
Std. Err.
.0133317
.0067835
.0124929
.0020686
.2164236
t
-8.97
2.38
6.88
15.53
6.50
Number of obs
F( 4, 1746)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
1751
114.83
0.0000
0.2083
0.2065
1.4833
P>|t|
Beta
0.000
0.018
0.000
0.000
0.000
-.2169182
.0567489
.1557163
.3330204
.
. nbreg childs educ income06 sibs age, dispersion(mean)
Poisson assumes Mean=Std.Dev (No over- or
underdispersion)
Negative Binomial does not make this assumption
Fitting Poisson model:
Iteration 0:
Iteration 1:
Iteration 2:
log likelihood =
log likelihood =
log likelihood =
-2975.315
-2975.314
-2975.314
Fitting constant-only model:
Log of expected counts is now the unit of
the dependent variable
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
-3262.2505
-3153.7715
-3153.2079
-3153.2072
-3153.2072
=
=
=
=
=
-2990.357
-2964.9271
-2962.3446
-2962.2649
-2962.2647
Fitting full model:
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
Negative binomial regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Dispersion
= mean
Log likelihood = -2962.2647
childs
Coef.
Std. Err.
educ
income06
sibs
age
_cons
-.0593967
.010126
.0409034
.0168531
.2482068
.0069457
.0037223
.0060993
.0011061
.1169235
/lnalpha
-2.301191
alpha
.1001395
z
0.000
0.007
0.000
0.000
0.034
1751
381.88
0.0000
0.0606
[95% Conf. Interval]
-.0730099
.0028304
.0289491
.0146852
.019041
-.0457835
.0174217
.0528577
.0190209
.4773726
.2291361
-2.75029
-1.852093
.0229456
.0639093
.1569084
Likelihood-ratio test of alpha=0:
-8.55
2.72
6.71
15.24
2.12
P>|z|
=
=
=
=
chibar2(01) =
26.10 Prob>=chibar2 = 0.000

The solid line gives a negative


The dotted line a positive mean
This can happen when we have some
selection problem

Diagnosis:
1.000
.998
.996
.994
.992
.990
Z
.988
.986

Rsq = 0.6211
200
400
600
800
1000
1200
X

Visual scatter plot will not help unless we
know in advance somehow the true
regression line
Remedy:

If it is a selection problem try to address
it.
Example 1: Suppose you take a survey of 10 people but you interview everyone 10 times.
Now your N=1000 but your errors are not independent. For the same person you will have similar errors


Example 2: Suppose you take 10 countries and you observe them in 10 different time period
Now your N=1000 but your errors are not independent. For the same country you will have similar errors



Example 3: Suppose you take 100 countries and you observe them only once. Now your N=100. But countries that are next to each other are often
similar (same geography and climate, similar history, cooperation etc.). If your model underpredicts Denmark, it is likely to underpredict Sweden
as well.

Example 4: Suppose you take 100 people but they are all couples, so what you really have is 50 couples. Husband and wife tend to be similar. If
your model underestimates one chances are it does the same for the other. Spouses have similar errors.
Statistical inference assumes that each case is independent of the other and in the two examples above it
is not the case. In fact, your N < 100.



This biases your standard error because the formula is “tricked into believing” that you have a larger
sample than you actually have and larger samples give smaller standard errors and better statistical
significance.
This may also bias your estimates of the intercept and the slope. Non-linearity is a special case of
correlated errors.

It is called autocorrelation because the correlation is between cases and not
variables, although autocorrelations often can be traced to certain variables such
as common geographic location or same country or person or family.

Diagnosis
 Visual, scatterplot
 Checking groups of cases that are theoretically suspect
 Certain forms of serial or spatial autocorrelations can be diagnosed by calculating
certain statistics (e.g., Durbin-Watson test)
Remedy:




You can include new variables in the equation
E.g.: for serial (temporal) correlation you can include the value of Y in t-1 as an independent
variable
For spatial correlation we can often model the relationships by introducing an weight matrix

Homoscedasticity means equal variance
Heteroscedasticity means unequal variance
We assume that each prediction is not just on target on average but also that we make the
same amount of error
Heteroscedasticity results in biased standard errors and statistical significance

Diagnosis:





Visual, scatter plot
Remedy:

Introducing a weight matrix (e.g. using 1/X)



Error represents all factors influencing Y that are not included in the regression equation
If an omitted variable is related to X the assumption is violated. This is the same as the
Completeness or Omitted Variable Problem
Diagnosis:



The error will ALWAYS be uncorrelated with X, there is no way to establish the TRUE error
Theoretical
Remedy:

Adding new variables to the model

We sometimes estimate more than one regression.

Suppose Yt=a+b1Xt-1+b2Zt-1+e but

Xt=a’+b’1Yt-1+b’2Zt-1+e’

e and e’ will be correlated

(whatever is omitted from both equations will show up in both e and e’ making them
correlated)

This is also the case in sample selection models

S=a+b1X+b2Z+e
S is whether one is selected into the sample

Y=a+b’1X+b’2Z+b’3W+b’4V+e’
Y is the outcome of interest
e and e’ will be correlated
(whatever is omitted from both equations will show up in both e and e’ making them
correlated)

