Transcript Document
SKEMA Ph.D programme
2010-2011
Class 5
Multiple Regression
Lionel Nesta
Observatoire Français des Conjonctures Economiques
[email protected]
Introduction to Regression
Typically, the social scientist is dealing with multiple
and complex webs of interactions between variables.
An immediate and appealing extension to simple
linear regression is to extend the set of explanatory
variable to other variables.
Multiple regressions include several explanatory
variables in the empirical model
yi 1 xi1 2 xi2
p xip ui
Introduction to Regression
Typically, the social scientist is dealing with multiple
and complex webs of interactions between variables.
An immediate and appealing extension to simple
linear regression is to extend the set of explanatory
variable to other variables.
Multiple regressions include several explanatory
variables in the empirical model
k K
yi k xik ui
k 1
To minimize the sum of squared errors
n
min yi yˆi
2
i 1
min yi ˆ k xik
i 1
k 1
n
k K
2
n 2
0
ˆ i 1
k
n
i 1
2
0 j 2 , 2 ,
, k ,
,K
Multivariate Least Square Estimator
Usually, the multivariate is described by matrix notation:
yi xi ui
y = Xβ + u
With the following least square solution:
ˆβ XX 1 Xy
1
2
ˆ
cov(β) XX
Assumption OLS 1
Linearity
The model is linear in its parameters
y 0 1x1 2 x 2
k x k u
It is possible to operate non linear transformation of the
variables (e.g. log of x) but not of the parameters like the
following :
y 0 1x12 u
OLS can not estimate this
Assumption OLS 2
Random Sampling
The n observations are a random sample of
the whole population
There is no selection bias in the sample. The results
pertain to the whole population
All observations are independent from one another (no
serial nor cross-sectional correlation)
Assumption OLS 3
No perfect Collinearity
There is no collinearity between independent
variables
No independent variable is constant. Each variable has
variance which can be used with the variance of the
dependent variable to compute the parameters.
No exact linear relationships amongst independent variables
Assumption OLS 4
Zero Conditional Mean
The error term u has an expected value of zero
E u x1 , x 2 ,
, xk 0
Given any values of the independent variables (IV), the error
term must have an expected value of zero.
In this case, all independent variables are exogenous.
Otherwise, at least one IV suffers from an endogeneity problem.
Sources of endogeneity
Wrong specification of the model
Omitted variable correlated with one RHS.
Measurement errors of RHS
Mutual causation between LHS and RHS
Simultaneity
Assumption OLS 5
Homoskedasticity
The variance of the error term, u, conditional
on RHS, is the same for all values of RHS.
Var u x1 , x 2 ,
, x k u2
Otherwise we speak of heteroskedasticity.
Assumption OLS 6
Normality of error term
The error term is independent of all RHS and
follows a normal distribution with zero mean
and variance σ²
u
Normal(0, )
2
Assumptions OLS
OLS1 Linearity
OLS2 Random Sampling
OLS3 No perfect Collinearity
OLS4 Zero Conditional Mean
OLS5 Homoskedasticity
OLS6 Normality of error term
Theorem 1
OLS1 - OLS4 : Unbiasedness of OLS. The set of
estimated parameters ˆ j is equal to the true unknown
values of j
E ˆ j j , j 0,1,2,
,k
Theorem 2
OLS1 – OLS5 : Variance of OLS estimate. The
variance of the OLS estimator is
Var ˆ j
x
n
i 1
2
u
x j 1 R
2
ij
2
j
… where R²j is the R-squared from regressing xj on all other
2
independent variables. But how can we measure u ?
Theorem 3
OLS1 – OLS5 : The standard error of the regression
is defined as
E 2u ˆ 2u
n k 1 n k 1
yi yi
i
2
2
ˆ
u
i
i
This is also called the standard error of the estimate or
the root mean squared errors (RMSE)
Standard Error of Each Parameter
Combining theorems 2 and 3 yields:
se ˆ j
ˆ u
x
n
i 1
x j 1 R
2
ij
2
j
Theorem 4
Under assumptions OLS1 – OLS5, estimators
ˆ 0 , ˆ 1 ,
, ˆ k are the Best Linear Unbiased Estimators
(BLUE) of 0 , 1 , , k
Assumptions OLS1 – OLS5 are known as the GaussMarkov Theorem, which stipulates that under OLS1-5, the
OLS are the best estimation method
The estimates are unbiased (OLS1-4)
The estimates have the smallest variance (OLS5)
Theorem 5
Under assumptions OLS1 – OLS6, the OLS estimates
follows a t distribution:
ˆ j j
se(ˆ )
j
t n k 1
Extension of theorem 5: Inference
We can define de confidence interval of β, at 95% :
j j t .025
ˆ u
x
n
i 1
ij
x j 1 R 2j
2
If the 95% CI does not include 0, then β is
significantly different than 0.
Student t Test for H0: βj=0
We are also in the position to infer on βj
H0: βj = 0
H1: βj ≠ 0
t
seˆ
seˆ
Rule of decision
Accept H0 is | t | < tα/2
Reject H0 is | t | ≥ tα/2
Summary
OLS1 Linearity
T1
T2-T4
T5
β~t
BLUE
Unbiasedness
OLS2 Random Sampling
OLS3 No perfect Collinearity
OLS4 Zero Conditional Mean
OLS5 Homoskedasticity
OLS6 Normality of error term
Application 1: seminal model
The knowledge production function
PAT f (RD,SIZE)
PAT A RD1 SIZE 2 exp u
pat 1 rd 2 size u
Application 1: modèle de base
. reg lpat lrd lassets
Source
SS
df
MS
Model
Residual
97.0696447
538.941858
2
428
48.5348224
1.25920995
Total
636.011503
430
1.47909652
lpat
Coef.
lrd
lassets
_cons
.6461714
-.3712237
-.5909529
Std. Err.
.0868021
.0722135
.3903255
t
7.44
-5.14
-1.51
Number of obs
F( 2,
428)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.131
=
=
=
=
=
=
431
38.54
0.0000
0.1526
0.1487
1.1221
[95% Conf. Interval]
.47556
-.513161
-1.358146
.8167828
-.2292864
.1762404
Application 2: Changing specification
The knowledge production function
PAT f (RD,SIZE)
1
RD
2
PAT A
SIZE
exp u
SIZE
RD
pat 1 log
2 size u
SIZE
Application 2: Changing specification
. reg lpat lrdi lassets
Source
SS
df
MS
Model
Residual
97.0696447
538.941858
2
428
48.5348224
1.25920995
Total
636.011503
430
1.47909652
lpat
Coef.
lrdi
lassets
_cons
.6461714
.2749477
-.5909529
Std. Err.
.0868021
.0337246
.3903255
t
7.44
8.15
-1.51
Number of obs
F( 2,
428)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.131
=
=
=
=
=
=
431
38.54
0.0000
0.1526
0.1487
1.1221
[95% Conf. Interval]
.47556
.2086614
-1.358146
.8167828
.3412341
.1762404
Application 3: Adding variables
The knowledge production function
PAT f (RD,SIZE,SPE)
1
RD
2
PAT A
SIZE
exp 3 SPE u
SIZE
rd
pat 1
2 size 3 SPE u
size
Application 3: Adding variables
. reg lpat lrdi lassets spe
Source
SS
df
MS
Model
Residual
105.748034
530.263469
3
427
35.2493446
1.24183482
Total
636.011503
430
1.47909652
lpat
Coef.
lrdi
lassets
spe
_cons
.670643
.2736255
.423136
-.4877403
Std. Err.
.0866968
.0334948
.1600635
.3895845
t
7.74
8.17
2.64
-1.25
Number of obs
F( 3,
427)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.009
0.211
=
=
=
=
=
=
431
28.38
0.0000
0.1663
0.1604
1.1144
[95% Conf. Interval]
.5002375
.2077903
.1085255
-1.253482
.8410485
.3394608
.7377464
.2780017
Qualitative variables used as
independent variables
Qualitative variables as indep. variables
Qualitative variables
Dummy variables
Generating dummy variables using STATA
Interpretation of coefficients in OLS
Interaction effects between continuous and dummy var.
Qualitatives variables
Qualitative variables provide information on discrete characteristics
The number of categories taken by qualitative variables is general
small.
These can be numerical values but each number denotes a attribute
– a characteristics.
A qualitative variable may have several categories
Two categories: male – female
Three categories: nationality (French, German, Turkish)
More than three categories: sectors (car, chemical, steel, electronic equip., etc.)
Qualitative variables
There are several ways to code qualitative variables with n
categories
Using one categorical variables
Producing n - 1 dummy variables
A dummy variable is a variable which takes values 0 or 1.
We also call them binary variables
We also call dichotomous variables
Qualitative variables
Coding using one categorical variable
Two categories: we generate a categorical variable called “gender”
set to 1 if the observation is a female, 2 if the observation is a male.
Three categories: we generate a categorical variable called
“country” set to 1 if the observation is French, 2 if the observation is
German, three if the observation if Turkish.
More than three categories : we generate a categorical variable
called “sector” set to 1 if the observation is in the car industry, 2 for
the chemical industry, three for the steel ifnustry, four for the
electronic equip industry, etc..
This requires the use of label in order to know to which category a given
number pertains
Labelling variables
Labelling is tedious, boring and uninteresting.
But there are clear consequences when one must interpret the results
label variable. Decribe a variable, qualitative or quantitative
label variable asset "real capital"
label define. Defines a label (meaning of numbers)
label define firm_type 1 "biotech" 0 "Pharma"
label values Applies the label to a given variable
label values type firm_type
Exemple de labellisation
*************************************************************************************
*******
CREATION DES LABELS INDUSTRIES
*********
*************************************************************************************
egen industrie = group(isic_oecd)
#delimit ;
label define induscode
1 "Text. Habill. & Cuir"
2 "Bois"
3 "Pap. Cart. & Imprim."
4 "Coke Raffin. Nucl."
5 "Chimie"
6 "Caoutc. Plast."
7 "Aut. Prod. min."
8 "Métaux de base"
9 "Travail des métaux"
10 "Mach. & Equip."
11 "Bureau & Inform."
12 "Mach. & Mat. Elec."
13 "Radio TV Telecom."
14 "Instrum. optique"
15 "Automobile"
16 "Aut. transp."
17 "Autres";
#delimit cr
label values industrie induscode
Exercise
1. Open SKEMA_BIO.dta
2. Create variable firm_type from type
3. Label variable firm_type
4. Define a label for firm_type and apply it
Dummy variables
Coding categorical variables using dummy variables only
Two categories.
We generate one dummy variable “female” set to 1if the obs. is a
female, 0 otherwise.
We generate one dummy variable “male” set to 1if the obs. is a
male, 0 otherwise.
But one of the dummy variable is simply redundant. When female
= 0, then necessarily male = 1 (and vice versa).
Hence with two categories, we only need one dummy variable.
Dummy variables
Coding categorical variables using dummy variables only
Three categories.
We generate one dummy variable “France” set to 1if the obs. is a
French, 0 otherwise.
We generate one dummy variable “Germany” set to 1if the obs. is
a German, 0 otherwise.
We generate one dummy variable “Turkish” set to 1if the obs. is a
Turkish, 0 otherwise.
But one of the dummy variable is simply redundant. When
France=0 and German=0, then Turkish=1.
For a variable with n categories, we must create n - 1 dummy variables,
each representing one particular category.
Generation of dummies with STATA
Using the if condition.
generate DEU = 0
replace DEU = 1 if country==“GERMANY”
generate LDF= 1 if size > 100
replace LDF =0 if size < 101
Avoiding the use of the if condition.
generate FRA = country==“FRANCE”
generate LDF = size > 100
Generation of dummies with STATA
With n categories and n being large, generating dummty
variables can become really tedious
Function tabulate has a very convenient extension,
since it will generate n dummy variables at once.
tabulate varcat, gen(v_)
tabulate country, gen(c_)
Will create n dummy variables with n being the number of country
in the dataset, and c_1 being the first country, c_2 being second,
c_3 the third, etc.
Reading coefficients of dummy variables
Remember! A coefficient tells us the increase in y
associated with a one-unit increase in x, other things held
constant (ceteris paribus).
If the knowledge production function goes
y biotech u
with « y » being the number of patent and “biotech” being
a dummy variable set to 1 for biotech fimrs, 0 otherwise.
Reading coefficients of dummy variables
If the firm is biotech company, then the dummy
variable “biotech” is equal to unity. Hence:
ˆ 1 ˆ ˆ
yˆ ˆ
If the firm is pharma company, then the dummy
variable “biotech” is equal to zero. Hence:
ˆ 0 ˆ
yˆ ˆ
Reading coefficients of dummy variables
The coefficient reads as the variation in the dependent
variable when the dummy variable is set to 1 relative to
the situation where the dummy variable is set to 0.
With two categories, I must introduce one dummy variable.
With three categories, I must introduce two dummy variables.
With n categories, I must introduce (n-1) dummy variables.
Exercise
1. Regress the following model:
PAT biotech u
2. Predict the number of patents for both biotech and
pharma companies
3. Produce descriptive statistics of PAT for each type of
company using the command table
4. What do you observe?
Reading coefficients of dummy variables
For semi logarithmic forms (log Y), coefficient β must be
read as an approximation of the percent change in Y
associated with a variation of 1 unit of the explanatory
variable.
This approximation is acceptable for small β (β < 0.1).
When β is large (β ≥ 0.1), the exact percent change in Y
is:
100 × (eβ – 1)
Application 4: dummy variable
The knowledge production function
PAT f (RD,SIZE,SPE, BIO)
1
RD
2
PAT A
SIZE
exp 3 SPE 4 BIO u
SIZE
rd
pat 1
2 size 3 SPE 4 BIO u
size
Application 4: dummy variable
. reg lpat lrdi lassets spe
Source
SS
biotech
df
MS
Model
Residual
203.874406
432.137097
4
426
50.9686015
1.01440633
Total
636.011503
430
1.47909652
lpat
Coef.
lrdi
lassets
spe
biotech
_cons
.4924169
.5558656
.4212942
1.657062
-5.464644
Std. Err.
.0804249
.0417126
.1446661
.1684813
.6164752
t
6.12
13.33
2.91
9.84
-8.86
Number of obs
F( 4,
426)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.004
0.000
0.000
=
=
=
=
=
=
431
50.24
0.0000
0.3206
0.3142
1.0072
[95% Conf. Interval]
.3343379
.4738775
.136946
1.325904
-6.676356
.650496
.6378537
.7056423
1.98822
-4.252932
Application 4: dummy variable
Biotech : ˆ ˆ 2 size ˆ 4
Patent
ln(PAT)
slope ˆ 2
ˆ 4
ˆ ˆ 4
slope ˆ 2
Pharma : ˆ ˆ 2 size
ˆ
size
Application 5: Interacting variables
The knowledge production function
PAT f (RD,SIZE, BIO)
1
RD
2
PAT A
SIZE
exp 3 BIO 4 BIO size u
SIZE
rd
pat 1
2 size 3 BIO 5 BIO size u
size
Application 5: Interacting variables
. reg lpat lrdi lassets spe
Source
SS
biotech bio_assets
df
MS
Model
Residual
207.026736
428.984767
5
425
41.4053471
1.00937592
Total
636.011503
430
1.47909652
lpat
Coef.
lrdi
lassets
spe
biotech
bio_assets
_cons
.4742035
.619805
.4131693
3.592252
-.1435349
-6.482948
Std. Err.
.0808846
.0551395
.1443802
1.107872
.081221
.8427254
t
5.86
11.24
2.86
3.24
-1.77
-7.69
Number of obs
F( 5,
425)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.004
0.001
0.078
0.000
=
=
=
=
=
=
431
41.02
0.0000
0.3255
0.3176
1.0047
[95% Conf. Interval]
.3152199
.5114249
.1293812
1.41466
-.3031798
-8.139376
.6331871
.7281852
.6969573
5.769843
.0161099
-4.826519
Application 5: Interacting variables
Patent
ln(PAT)
Biotech : ˆ ˆ 2 size ˆ 4 ˆ 5 BIO size
slope ˆ 2 ˆ 5 BIO size
ˆ 5
ˆ ˆ 5
slope ˆ 2
Pharma : ˆ ˆ 2 size
ˆ
Size
Specification Tests
Specification Tests for Multiple OLS
The knowledge production function
PAT f (RD,SIZE,SPE)
1
RD
2
PAT A
SIZE
exp 3 SPE u
SIZE
rd
pat 1
2 size 3 SPE u
size
Specification Tests for Multiple OLS
Critical probability α such that : Pr(Ha|H0)= α
Student t test: concerning the significance of one parameter
Fisher F test: concerning the significance of several
parameters simultaneously (Wald test)
Non linear restriction test: Testing for non-linear relationship
between parameters
Specification Tests for Multiple OLS
Testing linear combination of parameters
Concerning one parameter only
H0 : lassets = 0.30
test size = 0.30
Test on several parameters
H0 : size = 0.30 and rdi = 0.70
test (size = 0.3) (rdi=0.7)
H0 : rdi = 2 * size
test lrdi = 2 * lassets
H0 : lrdi + lassets = 1
test lrdi + lassets = 1
lincom _b[lrdi] + _b[lassets] - 1
Specification Tests for Multiple OLS
Testing non linear combination of parameters
Test on several parameters
H0 : size * rdi = 0.2
testnl _b[lrdi] * _b[lassets] = 0.2
nlcom _b[lrdi] * _b[lassets] = 0.2
Review of Assumptions
OLS assumption
Consistency
when violated
Efficiency
when violated
Test
-
-
-
Biased β
None
None. Redo
sampling & estimation
OLS3 No perfect Collinearity
-
-
-
OLS4 Zero Conditional Mean
Biased β
Poorly estimated
variance of β
Link test
Omitted Variable test
OLS5 Homoskedasticity
None
Underestimated
variance of β
Breusch-Pagan test
OLS6 Normality of error term
None
Lack of reliability of
the t test for β
Shapiro Wilk test
OLS1 Linearity
OLS2 Random Sampling
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS5 : Homoskedasticity of residuals
Rule of thumb using graphs
Stata Instruction rvfplot
White Test
Stata Instruction estat imtest
Breusch-Pagan Test
Stata Instruction estat hettest
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
0
-1
-2
Residuals
1
2
3
Hypothesis OLS5 : Homoskedasticity of residuals: rvfplot
-1
0
1
Fitted values
2
3
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS5 : Homoskedasticity of residuals: estat imtest
. imtest
Cameron & Trivedi's decomposition of IM-test
Source
chi2
df
p
Heteroskedasticity
Skewness
Kurtosis
21.74
3.05
15.55
9
3
1
0.0097
0.3840
0.0001
Total
40.34
13
0.0001
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS5 : Homoskedasticity of residuals: estat hettest
. hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of lpat
chi2(1)
Prob > chi2
=
=
2.83
0.0927
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS6 : Normality of residuals
Rule of thumb using graphs
Stata Instruction
predict res, residual
kdensity res, normal
Formally using the Shapiro-Wilk Test
Stata Instruction predict res, residual
swilk res, normal
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS6 : Normality of residuals: kdensity
.2
.1
0
Density
.3
.4
Kernel density estimate
-4
-2
0
Residuals
Kernel density estimate
Normal density
kernel = epanechnikov, bandwidth = 0.2971
2
4
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS6 : Normality of residuals
. swilk res
Shapiro-Wilk W test for normal data
Variable
Obs
W
V
z
res
431
0.98688
3.862
3.226
Prob>z
0.00063
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
There is no omitted variables (OLS4 on endogeneity)
Link test : Stata Instruction linktest
Regress the DV over the prediction and its squared value
Variable _hat must be significant, but not _hatsq
Ramsey RESET Test : Stata Instruction ovtest
Regress the DV over powers (4) of LHS variables
Regress the DV over powers (4) of RHS variables
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
There is no omitted variables (OLS4 on endogeneity): linktest
. quietly: regress lpat lrdi lassets spe
. linktest
Source
SS
df
MS
Model
Residual
112.393289
523.618213
2
428
56.1966447
1.22340704
Total
636.011503
430
1.47909652
lpat
Coef.
_hat
_hatsq
_cons
.2055605
.2707472
.4943074
Std. Err.
.3574387
.1161699
.2887035
t
0.58
2.33
1.71
Number of obs
F( 2,
428)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.566
0.020
0.088
=
=
=
=
=
=
431
45.93
0.0000
0.1767
0.1729
1.1061
[95% Conf. Interval]
-.4969932
.0424126
-.0731457
.9081141
.4990817
1.061761
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
There is no omitted variables (OLS4 on endogeneity): ovtest
Fnkm1 k
R12 R 02
k 1
1 R12
nmk
. quietly: regress lpat lrdi lassets spe
. ovtest
Ramsey RESET test using powers of the fitted values of lpat
Ho: model has no omitted variables
F(3, 424) =
2.34
Prob > F =
0.0732
Exercise
1. Regress the following model
rd
pat 1
2 size 3 SPE 4 BIO u
size
2. Assuming OLS1-3 to be correct, test OLS4-6 and
conclude
1. OL4 on specification test using linktest and ovetst
2. OLS5 on homoskedasticity using imtest and hettest
3. OLS6 on normality of errors using kdensity and swilk test