Transcript Document

SKEMA Ph.D programme
2010-2011
Class 5
Multiple Regression
Lionel Nesta
Observatoire Français des Conjonctures Economiques
[email protected]
Introduction to Regression
 Typically, the social scientist is dealing with multiple
and complex webs of interactions between variables.
An immediate and appealing extension to simple
linear regression is to extend the set of explanatory
variable to other variables.
 Multiple regressions include several explanatory
variables in the empirical model
yi    1 xi1  2 xi2 
  p xip  ui
Introduction to Regression
 Typically, the social scientist is dealing with multiple
and complex webs of interactions between variables.
An immediate and appealing extension to simple
linear regression is to extend the set of explanatory
variable to other variables.
 Multiple regressions include several explanatory
variables in the empirical model
k K
yi     k xik  ui
k 1
To minimize the sum of squared errors
n
min   yi  yˆi 
2
i 1


 min   yi  ˆ   k xik 
i 1 
k 1

n
k K
2
 n 2
 0

ˆ i 1

k
n

i 1
2
 0  j  2 , 2 ,

, k ,
,K 

Multivariate Least Square Estimator

Usually, the multivariate is described by matrix notation:
yi  xi  ui
y = Xβ + u

With the following least square solution:
ˆβ   XX 1 Xy
1
2
ˆ
cov(β)    XX 
Assumption OLS 1
Linearity
The model is linear in its parameters
y  0  1x1  2 x 2 
 k x k  u
 It is possible to operate non linear transformation of the
variables (e.g. log of x) but not of the parameters like the
following :
y  0  1x12  u
OLS can not estimate this
Assumption OLS 2
Random Sampling
The n observations are a random sample of
the whole population
 There is no selection bias in the sample. The results
pertain to the whole population
 All observations are independent from one another (no
serial nor cross-sectional correlation)
Assumption OLS 3
No perfect Collinearity
There is no collinearity between independent
variables
 No independent variable is constant. Each variable has
variance which can be used with the variance of the
dependent variable to compute the parameters.
 No exact linear relationships amongst independent variables
Assumption OLS 4
Zero Conditional Mean
The error term u has an expected value of zero
E  u x1 , x 2 ,
, xk   0
 Given any values of the independent variables (IV), the error
term must have an expected value of zero.
 In this case, all independent variables are exogenous.
Otherwise, at least one IV suffers from an endogeneity problem.
Sources of endogeneity
 Wrong specification of the model
 Omitted variable correlated with one RHS.
 Measurement errors of RHS
 Mutual causation between LHS and RHS
 Simultaneity
Assumption OLS 5
Homoskedasticity
The variance of the error term, u, conditional
on RHS, is the same for all values of RHS.
Var  u x1 , x 2 ,
, x k   u2
Otherwise we speak of heteroskedasticity.
Assumption OLS 6
Normality of error term
The error term is independent of all RHS and
follows a normal distribution with zero mean
and variance σ²
u
Normal(0,  )
2
Assumptions OLS
OLS1 Linearity
OLS2 Random Sampling
OLS3 No perfect Collinearity
OLS4 Zero Conditional Mean
OLS5 Homoskedasticity
OLS6 Normality of error term
Theorem 1
OLS1 - OLS4 : Unbiasedness of OLS. The set of
estimated parameters ˆ j is equal to the true unknown
values of  j
 
E ˆ j   j , j  0,1,2,
,k
Theorem 2
OLS1 – OLS5 : Variance of OLS estimate. The
variance of the OLS estimator is
 
Var ˆ j 

 x
n
i 1
2
u
 x j  1  R
2
ij
2
j

… where R²j is the R-squared from regressing xj on all other
2
independent variables. But how can we measure  u ?
Theorem 3
OLS1 – OLS5 : The standard error of the regression
is defined as
E  2u   ˆ 2u




 n   k  1   n  k  1
yi  yi
i
2
2
ˆ
u
 i
i
This is also called the standard error of the estimate or
the root mean squared errors (RMSE)
Standard Error of Each Parameter
 Combining theorems 2 and 3 yields:
 
se ˆ j 
ˆ u
x
n
i 1
 x j  1  R
2
ij
2
j

Theorem 4
Under assumptions OLS1 – OLS5, estimators
ˆ 0 , ˆ 1 ,
, ˆ k are the Best Linear Unbiased Estimators
(BLUE) of 0 , 1 , , k
Assumptions OLS1 – OLS5 are known as the GaussMarkov Theorem, which stipulates that under OLS1-5, the
OLS are the best estimation method
The estimates are unbiased (OLS1-4)
The estimates have the smallest variance (OLS5)
Theorem 5
Under assumptions OLS1 – OLS6, the OLS estimates
follows a t distribution:
ˆ j   j
se(ˆ )
j
t n k 1
Extension of theorem 5: Inference
 We can define de confidence interval of β, at 95% :
 j   j  t .025 
ˆ u
 x
n
i 1
ij
 x j  1  R 2j 
2
If the 95% CI does not include 0, then β is
significantly different than 0.
Student t Test for H0: βj=0
 We are also in the position to infer on βj

H0: βj = 0

H1: βj ≠ 0
t



seˆ
seˆ
Rule of decision
Accept H0 is | t | < tα/2
Reject H0 is | t | ≥ tα/2
Summary
OLS1 Linearity
T1
T2-T4
T5
β~t
BLUE
Unbiasedness
OLS2 Random Sampling
OLS3 No perfect Collinearity
OLS4 Zero Conditional Mean
OLS5 Homoskedasticity
OLS6 Normality of error term
Application 1: seminal model
The knowledge production function
PAT  f (RD,SIZE)
PAT  A  RD1  SIZE 2  exp  u 
pat    1  rd  2  size  u
Application 1: modèle de base
. reg lpat lrd lassets
Source
SS
df
MS
Model
Residual
97.0696447
538.941858
2
428
48.5348224
1.25920995
Total
636.011503
430
1.47909652
lpat
Coef.
lrd
lassets
_cons
.6461714
-.3712237
-.5909529
Std. Err.
.0868021
.0722135
.3903255
t
7.44
-5.14
-1.51
Number of obs
F( 2,
428)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.131
=
=
=
=
=
=
431
38.54
0.0000
0.1526
0.1487
1.1221
[95% Conf. Interval]
.47556
-.513161
-1.358146
.8167828
-.2292864
.1762404
Application 2: Changing specification
The knowledge production function
PAT  f (RD,SIZE)
1
 RD 
2
PAT  A  

SIZE
 exp  u 

 SIZE 
 RD 
pat    1  log 
  2  size  u
 SIZE 
Application 2: Changing specification
. reg lpat lrdi lassets
Source
SS
df
MS
Model
Residual
97.0696447
538.941858
2
428
48.5348224
1.25920995
Total
636.011503
430
1.47909652
lpat
Coef.
lrdi
lassets
_cons
.6461714
.2749477
-.5909529
Std. Err.
.0868021
.0337246
.3903255
t
7.44
8.15
-1.51
Number of obs
F( 2,
428)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.131
=
=
=
=
=
=
431
38.54
0.0000
0.1526
0.1487
1.1221
[95% Conf. Interval]
.47556
.2086614
-1.358146
.8167828
.3412341
.1762404
Application 3: Adding variables
The knowledge production function
PAT  f (RD,SIZE,SPE)
1
 RD 
2
PAT  A  

SIZE
 exp  3  SPE  u 

 SIZE 
 rd 
pat    1  
  2  size  3  SPE  u
 size 
Application 3: Adding variables
. reg lpat lrdi lassets spe
Source
SS
df
MS
Model
Residual
105.748034
530.263469
3
427
35.2493446
1.24183482
Total
636.011503
430
1.47909652
lpat
Coef.
lrdi
lassets
spe
_cons
.670643
.2736255
.423136
-.4877403
Std. Err.
.0866968
.0334948
.1600635
.3895845
t
7.74
8.17
2.64
-1.25
Number of obs
F( 3,
427)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.009
0.211
=
=
=
=
=
=
431
28.38
0.0000
0.1663
0.1604
1.1144
[95% Conf. Interval]
.5002375
.2077903
.1085255
-1.253482
.8410485
.3394608
.7377464
.2780017
Qualitative variables used as
independent variables
Qualitative variables as indep. variables

Qualitative variables

Dummy variables

Generating dummy variables using STATA

Interpretation of coefficients in OLS

Interaction effects between continuous and dummy var.
Qualitatives variables

Qualitative variables provide information on discrete characteristics

The number of categories taken by qualitative variables is general
small.

These can be numerical values but each number denotes a attribute
– a characteristics.

A qualitative variable may have several categories

Two categories: male – female

Three categories: nationality (French, German, Turkish)

More than three categories: sectors (car, chemical, steel, electronic equip., etc.)
Qualitative variables


There are several ways to code qualitative variables with n
categories

Using one categorical variables

Producing n - 1 dummy variables
A dummy variable is a variable which takes values 0 or 1.

We also call them binary variables

We also call dichotomous variables
Qualitative variables

Coding using one categorical variable



Two categories: we generate a categorical variable called “gender”
set to 1 if the observation is a female, 2 if the observation is a male.
Three categories: we generate a categorical variable called
“country” set to 1 if the observation is French, 2 if the observation is
German, three if the observation if Turkish.
More than three categories : we generate a categorical variable
called “sector” set to 1 if the observation is in the car industry, 2 for
the chemical industry, three for the steel ifnustry, four for the
electronic equip industry, etc..
This requires the use of label in order to know to which category a given
number pertains
Labelling variables
 Labelling is tedious, boring and uninteresting.
 But there are clear consequences when one must interpret the results
label variable. Decribe a variable, qualitative or quantitative
label variable asset "real capital"
label define. Defines a label (meaning of numbers)
label define firm_type 1 "biotech" 0 "Pharma"
label values Applies the label to a given variable
label values type firm_type
Exemple de labellisation
*************************************************************************************
*******
CREATION DES LABELS INDUSTRIES
*********
*************************************************************************************
egen industrie = group(isic_oecd)
#delimit ;
label define induscode
1 "Text. Habill. & Cuir"
2 "Bois"
3 "Pap. Cart. & Imprim."
4 "Coke Raffin. Nucl."
5 "Chimie"
6 "Caoutc. Plast."
7 "Aut. Prod. min."
8 "Métaux de base"
9 "Travail des métaux"
10 "Mach. & Equip."
11 "Bureau & Inform."
12 "Mach. & Mat. Elec."
13 "Radio TV Telecom."
14 "Instrum. optique"
15 "Automobile"
16 "Aut. transp."
17 "Autres";
#delimit cr
label values industrie induscode
Exercise
1. Open SKEMA_BIO.dta
2. Create variable firm_type from type
3. Label variable firm_type
4. Define a label for firm_type and apply it
Dummy variables

Coding categorical variables using dummy variables only

Two categories.
 We generate one dummy variable “female” set to 1if the obs. is a
female, 0 otherwise.
 We generate one dummy variable “male” set to 1if the obs. is a
male, 0 otherwise.
 But one of the dummy variable is simply redundant. When female
= 0, then necessarily male = 1 (and vice versa).

Hence with two categories, we only need one dummy variable.
Dummy variables

Coding categorical variables using dummy variables only

Three categories.
 We generate one dummy variable “France” set to 1if the obs. is a
French, 0 otherwise.
 We generate one dummy variable “Germany” set to 1if the obs. is
a German, 0 otherwise.
 We generate one dummy variable “Turkish” set to 1if the obs. is a
Turkish, 0 otherwise.
 But one of the dummy variable is simply redundant. When
France=0 and German=0, then Turkish=1.
For a variable with n categories, we must create n - 1 dummy variables,
each representing one particular category.
Generation of dummies with STATA

Using the if condition.





generate DEU = 0
replace DEU = 1 if country==“GERMANY”
generate LDF= 1 if size > 100
replace LDF =0 if size < 101
Avoiding the use of the if condition.


generate FRA = country==“FRANCE”
generate LDF = size > 100
Generation of dummies with STATA

With n categories and n being large, generating dummty
variables can become really tedious

Function tabulate has a very convenient extension,
since it will generate n dummy variables at once.

tabulate varcat, gen(v_)

tabulate country, gen(c_)

Will create n dummy variables with n being the number of country
in the dataset, and c_1 being the first country, c_2 being second,
c_3 the third, etc.
Reading coefficients of dummy variables

Remember! A coefficient tells us the increase in y
associated with a one-unit increase in x, other things held
constant (ceteris paribus).

If the knowledge production function goes
y     biotech  u
with « y » being the number of patent and “biotech” being
a dummy variable set to 1 for biotech fimrs, 0 otherwise.
Reading coefficients of dummy variables

If the firm is biotech company, then the dummy
variable “biotech” is equal to unity. Hence:
ˆ 1  ˆ  ˆ
yˆ  ˆ  

If the firm is pharma company, then the dummy
variable “biotech” is equal to zero. Hence:
ˆ 0  ˆ
yˆ  ˆ  
Reading coefficients of dummy variables

The coefficient reads as the variation in the dependent
variable when the dummy variable is set to 1 relative to
the situation where the dummy variable is set to 0.

With two categories, I must introduce one dummy variable.

With three categories, I must introduce two dummy variables.

With n categories, I must introduce (n-1) dummy variables.
Exercise
1. Regress the following model:
PAT     biotech  u
2. Predict the number of patents for both biotech and
pharma companies
3. Produce descriptive statistics of PAT for each type of
company using the command table
4. What do you observe?
Reading coefficients of dummy variables

For semi logarithmic forms (log Y), coefficient β must be
read as an approximation of the percent change in Y
associated with a variation of 1 unit of the explanatory
variable.

This approximation is acceptable for small β (β < 0.1).
When β is large (β ≥ 0.1), the exact percent change in Y
is:
100 × (eβ – 1)
Application 4: dummy variable
The knowledge production function
PAT  f (RD,SIZE,SPE, BIO)
1
 RD 
2
PAT  A  

SIZE
 exp  3  SPE  4  BIO  u 

 SIZE 
 rd 
pat    1  
  2  size  3  SPE  4  BIO  u
 size 
Application 4: dummy variable
. reg lpat lrdi lassets spe
Source
SS
biotech
df
MS
Model
Residual
203.874406
432.137097
4
426
50.9686015
1.01440633
Total
636.011503
430
1.47909652
lpat
Coef.
lrdi
lassets
spe
biotech
_cons
.4924169
.5558656
.4212942
1.657062
-5.464644
Std. Err.
.0804249
.0417126
.1446661
.1684813
.6164752
t
6.12
13.33
2.91
9.84
-8.86
Number of obs
F( 4,
426)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.004
0.000
0.000
=
=
=
=
=
=
431
50.24
0.0000
0.3206
0.3142
1.0072
[95% Conf. Interval]
.3343379
.4738775
.136946
1.325904
-6.676356
.650496
.6378537
.7056423
1.98822
-4.252932
Application 4: dummy variable
Biotech : ˆ  ˆ 2  size  ˆ 4
Patent
ln(PAT)
slope  ˆ 2
ˆ 4
ˆ  ˆ 4
slope  ˆ 2
Pharma : ˆ  ˆ 2  size
ˆ
size
Application 5: Interacting variables
The knowledge production function
PAT  f (RD,SIZE, BIO)

1
 RD 
2
PAT  A  

SIZE
 exp 3  BIO  4   BIO  size   u 

 SIZE 
 rd 
pat    1  
  2  size  3  BIO  5   BIO  size   u
 size 
Application 5: Interacting variables
. reg lpat lrdi lassets spe
Source
SS
biotech bio_assets
df
MS
Model
Residual
207.026736
428.984767
5
425
41.4053471
1.00937592
Total
636.011503
430
1.47909652
lpat
Coef.
lrdi
lassets
spe
biotech
bio_assets
_cons
.4742035
.619805
.4131693
3.592252
-.1435349
-6.482948
Std. Err.
.0808846
.0551395
.1443802
1.107872
.081221
.8427254
t
5.86
11.24
2.86
3.24
-1.77
-7.69
Number of obs
F( 5,
425)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.004
0.001
0.078
0.000
=
=
=
=
=
=
431
41.02
0.0000
0.3255
0.3176
1.0047
[95% Conf. Interval]
.3152199
.5114249
.1293812
1.41466
-.3031798
-8.139376
.6331871
.7281852
.6969573
5.769843
.0161099
-4.826519
Application 5: Interacting variables
Patent
ln(PAT)
Biotech : ˆ  ˆ 2  size  ˆ 4  ˆ 5   BIO  size 
slope  ˆ 2  ˆ 5   BIO  size 
ˆ 5
ˆ  ˆ 5
slope  ˆ 2
Pharma : ˆ  ˆ 2  size
ˆ
Size
Specification Tests
Specification Tests for Multiple OLS
The knowledge production function
PAT  f (RD,SIZE,SPE)
1
 RD 
2
PAT  A  

SIZE
 exp  3  SPE  u 

 SIZE 
 rd 
pat    1  
  2  size  3  SPE  u
 size 
Specification Tests for Multiple OLS
Critical probability α such that : Pr(Ha|H0)= α
 Student t test: concerning the significance of one parameter
 Fisher F test: concerning the significance of several
parameters simultaneously (Wald test)
 Non linear restriction test: Testing for non-linear relationship
between parameters
Specification Tests for Multiple OLS
Testing linear combination of parameters
 Concerning one parameter only
H0 : lassets = 0.30
test size = 0.30
 Test on several parameters
H0 : size = 0.30 and rdi = 0.70
test (size = 0.3) (rdi=0.7)
H0 : rdi = 2 * size
test lrdi = 2 * lassets
H0 : lrdi + lassets = 1
test lrdi + lassets = 1
lincom _b[lrdi] + _b[lassets] - 1
Specification Tests for Multiple OLS
Testing non linear combination of parameters
 Test on several parameters
H0 : size * rdi = 0.2
testnl _b[lrdi] * _b[lassets] = 0.2
nlcom _b[lrdi] * _b[lassets] = 0.2
Review of Assumptions
OLS assumption
Consistency
when violated
Efficiency
when violated
Test
-
-
-
Biased β
None
None. Redo
sampling & estimation
OLS3 No perfect Collinearity
-
-
-
OLS4 Zero Conditional Mean
Biased β
Poorly estimated
variance of β
Link test
Omitted Variable test
OLS5 Homoskedasticity
None
Underestimated
variance of β
Breusch-Pagan test
OLS6 Normality of error term
None
Lack of reliability of
the t test for β
Shapiro Wilk test
OLS1 Linearity
OLS2 Random Sampling
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS5 : Homoskedasticity of residuals
 Rule of thumb using graphs
Stata Instruction rvfplot
 White Test
Stata Instruction estat imtest
 Breusch-Pagan Test
Stata Instruction estat hettest
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
0
-1
-2
Residuals
1
2
3
Hypothesis OLS5 : Homoskedasticity of residuals: rvfplot
-1
0
1
Fitted values
2
3
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS5 : Homoskedasticity of residuals: estat imtest
. imtest
Cameron & Trivedi's decomposition of IM-test
Source
chi2
df
p
Heteroskedasticity
Skewness
Kurtosis
21.74
3.05
15.55
9
3
1
0.0097
0.3840
0.0001
Total
40.34
13
0.0001
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS5 : Homoskedasticity of residuals: estat hettest
. hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of lpat
chi2(1)
Prob > chi2
=
=
2.83
0.0927
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS6 : Normality of residuals
 Rule of thumb using graphs
Stata Instruction
predict res, residual
kdensity res, normal
 Formally using the Shapiro-Wilk Test
Stata Instruction predict res, residual
swilk res, normal
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS6 : Normality of residuals: kdensity
.2
.1
0
Density
.3
.4
Kernel density estimate
-4
-2
0
Residuals
Kernel density estimate
Normal density
kernel = epanechnikov, bandwidth = 0.2971
2
4
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
Hypothesis OLS6 : Normality of residuals
. swilk res
Shapiro-Wilk W test for normal data
Variable
Obs
W
V
z
res
431
0.98688
3.862
3.226
Prob>z
0.00063
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
There is no omitted variables (OLS4 on endogeneity)
 Link test : Stata Instruction linktest
Regress the DV over the prediction and its squared value
Variable _hat must be significant, but not _hatsq
Ramsey RESET Test : Stata Instruction ovtest
Regress the DV over powers (4) of LHS variables
Regress the DV over powers (4) of RHS variables
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
There is no omitted variables (OLS4 on endogeneity): linktest
. quietly: regress lpat lrdi lassets spe
. linktest
Source
SS
df
MS
Model
Residual
112.393289
523.618213
2
428
56.1966447
1.22340704
Total
636.011503
430
1.47909652
lpat
Coef.
_hat
_hatsq
_cons
.2055605
.2707472
.4943074
Std. Err.
.3574387
.1161699
.2887035
t
0.58
2.33
1.71
Number of obs
F( 2,
428)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.566
0.020
0.088
=
=
=
=
=
=
431
45.93
0.0000
0.1767
0.1729
1.1061
[95% Conf. Interval]
-.4969932
.0424126
-.0731457
.9081141
.4990817
1.061761
Specification Tests for Multiple OLS
Specification tests on the validity of assumptions
There is no omitted variables (OLS4 on endogeneity): ovtest
Fnkm1 k
R12  R 02
k 1
1  R12
nmk
. quietly: regress lpat lrdi lassets spe
. ovtest
Ramsey RESET test using powers of the fitted values of lpat
Ho: model has no omitted variables
F(3, 424) =
2.34
Prob > F =
0.0732
Exercise
1. Regress the following model
 rd 
pat    1  
  2  size  3  SPE  4  BIO  u
 size 
2. Assuming OLS1-3 to be correct, test OLS4-6 and
conclude
1. OL4 on specification test using linktest and ovetst
2. OLS5 on homoskedasticity using imtest and hettest
3. OLS6 on normality of errors using kdensity and swilk test