Transcript Document
SKEMA Ph.D programme 2010-2011 Class 5 Multiple Regression Lionel Nesta Observatoire Français des Conjonctures Economiques [email protected] Introduction to Regression Typically, the social scientist is dealing with multiple and complex webs of interactions between variables. An immediate and appealing extension to simple linear regression is to extend the set of explanatory variable to other variables. Multiple regressions include several explanatory variables in the empirical model yi 1 xi1 2 xi2 p xip ui Introduction to Regression Typically, the social scientist is dealing with multiple and complex webs of interactions between variables. An immediate and appealing extension to simple linear regression is to extend the set of explanatory variable to other variables. Multiple regressions include several explanatory variables in the empirical model k K yi k xik ui k 1 To minimize the sum of squared errors n min yi yˆi 2 i 1 min yi ˆ k xik i 1 k 1 n k K 2 n 2 0 ˆ i 1 k n i 1 2 0 j 2 , 2 , , k , ,K Multivariate Least Square Estimator Usually, the multivariate is described by matrix notation: yi xi ui y = Xβ + u With the following least square solution: ˆβ XX 1 Xy 1 2 ˆ cov(β) XX Assumption OLS 1 Linearity The model is linear in its parameters y 0 1x1 2 x 2 k x k u It is possible to operate non linear transformation of the variables (e.g. log of x) but not of the parameters like the following : y 0 1x12 u OLS can not estimate this Assumption OLS 2 Random Sampling The n observations are a random sample of the whole population There is no selection bias in the sample. The results pertain to the whole population All observations are independent from one another (no serial nor cross-sectional correlation) Assumption OLS 3 No perfect Collinearity There is no collinearity between independent variables No independent variable is constant. Each variable has variance which can be used with the variance of the dependent variable to compute the parameters. No exact linear relationships amongst independent variables Assumption OLS 4 Zero Conditional Mean The error term u has an expected value of zero E u x1 , x 2 , , xk 0 Given any values of the independent variables (IV), the error term must have an expected value of zero. In this case, all independent variables are exogenous. Otherwise, at least one IV suffers from an endogeneity problem. Sources of endogeneity Wrong specification of the model Omitted variable correlated with one RHS. Measurement errors of RHS Mutual causation between LHS and RHS Simultaneity Assumption OLS 5 Homoskedasticity The variance of the error term, u, conditional on RHS, is the same for all values of RHS. Var u x1 , x 2 , , x k u2 Otherwise we speak of heteroskedasticity. Assumption OLS 6 Normality of error term The error term is independent of all RHS and follows a normal distribution with zero mean and variance σ² u Normal(0, ) 2 Assumptions OLS OLS1 Linearity OLS2 Random Sampling OLS3 No perfect Collinearity OLS4 Zero Conditional Mean OLS5 Homoskedasticity OLS6 Normality of error term Theorem 1 OLS1 - OLS4 : Unbiasedness of OLS. The set of estimated parameters ˆ j is equal to the true unknown values of j E ˆ j j , j 0,1,2, ,k Theorem 2 OLS1 – OLS5 : Variance of OLS estimate. The variance of the OLS estimator is Var ˆ j x n i 1 2 u x j 1 R 2 ij 2 j … where R²j is the R-squared from regressing xj on all other 2 independent variables. But how can we measure u ? Theorem 3 OLS1 – OLS5 : The standard error of the regression is defined as E 2u ˆ 2u n k 1 n k 1 yi yi i 2 2 ˆ u i i This is also called the standard error of the estimate or the root mean squared errors (RMSE) Standard Error of Each Parameter Combining theorems 2 and 3 yields: se ˆ j ˆ u x n i 1 x j 1 R 2 ij 2 j Theorem 4 Under assumptions OLS1 – OLS5, estimators ˆ 0 , ˆ 1 , , ˆ k are the Best Linear Unbiased Estimators (BLUE) of 0 , 1 , , k Assumptions OLS1 – OLS5 are known as the GaussMarkov Theorem, which stipulates that under OLS1-5, the OLS are the best estimation method The estimates are unbiased (OLS1-4) The estimates have the smallest variance (OLS5) Theorem 5 Under assumptions OLS1 – OLS6, the OLS estimates follows a t distribution: ˆ j j se(ˆ ) j t n k 1 Extension of theorem 5: Inference We can define de confidence interval of β, at 95% : j j t .025 ˆ u x n i 1 ij x j 1 R 2j 2 If the 95% CI does not include 0, then β is significantly different than 0. Student t Test for H0: βj=0 We are also in the position to infer on βj H0: βj = 0 H1: βj ≠ 0 t seˆ seˆ Rule of decision Accept H0 is | t | < tα/2 Reject H0 is | t | ≥ tα/2 Summary OLS1 Linearity T1 T2-T4 T5 β~t BLUE Unbiasedness OLS2 Random Sampling OLS3 No perfect Collinearity OLS4 Zero Conditional Mean OLS5 Homoskedasticity OLS6 Normality of error term Application 1: seminal model The knowledge production function PAT f (RD,SIZE) PAT A RD1 SIZE 2 exp u pat 1 rd 2 size u Application 1: modèle de base . reg lpat lrd lassets Source SS df MS Model Residual 97.0696447 538.941858 2 428 48.5348224 1.25920995 Total 636.011503 430 1.47909652 lpat Coef. lrd lassets _cons .6461714 -.3712237 -.5909529 Std. Err. .0868021 .0722135 .3903255 t 7.44 -5.14 -1.51 Number of obs F( 2, 428) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.131 = = = = = = 431 38.54 0.0000 0.1526 0.1487 1.1221 [95% Conf. Interval] .47556 -.513161 -1.358146 .8167828 -.2292864 .1762404 Application 2: Changing specification The knowledge production function PAT f (RD,SIZE) 1 RD 2 PAT A SIZE exp u SIZE RD pat 1 log 2 size u SIZE Application 2: Changing specification . reg lpat lrdi lassets Source SS df MS Model Residual 97.0696447 538.941858 2 428 48.5348224 1.25920995 Total 636.011503 430 1.47909652 lpat Coef. lrdi lassets _cons .6461714 .2749477 -.5909529 Std. Err. .0868021 .0337246 .3903255 t 7.44 8.15 -1.51 Number of obs F( 2, 428) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.131 = = = = = = 431 38.54 0.0000 0.1526 0.1487 1.1221 [95% Conf. Interval] .47556 .2086614 -1.358146 .8167828 .3412341 .1762404 Application 3: Adding variables The knowledge production function PAT f (RD,SIZE,SPE) 1 RD 2 PAT A SIZE exp 3 SPE u SIZE rd pat 1 2 size 3 SPE u size Application 3: Adding variables . reg lpat lrdi lassets spe Source SS df MS Model Residual 105.748034 530.263469 3 427 35.2493446 1.24183482 Total 636.011503 430 1.47909652 lpat Coef. lrdi lassets spe _cons .670643 .2736255 .423136 -.4877403 Std. Err. .0866968 .0334948 .1600635 .3895845 t 7.74 8.17 2.64 -1.25 Number of obs F( 3, 427) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.009 0.211 = = = = = = 431 28.38 0.0000 0.1663 0.1604 1.1144 [95% Conf. Interval] .5002375 .2077903 .1085255 -1.253482 .8410485 .3394608 .7377464 .2780017 Qualitative variables used as independent variables Qualitative variables as indep. variables Qualitative variables Dummy variables Generating dummy variables using STATA Interpretation of coefficients in OLS Interaction effects between continuous and dummy var. Qualitatives variables Qualitative variables provide information on discrete characteristics The number of categories taken by qualitative variables is general small. These can be numerical values but each number denotes a attribute – a characteristics. A qualitative variable may have several categories Two categories: male – female Three categories: nationality (French, German, Turkish) More than three categories: sectors (car, chemical, steel, electronic equip., etc.) Qualitative variables There are several ways to code qualitative variables with n categories Using one categorical variables Producing n - 1 dummy variables A dummy variable is a variable which takes values 0 or 1. We also call them binary variables We also call dichotomous variables Qualitative variables Coding using one categorical variable Two categories: we generate a categorical variable called “gender” set to 1 if the observation is a female, 2 if the observation is a male. Three categories: we generate a categorical variable called “country” set to 1 if the observation is French, 2 if the observation is German, three if the observation if Turkish. More than three categories : we generate a categorical variable called “sector” set to 1 if the observation is in the car industry, 2 for the chemical industry, three for the steel ifnustry, four for the electronic equip industry, etc.. This requires the use of label in order to know to which category a given number pertains Labelling variables Labelling is tedious, boring and uninteresting. But there are clear consequences when one must interpret the results label variable. Decribe a variable, qualitative or quantitative label variable asset "real capital" label define. Defines a label (meaning of numbers) label define firm_type 1 "biotech" 0 "Pharma" label values Applies the label to a given variable label values type firm_type Exemple de labellisation ************************************************************************************* ******* CREATION DES LABELS INDUSTRIES ********* ************************************************************************************* egen industrie = group(isic_oecd) #delimit ; label define induscode 1 "Text. Habill. & Cuir" 2 "Bois" 3 "Pap. Cart. & Imprim." 4 "Coke Raffin. Nucl." 5 "Chimie" 6 "Caoutc. Plast." 7 "Aut. Prod. min." 8 "Métaux de base" 9 "Travail des métaux" 10 "Mach. & Equip." 11 "Bureau & Inform." 12 "Mach. & Mat. Elec." 13 "Radio TV Telecom." 14 "Instrum. optique" 15 "Automobile" 16 "Aut. transp." 17 "Autres"; #delimit cr label values industrie induscode Exercise 1. Open SKEMA_BIO.dta 2. Create variable firm_type from type 3. Label variable firm_type 4. Define a label for firm_type and apply it Dummy variables Coding categorical variables using dummy variables only Two categories. We generate one dummy variable “female” set to 1if the obs. is a female, 0 otherwise. We generate one dummy variable “male” set to 1if the obs. is a male, 0 otherwise. But one of the dummy variable is simply redundant. When female = 0, then necessarily male = 1 (and vice versa). Hence with two categories, we only need one dummy variable. Dummy variables Coding categorical variables using dummy variables only Three categories. We generate one dummy variable “France” set to 1if the obs. is a French, 0 otherwise. We generate one dummy variable “Germany” set to 1if the obs. is a German, 0 otherwise. We generate one dummy variable “Turkish” set to 1if the obs. is a Turkish, 0 otherwise. But one of the dummy variable is simply redundant. When France=0 and German=0, then Turkish=1. For a variable with n categories, we must create n - 1 dummy variables, each representing one particular category. Generation of dummies with STATA Using the if condition. generate DEU = 0 replace DEU = 1 if country==“GERMANY” generate LDF= 1 if size > 100 replace LDF =0 if size < 101 Avoiding the use of the if condition. generate FRA = country==“FRANCE” generate LDF = size > 100 Generation of dummies with STATA With n categories and n being large, generating dummty variables can become really tedious Function tabulate has a very convenient extension, since it will generate n dummy variables at once. tabulate varcat, gen(v_) tabulate country, gen(c_) Will create n dummy variables with n being the number of country in the dataset, and c_1 being the first country, c_2 being second, c_3 the third, etc. Reading coefficients of dummy variables Remember! A coefficient tells us the increase in y associated with a one-unit increase in x, other things held constant (ceteris paribus). If the knowledge production function goes y biotech u with « y » being the number of patent and “biotech” being a dummy variable set to 1 for biotech fimrs, 0 otherwise. Reading coefficients of dummy variables If the firm is biotech company, then the dummy variable “biotech” is equal to unity. Hence: ˆ 1 ˆ ˆ yˆ ˆ If the firm is pharma company, then the dummy variable “biotech” is equal to zero. Hence: ˆ 0 ˆ yˆ ˆ Reading coefficients of dummy variables The coefficient reads as the variation in the dependent variable when the dummy variable is set to 1 relative to the situation where the dummy variable is set to 0. With two categories, I must introduce one dummy variable. With three categories, I must introduce two dummy variables. With n categories, I must introduce (n-1) dummy variables. Exercise 1. Regress the following model: PAT biotech u 2. Predict the number of patents for both biotech and pharma companies 3. Produce descriptive statistics of PAT for each type of company using the command table 4. What do you observe? Reading coefficients of dummy variables For semi logarithmic forms (log Y), coefficient β must be read as an approximation of the percent change in Y associated with a variation of 1 unit of the explanatory variable. This approximation is acceptable for small β (β < 0.1). When β is large (β ≥ 0.1), the exact percent change in Y is: 100 × (eβ – 1) Application 4: dummy variable The knowledge production function PAT f (RD,SIZE,SPE, BIO) 1 RD 2 PAT A SIZE exp 3 SPE 4 BIO u SIZE rd pat 1 2 size 3 SPE 4 BIO u size Application 4: dummy variable . reg lpat lrdi lassets spe Source SS biotech df MS Model Residual 203.874406 432.137097 4 426 50.9686015 1.01440633 Total 636.011503 430 1.47909652 lpat Coef. lrdi lassets spe biotech _cons .4924169 .5558656 .4212942 1.657062 -5.464644 Std. Err. .0804249 .0417126 .1446661 .1684813 .6164752 t 6.12 13.33 2.91 9.84 -8.86 Number of obs F( 4, 426) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.004 0.000 0.000 = = = = = = 431 50.24 0.0000 0.3206 0.3142 1.0072 [95% Conf. Interval] .3343379 .4738775 .136946 1.325904 -6.676356 .650496 .6378537 .7056423 1.98822 -4.252932 Application 4: dummy variable Biotech : ˆ ˆ 2 size ˆ 4 Patent ln(PAT) slope ˆ 2 ˆ 4 ˆ ˆ 4 slope ˆ 2 Pharma : ˆ ˆ 2 size ˆ size Application 5: Interacting variables The knowledge production function PAT f (RD,SIZE, BIO) 1 RD 2 PAT A SIZE exp 3 BIO 4 BIO size u SIZE rd pat 1 2 size 3 BIO 5 BIO size u size Application 5: Interacting variables . reg lpat lrdi lassets spe Source SS biotech bio_assets df MS Model Residual 207.026736 428.984767 5 425 41.4053471 1.00937592 Total 636.011503 430 1.47909652 lpat Coef. lrdi lassets spe biotech bio_assets _cons .4742035 .619805 .4131693 3.592252 -.1435349 -6.482948 Std. Err. .0808846 .0551395 .1443802 1.107872 .081221 .8427254 t 5.86 11.24 2.86 3.24 -1.77 -7.69 Number of obs F( 5, 425) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.004 0.001 0.078 0.000 = = = = = = 431 41.02 0.0000 0.3255 0.3176 1.0047 [95% Conf. Interval] .3152199 .5114249 .1293812 1.41466 -.3031798 -8.139376 .6331871 .7281852 .6969573 5.769843 .0161099 -4.826519 Application 5: Interacting variables Patent ln(PAT) Biotech : ˆ ˆ 2 size ˆ 4 ˆ 5 BIO size slope ˆ 2 ˆ 5 BIO size ˆ 5 ˆ ˆ 5 slope ˆ 2 Pharma : ˆ ˆ 2 size ˆ Size Specification Tests Specification Tests for Multiple OLS The knowledge production function PAT f (RD,SIZE,SPE) 1 RD 2 PAT A SIZE exp 3 SPE u SIZE rd pat 1 2 size 3 SPE u size Specification Tests for Multiple OLS Critical probability α such that : Pr(Ha|H0)= α Student t test: concerning the significance of one parameter Fisher F test: concerning the significance of several parameters simultaneously (Wald test) Non linear restriction test: Testing for non-linear relationship between parameters Specification Tests for Multiple OLS Testing linear combination of parameters Concerning one parameter only H0 : lassets = 0.30 test size = 0.30 Test on several parameters H0 : size = 0.30 and rdi = 0.70 test (size = 0.3) (rdi=0.7) H0 : rdi = 2 * size test lrdi = 2 * lassets H0 : lrdi + lassets = 1 test lrdi + lassets = 1 lincom _b[lrdi] + _b[lassets] - 1 Specification Tests for Multiple OLS Testing non linear combination of parameters Test on several parameters H0 : size * rdi = 0.2 testnl _b[lrdi] * _b[lassets] = 0.2 nlcom _b[lrdi] * _b[lassets] = 0.2 Review of Assumptions OLS assumption Consistency when violated Efficiency when violated Test - - - Biased β None None. Redo sampling & estimation OLS3 No perfect Collinearity - - - OLS4 Zero Conditional Mean Biased β Poorly estimated variance of β Link test Omitted Variable test OLS5 Homoskedasticity None Underestimated variance of β Breusch-Pagan test OLS6 Normality of error term None Lack of reliability of the t test for β Shapiro Wilk test OLS1 Linearity OLS2 Random Sampling Specification Tests for Multiple OLS Specification tests on the validity of assumptions Hypothesis OLS5 : Homoskedasticity of residuals Rule of thumb using graphs Stata Instruction rvfplot White Test Stata Instruction estat imtest Breusch-Pagan Test Stata Instruction estat hettest Specification Tests for Multiple OLS Specification tests on the validity of assumptions 0 -1 -2 Residuals 1 2 3 Hypothesis OLS5 : Homoskedasticity of residuals: rvfplot -1 0 1 Fitted values 2 3 Specification Tests for Multiple OLS Specification tests on the validity of assumptions Hypothesis OLS5 : Homoskedasticity of residuals: estat imtest . imtest Cameron & Trivedi's decomposition of IM-test Source chi2 df p Heteroskedasticity Skewness Kurtosis 21.74 3.05 15.55 9 3 1 0.0097 0.3840 0.0001 Total 40.34 13 0.0001 Specification Tests for Multiple OLS Specification tests on the validity of assumptions Hypothesis OLS5 : Homoskedasticity of residuals: estat hettest . hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of lpat chi2(1) Prob > chi2 = = 2.83 0.0927 Specification Tests for Multiple OLS Specification tests on the validity of assumptions Hypothesis OLS6 : Normality of residuals Rule of thumb using graphs Stata Instruction predict res, residual kdensity res, normal Formally using the Shapiro-Wilk Test Stata Instruction predict res, residual swilk res, normal Specification Tests for Multiple OLS Specification tests on the validity of assumptions Hypothesis OLS6 : Normality of residuals: kdensity .2 .1 0 Density .3 .4 Kernel density estimate -4 -2 0 Residuals Kernel density estimate Normal density kernel = epanechnikov, bandwidth = 0.2971 2 4 Specification Tests for Multiple OLS Specification tests on the validity of assumptions Hypothesis OLS6 : Normality of residuals . swilk res Shapiro-Wilk W test for normal data Variable Obs W V z res 431 0.98688 3.862 3.226 Prob>z 0.00063 Specification Tests for Multiple OLS Specification tests on the validity of assumptions There is no omitted variables (OLS4 on endogeneity) Link test : Stata Instruction linktest Regress the DV over the prediction and its squared value Variable _hat must be significant, but not _hatsq Ramsey RESET Test : Stata Instruction ovtest Regress the DV over powers (4) of LHS variables Regress the DV over powers (4) of RHS variables Specification Tests for Multiple OLS Specification tests on the validity of assumptions There is no omitted variables (OLS4 on endogeneity): linktest . quietly: regress lpat lrdi lassets spe . linktest Source SS df MS Model Residual 112.393289 523.618213 2 428 56.1966447 1.22340704 Total 636.011503 430 1.47909652 lpat Coef. _hat _hatsq _cons .2055605 .2707472 .4943074 Std. Err. .3574387 .1161699 .2887035 t 0.58 2.33 1.71 Number of obs F( 2, 428) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.566 0.020 0.088 = = = = = = 431 45.93 0.0000 0.1767 0.1729 1.1061 [95% Conf. Interval] -.4969932 .0424126 -.0731457 .9081141 .4990817 1.061761 Specification Tests for Multiple OLS Specification tests on the validity of assumptions There is no omitted variables (OLS4 on endogeneity): ovtest Fnkm1 k R12 R 02 k 1 1 R12 nmk . quietly: regress lpat lrdi lassets spe . ovtest Ramsey RESET test using powers of the fitted values of lpat Ho: model has no omitted variables F(3, 424) = 2.34 Prob > F = 0.0732 Exercise 1. Regress the following model rd pat 1 2 size 3 SPE 4 BIO u size 2. Assuming OLS1-3 to be correct, test OLS4-6 and conclude 1. OL4 on specification test using linktest and ovetst 2. OLS5 on homoskedasticity using imtest and hettest 3. OLS6 on normality of errors using kdensity and swilk test