Review of Probability and Statistics

Download Report

Transcript Review of Probability and Statistics

Quantitative Methods
Mats Wilhelmsson
Lecture 1 and 2
2009-11-04
2009-11-18
Contents: Lectures




2
Regression Analysis I
Regression Analysis II
Time Series Analysis I
Time Series Analysis II
Seminars

Panel data
Two-stage least square
Simultaneous equations models
Logit/probit

?



3
Requirements
Lectures
 Seminars

Write a WP
 Present it at a conference
 May be written in groups of two

4
Course book

Wooldridge, J.M.,
Introductory Econometrics:
A modern approach

Any edition is ok!
5
Aim



6
After taking this course you should be able
to use, understand and interpret
quantitative methods, such as regression
analysis, to
examine e.g. the real estate market,
analyse trends,
conduct forecasting.
10000000
0
5000000
Pris kronor
15000000
20000000
Apartment prices, Stockholm 2008
0
7
100
200
300
Bostadsyta kvm
400
500
Simple regression price is a function of size
3
13
5
Source |
SS
df
MS
-------------+-----------------------------Model | 6.9364e+09
1 6.9364e+09
Residual | 9.3067e+09 8696
1070225.7
-------------+-----------------------------Total | 1.6243e+10 8697 1867662.26
Number of obs
F( 1, 8696)
Prob > F
R-squared
Adj R-squared
Root MSE
=
8698
= 6481.23
= 0.0000
= 0.4270
= 0.4270
= 1034.5
-----------------------------------------------------------------------------pris000 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------area |
34.01006
.4224533
80.51
0.000
33.18195
34.83817
_cons |
197.9821
28.35566
6.98
0.000
142.3983
253.5659
------------------------------------------------------------------------------
2
8
8
9
12
10
1
11
12
4
6
7
Lathund
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
9
Number of observations
Coefficients
SST, SSR, SSE
R-square
MST, MSR, MSE
R-square adjusted
Root MSE
Standard error
t-value
Confidence interval
F-value
Probability-value
Degree of freedom
Why study Econometrics?

Rare in economics to have experimental data


An empirical analysis uses data to test a theory
or to estimate a relationship


10
Need to use nonexperimental, or observational,
data to make inferences
A formal economic model can be tested
Theory may be ambiguous as to the effect of
some policy change – can use econometrics to
evaluate the program
Research method
Problem
Empirical analysis
Economic theory
Data
Sample
11
Hypothesis
Reject/verify
Population
Causality





12
The (usual) aim of testing hypothesis in economics
is to establish if one variable has any causal effect on
some other variable.
To only establish that two variables vary together is
often not enough
The term ceteris paribus (everything else equal) is
important when discussing causal effects.
Econometric models may be used to estimate ceteris
paribus effects.
Can be difficult to establish causality
Statistical theory
Population
All properties in the stock
Population - parameters
 Sample - estimates


Statistical inference
Estimating of parameters
 Test hypotheses

13
y = b 0 + b 1x + u
Sample
Sold properties
y = b0 + b1x + u
Different kinds of
data
14
Different kinds of data


Cross-sectional data
Time series data
Obs
1
2
3
4
.
200
Price (SEK)
600 000
750 000
675 000
825 000
.
925 000
Living Area
80
95
75
84
.
96
Obs.
Year
Index
GDP
1
1981
101
900
2
1982
105
1050
3
1983
110
1200
1999
250
8500
.
20


15
Poolade Cross Sections – many different cross-sections
over time
Panel data – the same cross-section over time
Types of Data – Cross
Sectional

Cross-sectional data is a random sample

Each observation is a new individual, firm,
etc. with information at a point in time

If the data is not a random sample, we have a
sample-selection problem
16
Types of Data – Time Series

Time series data has a separate observation
for each time period – e.g. stock prices

Since not a random sample, different
problems to consider

Trends and seasonality will be important
17
Descriptive statistics
18
Descriptive statistics, central
tendency

Mean (arithmetic),
x

x
(=E[X])
n



For the population we use;
x
Median


19
x denotes sample mean,
Sort x in increasing order,
median=equal number of observation above as
below (if n is an even number use mean)
Descriptive statistics, spread
1
2
2])
 Variance, s 
(=E[(X
)
(
x

x
)

n 1
2
2

 s =sample variance,
x denotes population variance
2
(and then we divide by n, not n-1)

Standard deviation,


s s
2
Correspondingly, x denotes standard deviation in
the population. Measured in the same unit as X.
Range  max(x)  min(x)
20
Covariance and correlation
Measures how two variables covary
 Covariance,
( X i  X )(Yi  Y )

cov(X , Y ) 
n 1


Correlation coefficient,
cov(X , Y )
corr( X , Y )   
, 0   1
sd ( X ) * sd (Y )
21
Skewness and kurtosis

Skewness: A measure of how far a distribution
is from being symmetric. Compared to normal
distribution, which is zero (0).


Kurtosis: A measure of the thickness of the tails
of a distribution. Compared to normal
distribution, which is three (3).

22
Skewness>0  skewness to the right.
Kurtosis>3  more peaked distribution.
Desirable properties of an estimator

Unbiased


Efficient


Low variance
Consistent

23
On average correct
Better estimator as sample size increases
Unbiasness
“If an estimator is unbiased, then its
probability distribution has an expected
value equal to parameter it is supposed to
be estimating.”
Hence, we are not correct every time, but on
average we are correct.
24
Efficiency
If we have two unbiased estimators, such as
the average and the median, of a
parameter, such as , the one with the
lowest variance is the efficient estimator.
In our case, the sample average is the
estimator with lowest variance.
25
Consistency
Pr( x     )  0 as n  
The law of large numbers: if we are interesting in estimating the
population average, we can get arbitrarily close to the true value
by choosing a sufficiently large sample.
26
Central limit theorem
Draw a random sample from a normal distributed
population
 The sample mean is a stochastic variable, which is
normally distributed,
X ~Normal( , 2/n)


Central Limit Theorem – this is true disregarding
the distribution of the population as long as n is
large enough!
27
Standard normal distribution




28
Standard normal distribution
 Expected value = 0, variance = 1
 The distribution may be looked up in a
table (e.g. 95% of all outcomes lie
between –1.96 and 1.96)
Assume X~Normal(X,2X)
Create a new variable Z=(X- X)/ X
We then have that Z~Normal(0,1), that is
with a standard normal distribution
Standard normal distribution
0,045
0,04
0,035
-1.96
0,03
0,025
0,02
95%
0,015
0,01
2.5%
0,005
Z
0
-3
29
-2,5
-2
-1,5
-1
-0,5
0
0,5
1
1,5
2
2,5
3
X 

1.96
30
The Simple
Regression Model
Regression analysis
Statistical (econometric) method
 Infer how different variables co vary

We’ll use a (random) sample
 To say something about the population


32
Or, how does the value of a property
depend on its living area?
Some Terminology

In the simple linear regression model,
where y = b0 + b1x + u, we typically refer
to y as the
Dependent Variable, or
 Left-Hand Side Variable, or
 Explained Variable, or
 Regressand

33
Some Terminology, cont.

In the simple linear regression of y on x,
we typically refer to x as the
Independent Variable, or
 Right-Hand Side Variable, or
 Explanatory Variable, or
 Regressor, or
 Covariate, or
 Control Variables

34
A First Simple Assumption

The average value of u, the error term, in the
population is 0. That is,

E(u) = 0

This is not a restrictive assumption, since we
can always use b0 to normalize E(u) to 0
35
Crucial assumption:
Zero Conditional Mean
A crucial assumption about how u and x
are related – causality
 We want it to be the case that knowing
something about x does not give us any
information about u, so that they are
completely unrelated. That is, that


E(u|x) = E(u) = 0
• X is strictly exogenous
• Ceteris paribus effects
36
For example…

Pricei=a+b*Living areai+ui





37
Where u is quality
Then, E(quality100 m2)= E(quality50 m2),
if zero conditional mean assumption is true.
In fact, the average quality must be the
same for all sizes.
If not, b is not a ceteris paribus effect.
This issue must be addressed before
relying on simple regression analysis.
Ordinary Least Squares (OLS)
Basic idea of regression is to estimate the
population parameters from a sample
 Let {(xi,yi): i=1, …,n} denote a random
sample of size n from the population
 For each observation in this sample, it
will be the case that

 yi
38
= b0 + b1xi + ui
Population regression line, sample data points
and the error terms
E(y|x) = b0 + b1x
y
y4
u4
y3
y2
.
u{
.{
.} u3
2
} u1
.
y1
39
x1
x2
x3
x4
x
And now… some math
A simple regression model: y= b0+b1x+u
 Let the regression line be:
yˆi  bˆ0  bˆ1xi
 The error in the estimate, the residual, is:

uˆi  yi  yˆi  yi  (bˆ0  bˆ1xi )

Choose bˆ0 and bˆ1 to minimize the sum of the
squared residuals:
min
b b
0, 1
40
2
ˆ
u
 i  min
b 0 , b1
ˆ  bˆ x ))2
(
y

(
b
 i 0 1i
More math…
Minimize as usual(?)
 FOC 1:

  uˆi2
 2 ( yi  ( bˆ0  bˆ1 xi ))  0   uˆi  0
bˆ
0

FOC 2:
  uˆi2
 2 xi ( yi  ( bˆ0  bˆ1 xi ))  0
bˆ
1
41
Even more…

From FOC 1 we get (note that  xi  nx ):
 y   bˆ   bˆ x
i
0
 ny  nbˆ0  bˆ1nx  0
1 i
bˆ0  y  bˆ1 x

From this, we get:

Insert this in FOC 2 to reach:
ˆ x  bˆ x ))  x ( y  y )  bˆ
x
(
y

(
y

b
i i
i i
1
1 i
1  xi ( xi  x )  0

Thus, we have:
bˆ
1
x ( y  y )  x  x  y  y 



 x  x 
 x (x  x)
i
i
i
2
i
42
i
i
i
Economical interpretation

b0: gives the predicted value of Y (in units of
Y) if X is equal to 0. This may or may not
have plausible meaning, depending on the
context.

b1: one-unit increase in X will cause a b1-unit
increase in Y (measured in units of Y).
43
Unbiasedness
The best-known desirable property of an
estimator.
 An unbiased estimator is one that has a
sampling distribution with a mean equal
to the parameter to be estimated.
 A perfect estimator gives a perfect guess
every time; an unbiased estimator gives
perfect result only on average.

44
Unbiasedness of OLS
1.
2.
3.
4.
45
Assume the population model is linear
in parameters, y = b0 + b1x + u
Assume we can use a random sample of
size n, {(xi, yi): i=1, 2, …, n}, from the
population model. Thus we can write
the sample model yi = b0 + b1xi + ui
Assume there is variation in x
Assume E(u|x) = 0
Unbiasedness



46
Proof of unbiasedness depends on our 4
assumptions – if any assumption fails, then OLS
is not necessarily unbiased.
The key assumption for regression analysis to be
useful is that the expected value of u given any
value of x is zero.
Remember unbiasedness is a description of the
estimator – in a given sample we may be “near”
or “far” from the true parameter.
Variance of the OLS Estimators
Now we know that the sampling
distribution of our estimate is centered
around the true parameter.
 Want to think about how spread out this
distribution is. The variance or its square
root, standard deviation.
 Much easier to think about this variance
under an additional assumption, so
 Assume Var(u|x) = 2 (Homoskedasticity)

47
Homoskedastic Case
y
f(y|x)
.
x1
48
x2
. E(y|x) = b + b x
0
1
Heteroskedastic Case
f(y|x)
.
.
x1
49
x2
x3
.
E(y|x) = b0 + b1x
x
Variance of OLS (cont)

Var(u|x) =  2
n
1
2
2
 Population:    ui
n i 1

50
s, the square root of the error variance is
called the standard deviation of the error
Estimating the Error Variance

We don’t know what the error variance,
 2, is, because we don’t observe the
errors, ui

What we observe are the residuals, ûi

We can use the residuals to form an
estimate of the error variance
51
Error Variance Estimate (cont)

Standard error of the estimate y:
1 n 2
1 n
2
ˆ
ˆ
ˆ 
u

(
y

y
)
 i n2
i
i
n  2 i 1
i 1
2
ˆ  ˆ 2

Standard error of the estimates b0 and b1
ˆ b21  var(bˆ1 )  ˆ 2 (
ˆ 2
1
)
nVar( x)
2
x
ˆ  var(bˆ0 ) 
(1 
)
n
Var ( x)
2
b0
52
Variance of OLS Summary
The larger the error variance, s2, the larger
the variance of the slope estimate
 The larger the variability in the xi, the
smaller the variance of the slope estimate
 As a result, a larger sample size should
decrease the variance of the slope
estimate

53
Multiple Regression
Analysis
Similarities with simple
regression analysis





55
b0 is still the intercept
b1 through bk are slope parameters
u is still the error term
We still need an assumption about zero
conditional mean:
E(u|x1,x2, …,xk) = 0
Still, minimize the sum of the squared error
terms. This will yield k+1 first order
conditions
Interpreting Multiple
Regression
yˆ  bˆ0  bˆ1x1  bˆ2 x2  ... bˆk xk

Holding x1,x2, …,xk fixed implies that
yˆ  bˆ1x1

56
That is b has a ceteris paribus
interpretation
Simple vs
Multiple Reg Estimate
~ ~
~
Comparethesimple regression y  b 0  b1 x1
wit h themultipleregression yˆ  bˆ0  bˆ1 x1  bˆ 2 x2
~
Generally,b1  bˆ1 unless :
bˆ  0 (i.e.no partialeffectof x ) OR
2
2
x1 and x2 are uncorrelated in thesample
57
How good is the
model?
Goodness-of-Fit
We can thinkof each observation as being made
up of an explainedpart,and an unexplained part,
yi  yˆ i  uˆi Wethendefine thefollowing:
  y  y  is the totalsum of squares (SST )
  yˆ  y  is theexplainedsum of squares (SSE)
 uˆ is theresidual sum of squares (SSR)
2
i
2
i
2
i
T henSST  SSE  SSR
59
Goodness-of-Fit (continued)
• How do we think about how well our
sample regression line fits our sample data?
• Can compute the fraction of the total sum
of squares (SST) that is explained by the
model, call this the R-squared of regression
• R2 = SSE/SST = 1 – SSR/SST
60
More about R-squared

R2 can never decrease when another
independent variable is added to a
regression, and usually will increase

Because R2 will usually increase with
the number of independent variables,
it is not a good way to compare
models
61
Goodness of fit Adjusted R2


Remember, R2 always increases if the
number of variables increases – not a good
way to compare different models
“Adjusted R2“, takes the number of variables
into consideration:
R
2

SSR n  k  1
 1
SST n  1
 1
62
ˆ 2
SST n  1
Adjusted R2 (cont)
Adjusted R2 is simply:
(1 – R2)(n – 1) / (n – k – 1)
 May be used to compare to models with
different number of variables (but same y)
 Does NOT work if y is different


63
E.g., one model with y against one with
ln(y)
Goodness of Fit
Even though adj-R2 is a good measure of
Goodness of Fit – don’t overdo it
 If economic theory states that a variable
should be in the model, let it be in the
model
 Avoid including variables that complicate
the interpretation of the model

64
Assumptions and
Precision
Unbiasedness for multiple
regressions

Four assumptions for unbiasedness
1. Population model is linear in parameters:
y = b0 + b1x1 + b2x2 +…+ bkxk + u
2. A random sample
3. Zero conditional mean, E(u|x1, x2,… xk) = 0
4. None of the x’s is constant, and there are no exact
linear relationships among them
66
Homoskedasticity
and Gauss-Markov

Homoskedasticity
Let x stand for (x1, x2,…xk)
 Assuming that Var(u|x) =  2


The 4 assumptions for unbiasedness, plus this
homoskedasticity assumption are known as the
Gauss-Markov assumptions
67
Estimating the Error Variance
We don’t know what the error variance, 2,
is, because we don’t observe the errors, ui
What we observe are the residuals, ûi
We can use the residuals to form an estimate
of the error variance
68
Variance of OLS (cont)
ˆ  uˆ
2
2
i
 n  k 1  SSR df
df = n – (k + 1), or df = n – k – 1
df (i.e. degrees of freedom) is the (number of observations) –
(number of estimated parameters)
 
Var bˆ j 
 x
ij
2
j

2

 x j  1 R
2
2
2
j

,
where R is theR fromregressing x j on all otherx' s
69
Components of OLS
Variances
The error variance: a larger  2 implies a
larger variance for the OLS estimators
 The total sample variation: a larger var(xj)
implies a smaller variance for the
estimators
 Linear relationships among the
independent variables: a larger Rj2 implies
a larger variance for the estimators

70
Rj2

Rj2 is the proportion of the total variation in xj that can be
explained by all other independent variables (x).

If Rj2 is equal to 0: var(bj) is going to be the smallest (given
var(xj) and 2).

If Rj2 goes to 1: var(bj)

71
multicollinearity
One additional
assumption
The Gauss-Markov Theorem

Given our 5 Gauss-Markov Assumptions it
can be shown that OLS is “BLUE”





73
Best
Linear
Unbiased
Estimator
Thus, if the assumptions hold, use OLS
Assumptions of the Classical
Linear Model




74
So far, we know that given the GaussMarkov assumptions, OLS is BLUE,
In order to do classical hypothesis testing,
we need to add another assumption
Assume that u is independent of x1, x2,…, xk
and u is normally distributed with zero mean
and variance 2: u ~ Normal(0,2)
If all this is fulfilled we have a “CLM”
y
f(y|x)
.
Normal
distributions
x1
75
x2
. E(y|x) = b + b x
0
1
Zero conditional
mean assumption
Too Many or
Too Few Variables

What happens if we include variables in our
specification that don’t belong?


What if we exclude a variable from our
specification that does belong?


77
There is no effect on our parameter
estimate, and OLS remains unbiased
OLS will usually be biased as the zero
conditional mean assumption is violated
“Omitted variable bias”
Omitted Variable Bias

Two cases where bias is equal to zero
 b2 = 0, that is x2 doesn’t really belong in


If correlation between x2 , x1 and x2 , y is the
same direction, bias will be positive


78
model
x1 and x2 are uncorrelated in the sample
E.g., y=Price, x1=Living area, x2=Lot size
Without x2, the positive effect from large lot
size will appear in b1 instead
Summary of Direction of
Bias of b1
Corr(x1, x2) > 0
Corr(x1, x2) < 0
b2 > 0
Positive bias
Negative bias
b2 < 0
Negative bias
Positive bias
79
Assumptions of the Classical
Linear Model
So far, we know that given the GaussMarkov assumptions, OLS is BLUE,
 In order to do classical hypothesis
testing, we need to add another
assumption.
 Assume that u is independent of x1,
x2,…, xk and u is normally distributed
with zero mean and variance 2:
 u ~ Normal(0,2)

80
The homoskedastic normal distribution with
a single explanatory variable
y
f(y|x)
.
Normal
distributions
x1
81
x2
. E(y|x) = b + b x
0
1
Normal Sampling Distributions
Under theCLM assumptions
bˆ ~ Normalb ,Var bˆ , so that
j
bˆ
82
j
bj


j
 
j


~
Normal
0,1
sd bˆ j
 
The t Test
Under theCLM assumptions
bˆ j  b j
~ t n  k 1
ˆ
sd b


 
j
Note thisis a t distribution (vs normal)
because we have to estimate 2 by ˆ 2
Note thedegrees of freedom: n  k  1
83
The t Test (cont)
Knowing the sampling distribution for the
standardized estimator allows us to carry
out hypothesis tests
 Start with a null hypothesis
 For example, H0: bj=0
 If accept null, then accept that xj has no
effect on y, controlling for other x’s

84
The t Test (cont)
T o performour test we first need t o form
ˆ
b
" the"t st atisticfor bˆ j : t bˆ  j
j
se bˆ
 
j
We will t henuse our t st atisticalong wit h
a rejectionrule t o determinewhether ot
accept thenull hypothesis
, H0
85
t Test: One-Sided Alternatives
Besides our null, H0, we need an
alternative hypothesis, H1, and a
significance level
 H1 may be one-sided, or two-sided

H1: bj > 0 and H1: bj < 0 are one-sided
 H1: bj  0 is a two-sided alternative


86
If we want to have only a 5% probability
of rejecting H0 if it is really true, then we
say our significance level is 5%
One-Sided Alternatives (cont)
Having picked a significance level, a, we
look up the (1 – a)th percentile in a t
distribution with n – k – 1 df and call this
c, the critical value
 We can reject the null hypothesis if the t
statistic is greater than the critical value
 If the t statistic is less than the critical
value then we fail to reject the null

87
One-Sided Alternatives (cont)
yi = b0 + b1xi1 + … + bkxik + ui
H0: bj = 0
H1: bj > 0
Fail to reject
reject
1 a
88
0
a
c
One-sided vs Two-sided
Because the t distribution is symmetric,
testing H1: bj < 0 is straightforward. The
critical value is just the negative of before
 We can reject the null if the t statistic < –
c, and if the t statistic > than –c then we
fail to reject the null
 For a two-sided test, we set the critical
value based on a/2 and reject H1: bj  0 if
the absolute value of the t statistic > c

89
Two-Sided Alternatives
yi = b0 + b1Xi1 + … + bkXik + ui
H0: bj = 0
H1: bj > 0
fail to reject
reject
1 a
a/2
90
reject
-c
0
a/2
c
Summary for H0: bj = 0
Unless otherwise stated, the alternative is
assumed to be two-sided
 If we reject the null, we typically say “the
coefficient is statistically different from
zero at the a% level”
 If we fail to reject the null, we typically
say “the coefficient is statistically
insignificant at the a % level”

91
Testing other hypotheses
A more general form of the t statistic
recognizes that we may want to test
something like H0: bj = aj
 In this case, the appropriate t statistic is


bˆ
t
j
 aj

 
,
where
sd bˆ j
a j  0 for t hest andard t est
92
Confidence Intervals


Another way to use classical statistical testing is
to construct a confidence interval using the same
critical value as was used for a two-sided test
A (1 - a) % confidence interval is defined as
 
 a
ˆ
ˆ
b j  c  se b j , where c is the1 -  percentile
 2
in a tn k 1 distribution
93
Computing p-values for t tests



94
An alternative to the classical approach is to ask,
“what is the smallest significance level at which
the null would be rejected?”
So, compute the t statistic, and then look up
what percentile it is in the appropriate t
distribution – this is the p-value
p-value is the probability we would observe the t
statistic we did, if the null were true
Testing a Linear Combination


Suppose instead of testing whether b1 is equal to a
constant, you want to test if it is equal to another
parameter, that is H0 : b1 = b2
Use same basic procedure for forming a t statistic
bˆ1  bˆ2
t
se bˆ1  bˆ2

95

Testing a Linear Combo (cont)
Many packages will have an option to
perform the test for you
 In Stata, after reg y x1 x2 … xk you
would type test x1 = x2 to get a p-value
for the test
 The nullhypothesis is that they are equal.
If the p-value is low (<0.05), reject the
nullhypothesis.

96
Multiple Linear Restrictions
Everything we’ve done so far has
involved testing a single linear restriction,
(e.g. b1 = 0 or b1 = b2 )
 However, we may want to jointly test
multiple hypotheses about our parameters
 A typical example is testing “exclusion
restrictions” – we want to know if a group
of parameters are all equal to zero

97
Testing Exclusion Restrictions
Now the null hypothesis might be
something like H0: bk-q+1 = 0, ... , bk = 0
 The alternative is just H1: H0 is not true
 Can’t just check each t statistic separately,
because we want to know if the q
parameters are jointly significant at a
given level – it is possible for none to be
individually significant at that level

98
Exclusion Restrictions (cont)


To do the test we need to estimate the “restricted
model” without xk-q+1,, …, xk included, as well as
the “unrestricted model” with all x’s included
Intuitively, we want to know if the change in
SSR is big enough to warrant inclusion of xk-q+1,,
…, xk
F
SSRr  SSRur  q , where
SSRur n  k  1
r is restrictedand ur is unrestricted
2

SSR    yi  yi  " sum of squared residuals"
99
The F statistic
The F statistic is always positive, since
the SSR from the restricted model can’t
be less than the SSR from the unrestricted
 Essentially the F statistic is measuring the
relative increase in SSR when moving
from the unrestricted to restricted model
 q = number of restrictions, or dfr – dfur
 n – k – 1 = dfur

100
The F statistic (cont)
To decide if the increase in SSR when we
move to a restricted model is “big
enough” to reject the exclusions, we need
to know about the sampling distribution
of our F stat
 Not surprisingly, F ~ Fq,n-k-1, where q is
referred to as the numerator degrees of
freedom and n – k – 1 as the denominator
degrees of freedom

101
The F statistic (cont)
Reject H0 at a
significance level
if F > c
f(F)
fail to reject
a
1 a
0
c
102
reject
F
The R2 form of the F statistic


Because the SSR’s may be large and unwieldy,
an alternative form of the formula is useful
We use the fact that SSR = SST(1 – R2) for any
regression, so can substitute in for SSRu and
SSRur
R

 Rr2 q
F
, where again
1  R n  k  1

2
ur
2
ur

r is restrictedand ur is unrestricted
103
Overall Significance


A special case of exclusion restrictions is to test
H0: b1 = b2 =…= bk = 0
Since the R2 from a model with only an intercept
will be zero, the F statistic is simply
2
R k
F
2
1  R n  k  1

104

General Linear Restrictions
The basic form of the F statistic will work
for any set of linear restrictions
 First estimate the unrestricted model and
then estimate the restricted model
 In each case, make note of R-square
 Estimate
Rur2  Rr2  q

F
1  R  n  k  1
2
ur
, where again
r is restrictedand ur is unrestricted
105
Functional form
How restrictive is “linear”?

The model is always linear, does not feel that good…

E.g., from theory/common sense, the marginal utility of
living area should be diminishing


That is, price for single family houses should not be linear
in living area
So, the model is not applicable?

No, we can work with how we specify the variables
107
Some examples (1)
Model 1: Price = b0 + b1 *Living area + b2 * Quality
Y
 b1
X 1
b 1 = the effect of one extra square meter of living area (keeping
quality constant)
b 2 = the effect of higher quality (keeping living area constant)
108
Some examples(2)
Model 2 – semi log (log-lin model)
ln(price) = b0 + b 1*living area + b 2 * quality

b 1 is the percentage increase of one additional square

meter of living area (quality held constant)
b 2 is the percentage increase of one additional point in
quality index (living area held constant)
Y
 b1Y
X 1

109
That is, one additional square meter increases your
price more if you’ve more quality in your house
Some examples (3)
Model 3 – semi log (lin-log model)
Price = b0 + b 1* ln(living area) + b 2 * quality
b 1 is the absolute change in price from a percentage change in
living area (again, quality held constant)
Y
1
 b1
X 1
X1
110
Some examples (4)
Model 4 – log-linear models
ln(price) = b0 + b 1* ln (living area) + b 2 * ln( quality)
b 1 is the percentage increase in price from
increasing the living area by 1% (quality held
constant)
b 2 is the percentage increase in price of increasing
the quality by 1% (living area held constant)
111
Log models summary
Model: ln(y) = b0 + b1ln(x) + u
 b1 is the elasticity of y with respect to x
Model: ln(y) = b0 + b1x + u
 b1 is approximately the percentage change in y
given a 1 unit change in x
Model: y = b0 + b1ln(x) + u
 b1 is approximately the change in y for a 100
percent change in x
112
Why use log models?





113
Log models are invariant to the scale of the
variables since measuring percent changes
They give a direct estimate of elasticity
For models with y > 0, the conditional
distribution is often heteroskedastic or skewed,
while ln(y) is much less so
The distribution of ln(y) is more narrow, limiting
the effect of outliers”
NOTE: Problem using adjusted R-square to
choose between models if the dependent
variable has been transformed.
Some more examples(5)
Model 5 – Quadratic models
price = b0 + b1* time + b2 * time2 + b3 *quality


May, as model 4, handle diminishing marginal return
(positive b 1 and negative but small b 2)
For large x we may have that y decreases if x increases
114
Some more examples(6)
Modell 6 – ”inverted” models (reciprocal)
1
price = b0 + b 1*
LA
When X is high the ratio is close to zero. That is, the
model captures an asymptotic relationship
(if study time   then exam score  b0 ).
115
Some more examples(7)
Model 7 – models with interacting variables
Price = b0 + b 1* living area + b 2 * quality + b 3 * quality * living area
Differentiate on living area:
price
 b1  b 3 * quality
LA
If b 3 is positive we have that, if living area is increased by one square
meter, price increases more for properties with high quality.
If b 3 is negative, we may still have that high quality properties on average
get higher prices. In this case this must be captured by b 2.
116
Dummy variables
Examples (8)– dummy variable
Modell 8: price = b0 + b1*living area + b2 * pool
pool =1 for properties with swimming pool,
pool = 0 otherwise
Price
pool = 1
pool = 0
Impact of
”pool” on
price
118
Living area
More categories and dummy variables

Former slide, just two categories (pool or not pool)


We may think of more categories



Each observation belongs to one of the categories
E.g., low, medium or high standard of a house
We will need 2 dummy variables;
Medium = 1 if medium standard, 0 otherwise and
High = 1 if high standard, 0 otherwise

119
The comparison will be towards “Low standard”,
which is the base-group (default)
Interaction between dummyvariables
Example, one “pool”-dummy and one “sea
view”-dummy
 Also include pool*sea view



Thus, 3 dummy-variables
The base group becomes no pool and no sea
view properties
120
Interaction between a dummy and
a continuous variable





Let d be a dummy and x continuous
 For instance, d=1 if colonial, and x=price
y = b0 + d0d + b1x + d1d*x + u
If d = 0, we get y = b0 + b1x + u
If d = 1, we get y = (b0 + d0) + (b1+ d1) x + u
Captures difference in slopes between different
groups
121
Illustration
y
y = b0 +
bd1=x 0
d0 > 0
d1 < 0
d=1
y = (b0 + d0) + (b1 + d1) x
122
x
Heteroskedasticity
Homoskedasticity assumption
The assumption of homoskedasticity
states that, given the explanatory
variables, the variance of u is constant.
 If this is not true – i.e., if the variance of u
is different for different values of x – we
have heteroskedasticity

124
Heteroskedasticity
f(y|x)
.
.
x1
125
x2
x3
.
E(y|x) = b0 + b1x
x
So, what’s the problem?
OLS is still unbiased and consistent, even if
we do not assume homoskedasticity
 The standard errors of the estimates are
biased if we have heteroskedasticity
 If the standard errors are biased, we can not
use the usual t statistics or F statistics for
drawing inferences

126
Testing for Heteroskedasticity




Essentially want to test
 H0: Var(u|x1, x2,…, xk) = s2,
which is equivalent to
 H0: E(u2|x1, x2,…, xk) = E(u2) = s2
If we assume the relationship between u2 and xj will be
linear, we can test as a linear restriction
So, for u2 = d0 + d1x1 +…+ dk xk + v this means testing
H0: d1 = d2 = … = dk = 0
127
Breusch-Pagan test




Don’t observe the error, but we can estimate it with
the residuals from the OLS regression
Regress the residuals squared on all of the x’s
Use the R2 to form an F test
 Basically, you do not want to find any relationship
between the u2 and the x’s
The F statistic is just the reported F statistic for overall
significance of the regression
128
OK… so what should we do
about this?

Re-specify the model


Weighted Least Square


A log-linear model may work better
If we know how the variance depend on x, we
may specify a model that does not suffer from
heteroskedasticity
Use a “Heteroskedasticity robust“ procedure

129
Yields ”robust standard errors”