No Slide Title

Download Report

Transcript No Slide Title

Lesson 11:
Regressions Part II
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-1
Does watching television rot your mind?
 Zavodny, Madeline (2006): “Does watching television rot your
mind? Estimates of the effect on test scores,” Economics of
Education Review, 25 (5): 565–573
 Television is one of the most omnipresent features of Americans’
lives. The average American adult watches about 15 h of
television per week, accounting for almost one-half of free time.
 The substantial amount of time that most individuals spend
watching television makes it important to examine its effects on
society, including human capital accumulation and academic
achievement.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-2
Data & Regression model
 This analysis uses three data sets to examine the relationship
between television viewing and test scores: the National
Longitudinal Survey of Youth 1979 (NLSY), the HSB survey and the
NELS. Each survey includes test scores and a question about the
number of hours of television watched by young adults.
Test score of
individual i at time t
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-3
Summary of samples from data sets
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-4
Regression results
**p<0.01; *p<0.05; †p<0.1
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-5
Multiple Linear Regression Model
 Relationship Between Variables Is a Linear Function
Y intercept
Slope
Random
Error
Y = b0 + b1X1 + b2X2 + b3X3 + … + bkXk + e
Dependent
(Response)
Variable
Ka-fu Wong © 2007
Independent
(Explanatory)
Variable
ECON1003: Analysis of Economic Data
Lesson11-6
Finance Application: multifactor pricing model
 It is assumed that rate of return on a stock (R) is linearly related
to the rate of return on some factor and the rate of return on the
overall market (Rm).
Rit = b0 + boi Rot+ b1Rmt +e
Rate of return on a
particular oil
company stock i at
time t
Ka-fu Wong © 2007
Rate of return on some
major stock index
The rate of return on
crude oil price on date t
ECON1003: Analysis of Economic Data
Lesson11-7
Estimation by Method of moments
Number of moment condition needed
Y = b0 + b1X1 + b2X2 + b3X3 + … + bkXk + e
k+1 parameters to estimate. Need k+1 moment conditions.
 Assumption #1
 E(e) = 0 implies E(y) – b0 – b1 E(x1) – b2 E(x2) - … bk E(xk)= 0
 Assumption #2
 E(ex1) =0 implies E[(y – b0 – b1x1 - … - bkxk)x1]=0
 Since Cov(e, x1) = E(ex1) – E(e)E(x1) = E(ex1), the assumption
really imply e and x are uncorrelated.
 Assumption #3: E(ex2) =0
 Assumption #4: E(ex3) =0
 …
 Assumption #k+1: E(exk) =0
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-8
Estimation of b0, b1, b2,…, bk
Method of moments
 Two approaches:
1. Solve the b0, b1, b2,…, bk from the k+1 moment conditions, in
terms of covariances, variances and means. Plug in to sample
analog of these covariances, variances and means ro produce
the sample estimate b0, b1, b2,…, bk
2. Assume b0, b1, b2,…, bk, solve them from the sample analog of
the k+1 moment conditions.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-9
Estimation of b0, b1, b2,…, bk
Maximum Likelihood
 Assume ei to be independent identically distributed with normal
distribution of zero mean and variance s2. Denote the normal
density for e be
 f(e)=f(y-b0-b1x1-b2x2-…-bkxk)
normal density
 Choose b0, b1, b2, …, bk to maximize the joint likelihood:
 L(b0, b1, b2, …, bk) = f(e1)*f(e2)*…*f(en)
f(e)= f(y-b0-b1x1-b2x2-…-bkxk)
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-10
To estimate b0 and b1 using ML (Computer)
 We do not know b0, b1, b2, …, bk. Nor do we know ei. In fact, our
objective is estimate b0, b1, b2, …, bk.
 The procedure of ML:
1. Assume a combination of b0, b1, b2, …, bk, call it b0, b1, b2, …, bk.
Compute the implied ei = yi-b0-b1x1i-b2x2i-…-bkxki and
f(ei)=f(yi-b0-b1x1i-b2x2i-…-bkxki)
2. Compute the joint likelihood conditional on the assumed values of
b0, b1, b2, …, bk:
 L(b0, b1, b2, …, bk) = f(e1)*f(e2)*…*f(en)
 Assume many more combination of b0, b1, b2, …, bk, and repeat the
above two steps, using a computer program (such as Excel).
 Choose the b0, b1, b2, …, bk that yield a largest joint likelihood.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-11
To estimate b0 and b1 using ML (Calculus)
 The procedure of ML:
1. Assume a combination of b0, b1, b2, …, bk, call it b0, b1, b2, …, bk.
Compute the implied ei = yi-b0-b1x1i-b2x2i-…-bkxki and
f(ei)=f(yi-b0-b1x1i-b2x2i-…-bkxki)
2. Compute the joint likelihood conditional on the assumed values of
b0, b1, b2, …, bk:
 L(b0, b1, b2, …, bk) = f(e1)*f(e2)*…*f(en)
 Choose b0, b1, b2, …, bk to maximize the likelihood function L(b0, b1,
b2, …, bk) – using calculus.
 Take the first derivative of L(b0, b1, b2, …, bk) with respect to b0,
set it to zero.
 Take the first derivative of L(b0, b1, b2, …, bk) with respect to bj,
set it to zero.
 Solve b0, b1, b2, …, bk using the k+1 equations.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-12
Estimation
Ordinary least squares
 For each value of X, there is a group of Y values, and these Y
values are normally distributed.
Yi~ N(E(Y|X1, X2,…,Xk), si2), i=1,2,…,n
 The means of these normal distributions of Y values all lie on the
straight line of regression.
E(Y|X1, X2,…,Xk) = b0+ b1X1 + b2X2 +… + bkXk
 The standard deviations of these normal distributions are equal.
si2= s2
i=1,2,…,n
i.e., homoskedasticity
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-13
Choosing the line that fits best
Ordinary Least Squares (OLS) Principle
 Straight lines can be described generally by
yi = b0 + b1x1i+ b2x2i +…+ bkxki
i=1,…,n
 Finding the best line with smallest sum of squared difference is the same
as
Min S(b0,b1) = S[yi – (b0 + b1x1i+ b2x2i +…+ bkxki)]2
 It can be shown the minimization yields the similar sample moment
conditions as discussed earlier in the method of moments.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-14
It can be shown that the estimators are
BLUE




Best: smallest variance
Linear: linear combination of yi
Unbiased: E(b0) = b0, E(b1) = b1
Estimator
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-15
Interpretation of Coefficients
yi = b0 + b1x1i + b2x2i + … + bkxki + ei
Prediction: y* = b0 + b1x1 + b2x2 + … + bkxk
1.
Slope (bj)
 Estimated Y changes by bj for each 1 unit increase in Xj,,
holding other variables constant
y* + Dy= b0 + b1x1 + …+ bj(xj+1)+… + bkxk
Dy= bj
More generally,
2.
y* + Dy= b0 + b1x1 + …+ bj(xj+Dxj)+… + bkxk
Dy= bjDxj
Dy/Dx = b1
Y-Intercept (b0 )

Ka-fu Wong © 2007
Estimated value of Y when X1 = X2 = … = Xk = 0
ECON1003: Analysis of Economic Data
Lesson11-16
Parameter Estimation Example
You work in advertising for the
New York Times. You want to find
the effect of ad size (sq. in.) &
newspaper circulation (000) on the
number of ad responses (00).
You’ve collected the
following data:
Resp
Size Circ
1
1
2
4
8
8
1
3
1
3
5
7
2
6
4
4
10
6
y
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
x1
x2
Lesson11-17
Parameter Estimation Computer Output
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate
Error Param=0 Prob>|T|
INTERCEP 1
0.0640
0.2599 0.246
0.8214
ADSIZE
1
0.2049
0.0588 3.656
0.0399
CIRC
1
0.2805
0.0686 4.089
0.0264


Slope (b1): # Responses to Ad is expected to increase by .2049 (20.49)
for each 1 sq. in. increase in Ad Size Holding Circulation Constant
Slope (b2): # Responses to Ad is expected to increase by .2805 (28.05)
for each 1 unit (1,000) increase in circulation Holding Ad Size Constant
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-18
Interpreting the Standard Error of the
Estimate

Assumptions:
 Observed Y values are normally distributed around
each estimated value of Y*
 Constant variance

se measures the dispersion of the points around the

se may be used to compute confidence intervals of the
regression line
 If se = 0, equation is a “perfect” estimator
estimated value
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-19
Test of Slope Coefficient (bj)
1.
Tests if there is a linear relationship between Xj & Y after
2.
Involves population slope bj
3.
Hypotheses
 H0: bj = 0 (Xj should not appear in the linear relationship)
 H1: bj 0
4.
Theoretical basis is sampling distribution of slopes
other variables are controlled for.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-20
Basis for Inference About the Population
Regression Slope

Let bj be a population regression slope and bj its least
squares estimate based on n data points. Then, if the
standard regression assumptions hold and it can also be
assumed that the errors ei are normally distributed, the
random variable
t= (bj – bj) / Sbj
is distributed as Student’s t with (n – k - 1) degrees of
freedom. In addition the central limit theorem enables us to
conclude that this result is approximately valid for a wide
range of non-normal distributions and large sample sizes, n.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-21
Confidence Intervals for the Population
Regression Slope bj

If the regression errors ei , are normally distributed and
the standard regression assumptions hold, a 100(1 )% confidence interval for the population regression
slope bj is given by
bj - t(n-k-1),/2 Sbj < bj < bj + t(n-k-1),/2 Sbj
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-22
Some cautions about the interpretation of
significance tests
 Rejecting H0: bj = 0 and concluding that the relationship
between xj and y is significant does not enable us to
conclude that a cause-and-effect relationship is present
between xj and y.
 Causation requires:
 Association
 Accurate time sequence
 Other explanation for correlation
Correlation  Causation
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-23
Some cautions about the interpretation of
significance tests
 Just because we are able to reject H0: bj = 0 and
demonstrate statistical significance does not enable us to
conclude that the relationship between x and y is linear.
 Linear relationship is a very small subset of possible
relationship among variables.
 A test of linear versus nonlinear relationship requires
another batch of analysis.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-24
Evaluating the Model
yi = b0 + b1x1i + b2x2i + … + bkxki + ei

Are the assumptions valid?
 Assumption #1: Linearity
 Assumption #2: A set of variables should be included.
 Assumption #3: The explanatory variables are uncorrelated
with error term.
 Assumption #4: The error term has a constant variance.
 Assumption #5: The errors are independent of each other.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-25
Measures of Variation in Regression




Total Sum of Squares (SST)
 Measures variation of observed Yi around the
mean,Y
Explained Variation (SSR)
 Variation due to relationship between
X&Y
Unexplained Variation (SSE)
 Variation due to other factors
SST=SSR+SSE
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-26
Variation in y (SST) = SSR + SSE
SST:
n
2
(
y

y
)
 i

i 1
n
*
*
2
(
y

y

y

y
)
 i


i 1
n
* 2
*
2
*
*
(
y

y
)

(
y

y
)

(
y

y
)(
y
 y)
 i
i
i 1


n
n
n
i 1
i 1
i 1
n
 y )   ( y  y)
* 2
*
2
*
*
(
y

y
)

(
y

y
)

(
y

y
)(
y
 y)
 i

 i
(y
i 1
n
* 2
i
SSE
Ka-fu Wong © 2007
*
i 1
2
=0, as imposed in the
estimation, E(ex)=0.
SSR
ECON1003: Analysis of Economic Data
Lesson11-27
Variation Measures
Y
Yi
Unexplained Sum of
Squares (Yi - Yi*)2
Total Sum of
Squares (Yi - Y)2
SSE
yi* = b0 +b1xi
SST
Explained Sum of
Squares (Yi* - Y)2
SSR
Y
Xi
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
X
Lesson11-28
Variation in y (SST) = SSR + SSE
 R2 (=r2, the coefficient of determination) measures the
proportion of the variation in y that is explained by the
variation in x.
n
R2  1 
SSE

n
 (y i  y)
i 1
2
2
(y

y)
 SSE
 i
i 1
n
2
(y

y)
 i

SSR
SST
i 1
 R2 takes on any value between zero and one.
 R2 = 1: Perfect match between the line and the data
points.
 R2 = 0: There are no linear relationship between x and y.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-29
Adjusted R-square
R2
 (unadjusted) R-square increases with the number of variables included.
 Thus, using R-square as a measure, we will always conclude a
model with more variables are better.
 However, adding a new variables is costly. Additional variable may add to
the uncertainty of estimating y.
 Thus, we would like to have a measure that will penalize the addition of
variables.
SSE/( n  k  1)
n 1 
2 
R  1
 1  (1  R )

SST/( n  1)
 n  k  1
2
Fix an R2, adjusted R2 decreases with k.
Fix k, adjusted R2 increases with R2.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-30
International price discrimination
 Cabolis, Christos, Sofronis Clerides, Ioannis Ioannou and
Daniel Senft (2007): “A textbook example of international
price discrimination,” Economics Letters, 95(1): 91-95.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-31
Motivation
 International price comparisons have a long history in economics.
Macroeconomists have used them extensively to test for
purchasing power parity and the law of one price. International
trade economists have been interested in international price
differences as evidence of trade barriers while industrial
organization economists have studied issues of market structure.
The popular and business press have also shown a keen interest
and frequently report intercity price comparisons for standardized
products such as the Big Mac or a Starbucks cappuccino.
 The paper documents the existence of very large differences in the
prices of textbooks across countries.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-32
Data
 Our data were collected from the Internet sites of Amazon.com in
two distinct phases. In May 2002 we collected information on
prices and characteristics of 268 books that were on sale on both
the US and UK websites of Amazon, Inc. This data set includes
both textbooks and general audience books and we refer to it as
our “broad sample”. In December 2002, we collected additional
data on economics textbooks; this is our “econ sample”. In this
phase, we broadened our sample by including Canada in the
search and collected more detailed information about each book.
 We tested for price differences by running a simple hedonic
regression of price on book characteristics and on dummy variables
that aim to capture differences across countries and book types.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-33
Estimates from the board sample
dependent variable: ln(p)
Variable
Coefficient
Estimate
Standard errors
Intercept
1.045
0.272
Textbook
0.268
0.052
US general book
0.126
0.044
US Textbook
0.306
0.031
Ln(pages)
0.345
0.048
Hardcover
0.343
0.044
N
536
R2
0.454
F-stat
56.52
Notes: Coefficients that are statistically different from zero at 5% and 1% are
marked with “*” and “**” respectively.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-34
Estimates from the Economics sample
dependent variable: ln(p)
Commercial
Hard.
Univ.
Press Hard.
Commercial
paper
Univ. press
paper
US
0.478**
(-0.043)
0.143**
(-0.045)
0.008
(-0.072)
−0.048
(-0.026)
CA
0.248**
(-0.049)
0.132**
(-0.03)
−0.032
(-0.066)
0.011
(-0.036)
US-INTRO
0.027
(-0.045)
0.310*
(-0.124)
CA-INTRO
0.074
(-0.062)
0.231
(-0.149)
DELTIME
0.024**
(-0.006)
−0.004
(-0.011)
0.007
(-0.006)
0.021*
(-0.008)
N
304
170
109
99
R2
0.303
0.152
0.223
0.413
F-stat
40.23
6.3
3.92
15.64
Notes: Coefficients that are statistically different from zero at 5% and 1% are
marked with “*” and “**” respectively.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-35
Testing for Linearity
Key Argument:
 If the value of y does not change linearly with the value of x, then
using the mean value of y is the best predictor for the actual value
of y. This implies y  y is preferable.
 If the value of y does change linearly with the value of x, then using
the regression model gives a better prediction for the value of y
than using the mean of y. This implies y=y* is preferable.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Testing for Linearity
 The Global F-test
H0: β1 = β2 = … = βk = 0 (no linear relationship)
H1: at least one βi ≠ 0 (at least one independent
variable affects Y)
Under the null SSR is either zero or very small!!
Test Statistic:
F
n
MSR
SSR/ k


MSE
SSE/( n  k  1)
 (y
i 1
n
 (y
i 1
i
*
i
 y) 2 / k
 y *i ) 2 /( n  k  1)
F is distributed with k numerator degree of freedom and n-k-1 denominator
degree of freedom. Reject H0 if F > Fk,n-k-1,.
[Variation in y] = SSR + SSE. Large F results from a large SSR. Then, much of
the variation in y is explained by the regression model. The null hypothesis
should be rejected; thus, the model is valid.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
F-Test for Overall Significance
Regression Statistics
Multiple R
0.72213
R Square
0.52148
Adjusted R Square
0.44172
Standard Error
47.46341
Observations
ANOVA
15
df
Regression
MSR 14730.0
F

 6.5386
MSE 2252.8
With 2 and 12 degrees
of freedom
SS
MS
2
29460.027
14730.013
Residual
12
27033.306
2252.776
Total
14
56493.333
Coefficients
Standard Error
t Stat
F
6.53861
P-value
Significance F
P-value for
the F-Test
0.01201
Lower 95%
Upper 95%
Intercept
306.52619
114.25389
2.68285
0.01993
57.58835
555.46404
Price
-24.97509
10.83213
-2.30565
0.03979
-48.57626
-1.37392
74.13096
25.96732
2.85478
0.01449
17.55303
130.70888
Advertising
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-38
F-Test for Overall Significance
(continued)
Test Statistic:
H0: β1 = β2 = 0
H1: β1 and β2 not both zero
 = .05
df1= 2
df2 = 12
MSR
F
 6.5386
MSE
Decision:
Critical
Value:
Since F test statistic is in the
rejection region (p-value <
.05), reject H0
F = 3.885
 = .05
0
Do not
reject H0
Ka-fu Wong © 2007
Reject H0
F.05 = 3.885
Conclusion:
F
There is evidence that at least one
independent variable affects Y
ECON1003: Analysis of Economic Data
Lesson11-39
Tests on a Subset of Regression
Coefficients
 Consider a multiple regression model involving variables xj and
zj , and the null hypothesis that the z variable coefficients are all
zero:
yi = b0 + b1 x1i + …+ bk xki + 1 z1i + … + r zri + ei
H0: 1 = 2 = … = r = 0
H1: at least one of j ≠0 (j=1,…,r)
Under the null SSR due to Z is either zero or very small!!
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-40
Tests on a Subset of Regression
Coefficients
 Goal: compare the error sum of squares for the complete model
with the error sum of squares for the restricted model
 First run a regression for the complete model and obtain SSE
 Next run a restricted regression that excludes the z variables
(the number of variables excluded is r) and obtain the
restricted error sum of squares SSE(r).
 Compute the F statistic and apply the decision rule for a
significance level 
( SSE(r)  SSE ) / r
Reject H0 if F 
 Fr,nK r 1,α
SSE/(n-k-1)
Note: SSE/(n-k-1) = Se2
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-41
EXAMPLE 1
 A market researcher for Super Dollar Super Markets is studying
the yearly amount families of four or more spend on food.
Three independent variables are thought to be related to yearly
food expenditures (Food). Those variables are: total family
income (Income) in $00, size of family (Size), and whether the
family has children in college (College).
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Example 1
continued
Note the following regarding the regression equation.
 The variable college is called a dummy or indicator variable. It can
take only one of two possible outcomes. That is a child is a college
student or not.
 Other examples of dummy variables include
 gender,
 the part is acceptable or unacceptable,
 the voter will or will not vote for the incumbent governor.
 We usually code one value of the dummy variable as “1” and the
other “0.”
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-43
EXAMPLE 1
continued
Family
Food
Income
Size
Student
1
3900
376
4
0
2
5300
515
5
1
3
4300
516
4
0
4
4900
468
5
0
5
6400
538
6
1
6
7300
626
7
1
7
4900
543
5
0
8
5300
437
4
0
9
6100
608
5
1
10
6400
513
6
1
11
7400
493
6
1
12
5800
563
5
0
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
EXAMPLE 1
continued
 Use a computer software package, such as Excel, to develop a
correlation matrix.
 From the analysis provided by Excel, write out the regression
equation:
Y*= 954 +1.09X1 + 748X2 + 565X3
 What food expenditure would you estimate for a family of 4,
with no college students, and an income of $50,000 (which is
input as 500)?
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
EXAMPLE 1 continued
The regression equation is
Food = 954 + 1.09 Income + 748 Size + 565 Student
Predictor
Constant
Income
Size
Student
Coef
954
1.092
748.4
564.5
SE Coef
1581
3.153
303.0
495.1
S = 572.7
R-Sq = 80.4%
T
0.60
0.35
2.47
1.14
P
0.563
0.738
0.039
0.287
R-Sq(adj) = 73.1%
Analysis of Variance
Source
Regression
Residual Error
Total
Ka-fu Wong © 2007
DF
3
8
11
SS
10762903
2623764
13386667
MS
3587634
327970
ECON1003: Analysis of Economic Data
F
10.94
P
0.003
Lesson11-46
EXAMPLE 1 continued
From the regression output we note:
 The coefficient of determination is 80.4 percent. This means that
more than 80 percent of the variation in the amount spent on food
is accounted for by the variables income, family size, and student.
 Each additional $100 dollars of income per year will increase the
amount spent on food by $109 per year.
 An additional family member will increase the amount spent per
year on food by $748.
 A family with a college student will spend $565 more per year on
food than those without a college student.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-47
EXAMPLE 1
continued
The estimated food expenditure for a family of 4 with
a $500 (that is $50,000) income and no college
student is $4,491.
Y* = 954 + 1.09(500) + 748(4) + 565 (0)
= 4491
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
EXAMPLE 1 continued
 Conduct a global test of hypothesis to determine if any
of the regression coefficients are not zero.
H0 : b1  b 2  b3  0
versus
H1 : Not all b s equal 0
 H0 is rejected if F>4.07.
 From the computer output, the computed value of F is
10.94.
 Decision: H0 is rejected. Not all the regression
coefficients are zero
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-49
EXAMPLE 1
continued
 Conduct an individual test to determine which
coefficients are not zero. This is the hypotheses for the
independent variable family size.
H0 : b 2  0
versus
H1 : b 2  0
 From the computer output, the only significant variable is
SIZE (family size) using the p-values. The other variables
can be omitted from the model.
 Thus, using the 5% level of significance, reject H0 if the
p-value<.05
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Correlation Matrix
 A correlation matrix is used to show all possible simple
correlation coefficients among the variables.
 See which xj are most correlated with y, and which xj are
strongly correlated with each other.
y
y 1.00
x1
x2
x1
rx1y
1.00
x2
rx2 y
rx1x2
xk
rxk y
rx1xk
1.00
rx2 xk
xk
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
1.00
Lesson11-51
Multicollinearity
1.
2.
High correlation between X variables
Multicollinearity makes it difficult to separate effect of
x1 on y from the effect of x2 on y. Leads to unstable
coefficients depending on X variables in model
3.
Always exists – a matter of degree
4.
Example: using both age & height as explanatory
variables in same model
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-52
Detecting Multicollinearity
1.
Examine correlation matrix
 Correlations between pairs of X variables are more
than with Y variable
2.
Few remedies
 Obtain new sample data
 Eliminate one correlated X variable
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-53
EXAMPLE 1 continued
 The correlation matrix is as follows:
Income
Size
Student
Food
0.587
0.876
0.773
Income
Size
0.609
0.491
0.743
 The strongest correlation between the dependent variable and an
independent variable is between family size and amount spent
on food.
 None of the correlations among the independent variables should
cause problems. All are between –.70 and .70.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-55
EXAMPLE 1
continued
 We rerun the analysis using only the significant
independent family size.
 The new regression equation is:
Y* = 340 + 1031X2
 The coefficient of determination is 76.8 percent. We
dropped two independent variables, and the R-square term
was reduced by only 3.6 percent.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Example 1
continued
Regression Analysis: Food versus Size
The regression equation is
Food = 340 + 1031 Size
Predictor
Constant
Size
S = 557.7
Coef
339.7
1031.0
SE Coef
940.7
179.4
R-Sq = 76.8%
T
0.36
5.75
P
0.726
0.000
R-Sq(adj) = 74.4%
Analysis of Variance
Source
Regression
Residual Error
Total
Ka-fu Wong © 2007
DF
1
10
11
SS
10275977
3110690
13386667
MS
10275977
311069
F
33.03
ECON1003: Analysis of Economic Data
P
0.000
Residual Analysis


Purposes
 Evaluate violations of assumptions, including the
assumption of linearity.
Graphical Analysis of Residuals
 Plot residuals versus Xi values
 Difference between actual Yi & predicted Yi*
 Studentized residuals:
 Allows consideration for the magnitude of the residuals
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-58
Residual Analysis for Homoscedasticity
 When the requirement of a constant variance
(homoscedasticity) is violated, we have heteroscedasticity.
Using Standardized Residuals (e/se)
OK Homoscedasticity
Heteroscedasticity
SR
SR
X
X
For example, for xi>xj
Var(ei|xi)>var(ej|xj)
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-59
Residual Analysis for Independence
Using Standardized Residuals (e/se)
OK Independent
Not Independent
SR
SR
X
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
X
Lesson11-60
Patterns in the appearance of the residuals
over time indicates that autocorrelation exists.
Residual
Residual
+ ++
+
0
+
+
+
+
+
+ +
+
+
+
++
+
+
+
Time
Note the runs of positive residuals,
replaced by runs of negative residuals
Ka-fu Wong © 2007
+
+
+
0 +
+
+
+
Time
+
+
Note the oscillating behavior of the
residuals around zero.
ECON1003: Analysis of Economic Data
Lesson11-61
The Durbin-Watson Statistic


Used when data is collected over time to detect
autocorrelation (Residuals in one time period are
related to residuals in another period)
Measures Violation of independence assumption
n
D
 (e
i 2
i
 ei 1 )2
n
e
i 1
2
i
Should be close to 2.
If not, examine the model for
autocorrelation.
Intuition: If x and y are independent, Var(x-y)= Var(x) + Var(y)
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-62
Outliers
 An outlier is an observation that is unusually small or large.
 Several possibilities need to be investigated when an outlier
is observed:
 There was an error in recording the value.
 The point does not belong in the sample.
 The observation is valid.
 Identify outliers from the scatter diagram.
 It is customary to suspect an observation is an outlier if its
|standard residual| > 2
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-63
An outlier
An influential observation
+ +
+
+ +
+ +
+ +
+++++++++++
… but, some outliers
may be very influential
+
+
+
+
+
+
+
The outlier causes a shift
in the regression line
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-64
Remedying violations of the required
conditions
 Nonnormality or heteroscedasticity can be remedied using
transformations on the y variable.
 The transformations can improve the linear relationship
between the dependent variable and the independent
variables.
 Many computer software systems allow us to make the
transformations easily.
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-65
Nonlinear Regression Models
 The relationship between the dependent variable and
an independent variable may not be linear
 Can review the scatter diagram to check for non-linear
relationships
 Example: Quadratic model
Y  β0  β1X1  β2 X  ε
2
1
 The second independent variable is the square of
the first variable
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-66
Quadratic Regression Model
Model form:
Yi  β0  β1X1i  β2 X1i2  εi
 where:
β0 = Y intercept
β1 = regression coefficient for linear effect of X on Y
β2 = regression coefficient for quadratic effect on Y
εi = random error in Y for observation i
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-67
Linear vs. Nonlinear Fit
Y
Y
X
X
Linear fit does not give
random residuals
Ka-fu Wong © 2007
residuals
residuals
X
X

Nonlinear fit gives
random residuals
ECON1003: Analysis of Economic Data
Lesson11-68
Quadratic Regression Model
Yi  β0  β1X1i  β2 X1i2  εi
Quadratic models may be considered when the scatter
diagram takes on one of the following shapes:
Y
Y
β1 < 0
β2 > 0
X1
Y
β1 > 0
β2 > 0
X1
Y
β1 < 0
β2 < 0
X1
β1 > 0
β2 < 0
X1
β1 = the coefficient of the linear term
β2 = the coefficient of the squared term
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-69
Testing for Significance: Quadratic Effect
 Testing the Quadratic Effect
 Compare the linear regression estimate
yˆ  b0  b1x1
 with quadratic regression estimate
ˆy  b0  b1x1  b2 x12
 Hypotheses
H0: b2=0 (The quadratic term does not improve the model)
H1: b2≠0 (The quadratic term improves the model)
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-70
Testing for Significance: Quadratic Effect
 Testing the Quadratic Effect
Hypotheses
 H0: b2=0 (The quadratic term does not improve the model)
 H1: b2≠0 (The quadratic term improves the model)
 The test statistic is
b 2  β2
t
sb2
d.f.  n  3
where:
b2 = squared term slope coefficient
β2 = hypothesized slope (zero)
Sb = standard error of the slope
2
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-71
Testing for Significance: Quadratic Effect
 Testing the Quadratic Effect
Compare Adjusted R2 from simple regression to
Adjusted R2 from the quadratic model
 If Adjusted R2 from the quadratic model is larger than
Adjusted R2 from the simple model, then the quadratic
model is a better model
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-72
Example: Quadratic Model
3
1
7
2
8
3
15
5
22
7
33
8
40
10
54
12
67
13
70
14
78
15
85
15
87
16
99
17
 Purity increases as filter time increases:
Purity vs. Time
100
80
Purity
Purity
Filter
Time
60
40
20
0
0
5
10
15
20
Time
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-73
Example: Quadratic Model
 Simple regression results: y* = -11.283 + 5.985 Time
Coefficients
Intercept
Time
Standard Error
Adjusted R Square
Standard Error
P-value
-11.28267
3.46805
-3.25332
0.00691
5.98520
0.30966
19.32819
2.078E-10
Regression Statistics
R Square
t Stat
F
0.96888
Significance F
373.57904
2.0778E-10
0.96628
6.15997
Time Residual Plot
t statistic, F statistic, and R2 are all high.
Residuals
But …. the residuals are not random:
10
5
0
-5 0
5
10
15
20
-10
Time
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-74
Example: Quadratic Model
 Quadratic regression results:
y = 1.539 + 1.565 Time + 0.245 (Time)2
^
t Stat
Time Residual Plot
P-value
Intercept
1.53870
2.24465
0.68550
0.50722
Time
1.56496
0.60179
2.60052
0.02467
Time-squared
0.24516
0.03258
7.52406
1.165E-05
10
Residuals
Standard
Error
Coefficients
0.99402
Standard Error
2.59513
F
1080.7330
0
5
2.368E-13
The quadratic term is significant and
improves the model: R2 is higher and se is
lower, residuals are now random
10
15
20
Time
Significance F
Time-squared Residual Plot
10
Residuals
0.99494
Adjusted R Square
Ka-fu Wong © 2007
0
-5
Regression Statistics
R Square
5
5
0
-5
0
ECON1003: Analysis of Economic Data
100
200
300
400
Time-squared
Lesson11-75
Some highly nonlinear models may be
transformed into a linear model
The Log Transformation
The Multiplicative Model:
 Original multiplicative model
Y  β0 X X ε
β1
1
β2
2
 Transformed multiplicative model
log(Y)  log(β0 )  β1log(X1 )  β2log(X 2 )  log(ε)
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-76
Interpretation of coefficients
For the multiplicative model:
log Yi  log β0  β1 log X1i  log εi
 When both dependent and independent variables are
logged:
 The coefficient of the independent variable X1 can be
interpreted as
A 1 percent change in X1 leads to an estimated b1
percentage change in the average value of Y
 b1 is the elasticity of Y with respect to a change in X1
Note: logY = b0 + b1 logX
b1 = DlogY /DlogX = %DY/%DX
DlogY = logY2 – log Y1 = log(Y2/Y1) = log(1+(Y2-Y1)/Y1) ≈ (Y2-Y1)/Y1
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-77
Dummy Variables
 A dummy variable is a categorical independent variable
with two levels:
 yes or no, on or off, male or female
 recorded as 0 or 1
 Regression intercepts are different if the variable is
significant
 Assumes equal slopes for other variables
 If more than two levels, the number of dummy variables
needed is (number of levels - 1)
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-78
Dummy variable example
 Intrersted in: Do the average income differ across male and
female?
 Compute the average income for female.
 Compute the average income for male.
 Conduct a two sample test of equal mean.
 Alternative approach: regression.
 Y=income
 X1 = 1 if male; 0 if female.
Y= b0 + b1X1 + e
X1 = 0 implies Y = b0 + e
X1 = 1 implies Y = b0 + b1 + e
Ka-fu Wong © 2007
Test H0: b1=0.
ECON1003: Analysis of Economic Data
Lesson11-79
Dummy Variable Example
yˆ  b0  b1x1  b2 x 2
Let:
y = Pie Sales
x1 = Price
x2 = Holiday (X2 = 1 if a holiday occurred during the week)
(X2 = 0 if there was no holiday that week)
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-80
Dummy Variable Example
yˆ  b0  b1x1  b 2 (1)  (b0  b 2 )  b1x1
Holiday
yˆ  b0  b1x1  b 2 (0) 
No Holiday
y (sales)
 b1 x 1
b0
Different
intercept
Same
slope
If H0: β2 = 0 is
rejected, then
“Holiday” has a
significant effect
on pie sales
b0 + b2
b0
Ka-fu Wong © 2007
x1 (Price)
ECON1003: Analysis of Economic Data
Lesson11-81
Interpreting the
Dummy Variable Coefficient
Example:
Sales  300 - 30(Price)  15(Holiday)
Sales: number of pies sold per week
Price: pie price in $
Holiday:
1 If a holiday occurred during the week
0 If no holiday occurred
b2 = 15: on average, sales were 15 pies greater in weeks with a holiday
than in weeks without a holiday, given the same price
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-82
Interaction Between Explanatory
Variables
 Hypothesizes interaction between pairs of x variables
 Response to one x variable may vary at different levels of
another x variable
 Contains two-way cross product terms
yˆ  b0  b1x1  b 2 x 2  b3 x 3
 b0  b1x1  b 2 x 2  b3 (x1x 2 )
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-83
Effect of Interaction
 Given:
Y  β0  β2 X 2  (β1  β3 X 2 )X1
 β0  β1X1  β2 X2  β3 X1X 2
 Without interaction term, effect of X1 on Y is measured by
 β1
 With interaction term, effect of X1 on Y is measured by
which changes as X2 changes
 β1 + β3 X2,
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-84
Interaction Example
Suppose x2 is a dummy variable and the estimated
regression equation is yˆ  1 2x1  3x 2  4x1x 2
y
12
x2 = 1:
^y = 1 + 2x + 3(1) + 4x (1) = 4 + 6x
1
1
1
8
4
x2 = 0:
^y = 1 + 2x + 3(0) + 4x (0) = 1 + 2x
1
1
1
0
0
0.5
1
1.5
x1
Slopes are different if the effect of x1 on y depends on x2 value
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-85
Significance of Interaction Term
 The coefficient b3 is an estimate of the difference in the coefficient
of x1 when x2 = 1 compared to when x2 = 0
 The t statistic for b3 can be used to test the hypothesis
H0:β3  0
H1: β3  0
 If we reject the null hypothesis we conclude that there is a
difference in the slope coefficient for the two subgroups
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-86
Lesson 11:
Regressions Part II
- END -
Ka-fu Wong © 2007
ECON1003: Analysis of Economic Data
Lesson11-87