Bivariate linear regression

Transcript Bivariate linear regression

Bivariate linear regression
ASW, Chapter 12
Economics 224 – Notes for November 12, 2008
Regression line
y  0  1 x  
• For a bivariate or simple regression with an
independent variable x and a dependent variable
y, the regression equation is y = β0 + β1 x + ε.
• The values of the error term, ε, average to 0 so
E(ε) = 0 and E(y) = β0 + β1 x.
• Using observed or sample data for values of x
and y, estimates of the parameters β0 and β1 are
obtained and the estimated regression line is
yˆ  b0  b1 x
where yˆ is the value of y that is predicted from
the estimated regression line.
Bivariate regression line
y = β0 + β1x + ε
y
E(y) = β0 + β1x
yi
E(yi)
ε or error term
xi
x
Observed scatter diagram and
estimated least squares line
y
ŷ = b0 + b1x
y (actual)
ŷ (estimated)
deviation
x
Example from SLID 2005
• According to human capital theory, increased
education is associated with greater earnings.
• Random sample of 22 Saskatchewan males
aged 35-39 with positive wages and salaries in
2004, from the Survey of Labour and Income
Dynamics, 2005.
• Let x be total number of years of school
completed (YRSCHL18) and y be wages and
salaries in dollars (WGSAL42).
Source: Statistics Canada, Survey of Labour and Income Dynamics,
2005 [Canada]: External Cross-sectional Economic Person File
[machine readable data file]. From IDLS through UR Data Library.
ID#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
YRSCHL18
17
12
12
11
15
15
19
15
20
16
18
11
14
12
14.5
13.5
15
13
10
12.5
15
12.3
WGSAL42
62500
15500
67500
9500
38000
36000
70000
47000
80000
28000
65000
48000
72500
33000
6000
62500
77500
42000
36000
21000
41000
52500
YRSCHL18 is the
variable “number of
years of schooling”
WGSAL42 is the
variable “wages
and salaries in
dollars, 2004”
Plot of WGSAL42 with YRSCHL18
yy
100000
Mean of y is
$45,954 and
sd is $21,960.
80000
60000
n = 22
cases
40000
20000
Mean of x is
14.2 and sd is
2.64 years.
0
8
10
12
14
16
18
Total Number of years of schooling compl
20
x
22
x
Plot of WGSAL42 with YRSCHL18
100000
y
80000
60000
yˆ  13,493 4,181x
40000
20000
0
8
10
12
14
16
18
Total Number of years of schooling compl
20
x
22
Analysis and results
H0: β1 = 0. Schooling has no effect on earnings.
H1: β1 > 0. Schooling has a positive effect on earnings.
From the least squares estimates, using the data for the 22
cases, the regression equation and associate statistics are:
y = -13,493 + 4,181 x.
R2 = 0.253, r = 0. 503.
Standard error of the slope b0 is 1,606.
t = 2.603 (20 df), significance = 0.017.
At α = 0.05, reject H0, accept H1 and conclude that
schooling has a positive effect on earnings.
Each extra year of schooling adds $4,181 to annual wages
and salaries for those in this sample.
Expected wages and salaries for those with 20 years of
schooling is -13,493 + (4,181 x 20) = $70,127.
Equation of a line
• y = β0 + β1 x. x is the independent
variable (on horizontal) and y is the
dependent variable (on vertical).
• β0 and β1 are the two parameters that
determine the equation of the line.
• β0 is the y intercept – determines the
height of the line.
• β1 is the slope of the line.
– Positive, negative, or zero.
– Size of β1 provides an estimate of the manner
that x is related to y.
Positive Slope: β1 > 0
y
Example – schooling (x) and
earnings (y).
Δy
β0
Δx
1  slope 
y
0
x
x
Negative Slope: β1 < 0
y
β0
Δx
Δy
1  slope 
y
0
x
Example – higher income (x)
associated with fewer trips by
bus (y).
x
Zero Slope: β1 = 0
y
β0
Δx
1  slope 
y
0
x
Example – amount of rainfall (x)
and student grades (y)
x
Infinite Slope: β1 = 
y
1  slope 
y

x
x
Infinite number of possible lines can be drawn. Find the
straight line that best fits the points in the scatter diagram.
Plot of WGSAL42 with YRSCHL18
100000
80000
60000
40000
20000
0
8
10
12
14
16
18
Total Number of years of schooling compl
20
22
Least squares method (ASW, 469)
• Find estimates of β0 and β1 that produce a line that fits
the points the best.
• The most commonly used criterion is least squares.
• The least squares line is the unique line for which the
sum of the squares of the deviations of the y values from
the line is as small as possible.
• Minimize the sum of the squares of the errors ε.
• Or, equivalent to this, minimize the sum of the squares of
the differences of the y values from the values of E(y).
That is, find b0 and b1 that minimize:

2
   yi  yˆ i     yi  b0  b1 x 
2
2
Least squares line
• Let the n observed values of x and y be
termed xi and yi, where i = 1, 2, 3, ... , n.
• ∑ε2 is minimized when b0 and b1 take on
the following values:
xi  x  yi  y 

b1 
2
 xi  x 
b0  y  b1 x
Province
Income Alcohol
Newfoundland
Prince Edward
Island
Nova Scotia
New Brunswick
Quebec
Ontario
26.8
8.7
27.1
29.5
28.4
30.8
36.4
8.4
8.8
7.6
8.9
10
Manitoba
Saskatchewan
30.4
29.8
9.7
8.9
Alberta
British Columbia
35.1
32.5
11.1
10.9
Is alcohol a superior good?
Income is family income in thousands of
dollars per capita, 1986. (independent
variable)
Alcohol is litres of alcohol consumed per
person 15 years of age or over, 1985-86.
(dependent variable)
Sources: Saskatchewan Alcohol and Drug Abuse Commission,
Fast Factsheet, Regina, 1988
Statistics Canada, EconomIc Families – 1986 [machine-readable data
file, 1988.
Hypotheses
H0: β1 = 0. Income has no effect on alcohol
consumption.
H1: β1 > 0. Income has a positive effect on
alcohol consumption.
Scatter diagram of alcohol consumption (y) with income
(x), Provinces of Canada, 1985-86
Litres per person alcohol consumption,
1985-86
12
11.5
11
10.5
10
9.5
9
8.5
8
25
27
29
31
33
35
37
Family income ($ thousands) per capita, 1986
39
Province
x
y
x-barx
y-bary
(x-barx)(y-bary)
x-barx sq
Newfoundland
PEI
26.8
27.1
8.7
8.4
-3.88
-3.58
-0.6
-0.9
2.328
3.222
15.0544
12.8164
Nova Scotia
29.5
8.8
-1.18
-0.5
0.59
1.3924
New Brunswick
Quebec
Ontario
Manitoba
28.4
30.8
36.4
30.4
7.6
8.9
10
9.7
-2.28
0.12
5.72
-0.28
-1.7
-0.4
0.7
0.4
3.876
-0.048
4.004
-0.112
5.1984
0.0144
32.7184
0.0784
Saskatchewan
Alberta
29.8
35.1
8.9
11.1
-0.88
4.42
-0.4
1.8
0.352
7.956
0.7744
19.5364
British Columbia
32.5
10.9
1.82
1.6
2.912
3.3124
sum
mean
306.8
30.68
93
9.3
25.08
90.896
-6.8E-14 -7.1E-15
b1
b0
0.275919732
0.834782609
Litres per person alcohol consumption ,
1985-86
Scatter diagram of alcohol consumption (y) with income (x),
Provinces of Canada, 1985-86
12
11.5
11
10.5
10
yˆ  0.835 0.276x
9.5
9
8.5
8
25
27
29
31
33
35
37
Family income ($ thousands) per capita, 1986
39
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.790288
R Square
0.624555
Adjusted R
Square
0.577624
Standard
Error
0.721104
Observations
10
Analysis. b1 = 0.276 and its standard
error is 0.076, for a t value of 3.648. At α
= 0.01, the null hypothesis can be
rejected (ie. with H0, the probability of a t
this large or larger is 0.0065) and the
alternative hypothesis accepted. At
0.01 significance, there is evidence that
alcohol is a superior good, ie. that
income has a positive effect on alcohol
consumption.
ANOVA
df
Regression
Residual
Total
Intercept
X Variable 1
SS
1
8
9
6.920067
4.159933
11.08
Standard
Coefficients
Error
0.834783
2.331675
0.27592
0.075636
MS
F
6.920067
0.519992
13.30803
t Stat
0.358018
P-value
0.729592
3.648018
0.006513
Significance
F
0.006513
Uses of regression line
• Draw line – select two x values (eg. 26 and 36)
and compute the predicted y values (8.1 and
10.8, respectively). Plot these points and draw
line.
yˆ  0.835 0.276x  0.832 (0.276 26)  8.091
yˆ  0.835 0.276x  0.832 (0.276 36)  10.771
• Interpolation. If a city had a mean income of
$32,000, the expected level of alcohol
consumption would be 9.7 litres per capita.
yˆ  0.835 0.276x  0.832 (0.276 32)  9.667
Extrapolation
• Suppose a city had a mean income of $50,000 in 1986.
From the equation, expected alcohol consumption would be
14.6 litres per capita.
yˆ  0.835 0.276x  0.832 (0.276 50)  14.635
• Cautions:
– Model was tested over the range of income values from
26 to 36 thousand dollars. While it appears to be close to
a straight line over this range, there is no assurance that a
linear relation exists outside this range.
– Model does not fit all points – only 62% of the variation in
alcohol consumption is explained by this linear model.
– Confidence intervals for prediction become larger the
further the independent variable x is from its mean.
Change in y resulting from
change in x
yˆ  b0  b1 x
changein y
b
changein x
dy / dx  b1
1
Estimate of change in y resulting from a change in x is b1.
For the alcohol consumption example, b1 = 0.276.
A 10.0 thousand dollar increase in income is associated
with a 2.76 per litre increase in annual alcohol consumption
per capita, at least over the range estimated.
This can be used to calculate the income elasticity for
alcohol consumption.
Goodness of fit (ASW, 12.3)
• y is the dependent variable, or the variable to be
explained.
• How much of y is explained statistically from the
regression model, in this case the line?
• Total variation in y is termed the total sum of squares, or
SST.
SST    yi  y 
2
• The common measure of goodness of fit of the line is the
coefficient of determination, the proportion of the
variation or SST that is “explained” by the line.
SST or total variation of y
yi  y  ( yi  yî )  ( yˆ i  y )
( yi  y )  ( yi  yˆ i )  ( yî  y )
Value of y
“explained” by
the line
“Error” of
prediction
Difference
from mean
Difference of any observed value of y from the mean
is the difference between the observed and predicted
value plus the difference of the predicted value from
the mean of y. From this, it can be proved that:
 y
 y     yi  yˆ i     yˆ i  y 
2
i
SST= Total
variation of y
2
SSE = “Unexplained” or
“error” variation of y
2
SSR = “Explained”
variation of y
Variation in y
y
ŷ = b0 + b1x
yi
ŷi
yi  y
y
xi
x
29
Variation in y “explained” by the line
y
ŷ = b0 + b1x
yi
ŷi
yi  y
y
yî  y “Explained”
portion
xi
x
30
Variation in y that is “unexplained” or error
y
‘Unexplained”
yi – ŷi
or error
ŷ = b0 + b1x
yi
ŷi
yi  y
y
yî  y
xi
x
31
Coefficient of determination
The coefficient of determination, r2 or R2 (the notation
used in many texts), is defined as the ratio of the
“explained” or regression sum of squares, SSR, to the
total variation or sum of squares, SST.
2
ˆ
 ( yi y )
SSR
r R 

SST  ( yi  y ) 2
2
2
The coefficient of determination is the square of the
correlation coefficient r. As noted by ASW (483), the
correlation coefficient, r, is the square root of the
coefficient of determination, but with the same sign
(positive or negative) as b1.
Calculations for:
Province
x
y
Predicted Y
Nfld
26.8
8.7
8.229431
PEI
27.1
8.4
NS
29.5
NB
SSR
SST
0.470569 0.221435
1.146117
0.36
8.312207
0.087793 0.007708
0.975734
0.81
8.8
8.974415
-0.17441
0.03042
0.106006
0.25
28.4
7.6
8.670903
-1.0709 1.146833
0.395763
2.89
Que
30.8
8.9
9.33311
-0.43311 0.187585
0.001096
0.16
Ont
36.4
10
10.87826
-0.87826 0.771342
2.490907
0.49
Man
30.4
9.7
9.222742
0.477258 0.227775
0.005969
0.16
SK
29.8
8.9
9.057191
-0.15719 0.024709
0.058956
0.16
Alb
35.1
11.1
10.51957
0.580435 0.336905
1.487339
3.24
BC
32.5
10.9
9.802174
1.097826 1.205222
0.252179
2.56
4.159933
6.920067
11.08
R squared
0.624555
yˆ  0.835 0.276x
Residuals
SSE
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.790288
R Square
0.624555
Adjusted R
Square
0.577624
Standard
Error
0.721104
Observations
10
ANOVA
df
Regression
Residual
Total
SS
1
8
9
6.920067
4.159933
11.08
MS
6.920067
0.519992
F
13.30803
Significance F
0.006513
Interpretation of R2
• Proportion, or percentage if multiplied by 100, of the
variation in the dependent variable that is statistically
explained by the regression line.
• 0  R2  1.
• Large R2 means the line fits the observed points well and
the line explains a lot of the variation in the dependent
variable, at least in statistical terms.
• Small R2 means the line does not fit the observed points
very well and the line does not explain much of the
variation in the dependent variable.
– Random or error component dominates.
– Missing variables.
– Relationship between x and y may not be linear.
How large is a large R2?
• Extent of relationship – weak relationship associated
with low value and strong relationship associated with
large value.
• Type of data
– Micro/survey data associated with small values of R2.
For schooling/earnings example, R2 = 0.253. Much
individual variation.
– Grouped data associated with larger values of R2. In
income/alcohol example, R2 = 0.625. Grouping
averages out individual variation.
– Time series data often results in very high R2. In
consumption function example (next slide), R2 =
0.988. Trends often move together.
700000
Consumption (y) and GDP (x),
Canada, 1995 to 2004,
quarterly data
Consumption
650000
600000
550000
500000
450000
800000
850000
900000
950000
1000000
GDP
1050000
1100000
1150000
Beware of R2
• Difficult to compare across equations, especially with
different types of data and forms of relationships.
• More variables added to model can increase R2.
Adjusted R2 can correct for this. ASW, Chapter 13.
• Grouped or averaged observations can result in larger
values of R2.
• Need to test for statistical significance.
• We want good estimates of β0 and β1, rather than high
R2.
• At the same time, for similar types of data and issues, a
model with a larger value of R2 may be preferable to one
with a smaller value.
Next day
• Assumptions of regression model.
• Testing for statistical significance.

Bivariate linear regression

Transcript Bivariate linear regression

Directory