No Slide Title

Download Report

Transcript No Slide Title

Dr. Ka-fu Wong
ECON1003
Analysis of Economic Data
Ka-fu Wong © 2003
Chap 13- 1
Chapter Thirteen
Linear Regression and Correlation
GOALS
1.
2.
3.
4.
5.
6.
7.
l
Draw a scatter diagram.
Understand and interpret the terms dependent variable and
independent variable.
Calculate and interpret the coefficient of correlation, the
coefficient of determination, and the standard error of
estimate.
Conduct a test of hypothesis to determine if the population
coefficient of correlation is different from zero.
Calculate the least squares regression line and interpret the
slope and intercept values.
Construct and interpret a confidence interval and prediction
interval for the dependent variable.
Set up and interpret an ANOVA table.
Ka-fu Wong © 2003
Chap 13- 2
Correlation Analysis
 Correlation Analysis is a group of statistical
techniques used to measure the strength of
the association between two variables.
 A Scatter Diagram is a chart that portrays the
relationship between the two variables.
 The Dependent Variable is the variable being
predicted or estimated.
 The Independent Variable provides the basis
for estimation. It is the predictor variable.
Ka-fu Wong © 2003
Chap 13- 3
Types of Relationships
 Direct vs. Inverse
 Direct - X and Y increase together
 Inverse - X and Y have opposite directions
 Linear vs. Curvilinear
 Linear - Straight line best describes the
relationship between X and Y
 Curvilinear - Curved line best describes the
relationship between X and Y
Ka-fu Wong © 2003
Chap 13- 4
Direct vs. Inverse Relationship
Positive Slope
Advertising
Ka-fu Wong © 2003
Inverse Relationship
Pollution Emissions
Sales
Direct Relationship
Negative Slope
Anti-Pollution Expenditures
Chap 13- 5
Example
 Suppose a university administrator wishes to determine
whether any relationship exists between a student’s score
on an entrance examination and that student’s cumulative
GPA. A sample of eight students is taken. The results are
shown below
Ka-fu Wong © 2003
Student
Exam Score
GPA
A
74
2.6
B
69
2.2
C
85
3.4
D
63
2.3
E
82
3.1
F
60
2.1
G
79
3.2
H
91
3.8
Chap 13- 6
Cumulative GPA
Scatter Diagram: GPA vs. Exam Score
4.00
3.75
3.50
3.25
3.00
2.75
2.50
2.25
2.00
|
|
|
|
|
|
|
|
|
|
50 55 60 65 70 75 80 85 90 95
Exam Score
Ka-fu Wong © 2003
Chap 13- 7
Possible relationships between X and Y in
Scatter Diagrams
Y
Y
(a) Direct linear
(b) Inverse linear
X
Y
(d) Inverse curvilinear
Ka-fu Wong © 2003
Y
X
(c) Direct curvilinear
X
Y (e) Inverse linear
with more scattering
X
X
Y
(f) No relationship
X
Chap 13- 8
The Coefficient of Correlation, r
 The Coefficient of Correlation (r) is a measure of the
strength of the linear relationship between two
variables.
 It requires interval or ratio-scaled data.
 It can range from -1.00 to 1.00.
 Values of -1.00 or 1.00 indicate perfect and strong
correlation.
 Values close to 0.0 indicate weak correlation.
 Negative values indicate an inverse relationship
and positive values indicate a direct relationship.
Ka-fu Wong © 2003
Chap 13- 9
Formula for r
We calculate the coefficient of correlation from
the following formulas.
n
r
(x
i 1
i
 x )(y i  y )
(n  1)s x s y

n
n
i 1
i 1
i 1
n( xi y i )   xi  y i
[ ( xi  x ) ][ ( y i  y ) ]
i 1
2
Ka-fu Wong © 2003
i 1
i 1
i 1
i 1
 ( xi  x )
i 1

n
2
n
n
n
n
n
n
n(  xi y i )   xi  y i
(n  1)

n
2
2
)
y

y
(
 i
i 1
(n  1)
(n  1)
n
n
n
i 1
i 1
i 1
n( xi y i )   xi  y i
n
n
n
n
[n  xi  ( xi ) ][n  y i  ( y i )2 ]
i 1
2
2
i 1
i 1
2
i 1
Chap 13- 10
Perfect Negative Correlation (r = -1)
Y
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
X
Ka-fu Wong © 2003
Chap 13- 11
Perfect Positive Correlation (r = +1)
Y
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
X
Ka-fu Wong © 2003
Chap 13- 12
Zero Correlation (r = 0)
Y
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
X
Ka-fu Wong © 2003
Chap 13- 13
Strong Positive Correlation (0<r<1)
Y
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
X
Ka-fu Wong © 2003
Chap 13- 14
Coefficient of Determination
 The coefficient of determination (r2) is the proportion of
the total variation in the dependent variable (Y) that is
explained or accounted for by the variation in the
independent variable (X).
 It is the square of the coefficient of correlation.
 It ranges from 0 to 1.
 It does not give any information on the direction of the
relationship between the variables.
 Special cases:
 No correlation: r=0, r2=0.
 Perfect negative correlation: r=-1, r2=1.
 Perfect positive correlation: r=+1, r2=1.
Ka-fu Wong © 2003
Chap 13- 15
EXAMPLE 1
 Dan Ireland, the student body president at Toledo State University,
is concerned about the cost to students of textbooks. He believes
there is a relationship between the number of pages in the text and
the selling price of the book. To provide insight into the problem
he selects a sample of eight textbooks currently on sale in the
bookstore. Draw a scatter diagram. Compute the correlation
coefficient.
Ka-fu Wong © 2003
Book
Page
Price ($)
Intro to History
500
84
Basic Algebra
700
75
Intro to Psyc
800
99
Intro to Sociology
600
72
Bus. Mgt.
400
69
Intro to Biology
500
81
Fund. of Jazz
600
63
Princ. Of Nursing
800
93
Chap 13- 16
Example 1 continued
Scatter Diagram of Number of Pages and Selling Price of Text
100
90
Price ($)
80
70
60
400
500
600
700
800
Page
Ka-fu Wong © 2003
Chap 13- 17
Example 1
Book
continued
Page
Price ($)
X
Y
XY
X2
Y2
Intro to History
500
84
42,000
250,000
7,056
Basic Algebra
700
75
52,500
490,000
5,625
Intro to Psyc
800
99
79,200
640,000
9,801
Intro to Sociology
600
72
43,200
360,000
5,184
Bus. Mgt.
400
69
27,600
160,000
4,761
Intro to Biology
500
81
40,500
250,000
6,561
Fund. of Jazz
600
63
37,800
360,000
3,969
Princ. Of Nursing
800
93
74,400
640,000
8,649
4,900
636
397,200
3,150,000
51,606
Total
Ka-fu Wong © 2003
Chap 13- 18
Example 1
continued
n
n
n
i 1
i 1
i 1
n(  x i y i )   x i  y i
r 
n
n
n
n
[n  x i  (  x i ) ][n  y i  (  y i )2 ]
i 1

2
2
i 1
i 1
2
i 1
8(397,200)  ( 4,900)(636)
[8(3,150,000  ( 4,900)2 ][8(51,606)  (636)2 ]
 0.614
 The correlation between the number of pages and
the selling price of the book is 0.614. This
indicates a moderate association between the
variable.
Ka-fu Wong © 2003
Chap 13- 19
EXAMPLE 1
continued
Is there a linear relation between number of pages and price of
books?
 Test the hypothesis that there is no correlation in the
population. Use a .02 significance level.
 Under the null hypothesis that there is no correlation in the
population. The statistic
t
r
(1  r 2 ) /(n  2)
follows student t-distribution with (n-2) degree of
freedom.
Ka-fu Wong © 2003
Chap 13- 20
EXAMPLE 1
continued
 Step 1:
H0: The correlation in the population is zero.
H1: The correlation in the population is not zero.
 Step 2: H0 is rejected if t>3.143 or if t<-3.143. There are 6
degrees of freedom, found by
n – 2 = 8 – 2 = 6.
 Step 3: To find the value of the test statistic we use:
t
r
(1  r ) /(n  2)
2

.614 8  2
1  (.614)
2
 1.905
 Step 4: H0 is not rejected. We cannot reject the hypothesis
that there is no correlation in the population. The amount
of association could be due to chance.
Ka-fu Wong © 2003
Chap 13- 21
Regression Analysis
 In regression analysis we use the
independent variable (X) to estimate the
dependent variable (Y).
 The relationship between the variables is
linear.
 Both variables must be at least interval scale.
Ka-fu Wong © 2003
Chap 13- 22
Simple Linear Regression Model
 Relationship Between Variables Is a Linear Function
0 and 1 are unknown,
Random
Y intercept
Slope
therefore, are estimated
Error
from the data.
Yi  0  1X i   i
y
1 = Rise/Run
Rise
Dependent
(Response)
Variable
Ka-fu Wong © 2003
Independent
(Explanatory)
Variable
0
Run
x
Chap 13- 23
Finance Application: Market Model
 One of the most important applications of linear
regression is the market model.
 It is assumed that rate of return on a stock (R) is
linearly related to the rate of return on the
overall market (Rm).
R = 0 + 1Rm +
Rate of return on
a particular stock
Rate of return on some
major stock index
The beta coefficient measures how
sensitive the stock’s rate of return is to
changes in the level of the overall market.
Ka-fu Wong © 2003
Chap 13- 24
Assumptions Underlying Linear
Regression
 For each value of X, there is a group of Y
values, and these Y values are normally
distributed.
 The means of these normal distributions of Y
values all lie on the straight line of regression.
 The standard deviations of these normal
distributions are equal.
 The Y values are statistically independent. This
means that in the selection of a sample, the Y
values chosen for a particular X value do not
depend on the Y values for any other X values.
Ka-fu Wong © 2003
Chap 13- 25
Choosing the line that fits best
 The estimates are determined by
 drawing a sample from the population of interest,
 calculating sample statistics.
 producing a straight line that cuts into the data.
The question is:
Which straight line fits best?
y
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
x
Ka-fu Wong © 2003
Chap 13- 26
Choosing the line that fits best
The best line is the one that minimizes the sum of squared
vertical differences between the points and the line.
Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
4
3
2.5
2
Let us compare two lines
(2,4)
w
The second line is horizontal
w (4,3.2)
(1,2) w
w (3,1.5)
1
Ka-fu Wong © 2003
1
The smaller the sum of squared
differences the better the fit of the
line to the data. That is, the line
with the least sum of squares (of
differences) will fit the line best.
2
3
4
Chap 13- 27
Choosing the line that fits best
Ordinary Least Squares (OLS) Principle
 Straight lines can be described generally by
Y = b 0 + b 1X
 Finding the best line with smallest sum of squared
difference is the same as
min
b0 ,b1
n
S(b 0 , b1 ) ≡∑[y i  (b 0  b1 x i )]2
i1
 Let b0* and b1* be the solution of the above problem.
Y* = b 0 * + b 1 * X
is known as the “average predicted value” (or simply
“predicted value”) of y for any X.
Ka-fu Wong © 2003
Chap 13- 28
Coefficient estimates from the
ordinary least squares (OLS) principle
 Solving the minimization problem implies the
first order conditions:
S(b 0 , b1 ) ≡
n
∑[y - (b
i
0
2
 b1 x i )]
i1
∂S(b0 , b1 ) n
 ∑2[y i - (b 0  b1 x i )](-b 0 )  0
∂b0
i1
∂S(b0 , b1 ) n
 ∑2[y i - (b 0  b1 x i )](-b1 x i )  0
∂b1
i1
Ka-fu Wong © 2003
Chap 13- 29
Coefficient estimates from the
ordinary least squares (OLS) principle
 Solving the first order conditions implies
n
b1 
n
n( x i y i )   x i  y i
i1
n
2
i
n
Ka-fu Wong © 2003
i1
n
i1
n( x )  ( x i )
i1
b0 
n
y
i1
n
n

 x y  nx y
2
i1
i
i1
n
x
i1
i
2
i
 nx
2
n
i
 b1
x
i1
n
i
 y  b1 x
Chap 13- 30
EXAMPLE 2
continued from Example 1
 Develop a regression equation for the
information given in EXAMPLE 1. The
information there can be used to estimate the
selling price based on the number of pages.
8(397,200)  ( 4,900)(636)
b1 
 .05143
2
8(3,150,000)  ( 4,900)
636
4,900
b0 
 0.05143
 48.0
8
8
Ka-fu Wong © 2003
Chap 13- 31
Example 2
continued from Example 1
The regression equation is:
Y* = 48.0 + .05143X
 The equation crosses the Y-axis at $48. A book
with no pages would cost $48.
 The slope of the line is .05143. Each additional
page costs about $0.05 or five cents.
 The sign of the b value and the sign of r will
always be the same.
Ka-fu Wong © 2003
Chap 13- 32
Example 2
continued from Example 1
We can use the regression equation to estimate
values of Y.
 The estimated selling price of an 800 page book
is $89.14, found by
Y*
Ka-fu Wong © 2003
=
=
48.0 + .05143X
48.0 + .05143(800) = 89.14
Chap 13- 33
Standard Error of Estimate
(denoted se or Sy.x)



Measures the reliability of the estimating equation
A measure of dispersion
Measures the variability, or scatter of the observed values
around the regression line
n
* 2
(
y

y
 i i)
se  s y . x 
i 1
n2
n

Ka-fu Wong © 2003

 (y )
i 1
i
2
n
n
i 1
i 1
 b0  y i  b1  x i y i
n2
Chap 13- 34
Scatter Around the Regression Line
More Accurate Estimator
of X, Y Relationship
Ka-fu Wong © 2003
Less Accurate Estimator
of X, Y Relationship
Chap 13- 35
Interpreting the Standard Error of the
Estimate



se measures the dispersion of the points
around the regression line
 If se = 0, equation is a “perfect” estimator
se is used to compute confidence intervals of
the estimated value
Assumptions:
 Observed Y values are normally distributed
around each estimated value of Y*
 Constant variance
Ka-fu Wong © 2003
Chap 13- 36
Variation of Errors Around the
Regression Line
f(e)
y values are normally distributed
around the regression line.
For each x value, the “spread” or
variance around the regression line
is the same.
Y
X2
X1
X
Ka-fu Wong © 2003
Regression Line
Chap 13- 37
Scatter around the Regression Line
Y = b0 + b1X + 2se
Dependent Variable ( Y)
Y = b0 + b1X + 1se

Y = b0 + b1X regression line
Y = b0 + b1X - 1se
Y = b0 + b1X - 2se
2se (95.5% Lie in this Region)
Independent Variable (X)
Ka-fu Wong © 2003
1se (68% Lie in this Region)
Chap 13- 38
Example 3
continued from Example 1 and 2.
Find the standard error of estimate for the
problem involving the number of pages in a book
and the selling price.
se 
n
n
n
i 1
i 1
i 1
2
(
y
)
 i  b0  y i  b1 xi y i
n2
51,606  48(636)  0.05143(397,200)

82
 10.408
Ka-fu Wong © 2003
Chap 13- 39
Equations for the Interval Estimates
Confidence Interval for the Mean of y
y *  tse h
Prediction Interval for the Mean of y
y  tse 1 h
*
1
( x  x )2
h  n
n
2
(
x

x
)
 i
i 1
Ka-fu Wong © 2003
Chap 13- 40
Confidence Interval Estimate for Mean
Response
y* = b0+b1xi
The following factors
influence the width of
the interval: Std Error,
Sample Size, X Value
X
Ka-fu Wong © 2003
Chap 13- 41
Confidence Interval
continued from Example 1, 2 and 3.
 For books of 800 pages long, what is that 95%
confidence interval for the mean price?
 This calls for a confidence interval on the average
price of books of 800 pages long.
y  tse h  y  tse
*
*
1

n
( x  x )2
n
2
(
x

x
)
 i
i 1
1
 89.14  2.447(10.408)

8
(800  612.5)2
( 4900)2
3,150,000 
8
 89.14  15.31
Ka-fu Wong © 2003
Chap 13- 42
Prediction Interval continued from Example 1, 2 and 3.
 For a book of 800 pages long, what is the 95%
prediction interval for its price?
 This calls for a prediction interval on the price of an
individual book of 800 pages long.
y *  tse 1  h  y *  tse
1
1 
n
( x  x )2
n
2
(
x

x
)
 i
i 1
1
89.14  2.447(10.408) 1  
8
(800  612.5)2
( 4900)2
3,150,000 
8
89.14  29.72
Ka-fu Wong © 2003
Chap 13- 43
Interpretation of Coefficients (b and a)
1. Slope (b1)
 Estimated Y changes by b1 for each 1 unit
increase in X
2. Y-Intercept (b0 )
 Estimated value of Y when X = 0
Ka-fu Wong © 2003
Chap 13- 44
Test of Slope Coefficient (b1)
1. Tests if there is a linear relationship between
X&Y
2. Involves population slope 1
3. Hypotheses
 H0: 1 = 0 (no linear relationship)
 H1: 1 0 (linear relationship)
4. Theoretical basis is sampling distribution of
slopes
Ka-fu Wong © 2003
Chap 13- 45
Sampling Distribution of the Least
Squares Coefficient Estimator

If the standard least squares assumptions hold, then
b1 is an unbiased estimator of 1 and has a population
variance
2
2
 
2
b1

n
2
(
x

x
)
 i


(n  1) s x2
i 1
and an unbiased sample variance estimator
s 
2
b1
se
n
2
2
(
x

x
)
 i
2
se

(n  1) s x2
i 1
Bias = E(b1)  1
Ka-fu Wong © 2003
“Unbiasd” means E(b1)  1 =0
Chap 13- 46
Basis for Inference About the Population
Regression Slope

Let 1 be a population regression slope and b1 its least
squares estimate based on n pairs of sample
observations. Then, if the standard regression
assumptions hold and it can also be assumed that the
errors i are normally distributed, the random variable
b1 - 1
t
sb1
is distributed as Student’s t with (n – 2) degrees of
freedom. In addition the central limit theorem
enables us to conclude that this result is
approximately valid for a wide range of non-normal
distributions and large sample sizes, n.
Ka-fu Wong © 2003
Chap 13- 47
Tests of the Population Regression Slope

If the regression errors i are normally distributed and
the standard least squares assumptions hold (or if the
distribution of b1 is approximately normal), the
following tests have significance value :
1.
To test either null hypothesis
H 0:  1 =  1*
or H0: 1  1*
against the alternative
H 1:  1 >  1*
The decision rule is to reject if
*
b1 - β1
t
 t (n- 2), 
s b1
Ka-fu Wong © 2003
Chap 13- 48
Tests of the Population Regression Slope
2.
To test either null hypothesis
H 0:  1 =  1*
or H0: 1 > 1*
against the alternative
H 1:  1   1*

the decision rule is to reject if
*
b1 - β1
t
≤ t (n -2), 
s b1
Ka-fu Wong © 2003
Chap 13- 49
Tests of the Population Regression Slope
3.
To test either null hypothesis
H 0:  1 =  1*
against the alternative
H 1:  1   1*

the decision rule is to reject if
*
*
b1 - β1
b1 - β1
t
≥ t (n -2), /2 or t 
≤ - t (n -2), /2
s b1
s b1
Equivalently
Ka-fu Wong © 2003
*
b1 - β1
t
≥ t (n- 2), /2
s b1
Chap 13- 50
Confidence Intervals for the Population
Regression Slope 1

If the regression errors i , are normally
distributed and the standard regression
assumptions hold, a 100(1 - )% confidence
interval for the population regression slope 1
is given by
b1 - t (n- 2),/2 sb1  β1  b1  t (n-2),/2 sb1
Ka-fu Wong © 2003
Chap 13- 51
Some cautions about the interpretation of
significance tests
 Rejecting H0: b1 = 0 and concluding that the relationship
between x and y is significant does not enable us to
conclude that a cause-and-effect relationship is present
between x and y.
 Causation requires:
 Association
 Accurate time sequence
 Other explanation for correlation
Correlation  Causation
Ka-fu Wong © 2003
Chap 13- 52
Some cautions about the interpretation of
significance tests
 Just because we are able to reject H0: 1 = 0 and
demonstrate statistical significance does not enable us to
conclude that there is a linear relationship between x and
y.
 Linear relationship is a very small subset of possible
relationship among variables.
 A test of linear versus nonlinear relationship requires
another batch of analysis.
Ka-fu Wong © 2003
Chap 13- 53
Evaluating the Model

Variation Measures
 Coeff. Of Determination
 Standard Error of Estimate

Test Coefficients for
Significance
yi* = b0 +b1xi
Ka-fu Wong © 2003
Chap 13- 54
Variation Measures
Y
Yi
Total Sum of
Squares (Yi - Y)2
Unexplained Sum of
Squares (Yi - Yi*)2
SSE
yi* = b0 +b1xi
SST
Explained Sum of
Squares (Yi* - Y)2
SSR
Y
Xi
Ka-fu Wong © 2003
X
Chap 13- 55
Measures of Variation in Regression




Total Sum of Squares (SST)
 Measures variation of observed Yi around
the mean,Y
Explained Variation (SSR)
 Variation due to relationship between
X&Y
Unexplained Variation (SSE)
 Variation due to other factors
SST=SSR+SSE
Ka-fu Wong © 2003
Chap 13- 56
Variation in y (SST) = SSR + SSE
 R2 (=r2, the coefficient of determination)
measures the proportion of the variation in y
that is explained by the variation in x.
n
R  1
SSE
2

n
 (y i  y )
i 1
2
2
(
y

y
)
 SSE
 i
i 1
n
 (y i  y )
i 1
2

SSR
n
2
(
y

y
)
 i
i 1
 R2 takes on any value between zero and one.
 R2 = 1: Perfect match between the line and the data
points.
 R2 = 0: There are no linear relationship between x and y.
Ka-fu Wong © 2003
Chap 13- 57
Summarizing the Example’s results
(Example 1, 2 and 3)
 The estimated selling price for a book with
800 pages is $89.14.
 The standard error of estimate is $10.41.
 The 95 percent confidence interval for all
books with 800 pages is $89.14 ± $15.31.
This means the limits are between $73.83 and
$104.45.
 The 95 percent prediction interval for a
particular book with 800 pages is $89.14 ±
$29.72. The means the limits are between
$59.42 and $118.86.
 These results appear in the following output.
Ka-fu Wong © 2003
Chap 13- 58
Example 3
continued
Regression Analysis: Price versus Pages
The regression equation is
Price = 48.0 + 0.0514 Pages
Predictor
Constant
Pages
S = 10.41
Coef
48.00
0.05143
R-Sq = 37.7%
Analysis of Variance
Source
DF
Regression
1
Residual Error
6
Total
7
Ka-fu Wong © 2003
SE Coef
16.94
0.02700
SS
393.4
650.6
1044.0
T
2.83
1.90
P
0.030
0.105
R-Sq(adj) = 27.3%
MS
393.4
108.4
F
3.63
P
0.105
Chap 13- 59
Testing for Linearity
Key Argument:
 If the value of y does not change linearly with the
value of x, then using the mean value of y is the
best predictor for the actual value of y. This
implies y  y is preferable.
 If the value of y does change linearly with the
value of x, then using the regression model gives
a better prediction for the value of y than using
the mean of y. This implies y  y * is preferable.
Ka-fu Wong © 2003
Chap 13- 60
Three Tests for Linearity
 Testing the Coefficient of Correlation
H0: r = 0 There is no linear relationship between x and y.
H1: r  0 There is a linear relationship between x and y.
Test Statistic:
t
r
(1  r 2 ) /(n  2)
 Testing the Slope of the Regression Line
H0: 1 = 0 There is no linear relationship between x and y.
H1: 1  0 There is a linear relationship between x and y.
Test Statistic:
b1
t
n
se /
Ka-fu Wong © 2003
x
i 1
2
i
 nx
2
Chap 13- 61
Three Tests for Linearity
 The Global F-test
H0: There is no linear relationship between x and y.
H1: There is a linear relationship between x and y.
n
Test Statistic:
( y i*  y )2

MSR
SSR / 1
F

 n i 1
MSE SSE /(n  2)
 ( y i  y i* )2 /(n  2)
i 1
[Variation in y] = SSR + SSE. Large F results from a large SSR. Then,
much of the variation in y is explained by the regression model. The
null hypothesis should be rejected; thus, the model is valid.
Note: At the level of simple linear regression, the global F-test is
equivalent to the t-test on b1. When we conduct regression analysis
of multiple variables, the global F-test will take on a unique function.
Ka-fu Wong © 2003
Chap 13- 62
Residual Analysis


Purposes
 Examine Linearity
 Evaluate violations of assumptions
Graphical Analysis of Residuals
 Plot residuals versus Xi values
Difference between actual Yi &
predicted Yi*
 Studentized residuals:
Allows consideration for the magnitude
of the residuals
Ka-fu Wong © 2003
Chap 13- 63
Residual Analysis for Linearity
OK
Not Linear
e
Linear
e
X
Ka-fu Wong © 2003
X
Chap 13- 64
Residual Analysis for Homoscedasticity
 When the requirement of a constant variance (homoscedasticity)
is violated we have heteroscedasticity.
OK Homoscedasticity
Heteroscedasticity
SR
SR
X
X
Using Standardized Residuals
Ka-fu Wong © 2003
Chap 13- 65
Residual Analysis for Independence
OK Independent
Not Independent
SR
SR
X
Ka-fu Wong © 2003
X
Chap 13- 66
Non-independence of error variables
 A time series is constituted if data were collected
over time.
 Examining the residuals over time, no pattern
should be observed if the errors are independent.
 When a pattern is detected, the errors are said to
be autocorrelated.
 Autocorrelation can be detected by graphing the
residuals against time.
Ka-fu Wong © 2003
Chap 13- 67
Patterns in the appearance of the residuals
over time indicates that autocorrelation exists.
Residual
Residual
+ ++
+
0
+
+
+
+
+
+ +
+
+
+
++
+
+
+
Time
Note the runs of positive residuals,
replaced by runs of negative residuals
Ka-fu Wong © 2003
+
+
+
0 +
+
+
+
Time
+
+
Note the oscillating behavior of the
residuals around zero.
Chap 13- 68
The Durbin-Watson Statistic


Used when data is collected over time to detect
autocorrelation (Residuals in one time period are
related to residuals in another period)
Measures Violation of independence assumption
n
D
 (e
i 2
i
 ei 1 )2
n
e
i 1
Ka-fu Wong © 2003
2
i
Should be close to 2.
If not, examine the model for
autocorrelation.
Chap 13- 69
Outliers
 An outlier is an observation that is unusually
small or large.
 Several possibilities need to be investigated when
an outlier is observed:
 There was an error in recording the value.
 The point does not belong in the sample.
 The observation is valid.
 Identify outliers from the scatter diagram.
 It is customary to suspect an observation is an
outlier if its |standard residual| > 2
Ka-fu Wong © 2003
Chap 13- 70
An outlier
An influential observation
+ +
+
+ +
+ +
+ +
+++++++++++
… but, some outliers
may be very influential
+
+
+
+
+
+
+
The outlier causes a shift
in the regression line
Ka-fu Wong © 2003
Chap 13- 71
Remedying violations of the required
conditions
 Nonnormality or heteroscedasticity can be
remedied using transformations on the y
variable.
 The transformations can improve the linear
relationship between the dependent variable and
the independent variables.
 Many computer software systems allow us to
make the transformations easily.
Ka-fu Wong © 2003
Chap 13- 72
A brief list of transformations
 Y’ = log y (for y > 0)
 Use when the se increases with y, or
 Use when the error distribution is positively skewed
 y’ = y2
 Use when the se2 is proportional to E(y), or
 Use when the error distribution is negatively skewed
 y’ = y1/2 (for y > 0)
 Use when the se2 is proportional to E(y)
 y’ = 1/y
 Use when se2 increases significantly when y increases
beyond some value.
Ka-fu Wong © 2003
Chap 13- 73
Chapter Thirteen
Linear Regression and Correlation
- END -
Ka-fu Wong © 2003
Chap 13- 74