Chapter 11 Introduction to Linear Regression and Correlation Analysis Chapter 11 - Chapter Outcomes After studying the material in this chapter, you should be able.

Download Report

Transcript Chapter 11 Introduction to Linear Regression and Correlation Analysis Chapter 11 - Chapter Outcomes After studying the material in this chapter, you should be able.

Chapter 11
Introduction to Linear
Regression and
Correlation Analysis
Chapter 11 - Chapter Outcomes
After studying the material in this chapter, you
should be able to:
Calculate and interpret the simple correlation
between two variables.
Determine whether the correlation is significant.
Calculate and interpret the simple linear
regression coefficients for a set of data.
Understand the basic assumptions behind
regression analysis.
Determine whether a regression model is
significant.
Chapter 11 - Chapter Outcomes
(continued)
After studying the material in this chapter, you
should be able to:
Calculate and interpret confidence intervals
for the regression coefficients.
Recognize regression analysis applications
for purposes of prediction and description.
Recognize some potential problems if
regression analysis is used incorrectly.
Recognize several nonlinear relationships
between two variables.
Scatter Diagrams
A scatter plot is a graph that may be
used to represent the relationship
between two variables. Also
referred to as a scatter diagram.
Dependent and Independent
Variables
A dependent variable is the variable to be
predicted or explained in a regression
model. This variable is assumed to be
functionally related to the independent
variable.
Dependent and Independent
Variables
An independent variable is the variable
related to the dependent variable in a
regression equation. The independent
variable is used in a regression model to
estimate the value of the dependent
variable.
Two Variable Relationships
(Figure 11-1)
Y
X
(a) Linear
Two Variable Relationships
(Figure 11-1)
Y
X
(b) Linear
Two Variable Relationships
(Figure 11-1)
Y
X
(c) Curvilinear
Two Variable Relationships
(Figure 11-1)
Y
X
(d) Curvilinear
Two Variable Relationships
(Figure 11-1)
Y
X
(e) No Relationship
Correlation
The correlation coefficient is a quantitative
measure of the strength of the linear
relationship between two variables. The
correlation ranges from + 1.0 to - 1.0. A
correlation of  1.0 indicates a perfect linear
relationship, whereas a correlation of 0
indicates no linear relationship.
Correlation
SAMPLE CORRELATION COEFFICIENT
r
 ( x  x )( y  y )
[ ( x  x ) ][ ( y  y ) ]
2
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
2
Correlation
SAMPLE CORRELATION COEFFICIENT
or the algebraic equivalent:
r
n xy   x y
[n( x 2 )  ( x) 2 ][n( y 2 )  ( y ) 2 ]
Correlation
(Example 11-1)
(Table 11-1)
Sales
y
487
445
272
641
187
440
346
238
312
269
655
563
Years
x
3
5
2
8
2
6
7
1
4
2
9
6
yx
1,461
2,225
544
5,128
374
2,640
2,422
238
1,248
538
5,895
3,378
y2
237,169
198,025
73,984
410,881
34,969
193,600
119,716
56,644
97,344
72,361
429,025
316,969
x2
9
25
4
64
4
36
49
1
16
4
81
36
  4,855  55   26,091  2,240,687   4,855
Correlation
(Example 11-1)
r
r
n xy   x y
[n( x )  ( x) ][n( y )  ( y ) ]
2
2
2
2
12(26,091)  55(4,855)
[12(329)  (55) 2 ][12(2,240,687)  (4,855) 2 ]
 0.8325
Correlation
(Example 11-1)
Sales
Years with Midwest
Sales
1
Years with Midwest 0.832534056
1
Correlation between Years and Sales
Excel Correlation Output
(Figure 11-5)
Correlation
TEST STATISTIC FOR CORRELATION
t
r
1 r
n2
2
df  n  2
where:
t = Number of standard deviations r is from 0
r = Simple correlation coefficient
n = Sample size
Correlation Significance Test
(Example 11-1)
H 0 :   0.0 (no correlation)
H A :   0.0
  0.05
Rejection Region
 /2 = 0.025
Rejection Region
 /2 = 0.025
t.025  2.228
t
n2
1 r 2
r
0

t.025  2.228
10
1  0.6931
 4.752
0.8325
Since t=4.752 > 2.048, reject H0, there is a significant
linear relationship
Correlation
Spurious correlation occurs when
there is a correlation between two
otherwise unrelated variables.
Simple Linear Regression
Analysis
Simple linear regression analysis
analyzes the linear relationship that
exists between a dependent variable
and a single independent variable.
Simple Linear Regression
Analysis
SIMPLE LINEAR REGRESSION MODEL
(POPULATION MODEL)
y   0  1 x  
where:
y = Value of the dependent variable
x = Value of the independent variable
 0= Population’s y-intercept
1 = Slope of the population regression line
 = Error term, or residual
Simple Linear Regression
Analysis
The simple linear regression model has four
assumptions:
 Individual values if the error terms, i, are
statistically independent of one another.
 The distribution of all possible values of  is normal.
 The distributions of possible i values have equal
variances for all value of x.
 The means of the dependent variable, for all specified
values of the independent variable, y, can be
connected by a straight line called the population
regression model.
Simple Linear Regression
Analysis
REGRESSION COEFFICIENTS
In the simple regression model, there
are two coefficients: the intercept and
the slope.
Simple Linear Regression
Analysis
The interpretation of the regression slope
coefficient is that is gives the average change
in the dependent variable for a unit increase
in the independent variable. The slope
coefficient may be positive or negative,
depending on the relationship between the
two variables.
Simple Linear Regression
Analysis
The least squares criterion is used
for determining a regression line
that minimizes the sum of squared
residuals.
Simple Linear Regression
Analysis
A residual is the difference between
the actual value of the dependent
variable and the value predicted by
the regression model.
y  yˆ
Sales in Thousands
Simple Linear Regression
Analysis
yˆ  150 60x
Y
390
400
300 312
200
Residual = 312 - 390 = -78
100
4
X
Years with Company
Simple Linear Regression
Analysis
ESTIMATED REGRESSION MODEL
(SAMPLE MODEL)
yˆi  b0  b1 x
where:
ŷ= Estimated, or predicted, y value
b0 = Unbiased estimate of the regression intercept
b1 = Unbiased estimate of the regression slope
x = Value of the independent variable
Simple Linear Regression
Analysis
LEAST SQUARES EQUATIONS
( x  x )( y  y )

b 
(x  x)

algebraic equivalent:
1
b1 
and
2
x y

 xy 
n
2
(
x
)

2
x


n
b0  y  b1 x
Simple Linear Regression
Analysis
SUM OF SQUARED ERRORS
SSE   y  b0  y  b1  xy
2
Simple Linear Regression Analysis
(Midwest Example)
(Table 11-3)
Sales
y
487
445
272
641
187
440
346
238
312
269
655
563
Years
x
3
5
2
8
2
6
7
1
4
2
9
6
xy
1,461
2,225
544
5,128
374
2,640
2,422
238
1,248
538
5,895
3,378
y2
237,169
198,025
73,984
410,881
34,969
193,600
119,716
56,644
97,344
72,361
429,025
316,969
x2
9
25
4
64
4
36
49
1
16
4
81
36
  4,855  55   26,091  2,240,687   4,855
Simple Linear Regression
Analysis
(Table 11-3)
b1 
x y

 xy 
n
2
(
x
)

2
x


n
55(4,855)
26,091
12

 49.9101
2
(55)
329
12
b0  y  b1x  404.5833 49.9101(4.5833)  175.8288
The least squares regression line is:
yˆ  175.8288 49.9101( x)
Simple Linear Regression
Analysis
(Figure 11-11)
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.832534056
0.693112955
0.662424251
92.10553441
12
ANOVA
df
Regression
Residual
Total
Intercept
Years with Midwest
SS
1
10
11
MS
F
Significance F
191600.622 191600.622 22.58527906 0.000777416
84834.29469 8483.429469
276434.9167
Coefficients Standard Error t Stat
P-value
Lower 95%
Upper 95%
Lower 95.0% Upper 95.0%
175.8288191 54.98988674 3.197475563 0.00953244 53.30369475 298.3539434 53.30369475 298.3539434
49.91007584 10.50208428 4.752397191 0.000777416 26.50996978 73.3101819 26.50996978 73.3101819
Excel Midwest Distribution Results
Least Squares Regression
Properties
The sum of the residuals from the least
squares regression line is 0.
The sum of the squared residuals is a
minimum.
The simple regression line always passes
through the mean of the y variable and
the mean of the x variable.
The least squares coefficients are unbiased
estimates of 0 and 1.
Simple Linear Regression
Analysis
SUM OF RESIDUALS
ˆ
(
y

y
 )0
SUM OF SQUARED RESIDUALS
ˆ
(
y

y
 )
2
Simple Linear Regression
Analysis
TOTAL SUM OF SQUARES
TSS   ( y  y)
2
where:
TSS = Total sum of squares
n = Sample size
y = Values of the dependent variable
y= Average value of the dependent variable
Simple Linear Regression
Analysis
SUM OF SQUARES ERROR (RESIDUALS)
SSE   ( y  yˆ )
2
where:
SSE = Sum of squares error
n = Sample size
y = Values of the dependent variable
ŷ= Estimated value for the average of y for the
given x value
Simple Linear Regression
Analysis
SUM OF SQUARES REGRESSION
SSR   ( yˆ  y)
2
where:
SSR = Sum of squares regression
y= Average value of the dependent variable
y = Values of the dependent variable
ŷ= Estimated value for the average of y for the
given x value
Simple Linear Regression
Analysis
SUMS OF SQUARES
TSS  SSE  SSR
Simple Linear Regression
Analysis
The coefficient of determination is the
portion of the total variation in the
dependent variable that is explained by its
relationship with the independent variable.
The coefficient of determination is also
called R-squared and is denoted as R2.
Simple Linear Regression
Analysis
COEFFICIENT OF DETERMINATION (R2)
SSR
R 
TSS
2
Simple Linear Regression
Analysis
(Midwest Example)
COEFFICIENT OF DETERMINATION (R2)
SSR 191,600.62
R 

 0.6931
TSS 276,434.90
2
69.31% of the variation in the sales data for this
sample can be explained by the linear relationship
between sales and years of experience.
Simple Linear Regression
Analysis
COEFFICIENT OF DETERMINATION
SINGLE INDEPENDENT VARIABLE CASE
R r
2
2
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Simple Linear Regression
Analysis
STANDARD DEVIATION OF THE
REGRESSION SLOPE COEFFICIENT
(POPULATION)
b 
1
where:

 (x  x)
2
 b= Standard deviation of the regression slope
1
(Called the standard error of the slope)
 = Population standard error of the estimate
Simple Linear Regression
Analysis
ESTIMATOR FOR THE STANDARD ERROR
OF THE ESTIMATE
SSE
s 
n  k 1
where:
SSE = Sum of squares error
n = Sample size
k = number of independent variables in the model
Simple Linear Regression
Analysis
ESTIMATOR FOR THE STANDARD
DEVIATION OF THE REGRESSION SLOPE
sb1 
where:
s
 (x  x)
2

s
( x )
x  n
2
2
sb1= Estimate of the standard error of the least squares
s =
slope
SSE Sample standard error of the estimate
n2
Simple Linear Regression
Analysis
TEST STATISTIC FOR TEST OF
SIGNIFICANCE OF THE REGRESSION
SLOPE
where:
b1  1
t
sb1
df  n  2
b1 = Sample regression slope coefficient
1 = Hypothesized slope
sb1 = Estimator of the standard error of the slope
Significance Test of
Regression Slope
(Example 11-5)
H 0 : 1  0.0
H A : 1  0.0
  0.05
Rejection Region
 /2 = 0.025
Rejection Region
 /2 = 0.025
t.025  2.228
0
t.025  2.228
sb1
10.50
t 1

 4.753
1
b 
49.91 0
Since t=4.753 > 2.048, reject H0: conclude that the
true slope is not zero
Simple Linear Regression
Analysis
MEAN SQUARE REGRESSION
SSR
MSR 
k
where:
SSR = Sum of squares regression
k = Number of independent variables in the model
Simple Linear Regression
Analysis
MEAN SQUARE ERROR
SSE
MSE 
n  k 1
where:
SSE = Sum of squares error
n = Sample size
k = Number of independent variables in the model
Significance Test
(Example 11-6)
H 0 : 1  0.0
H A : 1  0.0
  0.05
F  Ratio
MSR 191,600.6

 22.59
MSE 8,483.43
Rejection Region
 = 0.05
F  4.96
Since F= 22.59 > 4.96, reject H0: conclude that the
regression model explains a significant amount of the
variation in the dependent variable
Simple Regression Steps
Develop a scatter plot of y and x. You are
looking for a linear relationship between
the two variables.
Calculate the least squares regression line
for the sample data.
Calculate the correlation coefficient and the
simple coefficient of determination, R2.
Conduct one of the significance tests.
Simple Linear Regression
Analysis
CONFIDENCE INTERVAL ESTIMATE FOR
THE REGRESSION SLOPE
b1  t / 2 sb1
or equivalently:
b1  t / 2
s
 (x  x)
df  n  2
2
where:
sb1 = Standard error of the regression slope
coefficient
s = Standard error of the estimate
Simple Linear Regression
Analysis
CONFIDENCE INTERVAL FOR
yˆ  t / 2 s
where:
y | xp
( x p  x )2
1

2
n  (x  x)
ŷ = Point estimate of the dependent variable
t = Critical value with n - 2 d.f.
s = Standard error of the estimate
n = Sample size
xp = Specific value of the independent variable
x = Mean of independent variable observations
Simple Linear Regression
Analysis
PREDICTION INTERVAL FOR
Y | xp
1 (xp  x)
yˆ  t / 2 s 1 

2
n  (x  x)
2
Residual Analysis
Before using a regression model for
description or prediction, you should do a
check to see if the assumptions concerning
the normal distribution and constant
variance of the error terms have been
satisfied. One way to do this is through
the use of residual plots.
Key Terms
Coefficient of
Determination
Correlation Coefficient
Dependent Variable
Independent Variable
Least Squares
Criterion
Regression
Coefficients
Regression Slope
Coefficient
Residual
Scatter Plot
Simple Linear
Regression Analysis
Spurious
Correlation