Transcript Chapter 9
Chapter
Correlation and Regression
9 © 2012 Pearson Education, Inc.
All rights reserved.
1 of 84
Chapter Outline
• • • • 9.1 Correlation 9.2 Linear Regression 9.3 Measures of Regression and Prediction Intervals 9.4 Multiple Regression © 2012 Pearson Education, Inc. All rights reserved.
2 of 84
Section 9.1
Correlation
© 2012 Pearson Education, Inc. All rights reserved.
3 of 84
Section 9.1 Objectives
• • • • • Introduce linear correlation, independent and dependent variables, and the types of correlation Find a correlation coefficient Test a population correlation coefficient
ρ
using a table Perform a hypothesis test for a population correlation coefficient
ρ
Distinguish between correlation and causation 4 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Correlation
• •
Correlation
A relationship between two variables. The data can be represented by ordered pairs (
x
,
y
)
x y
is the is the
independent dependent
(or (or
explanatory response
) )
variable variable
KNOW WHICH IS X and WHICH IS Y © 2012 Pearson Education, Inc. All rights reserved.
5 of 84
Correlation
A
scatter plot
can be used to determine whether a linear (straight line) correlation exists between two variables.
y
Example
: 2
x y
1 2 3 – 4 – 2 – 1 4 0 5 2 2 4 6
x
–2 – 4 © 2012 Pearson Education, Inc. All rights reserved.
6 of 84
y
Types of Correlation
y
As
x
increases,
y
tends to decrease.
x
Negative Linear Correlation
y
As
x
increases,
y
tends to increase.
x
Positive Linear Correlation
y x
No Correlation © 2012 Pearson Education, Inc. All rights reserved.
x
Nonlinear Correlation 7 of 84
Example: Constructing a Scatter Plot
An economist wants to determine whether there is a linear relationship between a country’s gross domestic product (GDP) and carbon dioxide (CO 2 ) emissions. The data are shown in the table. Display the data in a scatter plot and determine whether there appears to be a positive or negative linear correlation or no linear correlation.
(Source: World Bank and U.S. Energy Information Administration)
GDP CO 2 emission (millions of (trillions of $), metric tons),
x y
1.6
428.2
3.6
4.9
1.1
0.9
2.9
2.7
2.3
1.6
1.5
828.8
1214.2
444.6
264.0
415.3
571.8
454.9
358.7
573.5
8 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Solution: Constructing a Scatter Plot
Appears to be a
positive linear correlation
. As the gross domestic products increase, the carbon dioxide emissions tend to increase.
© 2012 Pearson Education, Inc. All rights reserved.
9 of 84
Example: Constructing a Scatter Plot Using Technology
Old Faithful, located in Yellowstone National Park, is the world’s most famous geyser. The duration (in minutes) of several of Old Faithful’s eruptions and the times (in minutes) until the next eruption are shown in the table. Using a TI-83/84, display the data in a scatter plot. Determine the type of correlation.
Duration
x
1.80
1.82
1.90
1.93
1.98
2.05
2.13
2.30
2.37
2.82
3.13
3.27
3.65
Time,
y
56 58 62 56 57 57 60 57 61 73 76 77 77 Duration
x
3.78
3.83
3.88
4.10
4.27
4.30
4.43
4.47
4.53
4.55
4.60
4.63
Time,
y
79 85 80 89 90 89 89 86 89 86 92 91 10 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Solution: Constructing a Scatter Plot Using Technology
• • Enter the
x-
values into list L 1 and the
y-
values into list L 2 .
Use
Stat Plot
STAT > Edit… to construct the scatter plot.
STATPLOT 100 Use “zoom -9” to see the picture properly 50 1 From the scatter plot, it appears that the variables have a
positive linear correlation
.
5 11 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Correlation Coefficient
• • • •
Correlation coefficient
A measure of the strength and the direction of a linear relationship between two variables. The symbol
coefficient.
r
represents the
sample correlation
A formula for
r
n
x
2
n r
is
xy
2
n
y
2 2
n
is the number of data pairs
The population correlation coefficient
is represented by
ρ
(rho). 12 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Correlation Coefficient
• The range of the correlation coefficient is –1 to 1.
-1 If
r
= –1 there is a perfect negative correlation 0 If
r
is close to 0 there is no linear correlation 1 If
r
= 1 there is a perfect positive correlation © 2012 Pearson Education, Inc. All rights reserved.
There may still be a correlation however 13 of 84
Linear Correlation
y r
= –0.91
x
Strong negative correlation
y r
= 0.42
y r
= 0.88
x
Strong positive correlation
y r
= 0.07
x
Weak positive correlation © 2012 Pearson Education, Inc. All rights reserved.
x
Nonlinear Correlation 14 of 84
Calculating a Correlation Coefficient
In Words
1.
Find the sum of the
x
values.
2.
Find the sum of the
y
values.
3.
Multiply each
x
-value by its corresponding
y
-value and find the sum.
In Symbols
x
y
xy
© 2012 Pearson Education, Inc. All rights reserved.
15 of 84
Calculating a Correlation Coefficient
4.
In Words
Square each
x
-value and find the sum.
5.
6.
Square each
y
-value and find the sum.
Use these five sums to calculate the correlation coefficient.
In Symbols
x
2
y
2
r
n
x
2
n
xy
2
n
y
2 2 16 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Example: Finding the Correlation Coefficient
Calculate the correlation coefficient for the gross domestic products and carbon dioxide emissions data. What can you conclude?
© 2012 Pearson Education, Inc. All rights reserved.
GDP (trillions of $),
x
1.6
CO 2 emission (millions of metric tons), 428.2
y
3.6
4.9
1.1
828.8
1214.2
444.6
0.9
2.9
2.7
2.3
1.6
1.5
264.0
415.3
571.8
454.9
358.7
573.5
17 of 84
Solution: Finding the Correlation Coefficient
x
1.6
3.6
4.9
1.1
0.9
y
428.2
828.8
1214.2
444.6
264.0
xy
685.12
2983.68
5949.58
489.06
237.6
x
2
2.56
12.96
24.01
1.21
0.81
y
2
183,355.24
686,909.44
1,474,281.64
197,669.16
69,696 2.9
2.7
2.3
1.6
415.3
571.8
454.9
358.7
1204.37
1543.86
1046.27
573.92
8.41
7.29
5.29
2.56
172,474.09
326,955.24
206,934.01
128,665.69
1.5
Σ
x
= 23.1 Σ
y
573.5
= 5554 Σ
xy
860.25
= 15,573.71 Σ
x
2 2.25
= 67.35 Σ
y
2 328,902.25
= 3,775,842.76
© 2012 Pearson Education, Inc. All rights reserved.
18 of 84
Solution: Finding the Correlation Coefficient
Σ
x
= 23.1 Σ
y
= 5554 Σ
xy
= 15,573.71 Σ
x
2 = 32.44
r
n
xy
Σ
y
2 = 3,775,842.76
n
x
2 2
n
y
2 2 10(15,573.71) 10(67.35) 23.1
2 10(3, 775,842.76) 27, 439.7
139.89 6,911,511.6
0.882
r ≈ 0.882 suggests a strong positive linear correlation. As the gross domestic product increases, the carbon dioxide emissions also increase.
19 of 84 © 2012 Pearson Education, Inc. All rights reserved.
On a Calculator
• • • Enter the data in List 1 and List 2 To graph it 2 nd “y=“ plot on • • • • Check the window to ensure that all values will fit into the window –adjust x or y as required –Use “zoom 9” To calculate R
Larson/Farber 5th ed.
stats calc 4: LinReg(ax+b) Here you will find both “R” and “R 2 ” Notice also, this is the equation of the line Normally we use “y=
m
x+b” the calculator calls it “y=
a
x+b” 20
Example: Using Technology to Find a Correlation Coefficient
Use a technology tool to calculate the correlation coefficient for the Old Faithful data. What can you conclude?
Go ahead and enter these in List 1 and List 2. We will use them to calculate R Duration
x
1.8
1.82
1.9
1.93
1.98
2.05
2.13
2.3
2.37
2.82
3.13
3.27
3.65
Time,
y
56 58 62 56 57 57 60 57 61 73 76 77 77 Duration
x
3.78
3.83
3.88
4.1
4.27
4.3
4.43
4.47
4.53
4.55
4.6
4.63
Time,
y
79 85 80 89 90 89 89 86 89 86 92 91 21 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Solution: Using Technology to Find a Correlation Coefficient
STAT > Calc To calculate
r
, you must first enter the
DiagnosticOn
command found in the Catalog menu
r ≈ 0.979
suggests a strong positive correlation.
© 2012 Pearson Education, Inc. All rights reserved.
22 of 84
Using a Table to Test a Population Correlation Coefficient
ρ
• • • Once the
sample correlation coefficient r
has been calculated, we need to determine whether there is enough evidence to decide that the
population correlation coefficient ρ
is
significant
at a specified level of significance.
Use Table 11 in Appendix B.
If |
r
| is greater than the critical value, there is enough evidence to decide that the correlation coefficient
ρ
is significant.
23 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Using a Table to Test a Population Correlation Coefficient
ρ
• Determine whether
ρ
data (
n
is significant for five pairs of = 5) at a level of significance of
α
= 0.01.
level of significance Number of pairs of data in sample • If |
r
| > 0.959,
the correlation is significant
. Otherwise, there is not enough evidence to conclude that the correlation is significant.
24 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Larson/Farber 5th ed.
On a Calculator
Normally H 0 is “=“ We are testing “can we conclude there is a relationship” • • • • • • Instead of using the table we use LinRegTTest Go to Stats Test LinRegTTest (note: you have to have entered data in List1 and List2) Notice you need to know what the Alternative Hypothesis is to test it For the line that says “RegEQ” do this: Go to Vars Y-Vars Function Y1 enter This will enter Y1 on the RegEQ line This will then enter the equation for a line in the string Y1 –This is so we can graph it if we want 25
On a Calculator
• Notice that you see beta and rho. They will be “not equal to” or greater than or less than. This is the alternate hypothesis. (Remember beta is the “opposite” of the population correlation coefficient) • • • Normally we want beta to be “not equal to” After you calculate the values, you will find the “R” You will also see the t-score and more importantly the P value • This time we
WANT
P-value to be less than the alpha. This means we are in the tail, which means the correlation is stronger. The
MORE
in the tail, the stronger the correlation
Larson/Farber 5th ed.
26
• • • • • • • • • •
Graphing
Again, choose a statplot to view the points entered in List1 or List2 Select “zoom 9” to see the statplot best This time you will also see a line through the statplot This is the line you created and stored in string Y1 You can now see the line that “best fits” the plotted points Suppose you had 30 data points, and wanted to predict the 35 th data point Go to Vars Y-Vars Function enter (xx) enter Y1will show on the screen, enter this:
Y1 (35)
and hit enter This will show the predicted 35 th entry based on your line 35 would be the “x-value” (explanatory or independent) and your answer would be the “y-value” (dependent or response)
Larson/Farber 5th ed.
27
Using a Table to Test a Population Correlation Coefficient
ρ In Words
1.
Determine the number of pairs of data in the sample.
2.
Specify the level of significance.
3.
Find the critical value.
In Symbols
Determine
n
.
Identify
α
.
Use Table 11 in Appendix B.
© 2012 Pearson Education, Inc. All rights reserved.
28 of 84
Using a Table to Test a Population Correlation Coefficient
ρ
4.
In Words
Decide if the correlation is significant.
In Symbols
If |
r
| > critical value, the correlation is significant. Otherwise, there is not enough evidence to support that the correlation is significant.
5.
Interpret the decision in the context of the original claim.
29 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Example: Using a Table to Test a Population Correlation Coefficient
ρ
Using the Old Faithful data, you used 25 pairs of data to find r ≈ 0.979. Is the correlation coefficient significant? Use
α
= 0.05.
Duration
x
1.8
1.82
1.9
1.93
1.98
2.05
2.13
2.3
2.37
2.82
3.13
3.27
3.65
Time,
y
56 58 62 56 57 57 60 57 61 73 76 77 77 Duration
x
3.78
3.83
3.88
4.1
4.27
4.3
4.43
4.47
4.53
4.55
4.6
4.63
Time,
y
79 85 80 89 90 89 89 86 89 86 92 91 30 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Solution: Using a Table to Test a Population Correlation Coefficient
ρ
• • •
n
= 25,
α
= 0.05
|r| ≈ 0.979 > 0.396
There is enough evidence at the 5% level of significance to conclude that there is a significant linear correlation between the duration of Old Faithful’s eruptions and the time between eruptions.
© 2012 Pearson Education, Inc. All rights reserved.
31 of 84
Hypothesis Testing for a Population Correlation Coefficient
ρ
• • A hypothesis test can also be used to determine whether the sample correlation coefficient
r
provides enough evidence to conclude that the population correlation coefficient
ρ
is significant at a specified level of significance.
A hypothesis test can be one-tailed or two-tailed.
32 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Hypothesis Testing for a Population Correlation Coefficient
ρ
• Left-tailed test
H
0 :
ρ H a
:
ρ
≥ 0 (no significant negative correlation) < 0 (significant negative correlation) • Right-tailed test
H
0 :
ρ H a
:
ρ
≤ 0 (no significant positive correlation) > 0 (significant positive correlation) • Two-tailed test
H
0 :
ρ H a
:
ρ
= 0 (no significant correlation) ≠ 0 (significant correlation) © 2012 Pearson Education, Inc. All rights reserved.
33 of 84
The t-Test for the Correlation Coefficient
• • • • Can be used to test whether the correlation between two variables is significant. The
test statistic
is
r
.
The
standardized test statistic
t
r r
r
1
n
r
2 2 follows a
t
-distribution with
d.f. = n – 2
.
In this text, only two-tailed hypothesis tests for ρ are considered
.
34 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Using the t-Test for ρ
1.
In Words
State the null and alternative hypothesis.
2.
Specify the level of significance.
3.
Identify the degrees of freedom.
4.
Determine the critical value(s) and rejection region(s).
In Symbols
State
H
0 and
H
a . Identify
α
.
d.f. =
n
– 2 Use Table 5 in Appendix B.
© 2012 Pearson Education, Inc. All rights reserved.
35 of 84
Using the t-Test for ρ
5.
In Words
Find the standardized test statistic.
In Symbols
t
r
1
n
r
2 2 6.
Make a decision to reject or fail to reject the null hypothesis. If
t
is in the rejection region, reject
H
0 . Otherwise fail to reject
H
0 .
7.
Interpret the decision in the context of the original claim.
© 2012 Pearson Education, Inc. All rights reserved.
36 of 84
Example: t-Test for a Correlation Coefficient
Previously you calculated r ≈ 0.882. Test the significance of this correlation coefficient. Use
α
= 0.05.
© 2012 Pearson Education, Inc. All rights reserved.
GDP (trillions of $),
x
1.6
CO 2 emission (millions of metric tons), 428.2
y
3.6
4.9
1.1
828.8
1214.2
444.6
0.9
2.9
2.7
2.3
1.6
1.5
264.0
415.3
571.8
454.9
358.7
573.5
37 of 84
• • • • •
Solution: t-Test for a Correlation Coefficient
H
0 : ρ = 0
H a
:
ρ ≠ 0 0.05
If you reject the hypothesis, then there is evidence to conclude a linear correlation
d.f. = 10 – 2 = 8 Rejection Region:
© 2012 Pearson Education, Inc. All rights reserved.
•
t
Test Statistic:
0.882
2 5.294
•
Decision: Reject H 0
At the 5% level of significance, there is enough evidence to conclude that there is a significant linear correlation between gross domestic products and carbon dioxide emissions.
38 of 84
On the Calculator
• • • • We still use LinRegTTest This will give us a t score should we need it It also tells you the degrees of freedom, and everything else you need Remember: We WANT p to be less than alpha. That tells us it’s in the tails, and thus more likely to be explained by a correlation than by random chance. The more in the tails, the less likely it’s chance.
Larson/Farber 5th ed.
39
Correlation and Causation
• • The fact that two variables are strongly correlated does
not in itself
imply a cause-and-effect relationship between the variables.
If there is a significant correlation between two variables, you should
consider
the following possibilities.
1.
Is there a direct cause-and-effect relationship • between the variables?
Does
x
cause
y
? © 2012 Pearson Education, Inc. All rights reserved.
40 of 84
Correlation and Causation
2.
Is there a reverse cause-and-effect relationship • between the variables?
Does
y
cause
x
?
3.
Is it possible that the relationship between the variables can be caused by a third variable or by a combination of several other variables?
4.
Is it possible that the relationship between two variables may be a coincidence?
41 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Section 9.1 Summary
• • • • • Introduced linear correlation, independent and dependent variables and the types of correlation Found a correlation coefficient Tested a population correlation coefficient
ρ
using a table Performed a hypothesis test for a population correlation coefficient
ρ
Distinguished between correlation and causation 42 of 84 © 2012 Pearson Education, Inc. All rights reserved.
• Page 495 9-28
Assignment
Larson/Farber 5th ed.
43
Section 9.2
Linear Regression
© 2012 Pearson Education, Inc. All rights reserved.
44 of 84
Section 9.2 Objectives
• • Find the equation of a regression line Predict
y
-values using a regression equation © 2012 Pearson Education, Inc. All rights reserved.
45 of 84
Regression lines
• • After verifying that the linear correlation between two variables is significant, next we determine the equation of the line that best models the data (
regression line
).
Can be used to predict the value of
y
for a given value of
x
.
y x
46 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Residuals
•
Residual
Know this
The difference between the observed
y
-value and the predicted
y
-value for a given
x
-value on the line. For a given
x
-value,
d i
= (observed y-value) – (predicted y-value)
y
Observed
y
-value
d
4 {
d
3 { }
d
1 }
d
2 }
d
5 Predicted
y
-value
d
6 {
x
47 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Regression Line
•
Regression line
(
line of best fit
) The line for which the sum of the squares of the residuals is a minimum. • The equation of a regression line for an independent variable
x ŷ
=
mx
and a dependent variable +
b y y
-intercept Predicted Slope
y
-value for is a given
x
value 48 of 84 © 2012 Pearson Education, Inc. All rights reserved.
The Equation of a Regression Line
•
ŷ
=
mx
+
b
where
m
n n x
2
n y
m
n x
• • •
y
is the mean of the
y-
values in the data
x
is the mean of the
x-
values in the data The regression line always passes through the point © 2012 Pearson Education, Inc. All rights reserved.
49 of 84
Example: Finding the Equation of a Regression Line
Find the equation of the regression line for the gross domestic products and carbon dioxide emissions data. Enter these values into 2 lists © 2012 Pearson Education, Inc. All rights reserved.
GDP (trillions of $), CO 2 emission (millions of
x
metric tons),
y
1.6
3.6
428.2
828.8
4.9
1.1
0.9
2.9
1214.2
444.6
264.0
415.3
2.7
2.3
1.6
1.5
571.8
454.9
358.7
573.5
50 of 84
Solution: Finding the Equation of a Regression Line
Recall from section 9.1:
x
1.6
3.6
4.9
1.1
0.9
2.9
2.7
2.3
1.6
1.5
y
428.2
828.8
1214.2
444.6
264.0
415.3
571.8
454.9
358.7
573.5
xy
685.12
2983.68
5949.58
489.06
237.6
1204.37
1543.86
1046.27
573.92
860.25
x
2
2.56
12.96
24.01
1.21
0.81
8.41
7.29
5.29
2.56
2.25
Σ
x
= 23.1 Σ
y
= 5554 Σ
xy
= 15,573.71 Σ
x
2 = 67.35
© 2012 Pearson Education, Inc. All rights reserved.
y
2
183,355.24
686,909.44
1,474,281.64
197,669.16
69,696 172,474.09
326,955.24
206,934.01
128,665.69
328,902.25
Σ
y
2 = 3,775,842.76
51 of 84
Solution: Finding the Equation of a Regression Line
Σ
x
= 23.1 Σ
y
= 5554
m
n n x
2 Σ
xy
= 15,573.71 Σ
x 2
= 67.35 Σ
y 2
= 2 3,775,842.76
27, 439.7
139.89
196.151977
5554 10 (196.151977) 23.1
10 Equation of the regression line
x
102.289
52 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Solution: Finding the Equation of a Regression Line
• To sketch the regression line, use any two
x
-values within the range of the data and calculate the corresponding
y
-values from the regression line.
© 2012 Pearson Education, Inc. All rights reserved.
53 of 84
On the Calculator
• • • • • • • • Again, we can use Stats Tests LinRegTTest Set RegEQ:Y1 by going to… Vars Y-Vars Enter Enter This will store the equation of the line in Y1 Then set Statplot to “on” Choose “Zoom9” to plot equation Choose “Y=“ to see the equation of the line Notice the equation of the line is not in the normal “order” we are used to
Larson/Farber 5th ed.
54
Example: Using Technology to Find a Regression Equation
Use a technology tool to find the equation of the regression line for the Old Faithful data.
Duration
x
1.8
1.82
1.9
1.93
1.98
2.05
2.13
2.3
2.37
2.82
3.13
3.27
3.65
Time,
y
56 58 62 56 57 57 60 57 61 73 76 77 77 Duration
x
3.78
3.83
3.88
4.1
4.27
4.3
4.43
4.47
4.53
4.55
4.6
4.63
Time,
y
79 85 80 89 90 89 89 86 89 86 92 91 55 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Solution: Using Technology to Find a Regression Equation
© 2012 Pearson Education, Inc. All rights reserved.
100
y
x
33.683
50 1 5 56 of 84
Example: Predicting y-Values Using Regression Equations
The regression equation for the gross domestic products (in trillions of dollars) and carbon dioxide emissions (in millions of metric tons) data is
ŷ
= 196.152
x
+ 102.289. Use this equation to predict the
expected
carbon dioxide emissions for the following gross domestic products. (Recall from section 9.1 that
x
and
y
have a significant linear correlation.) 1.
1.2 trillion dollars 2.
2.0 trillion dollars 3.
2.5 trillion dollars 57 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Solution: Predicting y-Values Using Regression Equations
ŷ
= 196.152
x
+ 102.289
1.
1.2 trillion dollars
ŷ
=196.152(1.2) + 102.289 ≈ 337.671
When the gross domestic product is $1.2 trillion, the CO 2 emissions are about 337.671 million metric tons.
2.
2.0 trillion dollars
ŷ
=196.152(2.0) + 102.289 = 494.593
When the gross domestic product is $2.0 trillion, the CO 2 emissions are 494.595 million metric tons.
58 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Solution: Predicting y-Values Using Regression Equations
3.
2.5 trillion dollars
ŷ
=196.152(2.5) + 102.289 = 592.669
When the gross domestic product is $2.5 trillion, the CO 2 emissions are 592.669 million metric tons.
Prediction values are meaningful only for
x
-values in (or close to) the range of the data. The
x
-values in the original data set range from 0.9 to 4.9. So, it would not be appropriate to use the regression line to predict carbon dioxide emissions for gross domestic products such as $0.2 or $14.5 trillion dollars.
© 2012 Pearson Education, Inc. All rights reserved.
59 of 84
Section 9.2 Summary
• • Found the equation of a regression line Predicted
y
-values using a regression equation © 2012 Pearson Education, Inc. All rights reserved.
60 of 84
• Page 505 13-24
Assignment
Larson/Farber 5th ed.
61
Section 9.3
Measures of Regression and Prediction Intervals
© 2012 Pearson Education, Inc. All rights reserved.
62 of 84
Section 9.3 Objectives
• • • • Interpret the three types of variation about a regression line Find and interpret the coefficient of determination Find and interpret the standard error of the estimate for a regression line Construct and interpret a prediction interval for
y
© 2012 Pearson Education, Inc. All rights reserved.
63 of 84
Variation About a Regression Line
• • Three types of variation about a regression line Total variation Explained variation Unexplained variation To find the total variation, you must first calculate The
total deviation
The
explained deviation
The
unexplained deviation
© 2012 Pearson Education, Inc. All rights reserved.
64 of 84
Variation About a Regression Line
Total Deviation =
y i
Explained Deviation =
y y
ˆ
i
Unexplained Deviation =
y i y
i y y
(
x i
,
y i
) Total deviation
y i
y
(
x i
,
ŷ i
) (
x i
,
y i
) Unexplained
y i
y
ˆ
i
Explained deviation ˆ
i
y x x
© 2012 Pearson Education, Inc. All rights reserved.
65 of 84
Variation About a Regression Line
•
Total variation
The sum of the squares of the differences between the
y
-value of each ordered pair and the mean of
y
. Total variation =
y i
y
2 •
Explained variation
The sum of the squares of the differences between each predicted
y
-value and the mean of
y
.
Explained variation = ˆ
i
y
2 © 2012 Pearson Education, Inc. All rights reserved.
66 of 84
Variation About a Regression Line
•
Unexplained variation
The sum of the squares of the differences between the
y
-value of each ordered pair and each corresponding predicted
y
-value.
Unexplained variation =
y i
i
2 The sum of the explained and unexplained variation is equal to the total variation.
Total variation = Explained variation + Unexplained variation 67 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Variation
• • The
explained
variation can be explained by the relationship between x and y.
The
unexplained
variation cannot be explained by the relationship between x and y, and is due to chance or other variables.
•
This is all I want you to know for this concept.
Larson/Farber 5th ed.
68
Coefficient of Determination
• •
Coefficient of determination
The ratio of the explained variation to the total variation.
Denoted by
r
2
r
2 Explained variation Total variation © 2012 Pearson Education, Inc. All rights reserved.
69 of 84
The Difference between R and R
2
• • • • • • Remember R is the
sample correlation coefficient
It is “A measure of the strength and the direction of a linear relationship between two variables.” The higher the R value, the more probable it is that the sample accurately represents the population R 2 is the
coefficient of determination
It is “The ratio of the explained variation to the total variation.” The higher the R 2 value, the more likely it is that any deviation from the line of regression can be explained
Larson/Farber 5th ed.
70
Example: Coefficient of Determination
The correlation coefficient for the gross domestic products and carbon dioxide emissions data as calculated in Section 9.1 is r ≈ 0.882. Find the coefficient of determination. What does this tell you about the explained variation of the data about the regression line? About the unexplained variation?
Solution:
r
2 (0.882) 2 0.778
About
77.8%
of the variation in the carbon emissions can be explained by the variation in the gross domestic products. About
22.2%
of the variation is unexplained.
71 of 84 © 2012 Pearson Education, Inc. All rights reserved.
The Standard Error of Estimate
• • •
Standard error of estimate
The standard deviation of the observed
y i
-values about the predicted
ŷ
-value for a given
x i
-value.
Denoted by
s e
.
s e
y n i
2
y
ˆ
i
) 2
n
is the number of ordered pairs in the data set The closer the observed
y
-values are to the predicted
y
-values, the smaller the standard error of estimate will be.
72 of 84 © 2012 Pearson Education, Inc. All rights reserved.
The Standard Error of Estimate
1.
In Words
Make a table that includes the column headings shown.
2.
Use the regression equation to calculate the predicted
y
-values.
3.
Calculate the sum of the squares of the differences between each observed
y
-value and the corresponding predicted
y
-value.
4.
Find the standard error of estimate.
© 2012 Pearson Education, Inc. All rights reserved.
In Symbols
(
i y
, , , (
i
i
) 2
y i
y i
),
y
ˆ
i
mx i
b
(
y i
i
) 2
s e
y n i
2
y
ˆ
i
) 2 73 of 84
Example: Standard Error of Estimate
The regression equation for the gross domestic products and carbon dioxide emissions data as calculated in section 9.2 is
ŷ
= 196.152
x
+ 102.289
Find the standard error of estimate.
Solution:
Use a table to calculate the sum of the squared differences of each observed
y
-value and the corresponding predicted
y
-value.
74 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Solution: Standard Error of Estimate
x
1.6
3.6
y
428.2
828.8
4.9 1214.2
1.1
444.6
0.9
264.0
2.9
2.7
2.3
1.6
1.5
415.3
571.8
454.9
358.7
573.5
ŷ
i 416.1322
808.4362
1063.4338
318.0562
278.8258
671.1298
631.8994
553.4386
416.1322
396.517
y i
– ŷ
i 12.0678
20.3638
150.7662
126.5438
–14.8258
–255.8298
–60.0994
–98.5386
–57.4322
176.983
(
y i
– ŷ
i )
2
145.63179684
414.68435044
22,730.44706244
16,013.33331844
219.80434564
65,448.88656804
3611.93788036
9709.85568996
3298.45759684
31,322.982289
Σ = 152,916.020898
75 of 84 © 2012 Pearson Education, Inc. All rights reserved.
unexplained variation
Solution: Standard Error of Estimate
•
n
= 10, Σ(
y i – ŷ
i ) 2 = 152,916.020898
s e
(
y i n
2
y
ˆ
i
) 2 152,916.020898
10 2 138.255
The standard error of estimate of the carbon dioxide emissions for a specific gross domestic product is about 138.255 million metric tons.
© 2012 Pearson Education, Inc. All rights reserved.
76 of 84
Prediction Intervals
• Two variables have a
bivariate normal distribution
if for any fixed value of
x
, the corresponding values of
y
are normally distributed and for any fixed values of
y
, the corresponding
x
-values are normally distributed.
77 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Prediction Intervals
• • • A prediction interval can be constructed for the true value of
y
.
Given a linear regression equation a specific value of
x
, a
ŷ
=
mx
+
b
c-prediction interval
and
x
0 , for
y
is
ŷ – E < y < ŷ + E
where
E
1
n n
x
2 0
x
) 2
x
) 2 The point estimate is
ŷ
and the margin of error is
E
. The probability that the prediction interval contains
y
is
c
.
78 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Constructing a Prediction Interval for y for a Specific Value of x
1.
In Words
Identify the number of ordered pairs in the data set
n
and the degrees of freedom.
2.
Use the regression equation and the given
x
-value to find the point estimate
ŷ
.
3.
Find the critical value
t c
that corresponds to the given level of confidence
c
.
In Symbols
d.f. =
n
– 2
y
ˆ
i
mx i
b
Use Table 5 in Appendix B.
© 2012 Pearson Education, Inc. All rights reserved.
79 of 84
Constructing a Prediction Interval for y for a Specific Value of x
4.
In Words
Find the standard error of estimate
s e
.
In Symbols
s e
y n i
2
i
) 2 4.
Find the margin of error
E
.
E
1
n n
x
2 0
x
) 2
x
) 2 5.
Find the left and right endpoints and form the prediction interval.
© 2012 Pearson Education, Inc. All rights reserved.
Left endpoint:
ŷ – E
Right endpoint:
ŷ
+
E
Interval:
ŷ – E
<
y
<
ŷ
+
E
80 of 84
Example: Constructing a Prediction Interval
Construct a 95% prediction interval for the carbon dioxide emission when the gross domestic product is $3.5 trillion. What can you conclude?
Recall,
n
= 10,
ŷ x
23.1, = 196.152
x
+ 102.289,
s e x
2 67.35,
x
2.31
= 138.255
Solution:
Point estimate:
ŷ
= 196.152(3.5) + 102.289 ≈ 788.821
Critical value: d.f. =
n
–2 = 10 – 2 = 8
t c
= 2.306
© 2012 Pearson Education, Inc. All rights reserved.
81 of 84
Solution: Constructing a Prediction Interval
E
1
n n
( 0
x
2
x
) 2
x
) 2 (2.306)(138.255) 1 1 2 3 ) 2 349.424
Left Endpoint:
ŷ – E
788.821 – 349.424
= 439.397
Right Endpoint:
ŷ
+
E
788.821 + 349.424
= 1138.245
439.397 < y < 1138.245
You can be 95% confident that when the gross domestic product is $3.5 trillion, the carbon dioxide emissions will be between 439.397 and 1138.245 million metric tons.
© 2012 Pearson Education, Inc. All rights reserved.
82 of 84
Section 9.3 Summary
• • • • Interpreted the three types of variation about a regression line Found and interpreted the coefficient of determination Found and interpreted the standard error of the estimate for a regression line Constructed and interpreted a prediction interval for
y
83 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Section 9.4
Multiple Regression
© 2012 Pearson Education, Inc. All rights reserved.
84 of 84
Section 9.4 Objectives
• • Use technology to find a multiple regression equation, the standard error of estimate and the coefficient of determination Use a multiple regression equation to predict
y
-values 85 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Multiple Regression Equation
• • In many instances, a better prediction can be found for a dependent (response) variable by using more than one independent (explanatory) variable. For example, a more accurate prediction for the carbon dioxide emissions discussed in previous sections might be made by considering the number of cars as well as the gross domestic product.
86 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Multiple Regression Equation
• • • •
Multiple regression equation
ŷ
=
b
+
m
1
x
1 +
m
2
x
2 +
m
3
x
3 + … +
m k x k x
1 ,
x
2 ,
x
3 ,…,
x k
are independent variables
b
is the
y
-intercept
y
is the dependent variable * Because the mathematics associated with this concept is complicated, technology is generally used to calculate the multiple regression equation.
87 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Example: Finding a Multiple Regression Equation
A researcher wants to determine how employee salaries at a certain company are related to the length of employment, previous experience, and education. The researcher selects eight employees from the company and obtains the data shown on the next slide. Use MINITAB to find a multiple regression equation that models the data.
88 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Example: Finding a Multiple Regression Equation
Employee A B C D E F G H Salary,
y
57,310 57,380 54,135 56,985 58,715 60,620 59,200 60,320 Employment Experience (yrs),
x
1 10 (yrs),
x
2 2 Education (yrs),
x
3 16 5 3 6 8 20 6 1 5 8 0 16 12 14 16 12 8 14 4 6 18 17 89 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Solution: Finding a Multiple Regression Equation
• • • Enter the
y
-values in C1 and the
x
1 -,
x
2 -, and
x
3 values in C2, C3 and C4 respectively.
Select “Regression > Regression…” from the
Stat
menu.
Use the salaries as the response variable and the remaining data as the predictors.
90 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Solution: Finding a Multiple Regression Equation
The regression equation is
ŷ
= 49,764 + 364
x
1 + 228
x
2 © 2012 Pearson Education, Inc. All rights reserved.
+ 267
x
3 91 of 84
Predicting y
-
Values
• • After finding the equation of the multiple regression line, you can use the equation to predict
y
-values over the range of the data.
To predict
y
-values, substitute the given value for each independent variable into the equation, then calculate
ŷ.
92 of 84 © 2012 Pearson Education, Inc. All rights reserved.
Example: Predicting y-Values
Use the regression equation
ŷ
= 49,764 + 364
x
1 + 228
x
2 + 267
x
3 to predict an employee’s salary given 12 years of current employment, 5 years of experience, and 16 years of education.
Solution:
ŷ
= 49,764 + 364(12) + 228(5) + 267(16) = 59,544 The employee’s predicted salary is $59,544.
© 2012 Pearson Education, Inc. All rights reserved.
93 of 84
Section 9.4 Summary
• • Used technology to find a multiple regression equation, the standard error of estimate and the coefficient of determination Used a multiple regression equation to predict
y
values © 2012 Pearson Education, Inc. All rights reserved.
94 of 84