Transcript Chapter 9

Chapter

Correlation and Regression

9 © 2012 Pearson Education, Inc.

All rights reserved.

1 of 84

Chapter Outline

• • • • 9.1 Correlation 9.2 Linear Regression 9.3 Measures of Regression and Prediction Intervals 9.4 Multiple Regression © 2012 Pearson Education, Inc. All rights reserved.

2 of 84

Section 9.1

Correlation

© 2012 Pearson Education, Inc. All rights reserved.

3 of 84

Section 9.1 Objectives

• • • • • Introduce linear correlation, independent and dependent variables, and the types of correlation Find a correlation coefficient Test a population correlation coefficient

ρ

using a table Perform a hypothesis test for a population correlation coefficient

ρ

Distinguish between correlation and causation 4 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Correlation

• •

Correlation

A relationship between two variables. The data can be represented by ordered pairs (

x

,

y

)  

x y

is the is the

independent dependent

(or (or

explanatory response

) )

variable variable

 KNOW WHICH IS X and WHICH IS Y © 2012 Pearson Education, Inc. All rights reserved.

5 of 84

Correlation

A

scatter plot

can be used to determine whether a linear (straight line) correlation exists between two variables.

y

Example

: 2

x y

1 2 3 – 4 – 2 – 1 4 0 5 2 2 4 6

x

–2 – 4 © 2012 Pearson Education, Inc. All rights reserved.

6 of 84

y

Types of Correlation

y

As

x

increases,

y

tends to decrease.

x

Negative Linear Correlation

y

As

x

increases,

y

tends to increase.

x

Positive Linear Correlation

y x

No Correlation © 2012 Pearson Education, Inc. All rights reserved.

x

Nonlinear Correlation 7 of 84

Example: Constructing a Scatter Plot

An economist wants to determine whether there is a linear relationship between a country’s gross domestic product (GDP) and carbon dioxide (CO 2 ) emissions. The data are shown in the table. Display the data in a scatter plot and determine whether there appears to be a positive or negative linear correlation or no linear correlation.

(Source: World Bank and U.S. Energy Information Administration)

GDP CO 2 emission (millions of (trillions of $), metric tons),

x y

1.6

428.2

3.6

4.9

1.1

0.9

2.9

2.7

2.3

1.6

1.5

828.8

1214.2

444.6

264.0

415.3

571.8

454.9

358.7

573.5

8 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Solution: Constructing a Scatter Plot

Appears to be a

positive linear correlation

. As the gross domestic products increase, the carbon dioxide emissions tend to increase.

© 2012 Pearson Education, Inc. All rights reserved.

9 of 84

Example: Constructing a Scatter Plot Using Technology

Old Faithful, located in Yellowstone National Park, is the world’s most famous geyser. The duration (in minutes) of several of Old Faithful’s eruptions and the times (in minutes) until the next eruption are shown in the table. Using a TI-83/84, display the data in a scatter plot. Determine the type of correlation.

Duration

x

1.80

1.82

1.90

1.93

1.98

2.05

2.13

2.30

2.37

2.82

3.13

3.27

3.65

Time,

y

56 58 62 56 57 57 60 57 61 73 76 77 77 Duration

x

3.78

3.83

3.88

4.10

4.27

4.30

4.43

4.47

4.53

4.55

4.60

4.63

Time,

y

79 85 80 89 90 89 89 86 89 86 92 91 10 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Solution: Constructing a Scatter Plot Using Technology

• • Enter the

x-

values into list L 1 and the

y-

values into list L 2 .

Use

Stat Plot

STAT > Edit… to construct the scatter plot.

STATPLOT 100 Use “zoom -9” to see the picture properly 50 1 From the scatter plot, it appears that the variables have a

positive linear correlation

.

5 11 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Correlation Coefficient

• • • •

Correlation coefficient

A measure of the strength and the direction of a linear relationship between two variables. The symbol

coefficient.

r

represents the

sample correlation

A formula for

r

n

x

2

n r

 is

xy

  2

n

y

2 2

n

is the number of data pairs

The population correlation coefficient

is represented by

ρ

(rho). 12 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Correlation Coefficient

• The range of the correlation coefficient is –1 to 1.

-1 If

r

= –1 there is a perfect negative correlation 0 If

r

is close to 0 there is no linear correlation 1 If

r

= 1 there is a perfect positive correlation © 2012 Pearson Education, Inc. All rights reserved.

There may still be a correlation however 13 of 84

Linear Correlation

y r

= –0.91

x

Strong negative correlation

y r

= 0.42

y r

= 0.88

x

Strong positive correlation

y r

= 0.07

x

Weak positive correlation © 2012 Pearson Education, Inc. All rights reserved.

x

Nonlinear Correlation 14 of 84

Calculating a Correlation Coefficient

In Words

1.

Find the sum of the

x

values.

2.

Find the sum of the

y

values.

3.

Multiply each

x

-value by its corresponding

y

-value and find the sum.

In Symbols

x

y

xy

© 2012 Pearson Education, Inc. All rights reserved.

15 of 84

Calculating a Correlation Coefficient

4.

In Words

Square each

x

-value and find the sum.

5.

6.

Square each

y

-value and find the sum.

Use these five sums to calculate the correlation coefficient.

In Symbols

x

2 

y

2

r

n

x

2

n

xy

2

n

y

2   2 16 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Example: Finding the Correlation Coefficient

Calculate the correlation coefficient for the gross domestic products and carbon dioxide emissions data. What can you conclude?

© 2012 Pearson Education, Inc. All rights reserved.

GDP (trillions of $),

x

1.6

CO 2 emission (millions of metric tons), 428.2

y

3.6

4.9

1.1

828.8

1214.2

444.6

0.9

2.9

2.7

2.3

1.6

1.5

264.0

415.3

571.8

454.9

358.7

573.5

17 of 84

Solution: Finding the Correlation Coefficient

x

1.6

3.6

4.9

1.1

0.9

y

428.2

828.8

1214.2

444.6

264.0

xy

685.12

2983.68

5949.58

489.06

237.6

x

2

2.56

12.96

24.01

1.21

0.81

y

2

183,355.24

686,909.44

1,474,281.64

197,669.16

69,696 2.9

2.7

2.3

1.6

415.3

571.8

454.9

358.7

1204.37

1543.86

1046.27

573.92

8.41

7.29

5.29

2.56

172,474.09

326,955.24

206,934.01

128,665.69

1.5

Σ

x

= 23.1 Σ

y

573.5

= 5554 Σ

xy

860.25

= 15,573.71 Σ

x

2 2.25

= 67.35 Σ

y

2 328,902.25

= 3,775,842.76

© 2012 Pearson Education, Inc. All rights reserved.

18 of 84

Solution: Finding the Correlation Coefficient

Σ

x

= 23.1 Σ

y

= 5554 Σ

xy

= 15,573.71 Σ

x

2 = 32.44

r

n

xy

Σ

y

2 = 3,775,842.76

n

x

2 2

n

y

2 2  10(15,573.71)     10(67.35)  23.1

2 10(3, 775,842.76)   27, 439.7

139.89 6,911,511.6

 0.882

r ≈ 0.882 suggests a strong positive linear correlation. As the gross domestic product increases, the carbon dioxide emissions also increase.

19 of 84 © 2012 Pearson Education, Inc. All rights reserved.

On a Calculator

• • • Enter the data in List 1 and List 2 To graph it  2 nd “y=“  plot on • • • • Check the window to ensure that all values will fit into the window –adjust x or y as required –Use “zoom 9” To calculate R

Larson/Farber 5th ed.

 stats  calc  4: LinReg(ax+b) Here you will find both “R” and “R 2 ” Notice also, this is the equation of the line Normally we use “y=

m

x+b” the calculator calls it “y=

a

x+b” 20

Example: Using Technology to Find a Correlation Coefficient

Use a technology tool to calculate the correlation coefficient for the Old Faithful data. What can you conclude?

Go ahead and enter these in List 1 and List 2. We will use them to calculate R Duration

x

1.8

1.82

1.9

1.93

1.98

2.05

2.13

2.3

2.37

2.82

3.13

3.27

3.65

Time,

y

56 58 62 56 57 57 60 57 61 73 76 77 77 Duration

x

3.78

3.83

3.88

4.1

4.27

4.3

4.43

4.47

4.53

4.55

4.6

4.63

Time,

y

79 85 80 89 90 89 89 86 89 86 92 91 21 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Solution: Using Technology to Find a Correlation Coefficient

STAT > Calc To calculate

r

, you must first enter the

DiagnosticOn

command found in the Catalog menu

r ≈ 0.979

suggests a strong positive correlation.

© 2012 Pearson Education, Inc. All rights reserved.

22 of 84

Using a Table to Test a Population Correlation Coefficient

ρ

• • • Once the

sample correlation coefficient r

has been calculated, we need to determine whether there is enough evidence to decide that the

population correlation coefficient ρ

is

significant

at a specified level of significance.

Use Table 11 in Appendix B.

If |

r

| is greater than the critical value, there is enough evidence to decide that the correlation coefficient

ρ

is significant.

23 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Using a Table to Test a Population Correlation Coefficient

ρ

• Determine whether

ρ

data (

n

is significant for five pairs of = 5) at a level of significance of

α

= 0.01.

level of significance Number of pairs of data in sample • If |

r

| > 0.959,

the correlation is significant

. Otherwise, there is not enough evidence to conclude that the correlation is significant.

24 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Larson/Farber 5th ed.

On a Calculator

Normally H 0 is “=“ We are testing “can we conclude there is a relationship” • • • • • • Instead of using the table we use LinRegTTest Go to Stats  Test  LinRegTTest (note: you have to have entered data in List1 and List2) Notice you need to know what the Alternative Hypothesis is to test it For the line that says “RegEQ” do this: Go to Vars  Y-Vars  Function  Y1 enter This will enter Y1 on the RegEQ line This will then enter the equation for a line in the string Y1 –This is so we can graph it if we want 25

On a Calculator

• Notice that you see beta and rho. They will be “not equal to” or greater than or less than. This is the alternate hypothesis. (Remember beta is the “opposite” of the population correlation coefficient) • • • Normally we want beta to be “not equal to” After you calculate the values, you will find the “R” You will also see the t-score and more importantly the P value • This time we

WANT

P-value to be less than the alpha. This means we are in the tail, which means the correlation is stronger. The

MORE

in the tail, the stronger the correlation

Larson/Farber 5th ed.

26

• • • • • • • • • •

Graphing

Again, choose a statplot to view the points entered in List1 or List2 Select “zoom 9” to see the statplot best This time you will also see a line through the statplot This is the line you created and stored in string Y1 You can now see the line that “best fits” the plotted points Suppose you had 30 data points, and wanted to predict the 35 th data point Go to Vars  Y-Vars  Function  enter  (xx) enter Y1will show on the screen, enter this:

Y1 (35)

and hit enter This will show the predicted 35 th entry based on your line 35 would be the “x-value” (explanatory or independent) and your answer would be the “y-value” (dependent or response)

Larson/Farber 5th ed.

27

Using a Table to Test a Population Correlation Coefficient

ρ In Words

1.

Determine the number of pairs of data in the sample.

2.

Specify the level of significance.

3.

Find the critical value.

In Symbols

Determine

n

.

Identify

α

.

Use Table 11 in Appendix B.

© 2012 Pearson Education, Inc. All rights reserved.

28 of 84

Using a Table to Test a Population Correlation Coefficient

ρ

4.

In Words

Decide if the correlation is significant.

In Symbols

If |

r

| > critical value, the correlation is significant. Otherwise, there is not enough evidence to support that the correlation is significant.

5.

Interpret the decision in the context of the original claim.

29 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Example: Using a Table to Test a Population Correlation Coefficient

ρ

Using the Old Faithful data, you used 25 pairs of data to find r ≈ 0.979. Is the correlation coefficient significant? Use

α

= 0.05.

Duration

x

1.8

1.82

1.9

1.93

1.98

2.05

2.13

2.3

2.37

2.82

3.13

3.27

3.65

Time,

y

56 58 62 56 57 57 60 57 61 73 76 77 77 Duration

x

3.78

3.83

3.88

4.1

4.27

4.3

4.43

4.47

4.53

4.55

4.6

4.63

Time,

y

79 85 80 89 90 89 89 86 89 86 92 91 30 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Solution: Using a Table to Test a Population Correlation Coefficient

ρ

• • •

n

= 25,

α

= 0.05

|r| ≈ 0.979 > 0.396

There is enough evidence at the 5% level of significance to conclude that there is a significant linear correlation between the duration of Old Faithful’s eruptions and the time between eruptions.

© 2012 Pearson Education, Inc. All rights reserved.

31 of 84

Hypothesis Testing for a Population Correlation Coefficient

ρ

• • A hypothesis test can also be used to determine whether the sample correlation coefficient

r

provides enough evidence to conclude that the population correlation coefficient

ρ

is significant at a specified level of significance.

A hypothesis test can be one-tailed or two-tailed.

32 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Hypothesis Testing for a Population Correlation Coefficient

ρ

• Left-tailed test

H

0 :

ρ H a

:

ρ

≥ 0 (no significant negative correlation) < 0 (significant negative correlation) • Right-tailed test

H

0 :

ρ H a

:

ρ

≤ 0 (no significant positive correlation) > 0 (significant positive correlation) • Two-tailed test

H

0 :

ρ H a

:

ρ

= 0 (no significant correlation) ≠ 0 (significant correlation) © 2012 Pearson Education, Inc. All rights reserved.

33 of 84

The t-Test for the Correlation Coefficient

• • • • Can be used to test whether the correlation between two variables is significant. The

test statistic

is

r

.

The

standardized test statistic

t

 

r r

r

1

n

 

r

2 2 follows a

t

-distribution with

d.f. = n – 2

.

In this text, only two-tailed hypothesis tests for ρ are considered

.

34 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Using the t-Test for ρ

1.

In Words

State the null and alternative hypothesis.

2.

Specify the level of significance.

3.

Identify the degrees of freedom.

4.

Determine the critical value(s) and rejection region(s).

In Symbols

State

H

0 and

H

a . Identify

α

.

d.f. =

n

– 2 Use Table 5 in Appendix B.

© 2012 Pearson Education, Inc. All rights reserved.

35 of 84

Using the t-Test for ρ

5.

In Words

Find the standardized test statistic.

In Symbols

t

r

1

n

 

r

2 2 6.

Make a decision to reject or fail to reject the null hypothesis. If

t

is in the rejection region, reject

H

0 . Otherwise fail to reject

H

0 .

7.

Interpret the decision in the context of the original claim.

© 2012 Pearson Education, Inc. All rights reserved.

36 of 84

Example: t-Test for a Correlation Coefficient

Previously you calculated r ≈ 0.882. Test the significance of this correlation coefficient. Use

α

= 0.05.

© 2012 Pearson Education, Inc. All rights reserved.

GDP (trillions of $),

x

1.6

CO 2 emission (millions of metric tons), 428.2

y

3.6

4.9

1.1

828.8

1214.2

444.6

0.9

2.9

2.7

2.3

1.6

1.5

264.0

415.3

571.8

454.9

358.7

573.5

37 of 84

• • • • •

Solution: t-Test for a Correlation Coefficient

H

0 : ρ = 0

H a

:

 

ρ ≠ 0 0.05

If you reject the hypothesis, then there is evidence to conclude a linear correlation

d.f. = 10 – 2 = 8 Rejection Region:

© 2012 Pearson Education, Inc. All rights reserved.

t

Test Statistic:

 0.882

2  5.294

Decision: Reject H 0

At the 5% level of significance, there is enough evidence to conclude that there is a significant linear correlation between gross domestic products and carbon dioxide emissions.

38 of 84

On the Calculator

• • • • We still use LinRegTTest This will give us a t score should we need it It also tells you the degrees of freedom, and everything else you need Remember: We WANT p to be less than alpha. That tells us it’s in the tails, and thus more likely to be explained by a correlation than by random chance. The more in the tails, the less likely it’s chance.

Larson/Farber 5th ed.

39

Correlation and Causation

• • The fact that two variables are strongly correlated does

not in itself

imply a cause-and-effect relationship between the variables.

If there is a significant correlation between two variables, you should

consider

the following possibilities.

1.

Is there a direct cause-and-effect relationship • between the variables?

Does

x

cause

y

? © 2012 Pearson Education, Inc. All rights reserved.

40 of 84

Correlation and Causation

2.

Is there a reverse cause-and-effect relationship • between the variables?

Does

y

cause

x

?

3.

Is it possible that the relationship between the variables can be caused by a third variable or by a combination of several other variables?

4.

Is it possible that the relationship between two variables may be a coincidence?

41 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Section 9.1 Summary

• • • • • Introduced linear correlation, independent and dependent variables and the types of correlation Found a correlation coefficient Tested a population correlation coefficient

ρ

using a table Performed a hypothesis test for a population correlation coefficient

ρ

Distinguished between correlation and causation 42 of 84 © 2012 Pearson Education, Inc. All rights reserved.

• Page 495 9-28

Assignment

Larson/Farber 5th ed.

43

Section 9.2

Linear Regression

© 2012 Pearson Education, Inc. All rights reserved.

44 of 84

Section 9.2 Objectives

• • Find the equation of a regression line Predict

y

-values using a regression equation © 2012 Pearson Education, Inc. All rights reserved.

45 of 84

Regression lines

• • After verifying that the linear correlation between two variables is significant, next we determine the equation of the line that best models the data (

regression line

).

Can be used to predict the value of

y

for a given value of

x

.

y x

46 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Residuals

Residual

Know this

The difference between the observed

y

-value and the predicted

y

-value for a given

x

-value on the line. For a given

x

-value,

d i

= (observed y-value) – (predicted y-value)

y

Observed

y

-value

d

4 {

d

3 { }

d

1 }

d

2 }

d

5 Predicted

y

-value

d

6 {

x

47 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Regression Line

Regression line

(

line of best fit

) The line for which the sum of the squares of the residuals is a minimum. • The equation of a regression line for an independent variable

x ŷ

=

mx

and a dependent variable +

b y y

-intercept Predicted Slope

y

-value for is a given

x

value 48 of 84 © 2012 Pearson Education, Inc. All rights reserved.

The Equation of a Regression Line

ŷ

=

mx

+

b

where

m

n n x

    

 2  

n y

m

n x

• • •

y

is the mean of the

y-

values in the data

x

is the mean of the

x-

values in the data  The regression line always passes through the point  © 2012 Pearson Education, Inc. All rights reserved.

49 of 84

Example: Finding the Equation of a Regression Line

Find the equation of the regression line for the gross domestic products and carbon dioxide emissions data. Enter these values into 2 lists © 2012 Pearson Education, Inc. All rights reserved.

GDP (trillions of $), CO 2 emission (millions of

x

metric tons),

y

1.6

3.6

428.2

828.8

4.9

1.1

0.9

2.9

1214.2

444.6

264.0

415.3

2.7

2.3

1.6

1.5

571.8

454.9

358.7

573.5

50 of 84

Solution: Finding the Equation of a Regression Line

Recall from section 9.1:

x

1.6

3.6

4.9

1.1

0.9

2.9

2.7

2.3

1.6

1.5

y

428.2

828.8

1214.2

444.6

264.0

415.3

571.8

454.9

358.7

573.5

xy

685.12

2983.68

5949.58

489.06

237.6

1204.37

1543.86

1046.27

573.92

860.25

x

2

2.56

12.96

24.01

1.21

0.81

8.41

7.29

5.29

2.56

2.25

Σ

x

= 23.1 Σ

y

= 5554 Σ

xy

= 15,573.71 Σ

x

2 = 67.35

© 2012 Pearson Education, Inc. All rights reserved.

y

2

183,355.24

686,909.44

1,474,281.64

197,669.16

69,696 172,474.09

326,955.24

206,934.01

128,665.69

328,902.25

Σ

y

2 = 3,775,842.76

51 of 84

Solution: Finding the Equation of a Regression Line

Σ

x

= 23.1 Σ

y

= 5554

m

n n x

  2 Σ

xy

= 15,573.71 Σ

x 2

= 67.35 Σ

y 2

=      2 3,775,842.76

 27, 439.7

139.89

 196.151977

 5554 10   (196.151977) 23.1

10 Equation of the regression line

x

 102.289

52 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Solution: Finding the Equation of a Regression Line

• To sketch the regression line, use any two

x

-values within the range of the data and calculate the corresponding

y

-values from the regression line.

© 2012 Pearson Education, Inc. All rights reserved.

53 of 84

On the Calculator

• • • • • • • • Again, we can use Stats  Tests  LinRegTTest Set RegEQ:Y1 by going to… Vars  Y-Vars  Enter Enter This will store the equation of the line in Y1 Then set Statplot to “on” Choose “Zoom9” to plot equation Choose “Y=“ to see the equation of the line Notice the equation of the line is not in the normal “order” we are used to

Larson/Farber 5th ed.

54

Example: Using Technology to Find a Regression Equation

Use a technology tool to find the equation of the regression line for the Old Faithful data.

Duration

x

1.8

1.82

1.9

1.93

1.98

2.05

2.13

2.3

2.37

2.82

3.13

3.27

3.65

Time,

y

56 58 62 56 57 57 60 57 61 73 76 77 77 Duration

x

3.78

3.83

3.88

4.1

4.27

4.3

4.43

4.47

4.53

4.55

4.6

4.63

Time,

y

79 85 80 89 90 89 89 86 89 86 92 91 55 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Solution: Using Technology to Find a Regression Equation

© 2012 Pearson Education, Inc. All rights reserved.

100

y

x

 33.683

50 1 5 56 of 84

Example: Predicting y-Values Using Regression Equations

The regression equation for the gross domestic products (in trillions of dollars) and carbon dioxide emissions (in millions of metric tons) data is

ŷ

= 196.152

x

+ 102.289. Use this equation to predict the

expected

carbon dioxide emissions for the following gross domestic products. (Recall from section 9.1 that

x

and

y

have a significant linear correlation.) 1.

1.2 trillion dollars 2.

2.0 trillion dollars 3.

2.5 trillion dollars 57 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Solution: Predicting y-Values Using Regression Equations

ŷ

= 196.152

x

+ 102.289

1.

1.2 trillion dollars

ŷ

=196.152(1.2) + 102.289 ≈ 337.671

When the gross domestic product is $1.2 trillion, the CO 2 emissions are about 337.671 million metric tons.

2.

2.0 trillion dollars

ŷ

=196.152(2.0) + 102.289 = 494.593

When the gross domestic product is $2.0 trillion, the CO 2 emissions are 494.595 million metric tons.

58 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Solution: Predicting y-Values Using Regression Equations

3.

2.5 trillion dollars

ŷ

=196.152(2.5) + 102.289 = 592.669

When the gross domestic product is $2.5 trillion, the CO 2 emissions are 592.669 million metric tons.

Prediction values are meaningful only for

x

-values in (or close to) the range of the data. The

x

-values in the original data set range from 0.9 to 4.9. So, it would not be appropriate to use the regression line to predict carbon dioxide emissions for gross domestic products such as $0.2 or $14.5 trillion dollars.

© 2012 Pearson Education, Inc. All rights reserved.

59 of 84

Section 9.2 Summary

• • Found the equation of a regression line Predicted

y

-values using a regression equation © 2012 Pearson Education, Inc. All rights reserved.

60 of 84

• Page 505 13-24

Assignment

Larson/Farber 5th ed.

61

Section 9.3

Measures of Regression and Prediction Intervals

© 2012 Pearson Education, Inc. All rights reserved.

62 of 84

Section 9.3 Objectives

• • • • Interpret the three types of variation about a regression line Find and interpret the coefficient of determination Find and interpret the standard error of the estimate for a regression line Construct and interpret a prediction interval for

y

© 2012 Pearson Education, Inc. All rights reserved.

63 of 84

Variation About a Regression Line

• • Three types of variation about a regression line  Total variation  Explained variation  Unexplained variation To find the total variation, you must first calculate  The

total deviation

 The

explained deviation

 The

unexplained deviation

© 2012 Pearson Education, Inc. All rights reserved.

64 of 84

Variation About a Regression Line

Total Deviation =

y i

 Explained Deviation =

y y

ˆ

i

Unexplained Deviation = 

y i y

i y y

(

x i

,

y i

) Total deviation

y i

y

(

x i

,

ŷ i

) (

x i

,

y i

) Unexplained

y i

y

ˆ

i

Explained deviation ˆ

i

y x x

© 2012 Pearson Education, Inc. All rights reserved.

65 of 84

Variation About a Regression Line

Total variation

The sum of the squares of the differences between the

y

-value of each ordered pair and the mean of

y

. Total variation =  

y i

y

 2 •

Explained variation

The sum of the squares of the differences between each predicted

y

-value and the mean of

y

.

Explained variation =   ˆ

i

y

 2 © 2012 Pearson Education, Inc. All rights reserved.

66 of 84

Variation About a Regression Line

Unexplained variation

The sum of the squares of the differences between the

y

-value of each ordered pair and each corresponding predicted

y

-value.

Unexplained variation =  

y i

i

 2 The sum of the explained and unexplained variation is equal to the total variation.

Total variation = Explained variation + Unexplained variation 67 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Variation

• • The

explained

variation can be explained by the relationship between x and y.

The

unexplained

variation cannot be explained by the relationship between x and y, and is due to chance or other variables.

This is all I want you to know for this concept.

Larson/Farber 5th ed.

68

Coefficient of Determination

• •

Coefficient of determination

The ratio of the explained variation to the total variation.

Denoted by

r

2

r

2  Explained variation Total variation © 2012 Pearson Education, Inc. All rights reserved.

69 of 84

The Difference between R and R

2

• • • • • • Remember R is the

sample correlation coefficient

It is “A measure of the strength and the direction of a linear relationship between two variables.” The higher the R value, the more probable it is that the sample accurately represents the population R 2 is the

coefficient of determination

It is “The ratio of the explained variation to the total variation.” The higher the R 2 value, the more likely it is that any deviation from the line of regression can be explained

Larson/Farber 5th ed.

70

Example: Coefficient of Determination

The correlation coefficient for the gross domestic products and carbon dioxide emissions data as calculated in Section 9.1 is r ≈ 0.882. Find the coefficient of determination. What does this tell you about the explained variation of the data about the regression line? About the unexplained variation?

Solution:

r

2  (0.882) 2  0.778

About

77.8%

of the variation in the carbon emissions can be explained by the variation in the gross domestic products. About

22.2%

of the variation is unexplained.

71 of 84 © 2012 Pearson Education, Inc. All rights reserved.

The Standard Error of Estimate

• • •

Standard error of estimate

The standard deviation of the observed

y i

-values about the predicted

ŷ

-value for a given

x i

-value.

Denoted by

s e

.

s e

y n i

 2

y

ˆ

i

) 2

n

is the number of ordered pairs in the data set The closer the observed

y

-values are to the predicted

y

-values, the smaller the standard error of estimate will be.

72 of 84 © 2012 Pearson Education, Inc. All rights reserved.

The Standard Error of Estimate

1.

In Words

Make a table that includes the column headings shown.

2.

Use the regression equation to calculate the predicted

y

-values.

3.

Calculate the sum of the squares of the differences between each observed

y

-value and the corresponding predicted

y

-value.

4.

Find the standard error of estimate.

© 2012 Pearson Education, Inc. All rights reserved.

In Symbols

(

i y

, , , (

i

i

) 2

y i

y i

),

y

ˆ

i

mx i

b

 (

y i

i

) 2

s e

y n i

 2

y

ˆ

i

) 2 73 of 84

Example: Standard Error of Estimate

The regression equation for the gross domestic products and carbon dioxide emissions data as calculated in section 9.2 is

ŷ

= 196.152

x

+ 102.289

Find the standard error of estimate.

Solution:

Use a table to calculate the sum of the squared differences of each observed

y

-value and the corresponding predicted

y

-value.

74 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Solution: Standard Error of Estimate

x

1.6

3.6

y

428.2

828.8

4.9 1214.2

1.1

444.6

0.9

264.0

2.9

2.7

2.3

1.6

1.5

415.3

571.8

454.9

358.7

573.5

ŷ

i 416.1322

808.4362

1063.4338

318.0562

278.8258

671.1298

631.8994

553.4386

416.1322

396.517

y i

ŷ

i 12.0678

20.3638

150.7662

126.5438

–14.8258

–255.8298

–60.0994

–98.5386

–57.4322

176.983

(

y i

ŷ

i )

2

145.63179684

414.68435044

22,730.44706244

16,013.33331844

219.80434564

65,448.88656804

3611.93788036

9709.85568996

3298.45759684

31,322.982289

Σ = 152,916.020898

75 of 84 © 2012 Pearson Education, Inc. All rights reserved.

unexplained variation

Solution: Standard Error of Estimate

n

= 10, Σ(

y i – ŷ

i ) 2 = 152,916.020898

s e

  (

y i n

  2

y

ˆ

i

) 2  152,916.020898

10 2  138.255

The standard error of estimate of the carbon dioxide emissions for a specific gross domestic product is about 138.255 million metric tons.

© 2012 Pearson Education, Inc. All rights reserved.

76 of 84

Prediction Intervals

• Two variables have a

bivariate normal distribution

if for any fixed value of

x

, the corresponding values of

y

are normally distributed and for any fixed values of

y

, the corresponding

x

-values are normally distributed.

77 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Prediction Intervals

• • • A prediction interval can be constructed for the true value of

y

.

Given a linear regression equation a specific value of

x

, a

ŷ

=

mx

+

b

c-prediction interval

and

x

0 , for

y

is

ŷ E < y < ŷ + E

where

E

 1

n n

x

2 0 

x

) 2

x

) 2 The point estimate is

ŷ

and the margin of error is

E

. The probability that the prediction interval contains

y

is

c

.

78 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Constructing a Prediction Interval for y for a Specific Value of x

1.

In Words

Identify the number of ordered pairs in the data set

n

and the degrees of freedom.

2.

Use the regression equation and the given

x

-value to find the point estimate

ŷ

.

3.

Find the critical value

t c

that corresponds to the given level of confidence

c

.

In Symbols

d.f. =

n

– 2

y

ˆ

i

mx i

b

Use Table 5 in Appendix B.

© 2012 Pearson Education, Inc. All rights reserved.

79 of 84

Constructing a Prediction Interval for y for a Specific Value of x

4.

In Words

Find the standard error of estimate

s e

.

In Symbols

s e

y n i

 2

i

) 2 4.

Find the margin of error

E

.

E

 1

n n

x

2 0 

x

) 2

x

) 2 5.

Find the left and right endpoints and form the prediction interval.

© 2012 Pearson Education, Inc. All rights reserved.

Left endpoint:

ŷ – E

Right endpoint:

ŷ

+

E

Interval:

ŷ – E

<

y

<

ŷ

+

E

80 of 84

Example: Constructing a Prediction Interval

Construct a 95% prediction interval for the carbon dioxide emission when the gross domestic product is $3.5 trillion. What can you conclude?

Recall,

n

= 10,

ŷ x

23.1, = 196.152

x

+ 102.289,

s e x

2 67.35,

x

 2.31

= 138.255

Solution:

Point estimate:

ŷ

= 196.152(3.5) + 102.289 ≈ 788.821

Critical value: d.f. =

n

–2 = 10 – 2 = 8

t c

= 2.306

© 2012 Pearson Education, Inc. All rights reserved.

81 of 84

Solution: Constructing a Prediction Interval

E

 1

n n

( 0 

x

2 

x

) 2

x

) 2  (2.306)(138.255) 1  1 2   3 ) 2  349.424

Left Endpoint:

ŷ – E

788.821 – 349.424

= 439.397

Right Endpoint:

ŷ

+

E

788.821 + 349.424

= 1138.245

439.397 < y < 1138.245

You can be 95% confident that when the gross domestic product is $3.5 trillion, the carbon dioxide emissions will be between 439.397 and 1138.245 million metric tons.

© 2012 Pearson Education, Inc. All rights reserved.

82 of 84

Section 9.3 Summary

• • • • Interpreted the three types of variation about a regression line Found and interpreted the coefficient of determination Found and interpreted the standard error of the estimate for a regression line Constructed and interpreted a prediction interval for

y

83 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Section 9.4

Multiple Regression

© 2012 Pearson Education, Inc. All rights reserved.

84 of 84

Section 9.4 Objectives

• • Use technology to find a multiple regression equation, the standard error of estimate and the coefficient of determination Use a multiple regression equation to predict

y

-values 85 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Multiple Regression Equation

• • In many instances, a better prediction can be found for a dependent (response) variable by using more than one independent (explanatory) variable. For example, a more accurate prediction for the carbon dioxide emissions discussed in previous sections might be made by considering the number of cars as well as the gross domestic product.

86 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Multiple Regression Equation

• • • •

Multiple regression equation

ŷ

=

b

+

m

1

x

1 +

m

2

x

2 +

m

3

x

3 + … +

m k x k x

1 ,

x

2 ,

x

3 ,…,

x k

are independent variables

b

is the

y

-intercept

y

is the dependent variable * Because the mathematics associated with this concept is complicated, technology is generally used to calculate the multiple regression equation.

87 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Example: Finding a Multiple Regression Equation

A researcher wants to determine how employee salaries at a certain company are related to the length of employment, previous experience, and education. The researcher selects eight employees from the company and obtains the data shown on the next slide. Use MINITAB to find a multiple regression equation that models the data.

88 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Example: Finding a Multiple Regression Equation

Employee A B C D E F G H Salary,

y

57,310 57,380 54,135 56,985 58,715 60,620 59,200 60,320 Employment Experience (yrs),

x

1 10 (yrs),

x

2 2 Education (yrs),

x

3 16 5 3 6 8 20 6 1 5 8 0 16 12 14 16 12 8 14 4 6 18 17 89 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Solution: Finding a Multiple Regression Equation

• • • Enter the

y

-values in C1 and the

x

1 -,

x

2 -, and

x

3 values in C2, C3 and C4 respectively.

Select “Regression > Regression…” from the

Stat

menu.

Use the salaries as the response variable and the remaining data as the predictors.

90 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Solution: Finding a Multiple Regression Equation

The regression equation is

ŷ

= 49,764 + 364

x

1 + 228

x

2 © 2012 Pearson Education, Inc. All rights reserved.

+ 267

x

3 91 of 84

Predicting y

-

Values

• • After finding the equation of the multiple regression line, you can use the equation to predict

y

-values over the range of the data.

To predict

y

-values, substitute the given value for each independent variable into the equation, then calculate

ŷ.

92 of 84 © 2012 Pearson Education, Inc. All rights reserved.

Example: Predicting y-Values

Use the regression equation

ŷ

= 49,764 + 364

x

1 + 228

x

2 + 267

x

3 to predict an employee’s salary given 12 years of current employment, 5 years of experience, and 16 years of education.

Solution:

ŷ

= 49,764 + 364(12) + 228(5) + 267(16) = 59,544 The employee’s predicted salary is $59,544.

© 2012 Pearson Education, Inc. All rights reserved.

93 of 84

Section 9.4 Summary

• • Used technology to find a multiple regression equation, the standard error of estimate and the coefficient of determination Used a multiple regression equation to predict

y

values © 2012 Pearson Education, Inc. All rights reserved.

94 of 84