Correlation and Linear Regression

Download Report

Transcript Correlation and Linear Regression

McGraw-Hill/Irwin

Correlation and Linear Regression

Chapter 13

Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.

Topics

1.

2.

Correlation

   Scatter Diagram Correlation coefficient Test on correlation coefficient

Simple Linear Regression Analysis

     Estimation Validity of the model —test on the slope coefficient Fitness of the model —coefficient of determination Prediction The error term and residuals 13-2

Regression Analysis Introduction

 Recall in chapter 4 we used Applewood Auto Group data to show the relationship between two variables using a scatter diagram. The profit for each vehicle sold and the age of the buyer were plotted on an XY graph    The graph showed that as the age of the buyer increased, the profit for each vehicle also increased in general.

Numerical measures to express the strength of relationship between two variables are developed in this chapter. In addition, an equation is used to express the relationship between variables, allowing us to estimate one variable on the basis of another.

EXAMPLES Does the amount Healthtex spends per month on training its sales force affect its monthly sales?

Is the number of square feet in a home related to the cost to heat the home in January?

In a study of fuel efficiency, is there a relationship between miles per gallon and the weight of a car?

Does the number of hours that students studied for an exam influence the exam score?

13-3

Dependent Versus Independent Variable

The

Dependent Variable

is the variable being predicted or estimated.

The

Independent Variable

provides the basis for estimation. It is the predictor variable.

Which in the questions below are the dependent and independent variables?

1.Does the amount Healthtex spends per month on training its sales force affect its monthly sales?

2.Is the number of square feet in a home related to the cost to heat the home in January?

3.In a study of fuel efficiency, is there a relationship between miles per gallon and the weight of a car?

4.Does the number of hours that students studied for an exam influence the exam score?

13-4

Scatter Diagram Example

The sales manager of Copier Sales of America, which has a large sales force throughout the United States and Canada, wants to determine whether there is a

relationship between

the

number of sales calls made

in a month and the copiers sold.

number of copiers sold that month

. The manager selects a random sample of 10 representatives and determines the number of sales calls each representative made last month and the number of 13-5

Minitab Scatter Plots

No correlation Weak negative correlation Strong positive correlation 13-6

The Coefficient of Correlation,

r

The

Coefficient of Correlation

(

r

) is a measure of the strength of the relationship between two variables.

     It shows the direction and strength of the linear relationship between two interval or ratio-scale variables It can range from -1.00 to +1.00.

Values of -1.00 or +1.00 indicate perfect and strong correlation.

Values close to 0.0 indicate weak correlation.

Negative values indicate a

negative/indirect

relationship and positive values indicate a

positive/direct

relationship.

13-7

Correlation Coefficient - Interpretation

13-8

Correlation Coefficient - Example

X

 45

Y s X

 22  9 .

189

s Y

 14 .

337 =CORREL(VAR1, VAR2) What does correlation of 0.759 mean? First, it is positive, so we see there is a direct relationship between the number of sales calls and the number of copiers sold. The value of 0.759 is fairly close to 1.00, so we conclude that the association is strong. However, does this mean that more sales calls

cause

more sales? No, we have not demonstrated cause and effect here, only that the two variables —sales calls and copiers sold —are related.

Data Copier 13-9

Testing the Significance of the Correlation Coefficient

H 0 :  H 1 :  = 0 (the correlation in the population is 0) ≠ 0 (the correlation in the population is not 0)

t

r

1

n

 

r

2 2 Rejection region: reject H 0 if

t

>

t

 /2,n-2 or

t

< -

t

 /2,n-2 13-10

Testing the Significance of the Correlation Coefficient – Copier Sales Example 1. H 0 :  H 1 :  2. α=.05

= 0 (the correlation in the population is 0) ≠ 0 (the correlation in the population is not 0) 3. Computing

t

, we get

t

r

1

n

 

r

2 2  .

759 1  .

10  759 2 2 4. Reject H 0

t t t

>

t

 /2,n-2 >

t

0.025,8 if: or

t

or > 2.306 or

t t

< -

t

 /2,n-2 < -

t

0.025,8 < -2.306

Rejection region  3 .

297 Rejection region 5. The computed

t

(3.297) is within the rejection region, therefore, we will reject H 0 . This means the correlation in the population is not zero. From a practical standpoint, it indicates to the sales manager that there is correlation with respect to the number of sales calls made and the number of copiers sold in the population of salespeople.

13-11

One-tail Test of the Correlation Coefficient – Applewood Example Sometimes instead of a two-tail test on the correlation coefficient, we many want to test whether the coefficient takes a specific sign.

Example: in Chapter 4 we constructed the following scatter plot for the profit earned on a vehicle sale and the age of the purchaser of the Applewood Auto Group. The correlation coefficient is found to be .262. Test at the 5% significance level to determine whether the relationship is positive. (

n

= 180) $3 500 $3 000 $2 500 $2 000 $1 500 $1 000 $500 $0

Plot of Profit and Age

0 20 40

Age

60 80 13-12

One-tail Test of the Correlation Coefficient – Applewood Example 1. H 0 :  H 1 :  = 0 (the correlation in the population is 0) > 0 (the correlation in the population is positive/direct) 2. α=.05

3. Computing

t

, we get

t

r n

 1 

r

2 2  .

262 180  1  .

262 2 2  3 .

622 4. Reject H 0 if:

t t t

>

t

 ,n-2 >

t

0.05,178 > 1.653

5. The computed

t

(3.622) is within the rejection region, therefore, we will reject H 0 . This means the correlation in the population is positive. From a practical standpoint, it indicates that there is a positive correlation between profits and age in the population.

13-13

Regression Analysis

   In regression analysis we use the independent variable (

X

) to estimate the dependent variable (

Y

). The relationship between the variables is linear.

Both variables must be at least interval scale.

The least squares criterion is used to determine the equation. REGRESSION EQUATION An equation that expresses the LINEAR relationship between two variables.

LEAST SQUARES PRINCIPLE Determining a regression equation by minimizing the sum of the squares of the vertical distances between the actual

Y values

and the predicted values of

Y.

13-14

Linear Regression Model

Simple Linear Regression Model:

Y = α + β X + ε Y α

is the dependent variable and

X

is

Y

-intercept and

β

is the independent variable.

is the slope, both are population coefficients and need to be estimated using sample data.

ε

is the error term.

The model represents the linear relationship between the two variables in the population

Estimated Linear Regression Equation:

Y

ˆ

= a + b X

The estimated equation represents the linear relationship between the two variables estimated from the sample.

13-15

Regression Analysis – Least Squares Principle  The least squares principle is used to obtain

a

and

b

. 13-16

Computing the Slope of the Line and the Y intercept

13-17

Regression Equation Example

Recall the example involving Copier Sales of America. The sales manager gathered information on the number of sales calls made and the number of copiers sold for a random sample of 10 sales representatives. Use the least squares method to determine a linear equation to express the relationship between the two variables. What is the expected number of copiers sold by a representative who made 20 calls ?

Data Copier 13-18

Regression Equation Example

Excel: Data -> Data Analysis -> Descriptive Statistics See Excel instruction in Lind et al., p 329, #5 for more details.

Representative Calls Sales

Tom Keller 20 30 Jeff Hall 40 60 Brian Virost Greg Fish Susan Welch Carlos Ramirez Rich Niles Mike Keil Mark Reynolds Soni Jones 20 30 10 10 20 20 20 30 40 60 30 40 40 50 30 70 Step 1 – Find the slope (

b

) of the line

Calls

Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count

22.000

2.906

20.000

20.000

9.189

84.444

0.396

0.601

30.000

10.000

40.000

220.000

10.000

Sales

Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count

45.000

4.534

40.000

30.000

14.337

205.556

-1.001

0.566

40.000

30.000

70.000

450.000

10.000

Step 2 – Find the

y

-intercept (

a

)

Finding and Fitting the Regression Equation Example

The regression equation is :

Y

^ 

a

bX Y

^  18

.

9476  1

.

1842

X

For a representa tive who made 20 calls, the expected number of copiers sold is

Y

^  18 .

9476  1 .

1842 ( 20 )

Y

^  42 .

6316 13-20

Validity of the Model – Copier Sales Example

1. H 0 : β = 0 (the slope 0; there is no linear relationship; the model is invalid) H 1 : β ≠ 0 (the slope is not 0; there is linear relationship; the model is valid) 2. α=.05

3. Test statistic:

t

b

 0

s b

 1 .

1842  0 0 .

3591  3 .

297 4. Rejection region:

t

Reject H 0 if:

t

>

t

 /2,n-2

t

 /2,n-2 =

t

or

t

0.025,8 < -

t

 /2,n-2 =2.306

> 2.306 or

t

< -2.306

Rejection region Rejection region 5. Conclusion: The slope of the equation is significantly different from zero; there is linear relationship between the two variables and thus the model is valid.

13-21

One-tail Test on Slope Coefficient – Copier Sales Example

1. H 0 : β = 0 H 1 : β > 0 (positive slope) 2. α=.05

3. Test statistic:

t

b

 0

s b

 1 .

1842  0 .

3591 0  3 .

297 4. Rejection region: Reject H 0 if

t

>

t

 ,n-2

t

 ,n-2 =

t

0.05,8

=

1.860

t

> 1.860

5. Conclusion: the test statistic is 3.297, which is higher than the critical value of 1.86. Thus we reject the null hypothesis and conclude that the slope is positive.

13-22

Fitness of the model — Coefficient of Determination

The

coefficient of determination

(

r

2 ) is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X).    It is the square of the coefficient of correlation. It ranges from 0 to 1. The high

r

2 is, the better the model fits the data.

13-23

Coefficient of Determination (

r

2

) – Copier Sales Example

Variation in

Y

= variation explained by variation in

X

+ variation unexplained

r

2  variation explained by variation in X total variation in Y 

SSR SS

Total •The coefficient of determination,

r

2 found by (0.759) 2 [recall

r

= 0.759] ,is 0.576, •Interpretation:

57.6 percent of the variation in the number of copiers sold is explained

, or accounted for, by

the variation in the number of sales calls

.

13-24

Confidence Interval and Prediction Interval Estimates of Y

• A

confidence interval

reports the

mean

value of

Y

for a given

X.

A

prediction interval

reports the

range of values

of

Y

for a

particular

value of

X.

13-25

Confidence Interval Estimate - Example

We return to the Copier Sales of America illustration. Determine a 95 percent confidence interval for all sales representatives who make 25 calls.

Y

^  18 .

9476  1 .

1842

X

 18 .

9476  1 .

1842 ( 25 )  48 .

5526

t

 

t

1  .

2 95 ,

n

 2 

t

.

025 , 8  2 .

306 Thus, the 95 percent confidence interval for the average sales of all sales representatives who make 25 calls is from 40.9170 up to 56.1882 copiers.

13-26

Prediction Interval Estimate - Example

We return to the Copier Sales of America illustration. Determine a 95 percent prediction interval for Sheila Baker, a West Coast sales representative who made 25 calls.

If Sheila Baker makes 25 sales calls, the number of copiers she will sell will be between about 24 and 73 copiers.

13-27

Confidence and Prediction Intervals – Minitab Illustration

13-28

Regression analysis using Excel

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.759014

R Square 0.576102

Adjusted R Square Standard Error Observation s 0.523115

9.900824

10

|r|,

the sign of sign of

b.

r

depends on the See Excel instruction in Lind et al., p 510, #2.

Coefficient of determination

r 2

SSR=1065.789 and SS total = 1850.

r

2  1065 .

789 1850  .

576 ANOVA

df SS MS F Significance F

Estimated coefficients:

a

and

b

Regression Residual Total Intercept Calls 1 8 9 1065.789

784.2105

1850 1065.789

98.02632

10.87248

0.010902

T test statistics for testing on the slope p-value for testing on the slope

Coefficients

18.94737

1.184211

Standard Error

8.498819

0.359141

t Stat

2.229412

3.297345

P-value

0.056349

Lower 95%

-0.65094

Upper 95% Lower 95.0% Upper 95.0%

38.54568

-0.65094

38.54568

0.010902

0.356031

2.01239

0.356031

2.01239

Data Copier 13-29

Error Term and Residuals

Simple Linear Regression Model:

Y = α + β X + ε

Error term:

ε

The error term accounts for all other variables, measurable and immeasurable that are not part of the model but can affect the magnitude of Y. In the copier sales example, the error term may represent the salesperson's skills, effort , etc. The error term varies from one salesperson to another even if they make the same number of calls, that is, for the same value of X. The error terms makes the model probabilistic and thus differentiates it from deterministic models like Profit=Revenue – Cost.

13-30

Error Term and Residuals

Assumption requirement of the error term:  The probability distribution of

ε

is normal.

  The mean of the distribution is 0.

The variance of

ε

is constant variance regardless of  The value of

ε Y

associated with any particular value of is independent of

ε

associated with any other value of

Y

. In other words, we require the error terms to be independent of each other.

13-31

Error Term and Residuals

    Residuals: estimated error term, denoted  ˆ The residuals can be used to investigate whether the assumption requirements of the error term are satisfied for a regression analysis. We will explore this later.

The residuals measure the vertical distance between each observation and the regression line. The residual for a specific observation is calculated as:  ˆ 

Y

Y

ˆ 

Y

 (

a

bX

) 13-32

Error Term and Residuals —Example

  In the copier sales example, the estimated regression equation is

Y

^  18

.

9476  1

.

1842

X

Let’s calculate the residual that corresponds to the observation of Soni Jones, who made 30 calls (

X

) and sold 70 copiers (

Y

).

 ˆ  ˆ 

Y

Y

ˆ  70  ( 18 .

9476  1 .

1842 * 30 )  ˆ  ˆ  70  54 .

4736  15 .

5246 13-33

Error Term and Residuals —Example

LEAST SQUARES PRINCIPLE Determining a regression equation by minimizing the sum of the squares of the vertical distances between the actual

Y values

and the predicted values of

Y.

 If we obtain the residuals from the regression obtained using least squares principle, square the residuals and sum the squares up, the total we get will be the smallest among all possible regression lines that we can fit to the original data.

Representative Calls (X)

Tom Keller Jeff Hall Brian Virost Greg Fish Susan Welch Carlos Ramirez Rich Niles Mike Keil Mark Reynolds Soni Jones 20 40 20 30 10 10 20 20 20 30

Sales (Y)

30 60 40 60 30 40 40 50 30 70

Y

ˆ 42.63

66.32

42.63

54.47

30.79

30.79

42.63

42.63

42.63

54.47

SUM  ˆ -12.63

-6.32

-2.63

5.53

-0.79

9.21

-2.63

7.37

-12.63

15.53

 ˆ 2 159.56

39.89

6.93

30.54

0.62

84.83

6.93

54.29

159.56

241.07

784.21

smallest 13-34