No Slide Title

Download Report

Transcript No Slide Title

Dr. Ka-fu Wong

ECON1003 Analysis of Economic Data

Ka-fu Wong © 2003

Chap 14- 1

1.

2.

3.

4.

5.

6.

Chapter Fourteen

Multiple Regression and Correlation Analysis

GOALS

Describe the relationship between two or more independent variables and the dependent variable using a multiple regression equation.

Compute and interpret the multiple standard error of estimate and the coefficient of determination.

Interpret a correlation matrix.

Setup and interpret an ANOVA table.

Conduct a test of hypothesis to determine if any of the set of regression coefficients differ from zero.

Conduct a test of hypothesis on each of the regression coefficients.

l

Ka-fu Wong © 2003

Chap 14- 2

Multiple Regression Analysis

For two independent variables, the general form of the multiple regression equation is: Y i =

b

0 +

b

1 X 1i +

b

2 X 2i +

e

i

  

X 1i and X 2i are the i-th observation of independent variables.

b

0

b

1 is the Y -intercept.

is the net change in X 2 Y for each unit change in X 1 holding constant. It is called a partial regression coefficient, a net regression coefficient, or just a regression coefficient.

Ka-fu Wong © 2003

Chap 14- 3

Visualize the multiple linear regression in a plot

y The simple linear regression model allows for one independent variable, “x” y =

b

0 +

b

1 x +

e Ka-fu Wong © 2003

X Chap 14- 4

Visualize the multiple linear regression in a plot

y The simple linear regression model allows for one independent variable, “x” y =

b

0 +

b

1 x +

e

Note how the straight line becomes a plain, and...

X 2

Ka-fu Wong © 2003

X

1

The multiple linear regression model allows for more than one independent variable.

Y =

b

0 +

b

1 x 1 +

b

2 x 2 +

e

Chap 14- 5

Visualize the multiple non-linear regression in a plot

y y=

b

0 +

b

1 x 2

b 0

X

1 Ka-fu Wong © 2003

Chap 14- 6

Visualize the multiple non-linear regression in a plot

y y=

b

0 +

b

1 x 2

Ka-fu Wong © 2003

X 2

b 0

X

1

y =

b

0 +

b

1 x 1 2 +

b

2 x 2 … a parabola becomes a parabolic surface Chap 14- 7

Multiple Regression Analysis

   

The general multiple regression with k independent variables is given by: Y i =

b

0 +

b

1 X 1i +

b

2 X 2i +…+

b

k X ki +

e

i When k>2, it is impossible to visualize the regression equation in a plot. The least squares criterion is used to develop an estimation of this equation.

Because determining estimate them.

b

1 ,

b

2 , etc . is very tedious, a software package such as Excel or other statistical software is recommended to

Ka-fu Wong © 2003

Chap 14- 8

Choosing the line that fits best Ordinary Least Squares (OLS) Principle

b 0  

Straight lines can be describe generally by Y i = b 0 + b 1 X 1i + b 2 X 2i +…+ b k X ki Finding the best line with smallest sum of squared difference is the same as

min , b 1 , b 2 ,..., b k S(b 0 , b 1 , b 2 ,..., b k ) ≡ n ∑ i  1 [y i (b 0  b 1 x 1i  b 2 x 2i  ...

 b k x ki )] 2  

Let b 0 * , b 1 * problem. , b 2 * … b k * be the solution of the above Y * = b 0 * + b 1 * X 1 + b 2 * X 2 + …+b k * X k is known as the “average predicted value” (or simply “predicted value”) of y for any vector of (X 1 , X 2 , …, X k ).

Ka-fu Wong © 2003

Chap 14- 9

Coefficient estimates from the ordinary least squares (OLS) principle

Solving the minimization problem implies the first order conditions:

S(b 0 , b 1 , b 2 ,..., b k ) ≡ n ∑ i  1 [y i (b 0  b 1 x 1i  b 2 x 2i  ...

 b k x ki )] 2 ∂ S(b 0 , b 1 , b ∂ b 0 2 ,..., b k ) and for all other  n ∑ i  1 2[y i coeffieien ts (b 0  b 1 x 1i  b 2 x 2i b 1 , b 2 ,..., b j ,..., b k  ...

 b k x ki )](-b 0 )  0 ∂ S(b 0 , b 1 , ∂ b b 2 j ,..., b k )  i n ∑  1 2[y i (b 0  b 1 x 1i  b 2 x 2i  ...

 b k x ki )](-b j x ji )  0 Ka-fu Wong © 2003

Chap 14- 10

Coefficient estimates from the ordinary least squares (OLS) principle

Solving the first order conditions implies

The solution of b 0 * , b 1 * , b 2 * … b k *

Y * (X 1 = b , X 2 0 * + b 1 * X 1 + b 2 * X 2 + …+b k * X k is known as the “average predicted value” (or simply “predicted value”) of y for any vector of , …, X k ).

Ka-fu Wong © 2003

Chap 14- 11

Multiple Linear Regression Equations

An estimation of the coefficients is too complicated by hand , let me use some computational software packages, such as Excel.

Ka-fu Wong © 2003

Chap 14- 12

Interpretation of Estimated Coefficients

1.

2.

Slope (

b

k )

Estimated Increase in Constant Y Changes by X k

b

k for Each 1 Unit Holding All Other Variables

Example: If

b

1 = 2, then Sales ( Y ) Is Expected to Increase by 2 for Each 1 Unit Increase in Advertising ( X 1 ) Given the Y-Intercept (

b

0 )

Number of Sales Rep’s ( X 2 ) Average Value of Y When X k = 0

Ka-fu Wong © 2003

Chap 14- 13

Parameter Estimation Example

You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.) & newspaper circulation (000) on the number of ad responses (00).

You’ve collected the following data: Resp Size Circ 1 4 1 3 2 4 1 8 3 5 6 10 2 8 1 7 4 6

Ka-fu Wong © 2003

Chap 14- 14

Parameter Estimation Computer Output

Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Param=0 Prob>|T| INTERCEP 1 0.0640 0.2599 0.246 0.8214

ADSIZE 1 0.2049 0.0588 3.656 0.0399

CIRC 1 0.2805 0.0686 4.089 0.0264

b

0 b 1 b 2

Ka-fu Wong © 2003

Chap 14- 15

Interpretation of Coefficients Solution

1.

2.

Slope ( b 1 )

# Responses to Ad by .2049 (20.49) for each 1 sq. in. increase in Ad Size is expected to increase Holding Circulation Constant Slope ( b 2 )

# Responses to Ad is expected to increase by .2805 (28.05) for each 1 unit (1,000) increase in Constant circulation Holding Ad Size

Ka-fu Wong © 2003

Chap 14- 16

Multiple Standard Error of Estimate

The multiple standard error of estimate measure of the effectiveness of the regression equation.

is a

 

It is measured in the same units as the dependent variable. It is difficult to determine what is a large value and what is a small value of the standard error.

Ka-fu Wong © 2003

Chap 14- 17

Multiple Standard Error of Estimate

The formula is:

s

e 

s

y .12...k

 Σ(Y

n (k Y

*

)

2 

1)

Interpretation is similar to that in simple linear regression.

Ka-fu Wong © 2003

Chap 14- 18

Multiple Regression and Correlation Assumptions

    

The independent variables and the dependent variable have a linear relationship.

The dependent variable must be continuous and at least interval-scale.

The variation in ( homoscedasticity Y .

Y * ) or the same for all values of residual Y must be . When this is the case, we say the difference exhibits The residuals should follow the normal distributed with mean 0.

Successive values of the dependent variable must be uncorrelated.

Ka-fu Wong © 2003

Chap 14- 19

The ANOVA Table

  

The ANOVA table reports the variation in the dependent variable. The variation is divided into two components.

The Explained Variation is that accounted for by the set of independent variable.

The Unexplained or Random Variation is not accounted for by the independent variables.

Ka-fu Wong © 2003

Chap 14- 20

Correlation Matrix

A correlation matrix is used to show all possible simple correlation coefficients among the variables.

See which x j are most correlated with y, and which x j are strongly correlated with each other. y y x 1 x 2

1.00

x k x 1

r 1.00 r r

x 2

1.00 r r r

x k

1.00 Ka-fu Wong © 2003

Chap 14- 21

Multicollinearity

1.

2.

3.

4.

High correlation between X variables Multicollinearity makes it difficult to separate effect of x 1 on y from the effect of x Leads to unstable coefficients depending on variables in model 2 on y.

X Always exists -- matter of degree Example: using both age & height as explanatory variables in same model

Ka-fu Wong © 2003

Chap 14- 22

Detecting Multicollinearity

1.

2.

Examine correlation matrix

Correlations between pairs of are more than with Y variable X variables Few remedies

 

Obtain new sample data Eliminate one correlated X variable

Ka-fu Wong © 2003

Chap 14- 23

Correlation Matrix Computer Output

Correlation Analysis Pearson Corr Coeff /Prob>|R| under HO:Rho=0/ N=6 RESPONSE ADSIZE CIRC RESPONSE 1.00000 0.90932 0.93117

0.0 0.0120 0.0069 ADSIZE 0.90932 1.00000 0.74118

0.0120 0.0 0.0918 CIRC 0.93117 0.74118 1.00000

0.0069 0.0918 0.0

r

Y1 : correlation between response and ADSIZE

r

Y2 : correlation between response and CIRC

r

12 : correlation between ADSIZE and CIRC

Ka-fu Wong © 2003

All 1’s Chap 14- 24

Global Test

The global test is used to investigate whether any of the independent variables have significant coefficients. The hypotheses are:

H

0

H

1

:

b 1  b 2 

...

 b

k

: Not all

b

s equal 0

0

The test statistic follows an k (number of independent variables) and n-(k+1) degrees of freedom, where n is the sample size.

F distribution with

Ka-fu Wong © 2003

Chap 14- 25

Test for Individual Variables

  

This test is used to determine which independent variables have nonzero regression coefficients.

The variables that have zero regression coefficients are usually dropped from the analysis.

The test statistic is the t distribution with n ( k +1) degrees of freedom.

Ka-fu Wong © 2003

Chap 14- 26

EXAMPLE 1

A market researcher for Super Dollar Super Markets is studying the yearly amount families of four or more spend on food. Three independent variables are thought to be related to yearly food expenditures ( Food ). Those variables are: total family income ( Income ) in $00, size of family ( Size ), and whether the family has children in college ( College ).

Ka-fu Wong © 2003

Chap 14- 27

Example 1

continued

Note the following regarding the regression equation.

 

The variable college is called a dummy or indicator variable . It can take only one of two possible outcomes. That is a child is a college student or not.

Other examples of dummy variables include

gender,

the part is acceptable or unacceptable,

 

the voter will or will not vote for the incumbent governor.

We usually code one value of the dummy variable as “1” and the other “0.”

Ka-fu Wong © 2003

Chap 14- 28

Family 1 2 3 4 5 6 7 8 9 10 11 12

Ka-fu Wong © 2003

EXAMPLE 1

continued

Food 3900 5300 4300 4900 6400 7300 4900 5300 6100 6400 7400 5800 Income 376 515 516 468 538 626 543 437 608 513 493 563 Size 4 5 4 5 6 7 5 4 5 6 6 5 Student 0 1 0 0 1 1 0 0 1 1 1 0 Chap 14- 29

EXAMPLE 1

continued  

Use a computer software package, such as Excel, to develop a correlation matrix.

From the analysis provided by Excel, write out the regression equation: Y * = 954 +1.09

X 1 + 748 X 2 + 565 X 3

What food expenditure would you estimate for a family of 4, with no college students, and an income of $50,000 (which is input as 500)?

Ka-fu Wong © 2003

Chap 14- 30

EXAMPLE 1

continued

The regression equation is Food = 954 + 1.09 Income + 748 Size + 565 Student Predictor Coef SE Coef T P Constant 954 1581 0.60 0.563

Income 1.092 3.153 0.35 0.738

Size 748.4 303.0 2.47 0.039

Student 564.5 495.1 1.14 0.287

S = 572.7 R-Sq = 80.4% R-Sq(adj) = 73.1% Analysis of Variance Source DF SS MS F P Regression 3 10762903 3587634 10.94 0.003

Residual Error 8 2623764 327970 Total 11 13386667

Ka-fu Wong © 2003

Chap 14- 31

EXAMPLE 1

continued

From the regression output we note:

   

The coefficient of determination is 80.4 percent. This means that more than 80 percent of the variation in the amount spent on food is accounted for by the variables income, family size, and student.

Each additional $100 dollars of income per year will increase the amount spent on food by $109 per year.

An additional family member will increase the amount spent per year on food by $748. A family with a college student will spend $565 more per year on food than those without a college student.

Ka-fu Wong © 2003

Chap 14- 32

EXAMPLE 1

continued

The correlation matrix is as follows: Food Income Size Income 0.587

Size 0.876 0.609

Student 0.773 0.491 0.743

 

The strongest correlation between the dependent variable and an independent variable is between family size and amount spent on food.

None of the correlations among the independent variables should cause problems. All are between – .70 and .70.

Ka-fu Wong © 2003

Chap 14- 33

Ka-fu Wong © 2003

EXAMPLE 1

continued

The estimated food expenditure for a family of 4 with a $500 (that is $50,000) income and no college student is $4,491. Y * = 954 + 1.09(500) + 748(4) + 565 (0) = 4491 Chap 14- 34

EXAMPLE 1

continued

Conduct a global test of hypothesis to determine if any of the regression coefficients are not zero.

H

0 : b 1  b 2  b 3  0

versus H

1 : Not all b s equal 0   

H 0 is rejected if F >4.07.

From the computer output, the computed value of F is 10.94.

Decision: H 0 is rejected. Not all the regression coefficients are zero

Ka-fu Wong © 2003

Chap 14- 35

EXAMPLE 1

continued 

Conduct an individual test to determine which coefficients are not zero. This is the hypotheses for the independent variable family size.

H

0 : b 2  0

versus H

1 : b 2  0  

From the computer output, the only significant variable is SIZE (family size) using the The other variables can be omitted from the model.

p -values. Thus, using the 5% level of significance, reject H 0 if the p -value<.05

Ka-fu Wong © 2003

Chap 14- 36

EXAMPLE 1

continued  

We rerun the analysis using only the significant independent family size.

The new regression equation is: Y * = 340 + 1031 X 2

The coefficient of determination is 76.8 percent. We dropped two independent variables, and the R-square term was reduced by only 3.6 percent.

Ka-fu Wong © 2003

Chap 14- 37

Example 1

continued

Regression Analysis: Food versus Size The regression equation is Food = 340 + 1031 Size Predictor Coef SE Coef T P Constant 339.7 940.7 0.36 0.726

Size 1031.0 179.4 5.75 0.000

S = 557.7 R-Sq = 76.8% R-Sq(adj) = 74.4% Analysis of Variance Source DF SS MS F P Regression 1 10275977 10275977 33.03 0.000

Residual Error 10 3110690 311069 Total 11 13386667

Ka-fu Wong © 2003

Chap 14- 38

Evaluating the Model

Ka-fu Wong © 2003 

y

i *

= b

0

+b

1

x

1i

+…+b

k

x

ki Most of the procedures to evaluate the multiple regression model are the same as those discussed in the chapter of simple regression models.

Residual analysis

Test for linearity

     

Global F-test Test for the coefficient of correlation irrelevant.

Test for individual enough .

slope of regression line not Non-independence of error variables.

Durbin-Watson Statistics Outliers Chap 14- 39

Ka-fu Wong © 2003

Chapter Fourteen

Multiple Regression and Correlation Analysis

- END -

Chap 14- 40