Chapter 3 Multiple Linear Regression

Download Report

Transcript Chapter 3 Multiple Linear Regression

3.3 Hypothesis Testing in Multiple Linear Regression

• Questions: – What is the overall adequacy of the model?

– Which specific regressors seem important?

• Assume the errors are independent and follow a normal distribution with mean 0 and variance  2 1

3.3.1 Test for Significance of Regression

• Determine if there is a linear relationship between y and x j , j = 1,2,…,k.

• The hypotheses are H 0 : β 1 H 1 : β j  = β 2 =…= β k = 0 0 for at least one j • ANOVA • SS T = SS • SS R /  2 R + SS Res ~  2 k , SS Res /  2 are independent ~  2 n-k-1 , and SS R and SS Res

F

0 

SS

Re

s SS R

/

k

/(

n

k

 1 ) 

MS R MS

Re

s

~

F k

,

n

k

 1 2

E

(

MS

Re

s

)   2

E

(

MS R

)  *    2   *' '

X c k

X

2

c

 * (  1 ,..., 

k

)'

x

11 

x

1 

x

1

k

x k X c

  

x n

1  

x

1  

x nk

 

x k

  • Under H 1 , F 0 follows F distribution with k and n k-1 and a noncentrality parameter of    *'

X

c

' 2

X c

 * 3

• ANOVA table 4

5

Example 3.3 The Delivery Time Data

6

• R 2 and Adjusted R 2 – R 2 always increase when a regressor is added to the model, regardless of the value of the contribution of that variable.

– An adjusted R 2 :

R

2

adj

 1 

SS

Re

s SS T

/( /(

n n

p

)  1 ) – The adjusted R 2 will only increase on adding a variable to the model if the addition of the variable reduces the residual mean squares.

7

3.3.2 Tests on Individual Regression Coefficients

• For the individual regression coefficient: – H 0 : β j = 0 v.s. H 1 : β j  0 – Let C jj be the j-th diagonal element of (X’X) The test statistic: -1 .

t

0   ˆ  ˆ

j

2

C jj

se

(  ˆ  ˆ

j j

) ~

t n

k

 1 – This is a partial or marginal test because any estimate of the regression coefficient depends on all of the other regression variables.

– This test is a test of contribution of x j other regressors in the model given the 8

• Example 3.4 The Delivery Time Data 9

• The subset of regressors: 10

• For the full model, the regression sum of square

SS R

(  )   ˆ '

X

'

y

• Under the null hypothesis, the regression sum of squares for the reduce model

SS R

(  1 )   ˆ 1 '

X

1 '

y

• The degree of freedom is p-r for the reduce model.

• The regression sum of square due to β 2 given β 1

SS

(  |  ) 

SS

(  ) 

SS

(  )

R

2 1

R R

1 • This is called the extra sum of squares due to β 2 and the degree of freedom is p - (p - r) = r • The test statistic

F

0 

SS R

(  2 |  1 ) /

r MS

Re

s

~

F r

,

n

p

11

• If β 2 with  0, F 0 follows a noncentral F distribution    1 2  2 '

X

' 2 [

I

X

1 (

X

' 1

X

1 )  1

X

1 ' ]

X

2  2 • Multicollinearity: this test actually has no power!

• This test has maximal power when X 1 orthogonal to one another!

and X 2 are • Partial F test: Given the regressors in X 1 , measure the contribution of the regressors in X 2 .

12

• Consider y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 +  SS R ( β 1 | β 0 , β and SS R ( β 3 | β 0 2 , β 3 ), SS R ( β 2 | β 0 , β 1 , β 3 ) , β 2 , β 1 ) are signal-degree-of – freedom sums of squares.

• SS R ( β j | β 0 ,…, β j-1 , β j , … β k ) : the contribution of x j as if it were the last variable added to the model.

• This F test is equivalent to the t test.

• SS T = SS R ( β 1 , β 2 , β 3 | β 0 ) + SS Res • SS R ( β 1 , β 2 , β 3 | β 0 ) = SS R ( β 1 | β 0 ) + SS R ( β 2 | β 1 , β 0 ) + SS R ( β 3 | β 1 , β 2 , β 0 ) 13

• Example 3.5 Delivery Time Data 14

3.3.3 Special Case of Orthogonal Columns in X

• Model: y = X β +  = X 1 β 1 + X 2 β 2 +  • Orthogonal: X 1 ’X 2 = 0 • Since the normal equation (X’X) β = X’y,  

X

1 '

X

1 0 0

X

' 2

X

2     ˆ ˆ 1 2   

X X

' ' 1

y y

  2 •  ˆ 1  (

X

1 '

X

1 )  1

X

1 '

y

and  ˆ 2  (

X

2 '

X

2 )  1 '

X

2

y

15

16

3.3.4 Testing the General Linear Hypothesis

• Let T be an m  p matrix, and rank(T) = r • Full model: y = X β + 

SS

Re

s

(

FM

) 

y

'

y

  ˆ '

X

'

y

(n p degree of freedom) • Reduced model: y = Z matrix and   +  , Z is an n  is a (p-r)  1 vector. Then (p-r)  ˆ  (

Z

'

Z

)  1

Z

'

y SS

Re

s

(

RM

) 

y

'

y

  ˆ '

Z

'

y

(n p  r degree of freedom) • The difference: SS H = SS Res (RM) – SS Res (FM) with r degree of freedom. SS H is called the sum of squares due to the hypothesis H 0 : T β = 0 17

• The test statistic:

F

SS H

/

r SS

Re

s

(

FM

) /(

n

p

) ~

F r

,

n

p

18

19

• Another form:

F

  ˆ '

T

' [

T SS

Re (

s X

'

X

(

FM

)  1

T

) /(

n

' ]  1 

T

 ˆ

p

) /

r

• H 0 : T β

F

 (

T

 ˆ = c v.s. H 1 : T β  

c

)' [

T

(

X

'

X

c Then )  1

T

' ]  1

SS

Re

s

(

FM

) /(

n

 (

T

 ˆ

p

) 

c

) /

r

~

F r

,

n

p

20

3.4 Confidence Intervals in Multiple Regression

3.4.1 Confidence Intervals on the Regression Coefficients

• Under the normality assumption,  ˆ ~

MN

(  ,  2 (

X

'

X

)  1 ) 21

22

3.4.2 Confidence Interval Estimation of the Mean Response

• A confidence interval on the mean response at a particular point.

• x 0 = (1,x 01 ,…,x 0k )’ • The unbiased estimator of E(y|x 0 ) :

E

(

y

ˆ

0

)

E

(

y

|

x

0

)

x

0 ' 

Var

(

y

ˆ

0

)

 

The 100(1 -

 2

x

0 '

(

X

'

X

)

 1

) C.I.

on the

x

0

mean response yˆ

0 

t

 / 2 ,

n

p

(

 2

x

0 '

(

X

'

X

)

 1

x

0

)

1 / 2 23

• Example 3.9 The Delivery Time Data 24

3.4.3 Simultaneous Confidence Intervals on Regression Coefficients

• An elliptically shaped region 25

• Example 3.10 The Rocket Propellant Data 26

27

• Another approach:  ˆ

j

 

se

(  ˆ

j

), j  0, 1, ..., k •  is chosen so that a specified probability that all intervals are correct is obtained.

• Bonferroni method: Δ = t α /2p, n-p • Scheffe S-method: Δ =(2F α ,p, n-p ) 1/2 • Maximum modulus t procedure: Δ = u α ,p, n-2 upper  tail point of the distribution of the is the maximum absolute value of two independent student t r.v.’s each based on n-2 degree of freedom 28

Example 3.11 The Rocket Propellant Data

• Find 90% joint C.I. for β 0 and β 1 by constructing a 95% C.I. for each parameter.

29

• The confidence ellipse is always a more efficient procedure than the Bonferroni method because the volume of the ellipse is always less than the volume of the space covere3d by the Bonferroni intervals.

• Bonferroni intervals are easier to construct.

• The length of C.I.: Maximum modulus t < Bonferroni method < Scheffe S-method 30

3.5 Prediction of New Observations

31

3.6 Hidden Extrapolation in Multiple Regression

• Be careful about extrapolating beyond the region containing the original observations!

• Rectangle formed by ranges of regressors NOT data region.

• Regressor variable hull (RVH): the convex hull of the original n data points.

– Interpolation: x 0 – Extrapolation: x 0   RVH RVH 32

33

• h ii of the hat matrix H = X(X  X) detecting hidden extrapolation. -1 X’are useful in • h max : the maximum of h ii . The point x i largest value of h ii that has the will lie on the boundary of RVH • {x | x  (X  X) -1 x ≦ h max } is an ellipsoid enclosing all points inside the RVH.

• Let h 00 –h 00  = x 0 ′(X′X)-1x 0 h max : inside the RVH and the boundary of RVH –h 00 > h max : outside the RVH 34

• MCE : minimum covering ellipsoid (Weisberg, 1985).

35

36

3.7 Standardized Regression Coefficients

• Difficult to compare regression coefficients directly.

• Unit Normal Scaling: Standardize a Normal r.v.

37

• New model:

y i

* 

b

1

z i

1   

b k z ik

– There is no intercept.

 

i

,

i

 1 ,...,

n

– The least-square estimator of b is

b

ˆ  (

Z

'

Z

)  1

Z

'

y

* 38

• Unit Length Scaling: 39

• New Model:

y i

0 

b

1

w i

1   

b k w ik

 

i

,

i

 1 ,...,

n

• The least-square estimator:

b

ˆ  (

W

'

W

)  1

W

'

y

0 40

• It does not matter which scaling we use! They both produce the same set of dimensionless regression coefficient. 41

42

43

3.8 Multicollinearity

• A serious problem: Multicollinearity or near-linear dependence among the regression variables.

• The regressors are the columns of X. So an exact linear dependence would result a singular X’X 44

• Unit length scaling

W

'

W

Var

 (

b

ˆ 1 ) 2   1 0  0 1    and (

W

'

W

)  1

Var

 (

b

ˆ 2 2 )  1    1 0 0 1   45

• Soft drink data:

W

'

W

Var

(

b

ˆ 1  2 )  1  0 .

824 

Var

 (

b

ˆ 2 2 ) 0 .

824 1   and (

W

'

W

)  1  3 .

12     3 .

12 2 .

57  3 2 .

57 .

12   • Off-diagonal elements are of W’W usually called the simple correlations between regressors.

46

• Variance inflation factors (VIFs): – The main diagonal elements of the inverse of X’X ((W’W) -1 above) – From above two cases:Soft drink: VIF 1 = VIF 2 = 3.12 and Figure 3.12: VIF 1 = VIF 2 = 1 – VIF j = 1/(1-R j ) – R j is the coefficient of multiple determination obtained from regressing x j on the other regressor variables. – If x j is nearly linearly dependent on some of the other regressors, then R j  1 and VIF j will be large.

– Serious problems: VIFs > 10 47

• Figure 3.13 (a): The plan is unstable and very sensitive to relatively small changes in the data points.

• Figure 3.13 (b): Orthogonal regressors. 48

3.9 Why Do Regression Coefficients Have the Wrong Sign?

• The reasons of the wrong sign: 1. The range of some of the regressors is too small.

2. Important regressors have not been included in the model.

3. Multicollinearity is present.

4. Computational errors have been made.

49

• For reason 1:

Var

(  ˆ 1 )   2 /

S xx

  2 /(

i n

  1 (

x i

x

)) 50

• Although it is possible to decrease the variance of the regression coefficients by increase the range of the x’s, it may not be desirable to spread the levels of the regressors out too far: – The true response function may be nonlinear.

– Impractical or impossible.

• For reason 2: 51

• •

y

ˆ  1 .

835  0 .

463

x

1 , here

y

ˆ   ˆ is a " 1 .

036  total" 1 .

222 regression

x

1  3 .

649 coefficien

x

2 Here  ˆ 1 is the effect of x 1 given x 2 t.

52

• Fore reason 3: Multicollinearity inflates the variances of the coefficients, and this increases the probability that one or more regression coefficients will have the wrong sign.

• Different computer programs handle round-off or truncation problems in different ways, and some programs are more effective than the others in this regard.

53