Analysis of Variance in Matrix form

Download Report

Transcript Analysis of Variance in Matrix form

Multiple Regression

• A multiple regression model is a model that has more than one explanatory variable in it.

• Some of the reasons to use multiple regression models are:  Often multiple

X

’s arise naturally from a study.

 We want to control for some

X

’s  Want to fit a polynomial  Compare regression lines for two or more groups STA302/1001 - week 9

Multiple Linear Regression Model

• In a multiple linear regression model there are

p

predictor variables.

• The model is

Y i

  0   1

X i

1   2

X i

2    

p X i p

 

i

,

i

 1 , ...,

n

• This model is linear in the

β

’s. The variables may be non-linear, e.g., log(

X

1 ),

X

1 *

X

2 etc.

• We need to estimate

p

+1

β

’s and σ 2 .

• There are

p

+2 parameters in this model and so we need at least that many observations to be able to estimate them, i.e., need

n > p

+2.

STA302/1001 - week 9

Multiple Regression Model in Matrix Form

• In matrix notation the multiple regression model is:

Y=Xβ + ε

where

Y

 

Y

 

Y

1 2    

Y n

      ,             1 2

n

      ,             1 0

p

      ,

X

  1   1     1

X

11

X

21 

X n

1 

X X X

 1

p

2

p n p

      • Note,

Y

n

 

p

 1  and

ε

are vectors,

β

matrix. The matrix

X

 

X

is called the ‘design matrix’.

is a • The Gauss-Markov assumptions are:

E

(

ε | X

) =

0

, Var(

ε | X

) = σ 2

I

.

• These result in

E

(

Y | X

) =

0

, Var(

Y | X

) = σ 2

I

.

• The Least-Square estimate of

β

is

b

 

X

'

X

  1

X

'

Y

.

STA302/1001 - week 9

Estimate of σ

2

• The estimate of σ 2 is:

s

2 

MSE

df

e

i

2

error

n

e

'

e

p

 1 • It has

n-p

-1 degrees of freedom because… • • Claim: s 2 is unbiased estimator of σ 2 .

Proof:

STA302/1001 - week 9

General Comments about Multiple Regression

• The regression equation gives the mean response for each combination of explanatory variables.

• The regression equation will not be useful if it is very complicated or a function of large number of explanatory variables.

• We generally want “parsimonious” model, that is, a model that is as simple as possible to adequately describe the response variable.

• It is unwise to think that there is some exact, discoverable equation.

• Many possible models are available.

• One or two models may adequately approximate the mean of the response variable.

STA302/1001 - week 9

Example – House Prices in Chicago

• Data of 26 house sales in Chicago were collected (clearly collected some time ago). The variables in the data set are: price - selling price in $1000's bdr - number of bedrooms flr - floor space in square feet fp - number of fireplaces rms - number of rooms st - storm windows (1 if present, 0 absent) lot - lot size (frontage) in feet bth - number of bathrooms gar - garage size (0=no garage, 1=one-car garage, etc.) STA302/1001 - week 9

Interpreting Regression Coefficients

• In general, in multiple regression we interpret the coefficient of the

j

th predictor variable (

β j

or

b j

) as the change in

Y

associated with a change of one unit in

X j

with all the other variables held constant. • Note, that it may be impossible to hold all other variables constants.

• Example, re the home price example above, for 100 extra square feet (everything else held constant), the price goes up by $1760 on average. For one more room (everything else held constant), the price goes up by $3900 on average. For one more bedroom (everything else held constant), the price goes down by $7700 on average.

STA302/1001 - week 9

Inference for Regression Coefficients

• As in simple linear regression, we are interesting in testing: H 0 :

β j

= 0 versus H

a

:

β j

≠ 0. • The test statistics is

t stat

S

.

E b j

 

j

It has a

t

-distribution with

n-p-

1 degrees of freedom. • We can calculate the P-value from the

t

-table with

n-p-

1 df.

• This test gives an indication of whether or not the

j

th predictor variable statistically significant contributes to the prediction of the response variable

over and above

all the other predictor variables.

b

   1  ;  / 2  .

STA302/1001 - week 9

ANOVA Table

• The ANOVA table in multiple regression model is given by… STA302/1001 - week 9

Coefficient of Multiple Determination – R

2

• As in simple linear regression model,

R

2 = SSReg/SST.

• In multiple regression this is called the “coefficient of multiple determination”; it is not the square of a correlation coefficient.

• In multiple regression, need to be cautious judging model with

R

2 because it always goes up when more predictor variables are added to the model, regardless of whether the predictor variables are useful for predicting

Y

.

STA302/1001 - week 9

Adjusted R

2

• An attempt to make

R

2 more useful is to calculate Adjusted

R

2 (“Adj R-Sq” in SAS) • Adjusted

R

2 is adjusted for the number of predictor variables in the model.

• It can actually go down when more predictors are added.

• It can be used for choosing the best model.

• It is defined as Adj

R

2  1  

n

 1 

MSE SST

 1 

n n

  1

p

 1

SSE

.

SST

• Note that Adjusted

R

2 will increase only is MSE decrease.

STA302/1001 - week 9

ANOVA F Test in Multiple Regression

• In multiple regression, the ANOVA

F

test is designed to test the following hypothesis:

H

0

H a

:

 1   2   

p

0 : at least one of

 1

,

 2

,

 

p

is not 0

• This test aims to assess whether or not the model have any predictive ability.

• The test statistics is

F stat

MSR eg MSE

• If

H

0 is true, the above test statistics has an

F

distribution with

p

,

n-p-

1 degrees of freedom.

STA302/1001 - week 9 12

F-Test versus t-Tests in Multiple Regression

• In multiple regression, the

F

test is designed to test the overall model while the

t

tests are designed to test individual coefficients.

• If the

F

-test is significant and all or some of the

t

-tests are significant, then there are some useful explanatory variables for predicting

Y

.

• If the

F

-test is not significant (large P-value), and all the

t

-tests are not significant, it means that no explanatory variable contribute to the prediction of

Y

.

• If the

F

-test is significant and all the

t

-tests are not significant, then it is an indication of “multicolinearity” – i.e., correlated

X

’s. It means that individual

X

’s don’t contribute to the prediction of

Y

above other

X

’s.

over and STA302/1001 - week 9 13

• If the

F

-test is not significant and some of the

t

-tests are significant, it is an indication of one of two things:  The model has no predictive ability but if there are many predictors, a few may have small P-value (type I error in

t

-tests).

 Predictors were chosen poorly. If one useful predictor is added to many that are unrelated to the outcome its contribution may not be enough for model to be significant (F-test).

STA302/1001 - week 9 14