ANOVA F Test in Multiple Regression

Download Report

Transcript ANOVA F Test in Multiple Regression

ANOVA F Test in Multiple Regression
• In multiple regression, the ANOVA F test is designed to test the
following hypothesis:
H 0 : 1   2    k  0
H a : at least oneof 1 ,  2 ,  k is not 0
• This test aims to assess whether or not the model have any
predictive ability.
• The test statistics is
Fstat 
MSR
MSE
• If H0 is true, the above test statistics has an F distribution with
k, n-k-1 degrees of freedom.
week 10
1
F-Test versus t-Tests in Multiple Regression
• In multiple regression, the F test is designed to test the overall model
while the t tests are designed to test individual coefficients.
• If the F-test is significant and all or some of the t-tests are significant,
then there are some useful explanatory variables for predicting Y.
• If the F-test is not significant (large P-value), and all the t-tests are
not significant, it means that no explanatory variable contribute to the
prediction of Y.
• If the F-test is significant and all the t-tests are not significant, then it
is an indication of “multicolinearity” – i.e., correlated X’s. It means
that individual X’s don’t contribute to the prediction of Y over and
above other X’s.
week 10
2
• If the F-test is not significant and some of the t-tests are significant,
it is an indication of one of two things:
 The model has no predictive ability but if there are many predictors,
we can expect to get some type I errors in t-tests.
 Predictors were chosen poorly. If one useful predictor is added to
many that are unrelated to the outcome its contribution may not be
enough for model to have statistically significant predictive ability.
week 10
3
CIs and Pls in Multiple Regression
• The standard error of the estimate of the mean value of Y at new
values of the explanatory variables (Xh) is:
s X h'  X ' X  X h .
1
• 100(1-α)% CI for the mean value of Y at Xh is:
1
Yˆ  t nk 1;  / 2 s X h'  X ' X  X h .
• The standard error of the predicted value of Y at new values of the
explanatory variables (Xh) is:
s 1  X h'  X ' X  X h .
1
• 100(1-α)% CI for the predicted value of Y at Xh is:
1
Yˆ  t nk 1;  / 2 s 1  X h'  X ' X  X h
week 10
4
Example
• Consider the house prices example. Suppose we are interested in
predicting the price of a house with 2 bdr, 750 sqft, 1 fp, 5 rms,
storm windows (st=1), 25 foot lot, 1.5 baths and a 1 car garage.
• Then Xh is ….
week 10
5
Multicollinearity
• Multicollinearity occurs when explanatory variables are highly
correlated, in which case, it is difficult or impossible to measure their
individual influence on the response.
• The fitted regression equation is unstable.
• The estimated regression coefficients vary widely from data set to data
set (even if data sets are very similar) and depending on which predictor
variables are in the model.
• The estimated regression coefficients may even have opposite sign than
what is expected (e.g, bedroom in house price example).
week 10
6
• The regression coefficients may not be statistically significant from
0 even when corresponding explanatory variable is known to have a
relationship with the response.
• When some X’s are perfectly correlated, we can’t estimate β because
X’X is singular.
• Even if X’X is close to singular, its determinant will be close to 0
and the standard errors of estimated coefficients will be large.
week 10
7
Assessing Multicollinearity
• To asses multicolinearity we calculate the Variance Inflation Factor
for each of the predictor variables in the model.
• The variance inflation factor for the ith predictor variable is defined
as
1
VIF 
1  Ri2
where Ri2 is the coefficient of multiple determination obtained when
the ith predictor variable is regressed against other predictor
variables.
• Large value of VIFi is a sign of multicollinearity.
week 10
8
Indicator Variables
• Often, a data set will contain categorical variables which are
potential predictor variables.
• To include these categorical variables in the model we define
dummy variables.
• A dummy variable takes only two values, 0 and 1.
• In categorical variable with j categories we need j-1 indictor
variables.
week 10
9
Example
• Meadowfoam is a small plant found in the US Pacific Northwest.
Its seed oil is unique among vegetable oils for its long carbon
strings, and it is nongreasy and highly stable.A study was conducted
to find out how to elevate meadowfoam production to a profitable
crop. In a growth chamber, plants were grown under 6 light
intensities (in micromol/m^2/sec) and two timings of the onset of
the light treatment, either late (coded 0) or early (coded 1). The
response variable is the average number of flowers per plant for 10
seedlings grown under each of the 12 treatment conditions.
• This is an example of an experiment in which we can make causal
conclusions.
• There are two explanatory variables, light intensity and timing.
• There are 24 data points, 2 at each treatment combination.
week 10
10
Question of Interests
• What is the effect of timing on the seedling growth?
• What are the effects of the different light intensity?
• Does the effect of intensity depend on timing?
week 10
11
Indicator Variables in Meadowfoam Example
• To include the variable time in the model we define a dummy
variable that takes the value 1 if early timing and the value 0 if late
timing.
• The variable intensity has 6 levels (150, 300, 450, 600, 750, 900).
We will treat these levels as 6 categories.
• It is useful to do so if we expect a complex relationship between
response variable and intensity and if the goal is to determine which
intensity level is “best”.
• The cost in using dummy variables is degrees of freedom since we
need multiple dummy variables for each of the multiple categories.
• We define the dummy variables as follows….
week 10
12
Partial F-test
• Partial F-test is designed to test whether a subset of β’s are 0
simultaneously.
• The approach has two steps. First we fit a model with all predictor
variables. We call this model the “full model”.
• Then we fit a model without the predictor variables whose
coefficients we are interested in testing. We call this model the
“reduced model”.
• We then compare the SSR and SSE in these two models….
week 10
13
Test Statistic for Partial F-test
• To test whether some of the coefficients of the explanatory variables
are all 0 we use the following test statistic: .
ExtraSS Extradf
Fstat 
MSE full
Where Extra SS = SSEred - SSEfull, and
Extra df = number of parameters being tested.
• To get the Extr SS in SAS we can simply fit two regressions
(reduced and full) or we can look at Type I SS which are also called
Sequential Sum of Squares.
• The Sequential SS gives the additional contribution to SSR each
variable gives over and above variables previously listed.
• The Sequential SS depends on which order variables are stated in
model statement; the variables whose coefficients we want to test
must be listed last.
week 10
14
Partial Correlation
• Recall, for simple regression, the correlation between X and Y is
r
R 2  SSE SSTO
• When considering the reduced/full model when the full model has
only 1 additional predictor variable, the coefficient of partial
correlation is
Extra SS
rY X 1| X 2 ,...,X k 
SSE red
• It is negative if the coefficient of the additional predictor variable is
negative and positive otherwise.
• It is a measure of the contribution of the additional predictor
variable, given that the others are in the model.
week 10
15