Example - Nargund

Download Report

Transcript Example - Nargund

Regression Analysis: Part 2
Inference
Dummies / Interactions
Multicollinearity / Heteroscedasticity
Residual Analysis / Outliers
Shoe Size Example



Consider the prediction of Shoe Size discussed
before
As before, the response variable is Shoe Size,
and the explanatory variable is age.
What inferences can we make about the
regression coefficients?
Shoe Size and Age
Intercept
Age
Coefficients
Standard Error
t Stat
P-value
Lower 95% Upper 95%
-1.175925926
2.063543871 -0.56985749 0.580226 -5.717757659 3.365906
0.612037037
0.139124099 4.399216525 0.001065 0.305826804 0.918247
Regression Coefficients




Regression coefficients estimate the true, but unobservable,
population coefficients.
The standard error of bi indicates the accuracy of these point
estimates.
For example, the average effect on shoe size of a one-unit
increase in age is .612.
We are 95% confident that the coefficient is between .306
and .918.
Dummy Variables




You wish to check if the class (freshman, sophomore, junior,
senior) and SAT score can be used to predict the number of
hours per week a college student watches TV.
Data is collected through sampling, and a regression is to be
performed.
How would you code the ‘Class’ variable?
How would you interpret the resulting coefficients?
Interactions

Example: How do gender and lack of sleep affect performance
on a standard test?
If Male = 1, and Female = 0, what is the difference between a
regression model without interaction and one with?
Y-hat = b0 + b1X1 + b2X2
Y-hat = b0 + b1X1 + b2X2 + b3X1X2

How is the coeffiecient b3 interpreted?



Multicollinearity



We want to explain a person’s height by means of foot length.
The response variable is Height, and the explanatory
variables are Right and Left, the length of the right foot and
the left foot, respectively.
What can occur when we regress Height on both Right and
Left?
Multicollinearity



The relationship between the explanatory variable X and the
response variable Y is not always accurately reflected in the
coefficient of X; it depends on which other X’s are included or
not included in the equation.
This is especially true when there is a linear relationship
between two or more explanatory variables, in which case we
have multicollinearity.
By definition multicollinearity is the presence of a fairly strong
linear relationship between two or more explanatory variables,
and it can make estimation difficult.
Solution to
multicollinearity



Admittedly, there is no need to include both Right and Left in an
equation for Height - either one would do - but we include both
to make a point.
It is likely that there is a large correlation between height and
foot size, so we would expect this regression equation to do a
good job.
The R2 value will probably be large. But what about the
coefficients of Right and Left? Here is a problem.
Solution -- continued

The coefficient of Right indicates that the right foot’s effect on
Height in addition to the effect of the left foot. This additional
effect is probably minimal. That is, after the effect of Left on
Height has already been taken into account, the extra
information provided by Right is probably minimal. But it goes
the other way also. The extra effort of Left, in addition to that
provided by Right, is probably minimal.
Height Data - Correlations


To show what can happen numerically, we generated a
hypothetical data set of heights and left and right foot lengths
in this file.
We did this so that, except for random error, height is
approximately 32 plus 3.2 times foot length (all expressed in
inches).
As shown in the table to the right, the
correlations between Height and either
Right or Left in our data set are quite
large, and the correlation between
Right and Left is very close to 1.
Solution -- continued

The regression output when both Right and Left are entered in
the equation for Height appears in this table.
Solution -- continued



This output tells a somewhat confusing story.
The multiple R and the corresponding R2 are about what we
would expect, given the correlations between Height and either
Right or Left.
In particular, the multiple R is close to the correlation between
Height and either Right or Left. Also, the se value is quite good.
It implies that predictions of height from this regression
equation will typically be off by only about 2 inches.
Solution -- continued



However, the coefficients of Right and Left are not all what we
might expect, given that we generated heights as approximately
32 plus 3.2 times foot length.
In fact, the coefficient of Left has the wrong sign - it is
negative!
Besides this wrong sign, the tip-off that there is a problem is
that the t-value of Left is quite small and the corresponding pvalue is quite large.
Solution -- continued



Judging by this, we might conclude that Height and Left are
either not related or are related negatively. But we know from
the table of correlations that both of these are false.
In contrast, the coefficient of Right has the “correct” sign, and
its t-value and associated p-value do imply statistical
significance, at least at the 5% level.
However, this happened mostly by chance, slight changes in the
data could change the results completely.
Solution -- continued



Although both Right and Left are clearly related to Height, it is
impossible for the least squares method to distinguish their
separate effects.
Note that the regression equation does estimate the combined
effect fairly well, the sum of the coefficients is 3.178 which is
close to the coefficient of 3.2 we used to generate the data.
Therefore, the estimated equation will work well for predicting
heights. It just does not have reliable estimates of the individual
coefficients of Right and Left.
Solution -- continued



To see what happens when either Right or Left are excluded
from the regression equation, we show the results of simple
regression.
When Right is only variable in the equation, it becomes
Predicted Height = 31.546 + 3.195Right
The R2 and se values are 81.6% and 2.005, and the t-value and
p-value for the coefficient of Right are now 21.34 and 0.000 very significant.
Solution -- continued



Similarly, when the Left is the only variable in the equation, it
becomes
Predicted Height = 31.526 + 3.197Left
The R2 and SE values are 81.1% and 2.033, and the t-value and
p-value for the coefficient of Left are 20.99 and 0.0000 - again
very significant.
Clearly, both of these equations tell almost identical stories, and
they are much easier to interpret than the equation with both
Right and Left included.
Heteroscedasticity

Unequal variances – fan shape.


Use log transforms
Weighted Least Squares