Regression Analysis

Download Report

Transcript Regression Analysis

Regression Analysis
W&W Chapter 12 & 15 (1-3)
Correlation and Regression I
Bivariate regression shows us how
variables are linearly related. Correlational
analysis tells us the degree to which
variables are related.
Recall how to calculate correlation, or r:
r = (X-Mx)(Y-My) = covariance (X,Y)
 (X-Mx)2  (Y-My)2
(sx)(sy)
Correlation and Regression II
Compare this calculation to the one for a bivariate
slope coefficient, b:
b = (X – Mx)(Y – My) = covariance (X,Y)
(X – Mx)2
variance (X)
With correlation, we do not make a distinction
between a dependent variable, Y, and an
independent variable, X. Thus correlation takes
into account the individual variance of X and Y
in the denominator.
Correlation and Regression III
What is correlation intuitively?
We are subtracting X from its mean and Y from its
mean, which tells us how far a pair (X,Y) is from
the mean (Mx, My).
If r is positive, then X and Y are both above or both
below their mean.
If r is negative, then when one variable is above its
mean, the other is below its mean.
Correlation and Regression IV
We have already seen that b and r are similar.
In fact, we can write b in terms of r as:
b = r (sy/sx)
Example
Suppose we collect the following data on a sample
(N=8) of students’ math (X) and verbal scores
(Y) on a college entrance exam.
X
Y
X
Y
80
65
72
48
50
60
60
44
36
35
56
48
58
39
68
61
Example
Let’s plot this data in a scatterplot. We could
calculate the regression line using the
formulas for slope and intercept we
learned.
Suppose we want to predict a student’s verbal
score if their math score is unknown. What
would be our best guess?
Total Deviation
Our best guess would be the most typical
value, or the mean (My). The prediction
error for a given student would be the
distance between their verbal score and the
mean, or (Y-My).
But we can do better with information on X
by using the regression line (Yp) for our
prediction.
Total Deviation
(Y – My) = (Yp – My) + (Y – Yp)
Total
Explained Unexplained
Deviation Deviation
Deviation
The same is true for their sum of squares:
(Y – My)2 = (Yp – My)2 + (Y – Yp)2
SS(total)
SSR
SSE
Analysis of Variance
ANOVA table for Linear Regression
Source of
Variation
df
Sum of
squares
Mean
squares
Regression
k
(Yp – My)2
MST= (Yp – My)2/k
Error
n-k-1
(Y – Yp)2
MSE= (Y – Yp)2 /(n-k-1)
Total
n-1
(Y – My)2
k = the number of independent variables
Analysis of Variance
We can calculate the F-statistic like we did
for ANOVA previously.
F = MSR = variance explained by regression
MSE
unexplained variance
With k, n-k-1 numerator and denominator
degrees of freedom
F-test
For a bivariate regression model, the null
hypothesis for the F-test is equivalent to
the null hypothesis for a t-test of the slope
coefficient.
Y =  + X + 
Ho:  = 0
HA :   0
F-test
For a multivariate regression model, the F
test determines whether any of the slope
coefficients are zero.
Y = 0 + 1X1 + 2X2 + 3X3 + 
Ho: 1 = 2 = 3 = 0
HA :  1   2   3  0
F-test
We will see later on that we can also use Ftests to compare various models against
each other.
For example, in the previous model we might
want to compare a model with all
variables, to one with just X1 and X2.
Calculating
2
R
Another way to evaluate the overall fit of
our regression model (in addition to the F
test) is to calculate R2, which is the
proportion of total variance (in Y) that our
regression model can explain.
R2 = explained (regression) sum of squares
total sum of squares
Calculating
2
R
R2 = (Yp – My)2 = SSR
(Y – My)2 SS
R2 measures the proportion of the total
variation in Y explained by the regression
model.
0  R2  1
Better models have higher R2.
Interpreting
2
R
R2 = 1 means Yp = Yi for all I
R2 = 0 means no relationship, Yp = My
In the bivariate model, R2 is the square of the
correlation coefficient, r.
Some Assumptions
In order to estimate a regression model, we must make
several simplifying assumptions. Let's go back to the
problem of determining if fertilizer affects crop yield.
Suppose (in Figure 12a) that we set fertilizer at level X1
for many, many plots. The resulting yields will not all be
the same; the weather might be better for some plots, the
soil might be better for others, etc. Thus we would get a
distribution of Y1 given X1 or p(Y1X1). There will
similarly be a distribution of Y2 at X2 and so forth. We
can visualize a whole set of Y populations such as those
shown in Figure 12-1a.
Some Assumptions
To analyze such populations, we make three
assumptions about the regularity of these Y
distributions.
1) Homogenous Variance: All the Y distributions
have the same spread. Formally this means that
the probability distribution p(Y1X1) has the
same variance 2 as p(Y2X2) and so on. This
assumption is often called fixed in repeated
samples.
Some Assumptions
2) Linearity: For each Y distribution, the
mean or E(Y1X1), or just 1 lies on a
straight line known as the true (population)
regression line:
E(Yi) = i =  + Xi
The population parameters are estimated with
the sample data as a and b.
Some Assumptions
3) Independence: The random variables Y1, Y2 …are statistically
independent. For example, if Y1 happens to be large, there is no
reason to expect Y2 to be large. Their mean and variance are given
by:
Mean =  + XI
Variance = 2
Or Yi =  + Xi + ei
Where e1, e2, ….ei are independent errors with mean = 0 and variance =
2 .
For example, in Figure 12-1b, the first observed value Y1 is shown along
with the corresponding error term e1.
Some Assumptions
It is often assumed additionally that the errors are normally
distributed:
i  N[0, 2]
A violation of non-constant variance is called
heteroskedasticity (errors are not the same across the
range of X values).
We also assume no auto-correlation, or that Cov[i, j] = 0 if
i  j. A typical violation occurs with time series data
where the errors are related over time.
The Nature of the Error Term
The error may be regarded as the sum of two
components:
 Measurement Error: Sometimes we might
measure things incorrectly; this can create a
larger error.
 Inherent variability: Sometimes we draw a
sample that is not typical, or particular values
may be far away from their expectation (like
getting 10 heads in a row in 10 flips; possible but
unlikely).
The Nature of the Error Term
To summarize, we can see in Figure 12-2 that we have the
true population regression line (black thick line), and
since we do not have the population data, we must
estimate  &  with our sample data as a & b. Because
of error, sometimes the values of Y observed in the
sample will be a little low too (Y1) and sometimes they
will be a little too high (Y2 and Y3). The question then is
how close did we come to the true population line? The
best we can hope for is something reasonably close, and
we go back to the notion of a sampling distribution to
figure out how close we are.
Sampling distribution of b
How is the slope estimate b distributed
around its target ? Statisticians have been
able to derive the theoretical sampling
distribution and it is normally distributed
with an expected value of b =  and the
standard error of b = /[(X - Mx)2]. Here
 represents the standard deviation of the Y
observations about the population line.
Sampling distribution of b
An easier way to express the standard error of b is as
follows:
Standard error =   1
n sx
There are three ways that the standard error can be reduced
to produce a more accurate estimate b:
 By reducing , the inherent variability of the Y
observations
 By increasing n, the sample size
 By increasing sx, the spread of the X values.
Sampling distribution of b
The third point is particularly interesting because
increasing the variance of our independent
variable increases our leverage to explain the
variance of Y. We can see this clearly in Figure
12-4 where a larger range of values allows us to
fit a more accurate regression line.
Because we do not generally know , we estimate
standard error as:
SE = s/[(X-Mx)2],
where s2 = 1/(n-k-1) (Y - Yp)2