Transcript Document

Regression
• For the purposes of this class:
– Does Y depend on X?
– Does a change in X cause a change in Y?
– Can Y be predicted from X?
• Y= mX + b
Predicted values
180
Dependent Value
Actual values
160
140
120
Overall Mean
100
30
40
50
60
Independent Value
70
80
When analyzing a regression-type data set, the first step
is to plot the data:
Y
35
114
45
120
55
150
65
140
75
166
55
138
180
Dependent Value (Y)
X
160
140
120
100
30
40
50
60
70
Independent Value (X)
The next step is to determine the line that ‘best fits’
these points. It appears this line would be sloped
upward and linear (straight).
80
The line of best fit is the sample regression of Y on X,
and its position is fixed by two results:
Y = 1.24(X) + 69.8
Dependent Value
180
slope
160
Y-intercept
140
(55, 138)
120
Rise/Run
100
30
40
50
60
70
80
Independent Value
1) The regression line passes through the point (Xavg, Yavg).
2) Its slope is at the rate of “m” units of Y per unit of X, where m
= regression coefficient (slope; y=mx+b)
Testing the Regression Line for Significance
• An F-test is used based on Model, Error, and Total
SOS.
– Very similar to ANOVA
• Basically, we are testing if the regression line has
a significantly different slope than a line formed
by using just Y_avg.
– If there is no difference, then that means that Y
does not change as X changes (stays around the
average value)
• To begin, we must first find the regression line
that has the smallest Error SOS.
Error SOS
The regression line should pass through the overall average with
a slope that has the smallest Error SOS (Error SOS = the distance
between each point and predicted line: gives an index of the
variability of the data points around the predicted line).
Dependent Value
180
160
140
138
overall average is
the pivot point
120
100
30
40
50
55
60
Independent Value
70
80
For each X, we can predict Y: Y = 1.24(X) + 69.8
Error SOS is calculated as the sum of (YActual – YPredicted)2
This gives us an index of how scattered the actual observations
are around the predicted line. The more scattered the points,
the larger the Error SOS will be. This is like analysis of
variance, except we are using the predicted line instead of the
mean value.
X
Y_Actual
Y_Pred
SOSError
35
114
113.2
0.64
45
120
125.6
31.36
55
150
138
144
65
140
150.4
108.16
75
166
162.8
10.24
294.4
Total SOS
• Calculated as the sum of (Y – Yavg)2
• Gives us an index of how scattered our data set is
around the overall Y average.
Dependent Value
180
Regression line
not shown
160
140
Overall Y average
120
100
30
40
50
60
Independent Value
70
80
Total SOS gives us an index of how scattered the data
points are around the overall average. This is calculated
the same way for a single treatment in ANOVA.
X
Y_Actual
Y Average
SOSTotal
35
114
138
576
45
120
138
324
55
150
138
144
65
140
138
4
75
166
138
784
1832
What happens to Total SOS when all of the points are
close to the overall average? What happens when the
points form a non-horizontal linear trend?
Model SOS
• Calculated as the Sum of (YPredicted – Yavg)2
• Gives us an index of how far away the predicted
values are from the overall average value
Dependent Value
180
Distance between
predicted Y and
overall mean
160
140
120
100
30
40
50
60
Independent Value
70
80
Model SOS
• Gives us an index of how far away the predicted
values are from the overall average value
X
Y_Pred
Y Average
SOSModel
35
113.2
138
615.04
45
125.6
138
153.76
55
138
138
0
65
150.4
138
153.76
75
162.8
138
615.04
1537.6
• What happens to Model SOS when all of the
predicted values are close to the average value?
All Together Now!!
X
Y_Actual
Y_Pred
SOSError
Y_Avg
SOSTotal
35
114
113.2
0.64
138
576
615.04
45
120
125.6
31.36
138
324
153.76
55
150
138
144
138
144
0
65
140
150.4
108.16
138
4
153.76
75
166
162.8
10.24
138
784
615.04
1832
1537.6
294.4
SOSError =  (Y_Actual – Y_Pred)2
SOSTotal =  (Y_Actual –Y_ Avg)
SOSModel =  (Y_Pred – Y_Avg)
2
2
SOSModel
Using SOS to Assess Regression Line
• Model SOS gives us an index on how ‘different’ the
predicted values are from the average values.
– Bigger Model SOS = more different
– Tells us how different a sloped line is from a line made
up only of Y_avg.
– Remember, the regression line will pass through the
overall average point.
• Error SOS gives us an index of how different the
predicted values are from the actual values
– More variability = larger Error SOS = large distance
between predicted and actual values
Magic of the F-test
• The ratio of Model SOS to Error SOS (Model SOS divided by Error SOS)
gives us an overall index (the F statistic) used to indicate the relative
‘difference’ between the regression line and a line with slope of zero
(all values = Y_avg.
– A large Model SOS and small Error SOS = a large F statistic. Why does this
indicate a significant difference?
– A small Model SOS and a large Error SOS = a small F statistic. Why does
this indicate no significant difference??
• Based on sample size and alpha level (P-value), each F statistic has an
associated P-value.
– P < 0.05 (Large F statistic) there is a significant difference between the
regression line a the Y_avg line.
– P ≥ 0.05 (Small F statistic) there is NO significant difference between the
regression line a the Y_avg line.
Mean Model SOS = F
Mean Error SOS
Dependent Value
180
160
140
120
100
40
50
60
70
80
Independent Value
180
Dependent Value
Basically, this is an
index that tells us how
different the regression
line is from Y_avg, and
the scatter of the data
around the predicted
values.
30
160
140
120
100
30
40
50
60
70
Independent Value
80
Y = 1.24(X) + 69.8
Dependent Value
180
slope
160
Y-intercept
140
120
Rise/Run
100
30
40
50
60
70
Independent Value
Use regression line to predict a specific
number or a specific change.
80
Correlation (r):
Another measure of the mutual linear
relationship between two variables.
• ‘r’ is a pure number without units or dimensions
• ‘r’ is always between –1 and 1
• Positive values indicate that y increases when x
does and negative values indicate that y
decreases when x increases.
– What does r = 0 mean?
• ‘r’ is a measure of intensity of association
observed between x and y.
– ‘r’ does not predict – only describes associations
between variables
180
Dependent Variable
Dependent Variable
180
r>0
160
140
120
r<0
140
120
100
100
30
40
50
60
70
30
80
180
160
140
120
100
40
50
60
Independent Variable
50
60
70
r is also called Pearson’s
correlation coefficient.
r=0
30
40
Independent Variable
Inpendent Variable
Dependent Variable
160
70
80
80
R-square
• If we square r, we get rid of the negative value
if it is negative and we get an index of how
close the data points are to the regression
line.
• Allows us to decide how much confidence we
have in making a prediction based on our
model.
• Is calculated as Model SOS / Total SOS
r2 = Model SOS / Total SOS
= Model SOS
180
Dependent Value
= Total SOS
160
140
120
100
30
40
50
60
Independent Value
70
80
= Model SOS
180
Dependent Value
= Total SOS
R2 = 0.8393
160
r2 = Model SOS / Total SOS
140
 numerator/denominator
120
100
30
40
50
60
70
80
Independent Value
1.2
R2 = 0.0144
1
0.8
Small numerator
Big denominator
0.6
0.4
0.2
0
0
10
20
30
40
50
R-square and Prediction Confidence
1.2
1.2
2
R = 0.0144
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0
10
20
30
40
50
0
60
1.2
R2 = 0.5537
10
20
30
40
50
60
40
50
60
1.2
1
1
R2 = 0.7605
R2 = 0.9683
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0
10
20
30
40
50
60
0
10
20
30
Finally……..
• If we have a significant relationship (based on
the p-value), we can use the r-square value to
judge how sure we are in making a prediction.