Chapter 1: Controlled Experiments

Download Report

Transcript Chapter 1: Controlled Experiments

Chapters 10 and 11: Using Regression to Predict
Math 1680
Overview
Predicting Values
The Regression Line
The RMS Error
The Regression Effect
A Second Regression Line
Summary
Predicting Values
We have previously seen that a pair of data sets, X
and Y, can be characterized by their five-statistic
summary





µX, the average value in X
SDX, the standard deviation of X
µY, the average value in Y
SDY, the standard deviation of Y
r, the correlation coefficient
Often, we want to predict a y-value given a particular
x-value

Want to use only the five-statistic summary to make
prediction
Predicting Values
Suppose we have the following five-number
summary stats for the height (X) and weight
(Y) of men in the US



µX= 70 inches, SDX= 3 inches
µY= 162 lbs, SDY= 30 lbs
r = 0.47
If you had to guess what the weight of any
man would be, what is your best bet?
Predicting Values
Suppose we have the following five-number
summary stats for the height (X) and weight
(Y) of men in the US



µX= 70 inches, SDX= 3 inches
µY= 162 lbs, SDY= 30 lbs
r = 0.47
Suppose you know the man is 1 SD above
average

Would your best guess for his weight be 1 SD
above average?
The SD line is
the dashed line
running through
the scatter plot

If we guessed 1
SD above
average weight,
where would we
be on the plot?
What would a
better guess be?
The Regression Line
Suppose we have the following five-number
summary stats for the height (X) and weight
(Y) of men in the US



µX= 70 inches, SDX= 3 inches
µY= 162 lbs, SDY= 30 lbs
r = 0.47
It turns out that the correlation coefficient
determines the best guess

For every SD we move in X, we should move r
SD’s in Y
The Regression Line
The regression line from X to Y


Runs through the point of averages
Has a slope of r time the slope of the SD
line
The regression line predicts the average
value for y within the narrowed-down
range specified by a given x
The Regression Line
The formula for the regression line from X to
Y is
zY  rz X
Or, alternately,
SDY
y  r(
)(x   X )  Y
SDX
When is the regression line the same as the
SD line?
When r = 1 or -1
The regression
line is the solid
line running
through the
scatter plot

If we looked at
heights 1 SD
above the
average, the
regression line
runs through the
point 0.47 SD’s
above average
in weight
The Regression Line
Suppose we have the following five-number
summary stats for the height (X) and weight
(Y) of men in the US



µX= 70 inches, SDX= 3 inches
µY= 162 lbs, SDY= 30 lbs
r = 0.47
What is the average weight of all the men
who are 73 inches tall?
For a man 73 inches tall, what weight should
we predict?
176.1 lbs
The Regression Line
Suppose we have the following five-number
summary stats for the height (X) and weight
(Y) of men in the US



µX= 70 inches, SDX= 3 inches
µY= 162 lbs, SDY= 30 lbs
r = 0.47
What is the average weight of all the men
who are 64 inches tall?
For a man 64 inches tall, what weight should
we predict?
133.8 lbs
The Regression Line
To use the regression line from X to Y…


Standardize the given x-value to get zx
Use the regression equation to go from X
to Y
 zY = rzX

Unstandardize zY to get y
The Regression Line
Suppose we have the following fivenumber summary stats for the height
(X) and weight (Y) of men in the US



µX= 70 inches, SDX= 3 inches
µY= 162 lbs, SDY= 30 lbs
r = 0.47
Predict the weight of a man who is 6’4”
190.2 lbs
The Regression Line
Suppose we have the following fivenumber summary stats for the height
(X) and weight (Y) of men in the US



µX= 70 inches, SDX= 3 inches
µY= 162 lbs, SDY= 30 lbs
r = 0.47
Predict the weight of a man who is 5’6”
143.2 lbs
The Regression Line
Important notes about the regression line
from X to Y

It predicts the average value for y given an x
value
 If the scatter plot is football shaped, this prediction will
be above about half of the sample and below the other
half


This is because the variables are approximately normal
The slope of the regression line will always be r (
SDy
SDx
)
The RMS Error
Recall that an average alone did not
uniquely describe a data set


A spread measure was needed
Since the regression method only gives us
an average value as its prediction, we can’t
really tell by this alone how good a guess it
is
The prediction
given by the
regression line
for a height of
73 inches is at
(73 in, 176 lbs)


How much does
the heaviest 73”
tall man weigh?
How much does
the lightest 73”
tall man weigh?
The RMS Error
If we are given a specific man to predict, we are
likely to be a little off with the regression prediction


You can think of the prediction error as being the vertical
distance from the point to the regression line
That is, error = actual – predicted
If we want to get a good sense of what the typical
error for a given x-value is, we can find the RMS of
all the errors for all the points

This value is called the RMS error for the regression line
The RMS Error
The RMS error is to the regression line what
the SD is to the average


The RMS error measures the spread around a
prediction from the regression line
Recall we are generally assuming the data sets are
approximately normal
 About 68% of the points on a scatter plot will fall within
the strip that runs from one RMS error below to one RMS
error above the regression line
The RMS Error
1 RMS error,
68%
2 RMS errors,
95%
The RMS Error
The RMS error for regression from X to Y
(denoted R) can be calculated from the fivestatistic summary by
R  ( SDY ) 1  r
2
What units would R have?


What happens when r gets close to 0?
What happens when r gets close to 1 or -1?
The RMS Error
The RMS error allows us to give a range
around our prediction
If the scatter plot is football-shaped,
the RMS error is roughly constant
across the entire range of the data set

The vertical spread around one part is
about the same as the vertical spread
around other parts
The RMS Error
Suppose we have the following fivenumber summary stats for the height
(X) and weight (Y) of men in the US



µX= 70 inches, SDX= 3 inches
µY= 162 lbs, SDY= 30 lbs
r = 0.47
Predict and give the RMS error for the
weight of a man who is 6’2”
180.8 ± 26.5 lbs
The RMS Error
Suppose we have the following fivenumber summary stats for the height
(X) and weight (Y) of men in the US



µX= 70 inches, SDX= 3 inches
µY= 162 lbs, SDY= 30 lbs
r = 0.47
Predict and give the RMS error for the
weight of a man who is 5’4”
133.8 ± 26.5 lbs
The Regression Effect
A preschool program attempts to boost
students’ IQ scores


The children are tested when they enter the
program (pretest)
The children are retested when they leave the
program (post-test)
The Regression Effect
On both occasions, the average IQ
score was 100, with an SD of 15


Also, students with below-average IQs on
the pretest had scores that went up on the
average by 5 points
Students with above average scores on the
pretest had their scores drop by an
average of 5 points
The Regression Effect
Does the program equalize intelligence?
No. If the program really equalized
intelligence, then the SD for the post-test
results should be smaller than that of the
pre-test results. This is an example of the
regression effect.
The Regression Effect
The regression effect is a byproduct of the
fact that predictions from a regression line
are average values

Some of the people who did very well on the pretest may simply have had a good test day
 Their scores shouldn’t necessarily be as high on the post-
test as they were on the pretest

Similarly, some of the people who did poorly on
the pre-test may simply have had a bad test day
 Their scores shouldn’t necessarily be as low on the post-
test as they were on the pretest
The Regression Effect
Sometimes researchers mistake the
regression effect for some important
underlying cause in the study
(regression fallacy)


Tall fathers tend to have tall sons who are
slightly shorter than the father
There is no biological cause for this
reduction
 It is strictly statistical
The Regression Effect
As part of their training, air force pilots
make practice landings with instructors,
and are rated on performance

The instructors discuss the ratings with the
pilots after each landing
 Statistical analysis shows that pilots who make
poor landings the first time tend to do better
the second time
 Conversely, pilots who make good landings the
first time tend to do worse the second time
The Regression Effect
The conclusion is that criticism helps the
pilots while praise makes them do worse

As a result, instructors were ordered to criticize all
landings, good or bad
Was this warranted by the facts?
No. This is an example of regression fallacy.
The Regression Effect
An instructor gives a midterm


She asks the students who score 20 points below average to
see her regularly during her office hours for special tutoring
They all score at class average or above on the final
Can this improvement be attributed to the regression
effect? Why/why not?
No. If it was only the regression
effect, most of the students still would
have scored below average. The fact
that everyone in the tutoring group
scored above average indicated that
the tutoring had the proper effect.
A Second Regression Line
The focus so far has been on the
regression line from X to Y

Note, however, that there is also a
regression line from Y to X
What would the difference between the
two lines be?
The regression line from X to Y is given by zY = rzX, while
the regression line from Y to X is given by zX = rzY
A Second Regression Line
A study of 1,000 families gives the following



The husbands’ average height was 68 inches with
an SD of 2.7 inches
The wives’ average height was 63 inches with an
SD of 2.5 inches
The correlation between them was 0.25
Predict and give the RMS error for the
husband’s height when his wife’s height is 68
inches
69.35 inches, give or take 2.61 inches
A Second Regression Line
A study of 1,000 families gives the following



The husbands’ average height was 68 inches with
an SD of 2.7 inches
The wives’ average height was 63 inches with an
SD of 2.5 inches
The correlation between them was 0.25
Predict and give the RMS error for the wife’s
height when her husband’s height is 69.35
inches
63.31 inches, give or take 2.42 inches
A Second Regression Line
Regression Line from Y to X
SD Line
Regression Line from X to Y
A Second Regression Line
Regression Line from Y to X
SD Line
Regression Line from X to Y
A Second Regression Line
Regression Line from Y to X
SD Line
Regression Line from X to Y
Summary
When trying to make predictions from a footballshaped plot, a good predictor is the average value
for one variable within a restricted range in the other

The regression line runs through all of these averages
 For every SD moved in the independent variable, the
regression line predicts a move of r SD’s in the dependent
variable

The prediction from the regression line is likely to be off by
the RMS error
 The RMS error can be calculated as
  (SDY ) 1  r 2
Summary
The regression effect is purely statistical

It does not reflect a significant underlying
trend in the data
There are two regression lines for a
scatter plot

Which one to use depends on which
variable you are predicting