Section 4.2 Linear Regression and the Coefficient of Determination 4.2 / 1

Download Report

Transcript Section 4.2 Linear Regression and the Coefficient of Determination 4.2 / 1

Section 4.2
Linear Regression and the Coefficient of
Determination
4.2 / 1
The Least Squares Line
• When there appears to be a linear relationship
between x and y we attempt to “fit” a line to
the scatter diagram.
Least Squares Criterion
The sum of the squares
of the vertical
distances from the
points to the line is
made as small as
possible.
4.2 / 2
Least Squares Criterion
d represents the difference between the y coordinate of the
data point and the corresponding y coordinate on the line.
Thus if the data point lies above the line, d is positive, but if
the data point lies below the line, d is negative.
As a result, the sum of the d values can be small even if the
points are widely spread in the scatter diagram.
2
d
However, the squares
cannot be negative.
By minimizing the sum of the squares, we are, in effect, not
allowing positive and negative d values to “cancel out” one
another in the sum.
It is this way that we can meet the least-squares critirion of
minimizing the sum of the squares of the vertical distances
between the points and the line over all points in the
4.2 / 3
scatter diagram.
Equation of the Least Squares Line
ŷ = a + bx
a = the y-intercept
b = the slope
4.2 / 4
Finding the Equation of the Least Squares
Line
• Obtain a random sample of n data pairs (x, y).
1. Using the data pairs, compute Σx, Σy, Σx2, Σy2, and
Σxy.
Compute the sample means
x and y .
Finding the Slope
2. Use the following formula:
n xy   x  y 
slope  b 
2
2
n x   x 
Finding the y-intercept
y  intercept  a  y  bx
where
and
y  mean of y values
x  mean of x values
and b  the slope
4.2 / 6
Example
Find the Least Squares Line
X (Miles
Traveled)
2
y
(Minutes)
6
x2
xy
4
12
5
9
25
45
12
23
144
276
7
18
49
126
7
15
49
105
15
28
225
420
10
19
100
190
x = 58
y = 118
x2 = 596 xy = 1174
4.2 / 7
Example cont.
Finding the Slope
n  xy   x  y 
slope  b 
2
2
n  x   x 
7(1174)  (58)(118)

7(596)  59 2
 1.70
4.2 / 8
Example cont.
Finding the y-intercept
118
y  mean of y values 
 16.857143
7
58
x  mean of x values 
 8.2857143
7
y  int ercept  a  y  bx 
16.857143  1.700495 ( 8.2857143 )
 2.7673273  2.77
The equation of the least squares line is:
ŷ = a + bx
ŷ = 2.77 + 1.70x
4.2 / 9
Graph the least-Squares Line
We can use the slope-intercept method of algebra, but may not
always be convenient if the intercept is not within the range
of the sample data values.
It is better to select two x values in the range of the x data values
and then use the least-squares line to compute two
corresponding y values.
The point ( x , y ) is always on the least-squares line.
To find another point, give x a value and find the y.
In our example: ( x , y ) = (8.3 , 16.9)
Try x = 5. Compute ŷ : ŷ = 2.8 + 1.7(5)= 11.3
4.2 / 10
Graphing the least squares line
• Using two values in the range of x, compute
two corresponding y values.
• Plot these points.
• Join the points with a straight line.
4.2 / 11
Sketching the Line
Relationship between miles
traveled and minutes
minutes
30
20
10
0
0
5
10
15
20
miles
12
Meaning of Slope
In the equation ŷ = a + bx , the slope b tell us how many
units ŷ changes for each unit change in x.
In our example regarding the miles traveled and the time in
minutes
ŷ = 2.77 + 1.70x
The slope 1.70 tell us that a change in one mile takes in average
1.70 minutes.
The slope of the least-squares line tells how many units the
response variable is expected to change for each unit change
in the explanatory variable. The number of units change in the
response variable for each unit change in the explanatory
variable is called marginal change of the response variable.
4.2 / 13
Using the Equation of the Least Squares
Line to Make Predictions
• Choose a value for x (within the range of x
values).
• Substitute the selected x in the least squares
equation.
• Determine corresponding value of ŷ.
4.2 / 14
Predict the time
to make a trip of 14 miles
• Equation of least squares line:
ŷ = 2.8 + 1.7x
• Substitute x = 14:
ŷ = 2.8 + 1.7 (14)
ŷ = 26.6
• According to the least squares equation, a trip
of 14 miles would take 26.6 minutes.
4.2 / 15
Interpolation
• using the least squares line to predict ŷ
values for x values that are between
observed x values in the data set.
Extrapolation
using the least squares line to predict ŷ
values for x values that are beyond
observed x values in the data set.
4.2 / 16
Extreme Data Points
• The least squares line can be greatly
affected by extreme or influential data
points.
4.2 / 17
The least squares line
• Is developed from sample data pairs (x, y).
• May not reflect the relationship between x and y for
values of x outside the data range.
• For example, there is a fairly high correlation between
height and age for boys ages 1 year to 10 years. In general
the older the boy, the taller the boy. A least-squares line
based on such date give good predictions of height for
ages 1 to 10.
• However, it would be fairly meaningless to use the same
linear regression line to predict the height of 20 to 50
years old.
4.2 / 18
The least squares line
• Each different sample data will produce a
slightly different equation for the leastsquares line.
• The least-squares line developed with x as
the explanatory variable and y as the
response variable can be used only to
predict y values from specified x values.
4.2 / 19
A statistic related to r:
• If the sample correlation coefficient is r
• The coefficient of determination = r2
How good is the least squares line as an
r
instrument of regression?
2
r
The answer is the coefficient of determination
2
Coefficient of Determination r
Is a measure of the proportion of the
variation in y that is explained by the
regression line using x as the
predicting variable
2
4.2 / 20
Interpretation of r2
• If r = 0.9753643, then what percent of the variation in
minutes (y) is explained by the linear relationship with x,
miles traveled?
• What percent is unexplained?
• If r = 0.9753643, then r2 = .9513355
• Approximately 95 percent of the variation in minutes (y)
is explained by the linear relationship with x, miles
traveled.
• 1  r 2  5% is unexplained (due to the random chance or
the probability of lurking variables that influence y).
Assignments 7, 8 and 9
4.2 / 21
Correlation Coefficient r
Coefficient of Determination, r 2 (calc)
• The correlation coefficient, r, and
the coefficient of determination, r 2 ,
will appear on the screen that shows the regression equation
information
(be sure the Diagnostics are turned on --2nd Catalog (above 0), arrow down to
DiagnosticOn, press ENTER twice.)
•
In addition to appearing with the regression information, the
values r and r 2 can be found under
• VARS, #5 Statistics → EQ #7 r and #8 r 2 .
4.2 / 22
Linear Regression (calc)
• A linear regression is also know as the "line of best fit".
• Side note: Although commonly used when dealing with "sets" of
data, the linear regression can also be used to simply find the
equation of the line between two points.
Find the equation of the line passing through (-1, 1) and (-4,7).
Entering the information as described in the example below, we
see the following screens:
The equation is y = -2x -1.
The correlation coefficient is -1 since both point are "on" the line
and the line slopes negatively
4.2 / 23
Linear Regression Model Example (calc)
Let's examine an example of the linear regression as it pertains to a
"set" of data.
Data: Is there a relationship between Math SAT scores and the
number of hours spent studying for the test? A study was
conducted involving 20 students as they prepared for and took the
Math section of the SAT Examination.
Let x be the Hours Spent Studying and y be Math SAT Score
x
4
9
10
14
4
7
12
y
390
580
650
730
410
530
600
x
22
1
3
8
11
5
6
y
790
350
400
590
640
450
520
x
10
11
16
13
13
10
y
690
690
770
700
730
640
4.2 / 24
Linear Regression Model Example cont.
• Task:
a) Determine a linear regression model equation to represent this data.
b) Graph the new equation.
c) Decide whether the new equation is a "good fit" to represent this
data.
d) Interpolate data: If a student studied for 15 hours, based upon this
study, what would be the expected Math SAT score?
e) Interpolate data: If a student obtained a Math SAT score of 720,
based upon this study, how many hours did the student most likely
spend studying?
f) Extrapolate data: If a student spent 100 hours studying, what would
be the expected Math SAT score? Discuss this answer. Any answers in
relation to this problem are to be rounded to the nearest tenth.
If rounding is not indicated in a problem, leave the full calculator entries4.2
as answers
/ 25
Linear Regression Model Example cont.
• Step 1. Enter the data into the lists.
• Step 2. Create a scatter plot of the data.
Go to STATPLOT (2nd Y=) and choose the first plot. Turn the plot
ON, set the icon to Scatter Plot (the first one), set Xlist to L1 and
Ylist to L2 (assuming that is where you stored the data), and select
a Mark of your choice.
• Step 3. Choose Linear Regression Model.
Press STAT, arrow right to CALC, and arrow down to 4: LinReg
(ax+b). Hit ENTER. When LinReg appears on the home screen,
type the parameters L1, L2, Y1. The Y1 will put the equation into
Y= for you.
(Y1 comes from VARS → YVARS, #Function, Y1)
4.2 / 26
Linear Regression Model Example cont.
• Step 4. Graph the Linear Regression Equation from Y1.
ZOOM #9 ZoomStat to see the graph.
(answer to part b)
• Step 5. Is this model a "good fit"?
The correlation coefficient, r, is .9336055153 which places the
correlation into the "strong" category. (0.8 or greater is a "strong"
correlation)
The coefficient of determination, r 2, is .8716192582 which
means that 87% of the total variation in y can be explained by the
relationship between x and y. The other 13% remains
unexplained.
Yes, it is a "good fit".
(answer to part c)
•
4.2 / 27
Linear Regression Model Example cont.
Step 6. Interpolate: (within the data set)
If a student studied for 15 hours, based upon this study, what
would be the expected Math SAT score?
From the graph screen, hit TRACE, arrow up to obtain the linear
equation at the top of the screen, type 15, hit ENTER, and the
answer will appear at the bottom of the screen.
(answer to part d -Math SAT score of 733.1)
4.2 / 28
Linear Regression Model Example cont.
• Step 7. Interpolate: (within the data set)
If a student obtained a Math SAT score of 720, based upon this
study, how many hours did the student most likely spend
studying?
Go to TBLSET (above WINDOW) and set the TblStart to 13 (since
13 hours gives a score of 700). Set the delta Tbl to a decimal
setting of your choice. Go to TABLE and arrow up or down to find
your desired score of 720, in the Y1 column
– (answer to part e -- approx. 14.5 hours)
4.2 / 29
Linear Regression Model Example cont.
• Step 8. Extrapolate data: (beyond the data set)
If a student spent 100 hours studying, what would be the
expected Math SAT score?
Discuss this answer.
• With your linear equation in Y1, go to the home screen and type
Y1(100). Press ENTER. (Y1 comes from VARS → YVARS, #Function, Y1(100))
• Our equation shows that if a student studies 100 hours, he/she
should score 2885.8 on the Math section of the SAT
examination. The only problem with this answer is that the highest
score that can be obtained is 800. So why is this score so
outrageous? ANSWER: When you extrapolate data, the further
you move away from the data set, the less accurate your
information becomes. In this problem, the largest number of
hours in the data set was 22 hours, but the extrapolation tried to
jump to 100 hours.
(answer to part f)
4.2 / 30
Example
Linear Regression with Biological Data
(or the realities of working with real-life data)
Pierce (1949) measured the frequency (the
number of wing vibrations per second) of chirps made
by a ground cricket, at various ground
temperatures. Since crickets are ectotherms (coldblooded), the rate of their physiological processes and
their overall metabolism are influenced by
temperature. Consequently, there is reason to believe
that temperature would have a profound effect on
aspects of their behavior, such as chirp frequency.
4.2 / 31
Example cont.
Chirps/Second
20.0
16.0
19.8
18.4
17.1
15.5
14.7
17.1
15.4
16.2
15.0
17.2
16.0
17.0
14.1
Temperature (º F)
88.6
71.6
93.3
84.3
80.6
75.2
69.7
82.0
69.4
83.3
78.6
82.6
80.6
83.5
76.3
4.2 / 32
Example cont.
Task:
a) Determine a linear regression model equation to represent this data
b) Graph the new equation.
c) Decide whether the new equation is a "good fit" to represent this
data.
d) Extrapolate data: If the ground temperature reached 95º, then at
what approximate rate would you expect the crickets to be
chirping?
e) Interpolate data: With a listening device, you discovered that on a
particular morning the crickets were chirping at a rate of 18 chirps
per second. What was the approximate ground temperature that
morning?
f) If the ground temperature should drop to freezing (32º F), what
happens to the cricket's chirping? Answers in this problem are to be
4.2 / 33
rounded to the nearest thousandth.
Example cont.
Step 1. Enter the data into the lists.
Step 2. Create a scatter plot of the data.
Go to STATPLOT (2nd Y=) and choose the first plot. Turn the plot
ON, set the icon to Scatter Plot (the first one), set Xlist to L1 and
Ylist to L2 (assuming that is where you stored the data), and select
a Mark of your choice.
Obviously, there is some scatter to this data. This variability is the
norm, rather than the exception, when working with biological
data sets. Real life data seldom creates a nice straight line.
Step 3. Choose the Linear Regression Model.
Press STAT, arrow right to CALC, and arrow down to 4: LinReg
(ax+b). Hit ENTER. When LinReg appears on the home screen,
type the parameters L1, L2, Y1. The Y1 will put the equation in to
Y= for you.
(Y1 comes from VARS → YVARS, #Function, Y1)
4.2 / 34
Example cont.
Step 4. Graph the Linear Regression Equation from Y1.
ZOOM #9 ZoomStat to see the graph. (answer to part b)
Step 5. Is this model a "good fit"?
The correlation coefficient, r, is .8364792791 which just barely
places the correlation into the "strong" category. (0.8 or greater is
a "strong" correlation)
The coefficient of determination, r 2, is .6996975844 which
means that 70% of the total variation in y can be explained by the
relationship between x and y. The other 30% remains
unexplained.
Yes, it is somewhat of a "good fit".
(answer to part c)
4.2 / 35
Example cont.
• Step 6. Extrapolate: (beyond the data set)
If the ground temperature reached 95º, then at what
approximate rate would you expect the crickets to be
chirping?
Go to TBLSET (above WINDOW) and set the TblStart to 20
(since the highest temperature in the data set had 19.8
chirps/second). Set the delta Tbl to a decimal setting of your
choice. Go to TABLE (above GRAPH) and arrow up or down to
find your desired temperature, 95º, in the Y1 column.
(answer to part d -- approx. 21.265 chirps per second)
4.2 / 36
Example cont.
• Step 7. Interpolate:
(within the data set)
With a listening device, you discovered that on a particular
morning the crickets were chirping at a rate of 18 chirps per
second.
• What was the approximate ground temperature that morning?
From the graph screen, hit TRACE, arrow up to obtain the power
equation, type 47, hit ENTER, and the answer will appear at the
bottom of the screen.
(answer to part e -- the ground
temperature will be approx. 84.407º F)
4.2 / 37
Example cont.
• Step 8. If the ground temperature should drop to freezing (32º F),
what happens to the cricket's chirping?
• The TABLE tells us that at 32º F there are 1.85 chirps per
second. So, what does this really mean? Are the crickets cold?
• These findings are a bit deceiving. At 32º F, the crickets are
dead. The lifespan of a cricket in a cold climate is very short. The
crickets spend the winter as eggs laid in the soil. These eggs
hatch in late spring or early summer, and tiny immature crickets
called nymphs emerge. Nymphs develop into adults within
approximately 90 days. The adults mate and lay eggs in late
summer before succumbing to old age or freezing temperatures
in the fall.
• Also, remember that the further you extrapolate away from the
data set, the less reliable the information will be.
4.2 / 38