Section 7.3 ~ Best-Fit Lines and Prediction

Download Report

Transcript Section 7.3 ~ Best-Fit Lines and Prediction

Section 7.3 ~
Best-Fit Lines and Prediction
Introduction to Probability and Statistics
Ms. Young
Sec. 7.3
Objective

After this section you will become familiar with the
concept of a best-fit line for a correlation, recognize
when such lines have predictive value and when they
may not, understand how the square of the
correlation coefficient is related to the quality of the
fit, and qualitatively understand the use of multiple
regression.
Sec. 7.3
Line of Best-Fit

The best-fit line (or regression line) on a scatterplot is a line that lies
closer to the data points than any other possible line
This can be useful to make predictions based on existing data
 The line of best-fit should have approximately the same number of points
above it as it has below it and it does not have to start at the origin
 The precise line of best-fit can be calculated by hand, but is very tedious so
often times it is estimated “by eye” or by using a calculator

Sec. 7.3
Cautions in Making Predictions from Best-Fit Lines
1.
Don’t expect a best-fit line to give a good prediction unless the
correlation is strong and there are many data points

If the sample points lie very close to the best-fit line, the correlation is
very strong and the prediction is more likely to be accurate

If the sample points lie away from the best-fit line by substantial amounts,
the correlation is weak and predictions tend to be much less accurate
Sec. 7.3
Cautions in Making Predictions from Best-Fit Lines
2.
Don’t use a best-fit line to make predictions beyond the bounds of the
data points to which the line was fit

Ex. ~ The diagram below represents the relationship between candle length
and burning time. The data that was collected dealt with candles that all
fall between 2 in. and 4 in. Using the line of best fit to make a prediction
far off from these lengths would most likely be inappropriate.

According to the line of best-fit, a candle with a length of 0 in. burns for 2
minutes, an impossibility
Sec. 7.3
Cautions in Making Predictions from Best-Fit Lines
3.
A best-fit line based on past data is not necessarily valid now
and might not result in valid predictions of the future

4.
Don’t make predictions about a population that is different from
the population from which the sample data were drawn

5.
Ex. ~ Economists studying historical data found a strong
correlation between unemployment and the rate of inflation.
According to this correlation, inflation should have risen
dramatically in the recent years when the unemployment rate fell
below 6%. But inflation remained low, showing that the correlation
from old data did not continue to hold.
Ex. ~ you cannot expect that the correlation between aspirin
consumption and heart attacks in an experiment involving only men
will also apply to women
Remember that a best-fit line is meaningless when there is no
significant correlation or when the relationship is nonlinear

Ex. ~ there is no correlation between shoe size and IQ, so even
though you can draw a line of best-fit, it is useless in making any
conclusions
Sec. 7.3
Example 1
State whether the prediction (or implied prediction) should be trusted in
each of the following cases, and explain why or why not.

You’ve found a best-fit line for a correlation between the number of
hours per day that people exercise and the number of calories they
consume each day. You’ve used this correlation to predict that a person
who exercises 18 hours per day would consume 15,000 calories per day.


There is a well-known but weak correlation between SAT scores and
college grades. You use this correlation to predict the college grades of
your best friend from her SAT scores.


This prediction would be beyond the bounds of the data collected and should
therefore not be trusted
Since the correlation is weak, that means that there is much scatter in the
data and you should not expect great accuracy in the prediction
Historical data have shown a strong negative correlation between birth
rates in Russia and affluence. That is, countries with greater affluence
tend to have lower birth rates. These data predict a high birth rate in
Russia.

We cannot automatically assume that the historical data still apply today. In
fact, Russia currently has a very low birth rate, despite also having a low
level of affluence.
Sec. 7.3
Example 1 Cont’d…

A study in China has discovered correlations that are useful in
designing museum exhibits that Chinese children enjoy. A curator
suggests using this information to design a new museum exhibit for
Atlanta-area school children.


The suggestion to use information from the Chinese study for an Atlanta
exhibit assumes that predictions made from correlations in China also apply
to Atlanta. However, given the cultural differences between China and
Atlanta, the curator’s suggestion should not be considered without more
information to back it up.
Scientific studies have shown a very strong correlation between
children’s ingesting of lead and mental retardation. Based on this
correlation, paints containing lead were banned

Given the strength of the correlation and the severity of the consequences,
this prediction and the ban that followed seem quite reasonable. In fact,
later studies established lead as an actual cause of mental retardation,
making the rationale behind the ban even stronger.
Sec. 7.3
The Correlation Coefficient and Best-Fit Lines


Recall that the correlation coefficient (r)
refers to the strength of a correlation
The correlation coefficient can also be used to
say something about the validity of predictions
with best-fit lines

The coefficient of determination, r², is the
proportion of the variation in a variable that is
accounted for by the best-fit line

Ex. ~ The correlation coefficient for the diamond weight
and price from the scatterplot on p.307 is r = 0.777, so
r²≈ 0.604. This means that about 60% of the variation in
the diamond prices is accounted for by the best-fit line
relating weight and price and 40% of the variation in price
must be due to other factors.
Sec. 7.3
Example 2

You are the manager of a large department store. Over the
years, you’ve found a reasonably strong positive correlation
between your September sales and the number of employees
you’ll need to hire for peak efficiency during the holiday season.
The correlation coefficient is 0.950. This year your September
sales are fairly strong. Should you start advertising for help
based on the best-fit line?
r²= 0.903, which means that 90% of the variation in the number of
peak employees can be accounted for by a linear relationship with
September sales, leaving only 10% unaccounted for
 Because 90% is so high, it is a good idea to predict the number of
employees you’ll need using the best-fit line

Sec. 7.3
Multiple Regression

Multiple regression is a technique that allows
us to find a best-fit equation relating one
variable to more than one other variable


Ex. ~ Price of diamonds in comparison to carat, cut,
clarity, and color
The coefficient of determination (R²) is the
most common measure in a multiple regression

This tells us how much of the scatter in the data is
accounted for by the best-fit equation


If R²is close to 1, the best-fit equation should be very
useful for making predictions within the range of the data
If R²is close to 0, the predictions are essentially useless