Introduction to Linear Regression You have seen how to find the equation of a line that connects two points. You have seen how.
Download ReportTranscript Introduction to Linear Regression You have seen how to find the equation of a line that connects two points. You have seen how.
Slide 1
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 2
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 3
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 4
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 5
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 6
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 7
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 8
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 9
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 10
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 11
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 12
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 13
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 14
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 15
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 16
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 17
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 18
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 19
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 20
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 21
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 22
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 23
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 24
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 25
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 26
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 27
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 28
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 29
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 30
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 31
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 32
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 33
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 34
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 35
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 36
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 37
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 38
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 39
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 40
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 41
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 42
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 43
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 44
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 45
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 46
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 47
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 48
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 49
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 50
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 51
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 52
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 53
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 54
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 55
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 56
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 57
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 58
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 59
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 60
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 61
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 62
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 63
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 64
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 65
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 66
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 67
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 68
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 69
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 70
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 71
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 72
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 73
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 74
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 75
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 76
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 77
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 78
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 79
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 80
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 81
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 82
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 83
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 84
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 85
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 86
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 2
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 3
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 4
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 5
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 6
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 7
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 8
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 9
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 10
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 11
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 12
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 13
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 14
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 15
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 16
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 17
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 18
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 19
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 20
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 21
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 22
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 23
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 24
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 25
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 26
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 27
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 28
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 29
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 30
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 31
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 32
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 33
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 34
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 35
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 36
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 37
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 38
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 39
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 40
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 41
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 42
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 43
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 44
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 45
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 46
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 47
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 48
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 49
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 50
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 51
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 52
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 53
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 54
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 55
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 56
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 57
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 58
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 59
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 60
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 61
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 62
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 63
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 64
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 65
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 66
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 67
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 68
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 69
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 70
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 71
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 72
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 73
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 74
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 75
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 76
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 77
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 78
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 79
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 80
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 81
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 82
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 83
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 84
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 85
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.
Slide 86
Introduction to Linear Regression
You have seen how to find the equation of a line that
connects two points.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
You have seen how to find the equation of a line that
connects two points.
Often, we have more than two data points, and usually the
data points do not all lie on a single line.
It is possible to find the equation of a line that most
closely fits a set of data points. Such a line is called a
regression line or a linear regression equation.
Our goal here is to learn what a regression line is. You
can then watch the presentation on how to find the
equation of a regression line on Excel.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
Consider the following table that the average price of a
two-bedroom apartment in downtown New York City
from 1994 to 2004, where t=0 represents 1994.
We can plot each of these data points on a graph. Each
point is of the form (t, p), so we have 6 points to plot.
They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20),
and (10, 1.60). Just looking at them like this doesn’t give
much indication of a pattern, although we can see that the
p-values are increasing as t increases.
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
When we plot the points all together on a set of axes, we
get the following scatter plot:
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It seems that the data do follow a somewhat linear
pattern.
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can find the line the line that most closely fits the
equation and graph it over the data points.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Notice that the line does not go through all of the data
points.
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
We can also find the equation of this “line of best fit”.
We can also get what’s called the correlation coefficient.
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
You will be able to do all of this on Excel once you watch
the instructional video and read the PDFs for this
material. For now, we just want to get an idea of what
the regression line is and what the correlation coefficient
tells us about the regression equation.
What does the regression equation tell us about the
relationship between time and sale price?
1.8
1.6
Price p in millions of $
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
What does the regression equation tell us about the
relationship between time and sale price?
1.8
Price p in millions of $
1.6
p = 0.1264t + 0.2229
1.4
r = 0.9734
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
The slope and the vertical intercept (usually the yintercept, here the p-intercept) tell us different things.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
In this case, the p-intercept tells us what the sale price is
predicted to be when t=0 (that is, in the year 1994).
The regression equation is p=0.1264t+0.2229. Recall that
price is in millions of dollars.
Thus, if t=0, the regression equation predicts a price of
$0.2229 million or $222,900.
According to the table, the actual price was $0.38 million
or $380,000. These values don’t have to be the same
however, since the regression equation can’t match every
point exactly. It is only a model that most closely fits the
data points.
What does the slope of the regression equation tell us?
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
.
.
What does the slope of the regression equation tell us?
The slope of our regression equation is 0.1264.
We can always write a number x as x divided by 1, so we
can write this slope as
.
Recall that the definition of slope is
.
In this case we are using p and t, so it’s
So for our problem, we have
We can interpret this to mean that when t increases by 1,
we can expect that p will increase by 0.1264.
.
.
For this problem, t is measure in years and p is measured
in millions of dollars.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
For this problem, t is measure in years and p is measured
in millions of dollars.
So more specifically, the slope can be interpreted to mean
that if t increases by 1 year, the model predicts that the
average price p of a two-bedroom apartment will increase
by about $0.1264 million dollars, or $126,400.
Even more plainly, we can say that the model predicts that
the average price of a two-bedroom apartment in New
York City will increase by about $126,400 per year.
We can now use the linear regression model to predict
future prices. For example, if we wanted to predict what
the price of an apartment was in 2008, we could plug in
14 for t in the regression equation (since t=0 is 1994).
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
Plugging in 14 for t into the regression equation gives
p=0.1264(14)+0.2229=1.9925.
This means that if the trend continued, we can expect
that the price of a two-bedroom apartment was around
$1,992,500 in 2008.
You can also use the regression equation to check how
closely the model matches the actual price in some years
that were given on the table. For example, for 2000 the
equation predicts a price of p=0.1264(6)+0.2229=0.9813,
or $981,300.
According to the table, the actual price was $950,000, so
the regression equation is pretty close.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
It is important to remember that the regression equation
is just a model, and it won’t give the exact values.
If the equation is a good fit to the data however, it will
give a very good approximation, so it can be used to
forecast what may happen in the future if the current
trend continues.
Next, let’s take a quick look at how a regression equation
is derived, and then take a look at what the correlation
coefficient (or the r-squared value on Excel) tell us about
the regression equation.
Let’s take another look at the data points and the
regression line.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Let’s take another look at the data points and the
regression line.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
Why does this particular line give the best “fit” for the
data? Why not some other line?
It has to do with what is called a residual.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
It has to do with what is called a residual.
1.8
Price p in millions of $
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
A residual is the difference between a particular data
point and the regression line.
If we zoom in on a particular data point, we can see what
a residual is.
1.8
1.6
Price p in millions of $
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
Time t in years since 1994
10
12
If we zoom in on a particular data point, we can see what
a residual is.
Let’s zoom in on this particular data point.
Zooming into this box:
Zooming into this box:
We see the data point and the line.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
Zooming into this box:
We see the data point and the line.
The vertical distance between the line and the data point
is the residual.
The idea behind linear regression is to keep the residuals
as small as possible.
There is a method that allows us to minimize the sum of
all of the residuals.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
There is a method that allows us to minimize the sum of
all of the residuals.
This is called the least-squares method. You can read about
it in the PDF for linear regression.
Since these formulas can get fairly complicated, you will
not be required to use them in the course.
You will only need to know how to find a regression line
using Excel. You can watch the video on how to do this,
or read through the PDF, or both.
Next, we look at what the correlation coefficient tells us
about the regression equation.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
Recall that in our graph, a number was given, called the
correlation coefficient, denoted by the letter r.
The correlation coefficient tells us how closely the
regression line “fits” the data points.
It has a value between -1 and 1. A value very close to 1
indicates a very good fit with a positive sloping linear
function.
A value very close to -1 indicates a very good fit with a
negative sloping linear function.
A value very close to 0 indicates a very poor fit with the
data, so there will be no linear relationship between
variables in this case.
Excel will not give the value of r, instead it gives the value
of r squared.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
Excel will not give the value of r, instead it gives the value
of r squared.
The r-squared value basically tells us the same thing, but it
will only be between 0 and 1.
If the r-squared value is close to 1, there is a very good
linear fit for the data points.
If the r-squared value is close to 0, there is a very poor fit
between the data points.
We will now look at some examples of what it looks like
with an r-squared value close to 1 and with an r-squared
value close to 0.
Consider the following set of data points.
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Consider the following set of data points.
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
Consider the following set of data points.
8
7
y = 0.5091x + 1.94
R² = 0.9943
6
5
4
3
2
1
0
0
2
4
6
8
10
12
They follow a clear linear pattern, so we should expect
the r-squared value to be close to 1.
And it is.
Now consider the following set of data points.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
Now consider the following set of data points.
20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
Now consider the following set of data points.
20
18
y = -0.183x + 8.3267
R² = 0.0084
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
These points seem to be scattered everywhere and don’t
follow any linear pattern.
We expect the r-squared value to be close to 0.
And it is.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
So, to summarize, a linear regression equation is a line
that most closely fits a given set of data points.
The regression equation can be used to predict future
values, or values that are outside of the given data range.
We can find regression equation for any set of data
points, no matter how scattered the data look, but we can
tell how closely the data follow a linear pattern by
looking at the r-squared value.
An r-squared value close to 1 indicates a very good fit to
the given data, and an r-squared value close to zero
indicates a very poor fit to the data.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
The topic of linear regression is very deep, and we have
only given a very brief introduction to it here.
You can read more about it in the PDF given on the
Assigned Reading for section 1.4.
Be sure you also watch the video about how to find a
linear regression on Excel! You can find the video link in
the Assigned Reading for section 1.4.