Simple Linear Regression

Download Report

Transcript Simple Linear Regression

Simple Linear Regression
1
I want to start this section with a story. Imagine we take everyone in the class
and line them up from shortest to tallest. As you look to the front of the class
from your seat the shortest will be on the left and the tallest will be on the right.
In fact, in a face to face class we will line you up. Compare yourself to other
people and if you are taller than someone else move to the right, if smaller move
to the left.
Now, imagine we have everyone lined up in order from smallest to tallest. If you
are back in your seat and you look down at the line-up (you have to use your
imagination because you can not be both in the line-up and in your seat) I bet
the line-up looks like the following (when thinking about the height of the
people):
5’6”
6’1”
height
2
On the previous screen you see most people are between 5’6” and 6’1”. There
are some that are shorter and some that are taller. This is not rocket science,
right? From the line-up we could calculate the average height for the group.
Now, instead of looking at the height of people, let’s look at the size of their
feet. In the same order as height I would venture to say that the size of the feet
gets larger as we go from left to right in the room. Imagine you are walking
across the room looking down at peoples feet. I say the feet probably looks like
the following (I only show three, but I wanted you to fill in the rest):
3
Overview
Imagine you are the first person to get into the room each day.
Say you have a class roster so you know the names of all the
other people in the class. Also say on each successive day you
will try to guess the height of the person who comes into the
room first after you.
At this point in the story you have to guess without any clue
about who will come into the room. I tell you that the best guess
you could make each day is to just guess the average height.
While you would likely be wrong each day at least you would
even out days of being below average and days being above
average. Other methods to try to guess at the height might
always have you guess too high a value or too low a value.
4
Overview
Now, let’s change the story somewhat. Say before the person
enters the room and before you have to guess the height you can
see the person’s feet. Would knowing the size of their feet help
you guess the height of the person?
Since there is a pattern that people with larger feet tend to be
taller you could say the height is above average if the feet size is
above aveage and the height is below average if the feet are
below average. While you probably still not guess the height
exactly you would improve on just guessing the average height.
So, since foot size and height are related, knowing foot size can
help us predict height.
5
Overview
Note in this example that I am not saying that foot size is the cause
of height, just that foot size and height are related.
Regression analysis is a method to assist us in seeing if variables
are related. In this context when we say related we often use the
phrase that variables are correlated.
Also note that correlation is not causation. Foot size does not
cause height. In fact, foot size and height are really caused by
other variables such as nutrition and family genes.
In business we often seek out relationships between variables to
assist us in making sense of the world. The aim is to come up with
stories similar to the feet size/height story.
6
Overview
Consider an example about a group of college graduates. Each
graduate does not have the same dollar amount of starting
salary.
Since each graduate does not have the same starting salary
amount, an investigation might occur as to why not. In the
investigation one might think about other variables that might
influence starting salary. Starting salaries could be influenced by,
among other things, the gpa of the student, the number of
student groups the student was in, or even the work experience
of the graduate. This gpa variable might be important because
the larger the gpa the more will be the starting salary.
7
Overview
In the example so far, starting salary is called the response
variable because the values for starting salary are thought to
respond on the values for the other variables. The response
variable is often called the y variable and in a graph is put on the
vertical axis.
GPA, student group and work experience are all examples of
explanatory variables. When we use just one explanatory
variable with the response variable we have a situation where
we can conduct SIMPLE LINEAR REGRESSION. The explanatory
variable would be called the x variable and put on the horizontal
axis. When two or more explanatory variables are used we could
do MULTIPLE REGRESSION. For now we stick with simple linear
regression.
8
Using a Sample to Estimate the
Model
On the next slide I show some data and scatterplot for the
example we have been developing. Note that a sample has
been taken from 7 graduates and in the data the gpa and
starting salary are in the rows of the table. Each point in the
scatterplot is a gpa, starting salary pair for a graduate.
With a sample of data we can estimate the regression line as
ŷ = bo + b1x, where
bo is the y intercept of the line and is the value of ŷ when x is 0.
The slope b1 is a number that represents the expected change in ŷ
when x increases by 1 unit.
By the way ŷ is called y hat and we say that to know we have
9
a regression line.
Graduate
GPA
Start Salary
1
3.26
33.8
2
2.6
29.8
3
3.35
33.5
4
2.86
30.4
5
3.82
36.4
6
2.21
27.6
7
3.47
35.3
Start Salary
39
37
y = 5.7066x + 14.816
35
33
Start Salary
31
Linear (Start Salary)
29
27
25
0
1
2
3
4
5
GPA
10
Least Squares Method
Line 1
Y
Line 2
Line 3
X
11
Least Squares
On the previous slide I show a more generic scatter plot
and I put three lines in the graph.
All three lines are decent in the sense that with the upward
slope they all show the same basic idea as the dots in the
graph: as x rises, y rise (meaning x and y are positively
related.) In theory we could find the equation for each line
by algebra, or something like that. Then for each line we
would have a bo and b1 value.
Now line 1 is bad because it is too high. What I mean here
is that if we used the line to predict y we would always
predict too high a number. Similarly with line 3 we would
be too low all the time.
12
Least Squares
Line 2 is “among” the data points and when you make
predictions with the line sometimes you will be too high and
sometimes too low. But, no straight line can be exactly
perfect (unless all the points are truly on a straight line,
which will likely not happen in business and social
research).
Line 2 is my interpretation of the line that would be picked
by what is called the least squares method. When you look
at a y value on the line, called ŷ, the least squares line is
placed in such a way the that sum of the squared
differences of each dot to the line is minimized. Since each
dot has a y, the least squares method picks a bo and b1
such that the resulting differences y minus ŷ when squared,
and then summed across all values, is minimized.
13
bo = 14.816
b1 = 5.7066
14
Least Squares
For now we will assume Microsoft Excel or some other
program can show us the estimated regression line using
least squares. We just want to use what we get. On the
previous page I have Excel. Note in cell B25 you see the
word Coefficients. In cells a26:a27 you see the words
Intercept and GPA and then the numbers 14.8156153 and
5.706568981 are in cells b26:b27. This means
ŷ = bo + b1Xi has been estimated to be
Starting salary = 14.816 + 5.7066gpa.
Note the data had starting salary measured in thousands.
This means, for example, the data had 29.8 but it means
the real value is 29,800.
15
Prediction with least squares
Remember our estimated line is
Starting Salaries = 14.816 + 5.7066gpa.
Say we want to predict salary if the gpa is 2.7.
Starting Salaries = 14.816 + 5.7066(2.7)
= 30.22382.
This starting salary is $30,223.82
16
Interpolation and Extrapolation
You will notice in our example data set that the smallest value for
x was 2.21 and the largest value was 3.82.
When we want to predict a value of y, ŷ, for a given x, if the x is
within the range of the data values for x (2.21 to 3.82 in our
example) then we are interpolating. But if an x is outside our
range for x we are extrapolating.
Extrapolating should be used with a great deal of caution. Maybe
the relationship between x and y is different outside the range of
our data. If so, and we use the estimated line we may be way off
in our predictions.
Note the intercept has to be interpreted with similar caution
because unless our data includes x’s that include zero in the
range, the relationship between x and y could be very different in
the x = 0 neighborhood than the one suggested by least squares.
17
Variation
Remember to calculate the standard deviation of a variable we
take each value and subtract off the mean and then square the
result. (We also the divided by something, but that is not
important in this discussion.)
In a regression setting on the response variable Y we define the
total sum of squares SST as
Σ(Yi – Ybar)2 .
SST can be rewritten as
SST = Σ(Yi – Ŷi + Ŷi –Ybar)2 = Σ(Ŷi –Ybar)2 + Σ(Yi – Ŷi)2 = SSR + SSE.
Note: you may recall from algebra that (a + b)2 = a2 + 2ab + b2. In
our story here 2ab = 0. While this is not true in general in
algebra it is in this context of regression. If this note makes no
sense to you do not worry, just use SST = SSR + SSE
18
Variation
So we have SST = Σ(Yi – Ybar)2, SSR = Σ(Ŷi –Ybar)2 and
SSE = Σ(Yi – Ŷi)2 .
On the next slide I have a graph of the data with the regression
line put in and a line showing the mean of Y. For each point we
could look at the how far the point is from the mean line. This is
what SST is looking at. But SSR is indicating that of all the
difference in the point and the mean the regression line is able to
account for some of that variation. The rest of the difference is
SSE.
19
Variation
Y
Two
examples of
what is
going into
SSE
Least Squares regression
Line = Ŷi
Y bar
Two examples of
what is going into
SSR
X
20
The Coefficient of Determination
The coefficient of determination, often denoted r2, measures the
proportion in the variation in Y that is explained by the
explanatory variable X in the regression model.
r2 = SSR/SST. In our example from above we have r2 = SSR/SST =
0.98 rounded to 2 decimals. This means that 98% percent of the
variation in starting salary is explained by the variability in the
gpa of students.
Plus, only 2% of the variability in starting salary is due to other
factors.
21
Coefficient of Determination
Say we didn’t have an X variable to help us predict the Y variable.
Then a reasonable way to predict Y would be to just use its
average or mean value. But, with a regression, by using an X
variable it is thought we can do better than just using the mean
of Y as a predictor.
In a simple linear regression r2 is an indicator of the strength of
the relationship between two variables because the use of the
regression model would reduce the variability in predicting the
sales by just using the mean sales by the percentage obtained.
In different areas of study (like marketing, management, and so
on) the idea of what a good r2 is varies. But, you can be sure if r2
is .8 or above you have a strong relationship.
22
Correlation
Remember the correlation coefficient r was used to
understand the direction and strength of the relationship
between two variables. The coefficient r when squared is
the r2 in regression. Regression and correlation are related
in this way in the simple linear regression.
23
Residuals
A residual = observed value minus the predicted value
= y – ŷ.
Back on slide 14 I had the data set and we see, for
example, the individual with gps = 2.6 and starting salary =
29.8. So y = 29.8.
In the equation ŷ = 14.816 + 5.7066gpa with gpa = 2.6 we
get a y hat = 14.816 + 5.7066(2.6) = 29.65 and the
residual would be 29.8 – 29.65 = .15
Individual points with large residuals would indicate
influential data points.
24