linear regression intro

Download Report

Transcript linear regression intro

We’ll consider here the problem of paired data.
There are two common notations.
(x1, y1), (x2, y2), … , (xn, yn)
shows the data as n points in two-space
X
Y
x1
y1
x2
y2
x3
y3
…
…
xn
yn
This is the spreadsheet form.
PowerPoint show prepared by Gary Simon, 11 MARCH 2008.
The separate points are assumed independent.
We wish to find a relationship between variable X and
variable Y.
We have here a data set on eye response to different types of
drops, but for now we’ll look at just a few simple items of
information.
DP0OD
Pupil diameter, start of experiment, right eye
DP0OS
Pupil diameter, start of experiment, left eye
AGE
Subject age
There are altogether 100 subjects.
Let’s consider the relationship between pupil diameter in the
eyes.
An obvious first step is making a scatterplot showing all 100
people.
Let’s put the right eye on the horizontal axis and the left eye
on the vertical axis. This is not a critical decision.
S c a tte r plo t o f D P 0 O D v s D P 0 O S
This graph shows that the
points cluster near a
diagonal line. This is not
a surprise.
8
DP0 OD
7
6
5
4
3
3
4
5
6
DP0 OS
7
8
Here’s the same picture with the Y = X line superimposed:
S c a tte r plo t o f D P 0 O D v s D P 0 O S
8
DP0 OD
7
6
5
4
3
3
4
5
6
DP0 OS
The points cling close to the line.
7
8
There are a few simple ways to summarize this situation.
Perhaps the best is the correlation. Here r = 0.96.
Now let’s complicate this a bit. Suppose that we want to
check on the relationship between DP0OS (pupil diameter,
left eye) and AGE.
These two variables are not symmetric.
We’ll think of the variable AGE as “logically earlier.”
This means that we obtain it easily, reliably, and
(probably) earlier than the pupil diameter. Also,
it’s logical to think of using AGE to predict pupil
diameter.
We will designate AGE as the independent variable,
we will identify it with the symbol X, and we will
place it on the horizontal axis of the coming
scatterplot.
We’ll think of the variable DP0OS as “logically later.”
This information is obtained with some difficulty, with
possible error measurement, and (probably) later than
the age.
We will designate DP0OS as the dependent variable, we
will identify it with the symbol Y, and we will place it on
the vertical axis of the coming scatterplot.
The scatterplot is next. Before it’s shown, we should ask
ourselves whether
*
pupil diameter generally rises with age
*
pupil diameter is unrelated to age
*
pupil diameter generally decreases with age
What do you think?
Here is the scatterplot:
S c a tte r plo t o f D P 0 O S v s A G E
8
DP0 OS
7
6
5
4
3
20
30
40
50
A GE
60
70
Suppose that you would like to summarize the relationship
between the two variables. You would like to write
Pupil Diameter = Y = dependent variable
= f(AGE) = f(X) = f(independent variable)
for some function f .
The problem is that you’ll never find a believable
function to go through all the dots on the scatterplot.
There is too much statistical noise.
The expression of the model will be revised to
Y = f(X) + ε
The symbol ε represents statistical noise. It may involve
random errors in measuring Y or it may just represent
variability that we just don’t know to account for.
One could also have made “multiplicative noise” in the
form Y = f(X) × ε. In some cases, this is useful. For now,
we’ll stick with the “additive noise” with the + sign.
We will have a lot to say about the ε term. For now, we’ll
just assume that it is independent over the data points.
What form should we use for the function f ?
How about f(X) = log X ?
How about f(X) = a X2 + b X + c ?
How about f(X) = tan( a X2 + h) ?
How about f(X) =
aX
tanh
1

2
 bX  c
log | X |

 1
?
We will start with the simplest function, the straight line.
This is f(X) = β0 + β1 X .
The symbols β0 and β1 are parameters.
β0 is the intercept, also called Y-intercept.
β1 is the slope.
In nearly all cases, β0 and β1 are not known,
and we have to estimate them from data.
The notation is not universal. You will also see
f(X) = α + β X
This is OK.
f(X) = a + b X
Use of Roman letters is
not recommended.
For issues related to considering which symbols are fixed
and which are random, we will prefer f(x) = β0 + β1 x .
That is, we will prefer lower-case x.
It is however impossible to enforce distinctions
between x and X and also between y and Y. We can’t
be too dogmatic about the notation.
The relationship between Y and X will be described through
the simple linear regression model
Y = β0 + β1 x + ε
This is made more direct by putting on subscript i to label
individual data points. Our preferred form for the simple
linear regression model is
Yi = β0 + β1 xi + εi
with i = 1, 2, …, n.
The simple linear regression model also includes these
assumptions about the noise terms ε1 , ε2 , ε3 , … , εn :
The ε’s are independent of each other and also
independent of the x’s.
The ε’s are sampled from a hypothetical
population in which the mean is zero and the
standard deviation is σ.
In some cases, we may add in the further
assumption that the ε’s are sampled from a normal
population.
The simple linear regression model Yi = β0 + β1 xi + εi has
three unknown parameters: β0 , β1 , and σ .
Estimating these parameters is an important part of the
regression task.
Estimating β0 and β1 is equivalent to drawing a line on the
scatterplot. The estimate of σ tells us how well the line
describes the set of points on the scatterplot.
The estimate of β0 is written b0 .
The estimate of β1 is written b1 .
The estimate of σ is written s .
You’ll also see sε or sY | x .
Note this consistent pattern of usage:
Model parameters are Greek letters.
Data-based estimates are corresponding Latin letters.
Be aware that other schemes exist.
Someone who writes the model as Yi = α + βxi + εi
will use a for the estimate of α and will use
b for the estimate of β.
Someone who writes the model as Yi = a + b xi + εi
will use aˆ for the estimate of a and will use
bˆ for the estimate of b.
For our problem, the model is
DPi = β0 + β1 AGEi + εi
The pupil diameter DP is in units of mm (millimeters). The
variable AGE is in units of years.
Therefore, β0 and its estimate b0 are in units of mm.
Also, the ε’s and their standard deviation σ are in units of
mm. The estimate of σ is also in units of mm.
The slope β1 and its estimate b1 are in units of
mm
year
.
How should we estimate β0 and β1 ?
We could guess.
We could draw a nice-looking line on the scatterplot
and then use that line to get the estimates.
These are not necessarily bad methods, but they are not
reproducible. This means that different people get
different answers. Worse yet, the same person on two
occasions will produce different answers.
We will instead propose that the estimates be done by
minimizing a mathematical function.
Many proposals have been made, but the nearly universal
choice is least squares. Choose b0 and b1 to minimize the
function
n
Q =

Yi 
 b0  b1
xi 

2
i 1
How should this minimization be done?
The solution is by (mindless and routine) differentiation.
That is, solve the system
 
 b Q
 0

  Q
  b1
let

0
let

0
This results in two linear equations in the two unknowns
b0 and b1 .
The solution method selected by the previous slide works,
but it’s clumsy. Here is a cleaner way to do this.
n
(1) Find the five sums  x
n
, y ,
i
i
i 1
i 1
n
x
2
i
n
,

i 1
y
2
i
n
,

.
xi y i
i 1
i 1
(2) Next find these quantities:
n
x ,
, Sxx
y
n
Syy =  y
i 1
2
i
= x
 n

y

i


 i 1 

n
i 1
2
i
 n

x

i 

 i 1 

n
2
,
2
n
, Sxy =  x
i 1
i
yi
 n
 n

x
y
 i   i 
 i 1   i 1 

n
(3) Find b1 (the estimate of the slope β1) as
b1 =
S xy
S xx
(4) Find b0 (the estimate of the intercept β0) as
b0 = y - b1 x
Note that b1 is found before b0 .
(5) Finally, calculate
2
Syy | x =
S yy
S i
d

xy
S xx
We’ll use this later in the estimation of σ, the standard
deviation of the noise.
While it’s possible to do this for our problem of pupil
diameter versus age with just the use of a calculator…
there are too many steps and we are likely to make errors.
We’ll give this to the Minitab function
Stat > Regression > Regression.
The Minitab output is extensive, but from it we find
Regression Analysis: DP0OD versus AGE
The regression equation is
DP0OD = 7.27 - 0.0430 AGE
This is called the fitted regression equation. This
identifies for us b0 = 7.27 and b1 = -0.0430.
Here is a reprise of the scatterplot, now shown with the
fitted regression line.
F itte d L ine P lo t
DP0 O D = 7 .26 9 - 0.0 42 9 5 A GE
S
8
0.832776
R- S q
35.6%
R- S q (ad j)
34.9%
DP0 OD
7
6
5
4
3
20
30
40
50
60
70
A GE
This was made in Minitab with Stat > Regression >
Fitted Line Plot.
This has reported also sε = 0.832776, the estimate of σ.
It’s important to
distinguish
population
quantities from
sample quantities.
The process of regression is not simply
“numbers in”  “numbers out.”
The simple linear regression model is
Yi = β0 + β1 xi + εi
If you are asked to graph the line Y = β0 + β1 x
...
Please refuse! You cannot graph this line because
β0 and β1 are unknown population parameters.
With data, you will get the estimates b0 and b1.
The fitted regression line is Y = b0 + b1 x .
The “hat” on Y is helpful, but it’s a
typesetting nuisance. The fitted line is
often given without the “hat.”
For the pupil diameter problem, the fitted line is
= 7.27 - 0.0430 AGE
The interpretation of -0.0430 is . . .
that each year of age is associated with a reduction
of 0.0430 mm in pupil diameter.
The interpretation of 7.27 is . . .
to be avoided. It’s tempting to say that it’s an
assessment of pupil diameter at birth. The data set
did not have anyone younger than 18, so we won’t
force an interpretation.
The estimate of the noise standard deviation was
calculated as sε = 0.832776. This is about 0.83 mm,
which is rather large for this context.
What are we to make of this large value?
This is saying that AGE is far from a perfect
predictor of pupil diameter.
We still have to decide
* Is there an objective way to decide if this whole
activity was worth doing?
* Is there an objective way to decide if the model
Yi = β0 + β1 xi + εi was a good choice?