Transcript Slide 1

Simple Linear Regression
AMS 572
11/29/2010
2/69
Outline
1. Brief History and Motivation – Zhen Gong
2. Simple Linear Regression Model – Wenxiang Liu
3. Ordinary Least Squares Method – Ziyan Lou
4. Goodness of Fit of LS Line – Yixing Feng
5. OLS Example – Lingbin Jin
6. Statistical Inference on Parameters – Letan Lin
7. Statistical Inference Example – Emily Vo
8. Regression Diagnostics– Yang Liu
9. Correlation Analysis – Andrew Candela
10. Implementation in SAS – Joseph Chisari
3/69
Brief History and Introduction
Legendre published the
The
Karl
method
Pearson
was
andextended
Udny
In 1809,
Gauss
earliest
form
of published
by
Yule
extended
Galton
it toin
a the
more
theFrancis
same
method.
regression,
which was the
19th century
general
statistical
to describe
context
a
method of least squares in
biological
around
20th
phenomenon.
century.
1805.
4/69
Motivation for Regression Analysis
• Regression analysis is a
statistical methodology to
estimate the relationship of
a response variable to a set
of predictor variable.
• When there is just one
predictor variable, we will
use simple linear
regression. When there
are two or more predictor
variables, we use multiple
linear regression
Prediction for response
variable
New observed
predictor value
Predict Y, based on X
5/69
Motivation for Regression Analysis
2010 Camry:
Horsepower at 6000
rpm: 169
Highway gasoline
consumption: 0.03125
gallon per mile
2010 Milan:
Horsepower at 6000
rpm: 175
Highway gasoline
consumption: 0.0326
gallon per mile
2010 Fusion:
Horsepower at 6000
rpm: 263
Highway gasoline
consumption: ?
Response variable (Y): Highway gasoline consumption
Predictor variable (X): Horsepower at 6000 rpm
6/69
Simple Linear Regression Model
• A summary of the relationship between a
dependent variable (or response variable) Y
and an independent variable (or covariate
variable) X.
• Y is assumed to be a random variable while, even
if X is a random variable, we condition on it
(assume it is fixed). Essentially, we are
interested in knowing the behavior of Y given we
know X = x.
7/69
Good Model
• Regression models attempt to minimize the
distance measured vertically between the
observation point and the model line (or curve).
• The length of the line segment is called residual,
modeling error, or simply error.
• The negative and positive errors should cancel
out
⇒ Zero overall error
Many lines will satisfy this criterion.
8/69
Good Model
9/69
Probabilistic Model
• In simple linear regression, the population
regression line was given by
E(Y) = β0+β1x
• The actual values of Y are assumed to be the sum
of the mean value, E(Y), and a random error
term, ∊:
Y = E(Y) + ∊
= β0+β1x + ∊
• At any given value of x, the dependent variable Y
~ N (β0+β1x , σ2)
10/69
Least Squares (LS) Fit
Boiling Point of Water in the Alps
Pressure Boiling Pt Pressure Boiling Pt
20.79
194.5
24.01
201.3
20.79
194.3
25.14
203.6
22.40
197.9
26.57
204.6
22.67
198.4
28.49
209.5
23.15
199.4
27.76
208.6
23.35
199.9
29.04
210.7
23.89
200.9
29.88
211.9
23.99
201.1
30.06
212.2
24.02
201.4
11/69
Least Squares (LS) Fit
Find a line that represent
the
”best” linear relationship:
12/69
Least Squares (LS) Fit
• Problem: the data does
not go through a line
yi  0  1xi  i  1,2,......n 
•Find the line that
minimizes the sum:
n
2
Q    yi   0  1 xi 
i 1
• We are looking for the
line that minimizes
e( x)  yi  ( 01 xi )
2
i
13/69
Least Squares (LS) Fit
• To get the parameters that make the sum of square
difference become minimum, take partial derivative for
each parameter and equate it with zero.
2



y




x



i
0
1
i
Q
 0
 
1
1
2



y




x



i
0
1
i
Q
 0
 
 0
 0
2  yi   0  1 xi     xi    0
2  yi   0  1 xi    1   0
 0  xi  1  xi 2   xi yi
 0 n  1  xi   yi
 xi yi   0  xi  1  xi 2  0
 yi   0 n  1  xi  0
14/69
Least Squares (LS) Fit
• Solve the equations and we get
n

0 
n
n
n
i 1
n
i 1
n
i 1
( x )( yi )  ( xi )( xi yi )
2
i
i 1
n xi2  ( xi ) 2
i 1

1 
i 1
n
n
n
i 1
i 1
i 1
n xi yi  ( xi )( yi )
n
n
i 1
i 1
n xi2  ( xi ) 2
15/69
Least Squares (LS) Fit
• To simplify, we introduce
n
n
n
1 n
S xy   ( xi  x )( yi  y )   xi yi  ( xi )( yi )
n i 1
i 1
i 1
i 1
n
n
1 n
S xx   ( xi  x )   x  ( xi ) 2
n i 1
i 1
i 1
2
2
i
n
n
n
1
S yy   ( yi  y ) 2   yi2  ( yi ) 2
n i 1
i 1
i 1


0  y   1 x1 



Sxy
Sxx
• The resulting equation y   0  1 x is known as the least
squares line, which is an estimate of the true regression line.
16/69
Goodness of Fit of the LS Line
The fitted values is
yˆi   0  ˆ1 xi
The residuals
ei  yi  (ˆ0  ˆ1xi )
are used to evaluate the goodness of fit of the LS
Line.
17/69
Goodness of Fit of the LS Line
The error sum of squares
SSE=
The total sum of squares
SST=
The regression sum of squares
n
n
n
n
SST   ( yi  y )   ( yˆi  y )   ( yi  yˆi )  2 ( yi  yˆi )( yˆi  y )
2
i 1
2
i 1
2
i 1
SSR
i 1
SSE
SST=SSR+SSE
0
18/70
Goodness of Fit of the LS Line
• The coefficient of determination
is always between 0 and 1
• The sample correlation coefficient between X
and Y is
For the simple linear regression,
19/69
Estimation of the variance
The variance
measures the scatter of the
around their means
An unbiased estimate of
is given by
n
2
e
i
SSE
s 

n2 n2
2
This estimate of
i 1
has n-2 degrees of freedom.
20/69
Implementing OLS method to Problem 10.4
n
Q   [ yi  ( 0  1 xi )]
2
i 1
OLS method:
The time between
eruptions of Old
Faithful geyser in
Yellowstone National
Park is random but is
related to the
duration of the last
eruption. The table
below shows these
times for 21
consecutive
eruptions.
Obs
No.
Last Nex
t
Obs
No.
Last Nex
t
Obs
No.
Last
Ne
xt
1
2.0
50
8
2.8
57
15
4.0
77
2
1.8
57
9
3.3
72
16
4.0
70
3
3.7
55
10
3.5
62
17
1.7
43
4
2.2
47
11
3.7
63
18
1.8
48
5
2.1
53
12
3.8
70
19
4.9
70
6
2.4
50
13
4.5
85
20
4.2
79
7
2.6
62
14
4.7
75
21
4.3
72
21/69
Implementing OLS method to Problem 10.4
A scatter plot of Next vs. LAST
22/69
Implementing OLS method to Problem 10.4
x  3.238
21
S xx   ( xi  x )2  22.230
i 1
21
y=62.714
21
SSE   ( yi  yˆi )2  713.687
i 1
21
S yy   ( yi  y )2  2844.286
SSR   ( yˆi  y )2  2130.599
S xy   ( xi  x )( yi  y )  217.629
SST  S yy  2844.286
i 1
21
i 1
i 1
ˆ1  S xy / S xx  9.790
ˆ0  y  ˆ1x  31.013
23/69
Implementing OLS method to Problem 10.4
ˆ =ˆ0  ˆ1x
y
When x=3, y=60
r  SSR / SST  0.865
We could say that Last
is a good predictor of
Next
24/69
Statistical Inference
Statistical Inference on  0 and 1
 Final Result
ˆ0 and ˆ1 are normally distributed.
E(ˆ0 )  0
SD( ˆ0 )  
.
x
2
i
nSxx
ˆ0   0
~ N (0,1)
SD ( ˆ0 )
E(ˆ1 )  1

ˆ
SD( 1 ) 
S xx
ˆ1  1
~ N (0,1)
SD( ˆ1 )
25/69
Statistical Inference on  0 and 1

.
Derivation
Set
xi ’s as fixed and use ( xi  x)   xi  nx  0
(
x

x
)(
Y

Y
)
(
x

x
)
Y

Y
(
x

x
)



ˆ 

i
1
i
S xx
( xi  x )Y

S xx
i 1
n
ˆ0  Y  ˆ1x
i
i
S xx
i
26/69
Statistical Inference on  0 and 1
 Derivation
( xi  x ) E (Yi )
ˆ
E ( 1 )  
S xx
i 1
n
.
( xi  x ) E (  0  1 xi )

S xx
i 1
n
n
( xi  x )
( xi  x ) xi
 0 
 1 
S xx
S xx
i 1
i 1
n
n
1  n


( xi  x ) xi   ( xi  x ) x 


S xx  i 1
i 1

1
  ( xi  x ) 2  1
S xx i 1
n
2
 xi  x 
ˆ
 Var (Yi )
Var ( 1 )   
i 1  S xx 
n
 xi  x 

   
i 1  S xx 
n
2
2


2
S xx
2
n
2
(
x

x
)
 i
i 1
 2 S xx
S xx
2

2
S xx
27/69
Statistical Inference on  0 and 1
 Derivation
.
E ( ˆ0 )  E (Y  ˆ1 x )
E (Y )


 E ( ˆ ) x
i
1

n
 E ( 0  1 xi )

n 0  1  x i
 0
n
n
 1 x
 1 x
Var ( ˆ0 )  Var (Y  ˆ1 x )
2
 Var (Y )  x Var ( ˆ )
1
x


S xx
n

2
2
2
2


x
n

x
)
x

x
(
i
i
2 




nS
xx



 2  xi 2
nSxx
28/70
Statistical Inference on  0 and 1
2
Since (n  22) S  SSE
~

n2
2


2
SE( ˆ 0)  s
2
x
 i
nSxx
SE(ˆ1) 
s
S xx
.
Pivotal Quantities (P.Q.):
ˆ 0   0
ˆ 1   1
~ tn  2
~ tn  2
SE(ˆ 0)
SE( ˆ 1)

Confidence Intervals (CI’s):

.
ˆ 0  tn  2,  / 2SE(ˆ 0)
ˆ1  tn  2,  / 2SE(ˆ1 )
29/69
Statistical Inference on  0 and 1
 Hypothesis tests:
0
0
H
:



vs
.
H
:



0
1
1
0
1
1
.
ˆ1  10
 tn2, / 2
Reject H 0 at level  if t0 
SE( ˆ )
1

A useful application is to show whether there
is a linear relationship between x and y
H0 : 1  0 vs. H0 : 1  0
Reject H 0 at level  if t0 

ˆ1
SE( ˆ1 )
One-side alternative hypotheses can be
tested using one-side t-test.
 t n  2, / 2
30/69
Analysis of Variance (ANOVA)

Mean Square:
A sum of squares divided by its degrees of freedom.
SSR
MSR 
1
and
SSE
MSE 
n2
MSR SSR ˆ S xx  ˆ1
 2  2 
s/ S
MSE
s
s
xx

2
1
f1,n2,  t
2
  ˆ1 
 
  t02  F0
  SE( ˆ ) 
1 
 
2
n2, / 2
2
31/69
Analysis of Variance (ANOVA)

ANOVA Table:
Source of
Variation
(Source)
Sum of
Squares
(SS)
Degrees of
Freedom
(d.f.)
Regression
SSR
1
Error
SSE
n-2
Total
SST
n-1
Mean
Square
(MS)
SSR
1
SSE
MSE 
n2
MSR 
F
F
MSR
MSE
32/69
Statistical Inference Example –
Testing for Linear Relationship
• Problem 10.4
At α = 0.05, is there a linear trend between the
time to the NEXT eruption and the duration of
the LAST eruption?
H 0 : 1  0 vs. H1 : 1  0
Reject H0 if
t  tn2, /2 where t 
1
 
SE  1
33/70
Statistical Inference – Hypothesis Testing
Solution:
S xy 217.629
1
9.790

 7.531
B1 

 9.790 t 
S xx
22.230
SE  1 1.2999
n

SSE   yi  yi
i 1

 
2
 713.687
SSE
713.689
s

 6.129
n2
19
 
SE  1
s

S xx

6.129
 1.2999
22.230
tn2, /2  t19,0.025  2.093
7.531  2.093
We reject H0 and therefore
conclude
That there is a linear
relationship between
NEXT and LAST.
34/69
Statistical Inference Example Confidence and Prediction Intervals
• Problem 10.11 from Tamane & Dunlop Statistics
and Data Analysis
10.11 (a) Calculate a 95% PI for the time to the
next eruption if the last eruption lasted 3
minutes.
35/69
Problem 10.11 – Prediction Interval
Solution:
The formula for a 100(1-α)% PI for a future
*
Y
observation is given by
*
2
 *
1 ( x  x)
Y  tn2, /2 s 1  
n
S xx




36/69
Problem 10.11 - Prediction Interval
B1 
S xy
S xx
 9.790
B0  y  B1 x  31.013
*
Y  B0  B1x
 31.013  9.790(3)
 60.385
*
SSE
s
 6.129
n2
tn2, /2  t19,0.025  2.093
 *
1 ( x*  x)2
Y  tn2, /2 s 1  
n
S xx




 60.385 
1 (3  3.238)2
(2.093)(6.129) 1  
21
22.230
[47.238,73.529]
37/69
Problem 10.11 - Confidence Interval
10.11(b) Calculate a 95% CI for the mean time to
the next eruption for a last eruption lasting 3
minutes. Compare this confidence interval with
the PI obtained in (a)
38/70
Problem 10.11 - Confidence Interval
Solution:
The formula for a 100(1-α)% CI for  * is given by
 *
1 ( x*  x)2
   tn2, /2 s

n
S xx

*
where   B 0  B1 x
The 95% CI is [57.510, 63.257]
The CI is shorter than the PI
*



39/69
Regression Diagnostics
 Checking the Model Assumptions
1. E (Yi ) is a linear function of xi
2
2. Var (Yi )   is the same for all xi
3. The errors  i are normally distributed
4. The errors  i are independent(for
time series data)
 Checking for Outliers and Influential
Observations
40/69
Checking the Model Assumptions
• Residuals:
ei  yi  yˆi
e
•
i can be viewed as the “estimates”
of random errors
i
 's
 1 ( xi  x )  2
ei ~ N (0,  1  
  )
n
S
xx


2
2
41/69
Checking for Linearity
• If regression of y on x is linear, then the
plot of ei vs. xi should exhibit random
scatter around zero
42/69
Checking for Linearity
Tire Wear Data
y
i
400
xi
yi
yˆ i
ei
1
0
394.33
360.64
33.69
2
4
329.50
331.51
-2.01
3
8
302.39
302.39
-11.39
4
12
273.27
273.27
-18.10
5
16
244.15
244.15
-14.82
6
20
215.02
215.02
-10.19
7
24
185.90
185.90
-6.90
8
28
156.78
156.78
7.05
9
32
127.66
127.66
22.67
350
300
250
200
150
0
5
10
15
20
x
25
30
35
43/69
Checking for Linearity
40
Residual
30
20
10
0
-10
-20
0
5
10
15
20
x
25
30
35
i
Tire Wear Data
ei
xi y i
yˆ i
1
0
394.33
360.64
33.69
2
4
329.50
331.51
-2.01
3
8
302.39
302.39
-11.39
4
12
273.27
273.27
-18.10
5
16
244.15
244.15
-14.82
6
20
215.02
215.02
-10.19
7
24
185.90
185.90
-6.90
8
28
156.78
156.78
7.05
9
32
127.66
127.66
22.67
44/69
Checking for Linearity
• Data Transformation
x
x2
x3
x
x
y
y
y
log y
1/ y
x
y
log x y
1 / x y
x log y
x
1/ y
x
log x
y
y
1 / x
y2
x
x
y3
x
x2
x3
x
x
y2
y
y
y
y2
y3
45/69
Checking for Constant Variance
• If the constant variance assumption is
correct, the dispersion of the ei ' s is
approximately constant with respect to the
yˆi ' s
46/69
Checking for Constant Variance
Example from textbook 10.21
0.3
e
0.2
0.1
Residual
0
-0.1
-0.2
-0.3
-0.4
0
0.5
1
1.5
yˆ
2
2.5
47/69
Checking for Normality
• We can use residuals to make a normal
plot
Normal Probability Plot
0.99
0.98
0.90
0.75
Probability
Example from
textbook 10.21
Normal plot of
residuals
0.95
0.50
0.25
0.10
0.05
0.02
0.01
-0.3 -0.25 -0.2 -0.15 -0.1 -0.05
Data
0
0.05
0.1
0.15
0.2
48/69
Checking for Outliers
Definition: An outlier is an observation
that does not follow the general pattern of
the relationship between y and
x
• A large residual indicates an
outlier!!
ei
ei 

SE (ei )
*
ei
1 ( xi  x )
s 1 
n
S xx
2
ei
s
ei  2
*
49/69
Checking for Influential
Observations
An observation can be influential because
it has an extreme x-value, an y-value, or
both
• A large hii indicates an influential
observation!!
hii
yˆi   hij y j
1 ( xi  x )
hii  
n
S xx
hii  2(k  1) / n
k: # of predictors
n
j 1
2
50/69
Checking for Influential
Observations
90
80
70
60
50
40
30
20
10
0
2
4
6
8
10
12
14
16
18
20
51/69
Why use Correlation analysis?
• If the nature of the relationship between X and Y
is not known, we can investigate the correlation
between them without making any assumptions
of causality.
• In order to do this, assume (X,Y) follows the
bivariate normal distribution.
52/69
The Bivariate Normal Distribution
• (X,Y) has the following distribution:
53/69
Why can we do this?
• This assumption reduces to the probabilistic
model for linear regression since the conditional
distribution of Y given X=x is normal with the
following parameters:
• So when X=x the mean of Y is a linear function
of x and the variance is constant w.r.t. x.
54/69
So what?
• Under these assumptions we can use the data
available to make inferences about ρ.
• First we have to estimate ρ from the data. Define
the sample correlation coefficient R:
55/69
How can we use this?
• The exact distribution of R is very complicated,
but we do have some options.
• Under the null Hypothesis H0:ρ0=0 the
distribution of R is simplified. An exact test
exists in this case.
• For arbitrary values of ρ0 we can approximate a
function of R with a normal distribution thanks
to R.A. Fisher.
56/69
Testing H0 : ρ0=0
• Under H0 the distribution of
is t(n-2). This is kind of surprising, but think about
it. The test statistic we used to test β10=0 is
distributed as t(n-2) and ρ=0 if and only if β1=0.
That the two test statistics are equivalent is
shown on page 382-383 of the text.
57/70
Approximation of R
• Fisher showed that for n even as small as 10
• Now we can test H0 : ρ= ρ0 vs. H1 : ρ ≠ ρ0 for
arbitrary ρ0. We just compute:
58/69
Almost Finished!
• We now have the tools necessary for inference
on ρ. For a confidence interval for ρ compute:
and solve for:
59/69
Correlation - Conclusion
• When we are not sure of the relationship
between X and Y assume (Xi,Yi) is an
observation from a bivariate normal
distribution. To test H0 : ρ= ρ0 vs H1 : ρ ≠ ρ0 at
significance level α just compare :
to
But if ρ0 =0 compare
to t(n-2,α)
60/69
SAS - Reg Procedure
Proc Reg Data=Regression_Example;
Title "Regresion Example";
Model Next = Last;
Plot Next*Last;
Plot Residual.*Predicted.;
Output Out=Data_From_Regression Residual=R
Predicted=PV;
Run;
61/70
Proc Reg Output
0
Plot Next*Last
63/70
SAS - Plotting Regression Line
Symbol1 Value=Dot C=blue I=R;
Symbol2 Value=None C=red I=RLCLM95;
Proc Gplot Data=Regression_Example;
Title "Regression Line and CIs";
Plot Next*Last=1 Next*Last=2/Overlay;
Run;
64/70
Plotting Regression Line
65/69
SAS - Checking Homoscedasticity
Proc Reg Data=Regression_Example;
Title "Regresion Example";
Model Next = Last;
Plot Next*Last;
Plot Residual.*Predicted.;
Output Out=Data_From_Regression Residual=R
Predicted=PV;
Run;
9
Predicted.*Residual.
67/69
SAS - Checking Normality of Residuals
Proc Reg Data=Regression_Example;
Output Out=Data_From_Regression
Residual=R Predicted=PV;
Proc Univariate Data=Data_From_Regression
Normal;
Var R;
qqplot R / Normal(Mu=est Sigma=est);
Run;
68/69
Checking for Normality
69/69
Questions?