10.2 Fitting the Simple Linear Regression Model

Download Report

Transcript 10.2 Fitting the Simple Linear Regression Model

Simple Linear
Regression
With Thanks to My Students in
AMS572 – Data Analysis
1
1. Introduction
Example:
David Beckham: 1.83m
Victoria Beckham: 1.68m
●
Brad Pitt:
1.83m
Angelina Jolie: 1.70m
George Bush :1.81m
Laura Bush: ?
To predict height of the wife in a couple, based on the husband’s height
Response (out come or dependent) variable (Y):
Predictor (explanatory or independent) variable (X):
height of the wife
height of the husband
2
Regression analysis:
regression analysis is a statistical methodology to estimate the relationship
of a response variable to a set of predictor variable.
●
when there is just one predictor variable, we will use simple linear regression.
When there are two or more predictor variables, we use multiple linear
regression.
●
when it is not clear which variable represents a response and which is a predictor,
correlation analysis is used to study the strength of the relationship
●
History:
The earliest form of linear regression was the method of least squares,
which was published by Legendre in 1805, and by Gauss in 1809.
● The method was extended by Francis Galton in the 19th century to describe
a biological phenomenon.
● This work was extended by Karl Pearson and Udny Yule to a more general
statistical context around 20th century.
●
3
A probabilistic model

We denote the n observed values of the
predictor variable x as
x1, x2 , ..., xn
We denote the corresponding observed values
of the response variable Y as

y1, y2 , ..., yn
4
Notations of the simple linear Regression
yi
- Observed value of the random variable Yi depends on xi
Yi  0  1xi   i
i

- random error with
(i  1, 2, ..., n)
E ( i )  0
Var( i )   2
E(Yi )  i  0  1 xi
True Regression Line
unknown mean of Yi
Unknown Slope
Unknown Intercept
5
6
4 BASIC ASSUMPTIONS – for
statistical inference
Yi
Linear function of the predictor variable
Have a common variance, 
Same for all values of x.
i
2
Normally distributed
Independent
7
Comments:
1. Linear not in x
But in the parameters 0 and 1
Example:
E(Y )  0  1 log x
linear, logx = x*
2. Predictor variable is not set as predetermined fixed values,
is random along with Y. The model can be considered as a conditional model
Example: Height and Weight of the children.
Height (X) – given
Weight (Y) – predict
E(Y | X  x)  0  1 x
Conditional expectation of Y given X = x
8
2. Fitting the Simple
Linear Regression
Model
2.1 Least Squares (LS) Fit
9
Example 10.1 (Tires Tread Wear vs. Mileage: Scatter
Plot. From: Statistics and Data Analysis; Tamhane and
Dunlop; Prentice Hall. )
10
y  0  1 x
n
Q   [ yi  ( 0  1 xi )]
2
yi  ( 0  1 xi )(i  1,2,.....n)
i 1
11
The “best” fitting straight line in the sense of minimizing Q: LS
estimate

One way to find the LS estimate  0 and

1
n
Q
 2 [ yi  (  0  1 xi )]
 0
i 1
n
Q
 2 xi [ yi  (  0  1 xi )]
1
i 1
Setting these partial derivatives equal to zero and simplifying, we get
n
n
i 1
i 1
 0 n  1  xi   yi
n
n
n
 0  xi  1  x   xi yi
i 1
i 1
2
i
i 1
12

Solve the equations and we get

0 
n
n
n
n
i 1
i 1
n
i 1
n
i 1
( xi2 )( yi )  ( xi )( xi yi )
n x  ( xi )
2
i
i 1

1 
2
i 1
n
n
n
i 1
i 1
i 1
n xi yi  ( xi )( yi )
n
n
i 1
i 1
n xi2  ( xi ) 2
13

To simplify, we introduce
n
n
n
1 n
S xy   ( xi  x )( yi  y )   xi yi  ( xi )( yi )
n i 1
i 1
i 1
i 1
n
n
n
1
S xx   ( xi  x ) 2   xi2  ( xi ) 2
n i 1
i 1
i 1
n
n
n
1
S yy   ( yi  y ) 2   yi2  ( yi ) 2
n i 1
i 1
i 1



0  y  1 x 1 


S xy
S xx


The resulting equation y   0  1 x is known as the
least squares line, which is an estimate of the true
regression line.
14
Example 10.2
(Tire Tread vs. Mileage: LS Line Fit)
Find the equation of the line for the tire tread wear data from
Table10.1,we have
x
i
 144,  yi  2197.32,  xi2  3264,  yi2  589,887.08,  xi yi  28,167.72
and n=9.From these we calculate
n
S xy   xi yi 
i 1
n
S xx   xi 
2
i 1
1
n
n
x  16, y  244.15,
n
( xi )( yi )  28,167.72 
1
n
i 1
i 1
n
( xi )  3264 
2
i 1
1
1
(144  2197.32)  6989.40
9
(144)  960
2
9
15
The slope and intercept estimates are
6989.40
ˆ
1 
 7.281and ˆ0  244.15  7.281*16  360.64
960
Therefore, the equation of the LS line is
y  360.64  7.281x.
Conclusion: there is a loss of 7.281 mils in the tire groove depth for
every 1000 miles of driving.
Given a particular
x  25
We can find
y  360.64  7.281*25  178.62mils
Which means the mean groove depth for all tires driven for
25,000miles is estimated to be 178.62 miles.
16
2.2 Goodness of Fit of the LS Line

Coefficient of Determination and Correlation
yˆi  0  ˆ1 xi

(i  1,2,.....n)
The residuals:
e  y  (ˆ  ˆ x )
i
i
0
(i  1,2,.....n)
1 i
are used to evaluate the goodness of fit of the LS
line.
17
n
n
n
n
i 1
i 1
i 1
i 1
SST   ( yi  y )2   ( yˆi  y )2   ( yi  yˆi )2  2 ( yi  yˆi )( yˆi  y )
SSR

SSE
0
We define:
SST  SSR  SSE
Theratio R 2  SSR  1  SSE
SST
SST
Note: total sum of squares (SST)
Regression sum of squares (SSR)
Error sum of squares (SSE)
R 2 is called the coefficient of determination
18
Example 10.3 (Tire Tread Wear vs. Mileage:
Coefficient of Determination and Correlation

For the tire tread wear data, calculate R2 using the result s
from example 10.2 We have
n
1 n 2
1
SST  S yy   y  ( yi )  589,887.08  (2197.32)2  53, 418.73
n i 1
9
i 1
2
i


Next calculate SSR  SST  SSE  53, 418.73  2531.53  50,887.20
Therefore R2  50,887.20  0.953
53, 418.73

The Pearson correlation is r   0.953  0.976
where the sign of r follows from the sign of ˆ1  7.281 since
95.3% of the variation in tread wear is accounted for by
linear regression on mileage, the relationship between the
two is strongly linear with a negative slope.
19
The Maximum Likelihood
Estimators (MLE)


Consider the linear model: yi  a  bxi   i
where  i is drawn from a normal population with
mean 0 and standard deviation σ, the likelihood
function for Y is:
2



(
y

a

bx
)
1

i
i
L
exp

n
2
2
2


(2 ) 2
Thus, the log-likelihood for the data is:
( yi  a  bxi )
n
n
2
log L   ln(2 )  ln( )  
.
2
2
2
2
2
The MLE Estimators




Solving  log L  0,  log L  0,  log L  0
a
b
 2
We obtain the MLEs of the three unknown model
parameters a, b,  2
The MLEs of the model parameters a and b are
the same as the LSEs – both unbiased
The MLE of the error variance, however, is
n
biased:
2
e

i
SSE
2
i 1
ˆ
 

n
n
2.3 An Unbiased Estimator of 2
An unbiased estimate of
 2 is given by
n
2
e
i
SSE
s 

n2 n2
2
i 1
2

Example 10.4(Tire Tread Wear Vs. Mileage: Estimate of
Find the estimate of for the tread wear data using the results from Example 10.3
We have SSE=2351.3 and n-2=7,therefore
2351.53
 361.65
7
Which has 7 d.f. The estimate of  is s  361.65  19.02 miles.
S2 
22
3. Statistical Inference on 0 and 1
Under the normal error assumption


* Point estimators: 0 , 1


* Sampling distributions of  0 and 1 :
2


x
i

2

ˆ 0 ~ N   0, 

nS
xx 





ˆ

 1 ~ N   1,
Sxx 

2
 xi
SE (  0 )  s
nS xx

2

s
SE (1 ) 
S xx
For mathematical derivations, please refer to the Tamhane and Dunlop text book, P331.
23
Statistical Inference on 0 and 1
, Con’t
* Pivotal Quantities (P.Q.’s):
ˆ 0   0
~ tn  2
SE(ˆ 0)
ˆ 1   1
~ tn  2
SE( ˆ 1)
* Confidence Intervals (CI’s):
   

0  t  SE  0  , 1  t  SE  1 
n  2,
n  2,


 
2
2

24
Statistical Inference on 0 and 1
* Hypothesis tests:
, Con’t
H0 : 1  10
H 0 : 1  0
Ha : 1  
H a : 1  0
0
1
-- Test statistics:

t0 
1  

0
1

t0 
1

SE ( 1 )
SE ( 1 )
-- At the significance level  , we reject H 0 in
favor of H a if and only if (iff) t0  tn2, / 2
-- The first test is used to show whether there
is a linear relationship between x and y
25
Analysis of Variance (ANOVA), Con’t
Mean Square:
-- a sum of squares divided by its d.f.
SSR
SSE
MSR=
, MSE=
1
n2
MSR SSR
 2 
MSE
s
ˆ 12 Sxx
s2
ˆ 1


 

 s / Sxx 
2
2
 ˆ1  H 0 2
 
 t0 ~ F 1, n  2


ˆ
 SE (  1) 
26
Analysis of Variance (ANOVA)
ANOVA Table
Source of
Variation
(Source)
Sum of
Squares
(SS)
Degrees of
Freedom
(d.f.)
Regression
SSR
1
Error
SSE
n-2
Total
SST
n-1
SS
d.f.
Regression
Error
50,887.20
2531.53
1
7
Total
53,418.73
8
Mean
Square
(MS)
F
SSR
1
SSE
MSE=
n2
MSR
F=
MSE
MS
F
MSR=
Example:
Source
50,887.20
361.25
140.71
27
4. Regression Diagnostics
4.1 Checking for Model Assumptions




Checking for Linearity
Checking for Constant Variance
Checking for Normality
Checking for Independence
28
Checking for Linearity
i
Xi
Yi
^
Yi
Y=β0 + β1 x
Xi =Mileage
Yi =Groove Depth
ei
0
394.33
360.64
33.69
2
4
329.50
331.51
-2.01
3
8
291.00
302.39
-11.39
4
12
255.17
273.27
-18.10
30
5
16
229.33
244.15
-14.82
20
6
20
204.83
215.02
-10.19
10
7
24
179.00
185.90
-6.90
0
8
28
163.83
156.78
7.05
-10
9
32
150.33
127.66
22.67
-20
^
^
Y=β0 + β1 x
^
Yi =fitted value
1
^
^
Residual = ei = Yi- Yi
ei =residual
Scatterplot of ei vs Xi
ei
40
0
5
10
15
20
Xi
25
30
35
29
Checking for Normality
Normal Probability Plot of residuals
Normal
99
Mean
StDev
N
AD
P-Value
95
90
3.947460E-16
17.79
9
0.514
0.138
Percent
80
70
60
50
40
30
20
10
5
1
-40
-30
-20
-10
0
10
C1
20
30
40
50
30
Checking for Constant Variance
Plot of Residuals vs Fitted Value
40
30
Residuals
20
10
0
0
100
200
300
400
-10
-20
-30
Fitted Value
Var(Y) is not constant.
A sample residual plots when
Var(Y) is constant.
31
Checking for Independence


Does not apply for
Simple Linear
Regression Model
Only apply for time
series data
32
4.2 Checking for Outliers &
Influential Observations




What is OUTLIER
Why checking for outliers is important
Mathematical definition
How to deal with them
33
4.2-A. Intro
Recall Box and Whiskers Plot (Chapter 4 of T&D)



Where (mild) OUTLIER is defined as any observations that lies outside of
Q1-(1.5*IQR) and Q3+(1.5*IQR) (Interquartile range, IQR = Q3 − Q1)
(Extreme) OUTLIER as that lies outside of Q1-(3*IQR) and Q3+(3*IQR)
Observation "far away" from the rest of the data
34
4.2-B. Why are outliers a problem?
May indicate a sample peculiarity or a data entry error or
other problem ;
 Regression coefficients estimated that minimize the Sum of
Squares for Error (SSE) are very sensitive to outliers >>Bias
or distortion of estimates;
 Any statistical test based on sample means and variances
can be distorted In the presence of outliers >>Distortion of pvalues;
 Faulty conclusions.
Example:
( Estimators not sensitive to outliers are said to be robust )

Sorted Data
Median
Mean
Variance
95% CI for mean
Real
Data
1 3 5 9 12
5
6.0
20.6
[0.45, 11.55]
Data
with
Error
1 3 5 9 120
5
27.6
2676.8
[-36.630,91.83]
35
4.2-C. Mathematical Definition

Outlier
The standardized residual is given by
If |ei*|>2, then the corresponding observation may be regarded an outlier.
Example: (Tire Tread Wear vs. Mileage)
i
1
2
3
4
5
6
7
8
9
ei *
2.25
-0.12
-0.66
-1.02
-0.83
-0.57
-0.40
0.43
1.51
• STUDENTIZED RESIDUAL: a type of standardized residual calculated with the current
observation deleted from the analysis.
• The LS fit can be excessively influenced by observation that is not necessarily an outlier as
defined above.
36
4.2-C. Mathematical Definition

Influential Observation
Observation with extreme x-value, y-value, or both.
• On average hii is (k+1)/n, regard any hii>2(k+1)/n as high leverage;
• If xi deviates greatly from mean x, then hii is large;
• Standardized residual will be large for a high leverage observation;
• Influence can be thought of as the product of leverage and outlierness.
Example: (Observation is influential/ high leverage, but not an outlier)
0
eg.1 with
without
eg.2 scatter plot
5
10
15
residual plot
37
4.2-C. SAS code of the tire example
SAS code
Data tire;
Input x y;
Datalines;
0 394.33
4 329.50
…
32 150.33;
Run;
proc reg data=tire;
model y=x;
output out=resid rstudent=r h=lev cookd=cd
dffits=dffit;
Run;
proc print data=resid;
where abs(r)>=2 or lev>(4/9) or cd>(4/9) or
abs(dffit)>(2*sqrt(1/9));
run;
38
4.2-C. SAS output of the tire example
SAS output
39
4.2-D. How to deal with Outliers
& Influential Observations





Investigate (Data errors? Rare events? Can be
corrected?)
Ways to accommodate outliers
Non Parametric Methods (robust to outliers)
Data Transformations
Deletion (or report model results both with and
without the outliers or influential observations to
see how much they change)
40
4.3 Data Transformations
Reason



To achieve linearity
To achieve homogeneity of variance
To achieve normality or symmetry about the
regression equation
41
Types of Transformation

Linearzing Transformation
transformation of a response variable, or predicted
variable, or both, which produces an approximate
linear relationship between variables.

Variance Stabilizing Transformation
make transformation if the constant variance
assumption is violated
42
Linearizing Transformation

Use mathematical operation, e.g. square
root, power, log, exponential, etc.

Only one variable needs to be transformed in
the simple linear regression.
Which one? Predictor or Response? Why?
43
e.g. We take a log transformation on
Y = exp (-x) <=> log Y = log - x
Plot of Residual vs xi & xi from the exponential fit
40
^
Y=
Yi
0
394.33
5.926
^
exp (logYi)
374.64
30
Ei
19.69
20
Residual
Xi
^
log Yi
Variable
ei (original)
ei with transformation
10
0
4
329.50
5.807
332.58
-3.08
8
291.00
5.688
295.24
-4.24
-10
-20
0
5
10
15
20
25
30
35
xi
12
255.17
5.569
262.09
-6.92
Normal Probability Plot of ei and ei with transformation
99
16
229.33
5.450
232.67
-3.34
Variable
ei
ei with transformation
95
Mean
90
24
204.83
179.00
5.331
5.211
206.54
183.36
-4.36
StDev N
AD
P
17.79 9 0.514
8.142 9 0.912
0.138
0.011
80
-1.71
70
Percent
20
3.947460E-16
0.3256
60
50
40
30
20
10
28
163.83
5.092
162.77
1.06
32
150.33
4.973
144.50
5.83
5
1
-40
-30
-20
-10
0
10
Data
20
30
40
50
44
Variance Stabilizing Transformation
Delta method : Two terms Taylor-series approximations
Var( h(Y)) ≈ [h()]2 g2 ()
Var(Y) = g2(),E(Y) = 
set [h’()]2 g2 ()1
1.
h’() =
2.
3.
where
h() =
1
g ( )

d
g ( )


h(y) =
dy
g ( y)
e.g. Var(Y) = c2 2 , where c > 0, g() = c↔ g(y) = cy
h(y) =
 dy
cy

1
c
 dyy

1
c
log( y)
Therefore it is the logarithmic transformation
45
5. Correlation Analysis

Pearson Product Moment Correlation: a
measurement of how closely two variables share a
linear relationship.


Cov(X,Y)
  corr(X,Y) 
Var(X)Var(Y)
Useful when it is not possible to determine which
variable is the predictor and which is the response.

Health vs wealth. Which is predictor? Which is response?
46
Statistical Inference on the
Correlation Coefficient ρ

We can derive a test on the correlation
coefficient in the same way that we have
been doing in class.


Assumptions

X, Y are from the bivariate normal
distribution
Start with point estimator

r: sample correlation coefficient: estimator
of the population correlation coefficient ρ
n

r
(X
i 1
n
(X
i 1

 X )(Yi  Y )
 X)
n
2
 (Y  Y )
i 1
2
i
Get the pivotal quantity

The distribution of r is quite complicated

T0: test statistic for ρ = 0


i
i
T0 
r n2
1 r2
Do we know everything about the p.q.?

Yes: T ~ tn-2 under H0 : ρ=0
47
Bivariate Normal Distribution

pdf:

Properties



μ1, μ2 means
for X, Y
σ12, σ22 variances
for X, Y
ρ the correlation coeff
between X, Y
48
Derivation of T0
are these equivalent?
r n  2 ? ˆ1
T0 

2
SE ( ˆ )
1 r

Therefore, we can use t
as a statistic for testing
against the null
hypothesis
H0: β1=0

Equivalently, we can
test against
H0: ρ=0
1
substitute:
s
S
S xx
r  ˆ1 x  ˆ1 xx  ˆ1
sy
S yy
SST
SSE (n  2) s 2
1 r 

SST
SST
2
then:
T0  ˆ1
S xx
SST
ˆ1
ˆ1
(n  2) SST


(n  2) s 2
s / S xx SE ( ˆ1 )
 yes, they are equivalent.
49
Exact Statistical Inference on ρ

Test
 H0 : ρ=0 ,
Ha : ρ≠0
 Test statistic:
T0 

r n2

Example

A researcher wants to determine if two test
instruments give similar results. The two
test instruments are administered to a
sample of 15 students. The correlation
coefficient between the two sets of scores
is found to be 0.7. Is this correlation
statistically significant at the .01 level?

1 r2
Reject H0 iff t0  tn2, / 2
H0 : ρ=0 ,
t0 
Ha : ρ≠0
0.7 15  2
1  0.7

for α = .01,

▲ Reject H0
2
 3.534
3.534 = t0 > t13, .005 = 3.012
50
Approximate Statistical Inference on ρ

There is no exact method of
testing ρ vs an arbitrary ρ0



Distribution of R is very
complicated
T0 ~ tn-2 only when ρ = 0
To test ρ vs an arbitrary ρ0 one
can use Fisher’s transformation
 1 1   1 
1 1 R 

,
tanh R  ln
  N  ln

2  1 R 
2
1


n

3




 Therefore, let
1
^ 
 1  1  0  1 
1  1 r 
  ln 
,
under
H
,

~
N
 ln 

,
0

2  1 r 
2
1


n

3
0 



^
51
Approximate Statistical Inference on ρ

Test :
H 0 :   0 vs. H1 :   0
1  1  0 
 H 0 :   0  ln 
 vs. H1 :   0
2  1  0 


Sample estimate:
Z test statistic:
1 1 r 
  ln

2  1 r 
^
^

z0  n  3  0 


We reject H0 if |z0| > zα/2
^
1
1
  z / 2
     z / 2
n3
n3
e 2l  1
e 2u  1
 2l
   2u
e 1
e 1
^

CI for ρ:
52
Approximate Statistical Inference on ρ
using SAS

Code:

Output:
53
Pitfalls of Regression and
Correlation Analysis

Correlation and causation


Coincidental data


Sun spots and republicans
Lurking variables


Ticks cause good health
Church, suicide, population
Restricted range

Local, global linearity
54
Summary
Model
Assumptions
Linear regression analysis
Correlation
Coefficient r
The Least squares (LS) estimates: 0 and 1
Probabilistic model
for Linear regression:
 0or1  t
n  2,  / 2
SE( 0or1)
Correlation
Analysis
Confidence Interval & Prediction interval
Outliers?
Influential Observations?
Data Transformations?
55
n
Least Squares (LS) Fit
Q   [ yi  (  0  1 xi )]2
i 1
Sample correlation coefficient r
Statistical inference on ß0 & ß1
Prediction Interval
Model Assumptions
2

x
i 

2

ˆ 0 ~ N   0, 

nSxx 

2



ˆ

 1 ~ N   1, 
Sxx 


*
x
x
1
 *
*
Y  Y  tn  2, / 2 s 1  
n
S xx


(
Linearity
Normality
)
2





Constant Variance
Independence
Correlation Analysis
56
Questions?
57