Transcript Slide 1
SIMPLE LINEAR REGRESSION
AND CORRELATION
Prepared by:
Jackie Zerrle
David Fried
Chun-Hui Chung
Weilai Zhou
Shiyhan Zhang
Alex Fields
Yu-Hsun Cheng
Roosevelt Moreno
AMS 572.1 DATA ANALYSIS, FALL 2007.
What is Regression Analysis?
A statistical methodology to estimate the
relationship of a response variable to a set of
predictor variables.
It is a tool for the investigation of relationships
between variables.
Often used in economics – supply and demand.
How does one aspect of the economy affect
other parts?
Was proposed by German mathematician
Gauss.
Linear Regression
The simplest relationship between x ( the predictor
variable) and Y (the response variable) is linear.
Yi 0 1xi i, (i 1, 2,..., n).
is a random error with E( i) 0 and Var ( i)
i
E (Yi ) i 0 1xi
represents the true but
unknown mean of Y. This relationship is the true
regression line.
2
Simple Linear Regression Model
Simple Linear Regression Model
4 Basic Assumptions:
1.
The mean of Y i is a linear function of xi .
2.
The Y i have a common variance
, which is
2
the same for all values of x.
3.
The errors
4. The errors
are normally distributed.
i
are independent.
i
Example -- Sales vs. Advertising- Information was given such as the cost of
advertising and the sales that occurred as a
result.
Make a scatter plot
To get a good fit, however, we will use the Least
Squares (LS) method.
Example -- Sales vs. Advertising-Data
Sales($000,000s) ( yi )
Advertising ($000s) ( xi )
28
71
14
31
19
50
21
60
16
35
Example -- Sales vs. Advertising- Try to fit a straight line :
y 0 1x
Where o = 2.5 and
28 14
1
3.5
71 31
Look at the deviations between the observed
values and the points from the line:
yi ( 0 1xi ) (i 1, 2,..., n)
Example -- Sales vs. Advertising—
Scatter Plot with a Trial Straight Line fit
http://learning.mazoo.net/archives/000899.html
Least Squares (Cont…)
Deviations should be as small as possible.
Sum of the squared deviations:
n
Q yi ( 0 1xi )
2
i 1
In our example, Q=7.87
Least Squares estimates:
0
and
1
minimize
Q
and are denoted by
0 and 1
Least Squares Estimates
To find ˆ and
0
derivatives of Q.
ˆ1
, take the first partial
n
Q
2 [ yi ( 0 1 xi )]
0
i 1
n
Q
2 xi [ yi ( 0 1 xi )]
1
i 1
Normal Equations
We then set these partial derivatives equal to
zero and simplify.
These are our normal equations:
n
n
i 1
i 1
0 n 1 xi yi
n
n
n
0 xi 1 x xi yi
i 1
i 1
2
i
i 1
Normal Equations
Solve for
0 and 1 :
n
0
n
n
n
i 1
n
i 1
n
i 1
( x )( yi ) ( xi )( xi yi )
2
i
i 1
n xi2 ( xi ) 2
i 1
1
i 1
n
n
n
i 1
i 1
i 1
n xi yi ( xi )( yi )
n
n
n x ( xi ) 2
i 1
2
i
i 1
These formulas can be simplified to:
n
n
n
1 n
S xy ( xi x )( yi y ) xi yi ( xi )( yi )
n i 1
i 1
i 1
i 1
n
n
n
1
S xx ( xi x ) 2 xi2 ( xi ) 2
n i 1
i 1
i 1
n
n
n
1
S yy ( yi y ) 2 yi2 ( yi ) 2
n i 1
i 1
i 1
Sxy
gives the sum of cross-products of the x’s
and Y’s around their respective means.
Sxx
and
Syy
give the sums of squares of the
differences between the
yi
and
xi
and
xi , and the
yi , respectively.
These expressions can be simplified to:
0 y 1 x 1
S xy
S xx
The least squares (LS) line, which is an
estimate of the true regression line is:
ˆ
ˆ
ˆ
y 0 1 x
Find the equation of the line for the number of
sales due to increased advertising
xi 247 ,
yi 98 ,
2
x
i 13327 ,
2
y
i 2038 ,
x y 5192
i i
and n=5 which allows us to get
x 49.4 ; y 19.6
S xy xi yi ( xi )( yi ) 15 247 98 350.80
n i 1
i 1
i 1
n
1
n
1
n
n
n
S xx xi ( xi ) 13327 (247) 1125.20
n i 1
i 1
2
2
1
5
2
Example -- Sales vs. Advertising-
The slope and intercept estimates are:
350.80
ˆ
1
0.31andˆ0 19.6 0.31 49.4 4.2
1125.20
The equation of the LS line is:
y 4.29 0.31x
Coefficient of Determination and
Coefficient of Correlation
yˆi 0 ˆ1xi i 1,2,.....n
Residuals are used to evaluate the
goodness of fit of the LS line:
ei yi (ˆ0 ˆ1xi ),i 1,2,.....n
Q min ei
2
Error sum of squares (SSE):
Q min ei2
Qmin also equals:
n
y y
i 1
i
2
2
y yi S yy
i 1
i 1
n
n
2
i
1
n
This is the total sum of squares (SST).
Total Sum of Squares:
n
n
n
n
i 1
i 1
i 1
i 1
SST ( yi y )2 ( yˆi y )2 ( yi yˆi )2 2 ( yi yˆi )( yˆi y )
SSR
SSE
Regression Sum of Squares:
SST SSR SSE
SSR
SSE
r
1
SST
SST
2
0
, where
is the ratio.
Sales vs. Advertising
Coefficient of Determination and Correlation
Calculate r2 and r using our data.
n
n
1
SST S yy yi2 ( yi2 ) 2038 15 (98)2 117.2
n i 1
i 1
Next calculate SSR
SSR =SST SSE 117.2 7.87 109.33
Then,
109.33
r
0.933andr 0.933 0.966
117.2
2
Since 96.6% of the variation in sales is accounted for by
linear regression on advertising, the relationship
between the two is strongly linear with a positive slope.
Estimation of 2
Variance 2 measures the scatter of the Y i
around their means.
The unbiased estimate of the variance is
given by:
n
2
e
i
SSE
s
n2 n2
2
i 1
Sales vs. Advertising
Estimation of 2
Find the estimate of 2 using our past
results
SSE = 7.87 and n-2=3; so,
7.87
2
s
2.62
3
The estimate of is:
s 2.62 $1.62 or $162
Statistical Inference on 0
ⅰ. Point Estimator
ⅱ. Confidence Interval
ⅲ. Test
and
Distributions of 0 and 1
2
x
i
2
ˆ 0 ~ N 0,
nS
xx
Point estimators
xi 2
SE ( 0 ) s
nS xx
2
ˆ
1 ~ N 1,
Sxx
s
SE (1 )
S xx
100(1 -)% IC
0 tn2, SE 0 , 1 tn2, SE 1
2
2
Hypothesis test
P.Q.
ˆ 0 0
~ tn 2
SE(ˆ 0)
Hypothesis:
H0 : 1
0
1
vs.
Ha : 1 10
ˆ 1 1
~ tn 2
SE( ˆ 1)
Test Statistics:
t0
1
0
1
SE ( 1 )
t0
1
SE ( 1 )
Reject region: reject Ho at if:
t0 tn2, / 2
Analysis of Variance
for Simple Linear Regression
The analysis of variance (ANOVA) is a
statistical technique to decompose the
total variability in the yi’s into separate
variance components associated with
specific sources.
Mean square is a sum of squares
divided by its d.f.
Mean Square Regression
SSR
MSR=
1
Mean Square Error
SSE
MSE=
n2
The ratio of MSR to MSE provides an
equivalent to test the significance of the
linear relationship between x and y:
2
2
MSR SSR ˆ Sxx ˆ 1 ˆ1 H 0 2
2
t ~ F 1, n 2
2
ˆ
MSE
s
s
s / Sxx SE ( 1)
2
1
ANOVA table
Source of
Variation
(Source)
Sum of
Squares
(SS)
Degrees of
Freedom
(d.f.)
Regression
SSR
1
Error
SSE
n–2
Total
SST
n-1
Mean
Square
(MS)
MSR
SSR
1
MSE
SSE
n2
F
F
MSR
MSE
Prediction of Future observations
Suppose we fix x at specified value x*
How do we predict the value of the r.v. Y*?
Point Estimator:
*
*
Y 0 1x
*
Prediction Intervals (PI)
The Confidence Intervals for Y* and E(Y*) are
called Prediction Intervals.
Formulas for a 100(1-α)% PI:
* tn 2, / 2 s
1
n
x * x
Y * tn 2, / 2 s 1
s MSE
2
S XX
1
n
* * tn 2, / 2 s
x * x
S XX
1
n
x * x
S XX
2
Y * Y * tn 2, / 2 s 1
2
1
n
x * x
S XX
2
Cautions about making predictions
Note that the PI will be shortest when x*
is equal to the sample mean.
The farther away x* is from the sample
mean the longer the PI will be.
Extrapolation beyond the range of the
data is highly imprecise and should be
avoided.
Example 10.8
Calculate a 95% PI for the mean groove depth of the
population of all tires and for the groove depth of a
single tire with a mileage of 25,000 (based on the date
from earlier sections).
In previous examples, we already measured the
following quantities:
x 25 0 1 x* 178.62
*
s 19.02
;
x 16
n9
;
S XX 960
;
t7,0.025 2.365
Example 10.8 (continued)
Now we simply plug these numbers into our
formulas
95% PI for E(Y*):
178.62 2.365 19.02
2
(25
16)
1
9
960
[158.73,198.51]
95% PI for Y*:
2
(25
16)
178.62 2.365 19.02 1 19
[129.44, 227.80]
960
Calibration (Inverse Regression)
Suppose we are given μ*=E(Y*), and we want
an estimate of x*.
We simply solve the linear regression formula
for x* to obtain our point estimator:
* 0
x*
1
Calculating the CI is more complicated and is
not covered in this course.
Example 10.9
Estimate the mean life of a tire at wearout
(62.5 mils remaining).
We want to estimate x* when μ*=62.5
From previous examples, we have calculated:
0 360.64
1 7.281
Plugging this data into our equation we get:
62.5 360.64
x*
1000 40, 947.67
7.281
REGRESION DIAGNOSTIC
The four basic assumptions of linear regression
need to be verified from data to assure that the
analysis is valid.
1. The mean of
2. The
Yi
Yi
is a linear function of
have a common variance
the same for all values of
3. The errors
4. The errors
2
xi
,which is
x
i are normally distributed.
i are independent.
Checking The Model Assumptions
Checking for Linearity
Checking for Constant Variance
Checking for Normality
Checking for Independence
How to do this?
If the model is correct, then the residuals
ei yi yˆi
can be viewed as the “estimates” of the random errors
Residual plots are the primary tool.
i‘s.
Checking for linearity
If regression of y on x
should
is linear, then the plot of
exhibit random scatter around zero.
ei
vs.
xi
Example 10.10
i
xi
yi
yˆi
1
0
394.33
360.64
33.69
2
4
329.50
331.51
-2.01
3
8
291.00
302.39
-11.39
4
12
255.17
273.27
-18.10
5
16
229.33
244.15
-14.82
6
20
204.83
215.02
-10.19
7
24
179.00
185.90
-6.90
8
28
163.83
156.78
7.05
The plot is clearly parabolic. The linear regression does not fit
the data adequately. Maybe we can try a second degree model:
9
32
150.33
127.66
22.67
y 0 1 x 2 x 2
ei
Checking for Constant Variance
Plot
ei
vs. yˆi .
Since the yˆi are linear functions of
xi , we can also plot ei vs. xi.
If the constant variance assumption is correct, Var (ei )
The plot of
ei
vs.
yˆi
would be like
2
Checking for normality
Making a normal plot
1. The normal plot requires that the observations form a random sample
with a common mean and variance.
2.
The
yi
do not form such a random sample,
E(Yi ) i
depend on
and hence are not equal.
3. Residuals using to make normal plot (They have a zero mean and an
approximately constant variance.
xi
Checking for normality
Example 10.10
Checking for Independence
A well-known statistical test is the Durbin-Watson Test
n
d
2
(
e
e
)
u u 1
u 2
n
2
e
u
u 1
1. When d is more near 2, residuals are more independent.
2. When d is more near 0, residuals are more positively
correlated.
3. When d is more near 4, residuals are more negatively
correlated.
CHECKING FOR OUTLIERS AND
INFLUENTIAL OBSERVATIONS
Checking for Outliers
Standard residuals
ei
e
SE (ei )
*
i
ei
ei
, i 1, 2,..., n.
2
s
1 ( xi - x )
s 1
n
S xx
If ei* 2, then the corresponding observation may be regarded an outlier.
Checking for Influential Observations
An influential observation is not necessarily an outlier.
An observation can be influential because it has an extreme
x-value, an extreme y-value, or both.
How can we identify influential observations?
Leverage
yˆi can be expressed as a linear combination of all the y j as follows:
n
yˆi hij y j ,
j 1
n
h
i 1
ii
k 1
where the hij are some functions of the x's. We call hij as the leverage.
A rule of thumb is to regard any hii 2(k 1) / n as high leverage.
The observation with high leverage is an influential observation.
In this chapter, k 1, and so hii 4 / n is regarded as high leverage.
1 ( xi x ) 2
The formula for hii for k 1 is given by hii
n
S xx
How to Deal with Outliers and Influential Observations?
Detect outliers and
influential
observations
If they are
erroneous
observations or
not
Discard these
observations
Include them in
the analysis
Do Analysis
Two separate analyses may be done, one with and one
without the outliers and influential observations.
Example 10.12
No.
1
2
3
4
5
6
7
8
9
10
11
X
8
8
8
8
8
8
8
19
8
8
8
Y
6.28
5.76
7.71
8.84
8.47
7.04
5.25
12.50
5.56
7.91
6.89
ei*
-0.341
-1.067
0.582
1.735
1.300
0.031
-1.624
0
-1.271
0.757
-0.089
hii
0.1
0.1
0.1
0.1
0.1
0.1
0.1
1
0.1
0.1
0.1
DATA TRANSFORMATIONS
Linearizing Transformations
Simple
Functional relationship
i.e
power form:
y x
ln y ln x
ln ln x
then
y ln y and x ln x
produce
0 ln and 1
DATA TRANSFORMATIONS
Linearizing Transformations
Simple
Functional relationship
i.e
exponential form:
x
y e
x
ln y ln e
ln x
then
y ln y and x x
produce
0 ln and 1
DATA TRANSFORMATIONS
Linearizing Transformations
y
x
y
y
log x
y
x3
y
1x
y2
x
log y
x
y3
x
1
y
x
y
x
y
x2
x
y
x
y
log x
y
x2
y
1x
y2
x3
y
x
y3
x
y2
x
y
x
y3
x
DATA TRANSFORMATIONS
Linearizing Transformations
Ex. 10.13 (Tire tread wear vs. Mileage: Exponential Model)
y
x
DATA TRANSFORMATIONS
Linearizing Transformations
Ex. 10.13 (Tire tread wear vs. Mileage: Exponential Model)
y
x
DATA TRANSFORMATIONS
Linearizing Transformations
Ex. 10.13 (Tire tread wear vs. Mileage: Exponential Model)
y
x
DATA TRANSFORMATIONS
Variance Stabilizing Transformations
Based on two-term Taylor-series approximations
Given relationship between mean and variance:
2 ( )
The following transformation makes variances
approximately equal, even if means differ :
Y f ( x) f (u ) (u )
1
2
DATA TRANSFORMATIONS
Variance Stabilizing Transformations
Delta Method
Var h(Y )
h( )
2
g 2 ( )
Let:
h( )
2
then
g 2 ( ) 1 h( )
h( ) gd( )
consequently
2
Var
Y
g
( )
E Y
h( y ) gdy( y )
1
g ( )
DATA TRANSFORMATIONS
Variance Stabilizing Transformation
Example 1
Var Y c2 2 c 0
here
Example 2
Var Y c2 c 0
here
g ( ) c
then
g ( ) c
then
h( y ) dy
cy
1c dyy 1c ln y
h( y ) c dyy
1c
dy
y
2
c
y
CORRELATION ANALYSIS
Background on correlation
A number of different
correlation coefficients are
used for different situations.
The best known is the
Pearson product-moment
correlation coefficient, which
is obtained by dividing the
covariance of the two
variables by the product of
their standard deviations.
Despite its name, it was first
introduced by Francis Galton.
CORRELATION ANALYSIS
•When it is not clear which is the
predictor variable and which is the
response variable?
•When both variables are random?
Bivariate Normal Distribution
Correlation: a measurement of how closely two variables
share a linear relationship. Or the measure of independence.
corr(X,Y)
Cov(X,Y)
Var(X)Var(Y)
If =0 , uncorrelated, that implies independence, but does
not guarantee it
If =-1 or +1, it represents perfect association
Useful when it is not possible to determine which variable is
the predictor and which is the response.
Health vs Wealth. Which is predictor? Which is response?
Bivariate Normal Distribution
p.d.f. of (X,Y)
f ( x, y )
1
2 X Y 1
2
e
2
x 2
x X y Y y Y
1
X
2
2
2(1 ) X
X Y Y
Properties
p.d.f is defined at -1<<1
Undefined if =±1 and is called degenerate.
The marginal p.d.f of x is
N (X , X2 )
The marginal p.d.f of Y is
N (Y , Y2 )
Bivariate Normal Distribution
How to calculate
Let
f(X,Y) has a covariance =
X2
A
X Y
X2
det
X Y
X Y
Y2
X Y
2 2
2 2 2
2 2
2
1
X Y
X Y
X Y
2
Y
where
f ( x) N (X ,X2 ) ; f (Y ) N (Y ,Y2 )
Calculation
f ( X ,Y )
1
2
N
2
e
2 1
x X y Y A
2
det A
wwhere N=2 since it is bivariate bi=2, thus:
f ( X ,Y )
1
2
N
2
2
X
Y2 1 2
e
2 1
x X y Y A
2
wwhere
2
1
Y
A1 2 2
X Y 1 2 X Y
X Y
X2
Calculation (cont…)
f ( x, y )
f ( x, y )
1
2 X Y 1
2
1
2 X Y 1
2
e
2 x 2 2 y 2 2 x y
1
X
X
Y
X Y
X
Y
Y
2
2 2
2(1 )
X Y
e
x 2 2 x y y 2
1
X
X
Y
Y
2
2
2
X Y
2(1 ) X
Y
Statistical Inference on the Correlation Coefficient ρ
We can derive a test on the correlation coefficient in the same way that we
have been doing in class.
Assumptions
X, Y are from the bivariate normal distribution
R: sample estimate of the population correlation coefficient ρ
Start with point estimator
n
R
i 1
i
X )(Yi Y )
n
n
i 1
i 1
( X i X )2 (Yi Y )2
Get the pivotal quantity
(X
The distribution of R is quite complicated
T: transform the point estimator into a p.q.
T
R n2
1 R2
Do we know everything about the p.q.?
Yes: T ~ tn-2 under H0 : ρ=0
Derivation of T
are these equivalent?
r n 2 ? ˆ1
t
2
SE ( ˆ )
1 r
Therefore, we can use t as a
statistic for testing against the
null hypothesis
H0: β1=0
Equivalently, we can test
against
H0: ρ=0
1
substitute:
s
S
S xx
r ˆ1 x ˆ1 xx ˆ1
sy
S yy
SST
SSE (n 2) s 2
1 r
SST
SST
then:
2
t ˆ1
S xx
SST
ˆ1
ˆ1
(n 2) SST
s
2
(n 2) s
SE ( ˆ1 )
S xx
yes, they are equivalent.
Exact Statistical Inference on ρ
Test
H0 : ρ=0 vs. Ha : ρ≠0
H0 : 1 0vs.Ha : 1 0
Test Statistics
t
r n2
1 r2
Y
Y
E (Y x) Y
X
x
X
X
tn 2
S
S
S xx
r ˆ1 x ˆ1 xx ˆ1
Sy
S yy
SST
(n 2) s 2
1 r
SST
2
t
r n2
1 r2
ˆ1
S xx
ˆ1
SST
(n 2)
SST
(n 2) s 2 s S xx
tn 2
Exact Statistical Inference on ρ (Cont.)
Reject Region :
Reject H0 if t0 > tn-2
Example:
The times for 25 soft drink deliveries (y) monitored as a function of
delivery volume (x) is shown in table next page.
Testing the null hypothesis that the correlation coefficient is equal to 0.
Exact Statistical Inference on ρ
Data
Y
X
Y
X
Y
X
Y
X
Y
X
7
16.68
7
18.11
16
40.33
10
29.00
10
17.90
3
11.50
2
8.00
10
21.00
6
15.35
26
52.32
3
12.03
7
17.83
4
13.50
7
19.00
9
18.75
4
14.88
30
79.24
6
19.75
3
9.50
8
19.83
6
13.75
5
21.50
9
24.00
17
35.10
4
10.75
Exact Statistical Inference on ρ
Solution
The sample correlation coefficient is
(X
n
R
i
X )(Y Y )
i
2473.34
i 1
n
n
(X X )
i
2
i 1
(Y Y )
2
1136.57 * 5784.54
i
i 1
for α = .01,
Reject H0
t0
r
n2
1 r
2
17.56 t0
0.96 *
25 2
1 0.96
17.56
2
t23,0.005 2.807
0.96
Approximate Statistical Inference on ρ
There is no exact method of testing
ρ vs an arbitrary ρ0
Distribution of R is very complicated
T ~ tn-2 only when ρ = 0
To test ρ vs an arbitrary ρ0 use
Fisher’s Normal approximation
1 1 1
1 R
tanh 1 R 12 ln
N
2 ln
,
1
n
3
1 R
Transform the sample estimate
1 1 0 1
1 r
ln
,
, under H 0 , ~ N 2 ln
1
n
3
1 r
0
1
2
Approximate Statistical Inference on ρ
Test :
H 0 : 0 vs. H1 : 0
1 r
ln
1
r
1
2
1 0
H 0 : 0 ln
vs. H1 : 0
1 0
1
2
Z statistic:
z0 n 3 0
reject H0 if |z0| > zα/2
Sample estimater:
CI:
1
1
z /2
n3
n3
e2l 1
e2u 1
2u
2l
e 1
e 1
z /2
Approximate Statistical Inference on ρ
Code:
Approximate Statistical Inference on ρ
Output:
Approximate Statistical Inference on ρ
Retaking the previous example:
The times for 25 soft drink deliveries (y) monitored as a function of
delivery volume (x) is shown in table next page.
Testing the null hypothesis that the correlation coefficient is equal to 0.
SAS coding for last example
data corr_bev;
input y x;
datalines;
7 16.68
3 11.5
3 12.03
4 14.88
6 13.75
7 18.11
2 8.00
7 17.83
30 79.24
5 21.5
16 40.33
10 21.00
4 13.5
6 19.75
9 24.00
10 29.00
6 15.35
7 19.00
3 9.50
17 35.1
10 17.90
26 52.32
9 18.75
8 19.83
4 10.75
;
run;
proc gplot data=corr_bev;
plot y*x;
run;
proc corr data=corr_bev outp=corr;
var x y;
run;
SAS analysis for last example
SAS analysis for last example
Pitfalls of Regression and Correlation Analysis
Correlation and causation
Coincidental data
Baldness and lawyers
Lurking variables(third unobserved variable)
Good mood cause good health
Relationship between eating and weight, with
unobserved variable of heredity(metabolism,and
illness).
Restricted range
IQ, school performance (elementary school to
college) college lower IQ’s are less common so there
would clearly be a decrease in the range.
Pitfalls of Regression and Correlation
Analysis
Correlation and linearity
The correlation value may not
be enough to evaluate a
relationship, especially in the
case where the assumption of
normality is incorrect.
This image created by Francis
Anscombe, has common mean
(7.5), standard
deviation(4.12), correlation
(.81) and regression line
y=3+.5x
SIMPLE LINEAR REGRESSION
AND CORRELATION
Prepared by:
Jackie Zerle
David Fried
Chun-Hui Chung
Weilai Zhou
Shiyhan Zhang
Alex Fields
Yu-Hsun Cheng
Roosevelt Moreno
AMS 572.1 DATA ANALYSIS, FALL 2007.