Simple Linear Regression - University of South Carolina

Download Report

Transcript Simple Linear Regression - University of South Carolina

Simple Linear Regression
Lecture for Statistics 509
November-December 2000
Correlation and Regression
• Study of association and/or relationship between variables.
• Useful for determining the effect of changes in one variable (called the
independent or control variable) on another variable (called the
dependent or response variable).
• Regression models could be utilized to determine optimal operating
conditions [these conditions specified by the control variables] in order
to achieve a certain specified value or yield on the response variable.
• Regression models could also be utilized to predict the value of the
response given a value of the independent variable, or could be used
for “calibrating” the value of the independent variable to achieve a
certain response.
Week of 11/27/2000
Stat 509 - Regression Lecture
2
Some Examples
• Control variable is X = Average Speed of a Car and response variable
is Y=Fuel Efficiency of the Car. Goal is to determine speed to
optimize the efficiency of the car.
• Control variable is X = Temperature, while the response variable is Y =
Yield in a chemical reaction.
• Control variable is X = amount of fertilizer applied on a plant, while
the response variable is Y = yield of this plant.
• Control variable is X = thickness of a stack of bond paper, while the
response variable is Y = number of sheets in this stack.
• Control variable is X = average time of studying, while the response
variable is Y = GPA.
Week of 11/27/2000
Stat 509 - Regression Lecture
3
Population Model
• Each member of the population will have a value for the independent
variable X and the response variable Y, usually represented by the
vector (X,Y).
• For a given value X = x, the variable Y has a certain distribution whose
conditional mean is m(x) and whose conditional variance is s2(x).
• This could be visualized as follows: When you consider the
subpopulation consisting of units whose values of X equal x, then their
Y-values has a certain distribution whose mean is m(x) and whose
variance is s2(x). When you pick a unit from this subpopulation, then
the Y-value that you will observe is governed by this particular
distribution. In particular, this observation could be expressed via
• Y = m(x) + e, where e is some “error term.”
Week of 11/27/2000
Stat 509 - Regression Lecture
4
Assumptions for Simple Linear Regression
• Assumptions for Simple Linear Regression
 m(x) = E(Y|X=x) = a + bx. This means that the mean of Y, given X =
x, is a linear function of x.
 b is called the regression coefficient or the slope of the regression line;
a is the y-intercept.
 s2(x) = s2 does not depend on x. This is the assumption of “equal
variances” or homoscedasticity.
• Furthermore, for the sample data (x1, Y1), (x2, Y2), …, (xn, Yn):
• Y1, Y2, …, Yn are independent observations, and their conditional
distributions are all normal.
• In shorthand notation:
• Yi = m(xi) + ei = a + bxi + ei, i=1,2,…,n, where e1, e2, …, en are
independent and identically distributed (IID) N(0,s2).
Week of 11/27/2000
Stat 509 - Regression Lecture
5
Regression Problem
• Given the sample (bivariate) data (x1, Y1), (x2, Y2), …, (xn, Yn),
satisfying the linear regression model
• Yi = a + bxi + ei with e1, e2, …, en IID N(0, s2)
• we would like to address the following questions:
•
•
•
•
•
How should the data be summarized graphically?
What are the estimators of the parameters a, b, and s2?
What will be an estimate of the prediction line?
What are the properties of the estimators of the model parameters?
How do we test whether the fitted regression model is a significant
model?
• How do we construct CIs or test hypotheses concerning parameters?
• How do we perform prediction using the prediction model?
Week of 11/27/2000
Stat 509 - Regression Lecture
6
Illustrative Example: On Plasma Etching
• Plasma etching is essential to the fine-line pattern transfer
in current semiconductor processes. The paper “Ion BeamAssisted Etching of Aluminum with Chlorine” in J.
Electrochem. Soc. (1985) gives the data below on chlorine
flow (x, in SCCM) through a nozzle used in the etching
mechanism, and etch rate (y, in 100A/min)
x
y
1.5
23.0
Week of 11/27/2000
1.5
24.5
2.0
25.0
2.5
30.0
2.5
33.5
3.0
40.0
Stat 509 - Regression Lecture
3.5
40.5
3.5
47.0
4.0
49.0
7
The Scatterplot
Scatterplot of Chlorine Flow and Etch Rate
EtchRate
50
40
30
20
2
3
4
ChlorineFlow
Week of 11/27/2000
Stat 509 - Regression Lecture
8
Least-Squares Prediction Line
The least-squares (LS) principle to fitting the regression line to the
scatterplot states that the best fitting line
Yˆ  a  bx
is such that the coefficients a and b will provide the smallest possible
value to the sum of squared deviations between the observed Y-values and
their associated predicted values. The predicted values are
Yˆi  a  bx i , i  1,2,...,n ,
so the quantity that needs to be minimized is given by:
Q ( a , b) 

n
i 1
Yi  Yˆi
   Y
2
n
i 1
i
2
 ( a  bx i )  .
Using minimization techniques from Calculus, the coefficients that will
provide the minimum value for Q(a,b) are given in the next slide.
Week of 11/27/2000
Stat 509 - Regression Lecture
9
Formulasfor Simple Linear Regression
n
n
SXX   ( X i  X )   X i2  n X
2
i 1
i 1
n
n
SYY   (Yi  Y )   Yi 2  nY
2
i 1
2
2
i 1
n
n
i 1
i 1
SXY   ( X i  X )(Yi  Y )   X iYi  n X Y
r
SXY
 Sample Correlat ion Coefficient
( SXX )(SYY )
SXY
 Est imat orof b
SXX
a  Y  b X  Est imat orof a
Yˆ  a  bX  P redict ionLine
b
Ri  Yi  Yˆi  Yi  ( a  bX i ), i  1,2,...,n
Week of 11/27/2000
Stat 509 - Regression Lecture
10
n
n
SSE   R   (Yi  Yˆi ) 2
i 1
2
i
i 1
n
SSR   (Yˆi  Y ) 2
i 1
SSY  SSR  SSE
SSE
n2
MSR
SSR / 1
Fc 

MSE SSE /(n  2)
S 2  MSE 
MSE is an unbiased estimatorof s 2
SSR
Coefficient of Determination  R 2 
SYY
Week of 11/27/2000
Stat 509 - Regression Lecture
11
Analysis of Variance Table
Source of Degrees-ofVariation
Freedom
Regression
1
Error
n-2
Total
n-1
Sum of
Squares
SSR
SSE
SYY
Mean
Squares
MSR
MSE
F-Value
MSR/MSE
To test the null hypothesis H0: b=0, compare the Fvalue (MSR/MSE) to the tabular value obtained
from the F-distribution with degrees-of-freedom
(1,n-2). If the F-value is larger, then the null
hypothesis is rejected, and it is concluded that the
regression model is significant (at the prespecified
level of significance).
Week of 11/27/2000
Stat 509 - Regression Lecture
12
St andard Errorsand ConfidenceInt ervals
MSE
SXX
2

1
X 
2
sˆ ( a )  ( MSE)  

n
SXX


sˆ 2 (b) 
2


1
(
x

X
)
0
ˆ
ˆ
s [Y ( x0 )]  ( MSE)  

n
SXX


ConfidenceInt ervalfor m ( x0 ) :
2

1 ( x0  X ) 2

( a  bx0 )  (tn 2;a / 2 ) MSE
n
SXX

P redict ionInt ervalat X  x0 :




1 ( x0  X ) 2
( a  bx0 )  (tn 2;a / 2 ) MSE 1  
n
SXX

Week of 11/27/2000
Stat 509 - Regression Lecture



13
Excel Worksheet for Regression
Computations
X=ChlorineFlow Y=EtchRate X^2
Y^2
XY
1.5
23
2.25
529
34.5
1.5
24.5
2.25
600.25
36.75
2
25
4
625
50
2.5
30
6.25
900
75
2.5
33.5
6.25 1122.25
83.75
3
40
9
1600
120
3.5
40.5
12.25 1640.25
141.75
3.5
47
12.25
2209
164.5
4
49
16
2401
196
24
312.5
SumX
SumY
SXX
SYY
SXY
6.5
776.055556
68.9166667
Week of 11/27/2000
70.5 11626.75
902.25
SumX2
SumY2
SumXY
b
a
MSE
Stat 509 - Regression Lecture
10.60256
6.448718
6.480311
14
Regression Analysis from Minitab
Week of 11/27/2000
Stat 509 - Regression Lecture
15
Fitted Line in Scatterplot with Bands
Regression Analysis of the Plasma Etching Data
Y = 6.44872 + 10.6026X
R-Sq = 94.2 %
55
EtchRate
45
35
25
Regression
95% CI
95% PI
15
2
3
4
ChlorineFlow
Week of 11/27/2000
Stat 509 - Regression Lecture
16