Document 7764573

Download Report

Transcript Document 7764573

Chapter 11
Simple Linear Regression and
Correlation
Learning Objectives
• Use simple linear regression for building empirical
models
• Estimate the parameters in a linear regression
model
• Determine if the regression model is an adequate
fit to the data
• Test statistical hypotheses and construct
confidence intervals
• Prediction of a future observation
• Use simple transformations to achieve a linear
regression model
• understand the correlation
Regression analysis
•
•
•
•
Relationships between two or more variables
Useful for these types of problems
Predict a new observation
Sometimes a regression model will arise from a
theoretical relationship
• At other times no theoretical knowledge
• Choice of the model is based on inspection of a
scatter diagram
• Empirical model
Regression Model
• Mean of the random variable Y is related to x
E[Y|x]=Y|x=0+ 1x
• 0 and 1 called regression coefficients
• Appropriate way to generalize this to a probabilistic
model
• Assume that the expected value of Y is a linear
function of x
• Actual value of Y is determined by the mean value
function plus a random error term
Y=0+ 1x+
• Where  called random error term
1 and 
• Suppose that the mean and variance of 
are 0 and 2
• Slope, 1, can be interpreted as the change
in the mean of Y
• Height of the line at any value of x is just
the expected value of Y for that x
• Variability of Y at a particular value of x is
determined by the error variance 2
• Implies that there is a distribution of Yvalues at each x
Graph of the Variability
• Distribution of Y for any Y
given value of x
• Values of x are fixed,
and Y is a random
variable with the
following mean and
variance
Mean: Y|x=0+ 1x
Variance 2
True regression line
x
Simple Linear Regression
• Values of the intercept, slope and the
error variance will not be known
• Must be estimated from sample data
• Fitted model is used in prediction of future
observations of Y at a particular level of x
Method of Least Squares
• True relationship
between Y and x is a
straight line
• Assume n pairs of
observations
• Estimates of 0 and 1
result in a line that is a
“best fit” to the data
• Called method of least
squares
Least Squares Method
• Assuming the n observations in the sample
• Sum of the squares of the deviations of the
observations from the true regression line
n
n
2
L    i   ( yi   0  1 xi )2
i 1
i 1
• Taking the partial derivatives
L
o
L
1
n
ˆo ˆ1
 2 ( yi  ˆo  ˆ1 xi )  0
i 1
n
ˆ ˆ
o 1
 2 ( yi  ˆo  ˆ1 xi )xi  0
i 1
Least Squares Method-Cont.
• Simplifying
n
n
i 1
i 1
nˆo  ˆ1  xi   yi
n
n
n
i 1
i 1
i 1
2
ˆ
o  xi  1  xi   yi xi
ˆ
• Results are
ˆo  y  ˆ1 x
n
n
ˆ1 
yx 
i 1
n
(  yi )(  xi )
i 1
i 1
i i
n
n
n
x
i 1
i
2

(  xi ) 2
i 1
n
• Fitted or estimated regression line is
yˆ  ˆo  ˆ1x
Using Special Symbols
• Convenient to use special symbols
ˆ1 
• Numerator
n
S xy
S xx
n
i 1
i 1
n
(  xi )(  yi )
i 1
n
S xy   yi ( xi  x )   xi yi 
2
i 1
n
• Denominator
n
n
n
i 1
i 1
S xx   (xi  x )2   xi 
2
(  xi )2
i 1
n
Residual Error
• Describes the error in the fit of the
model to the ith observation yi
• Each pair of observations satisfies
yi  ˆo  ˆ1 xi  ei
• Denoted by ei
ei  yi  yˆ i
Estimating 2
• Another unknown parameter,2, the variance of the
error term 
• Residuals ei are used to obtain an estimate of 2
• Sum of squares of the residuals, often called the
error sum of squares
n
n
2
SS E   ei   ( yi  yˆ i ) 2
SS E
2
ˆ 
n2
i 1
i 1
• A more convenient computing formula
SSE  SST  ̂1S xy
• SST is the total sum of squares
SST   ( yˆ  y )
n
2
i 1
i
Example
• Regression methods were used to analyze the data from
a study investigating the relationship between roadway
surface temperature (x) and pavement deflection ( y).
• Summary quantities were as follows
n  20,  yi  12.75,  yi  8.86,  xi  1478
2
 xi  143,215.8 and  xi yi  1083.67
2
Questions
(a) Calculate the least squares estimates of the slope
and intercept. Graph the regression line.
(b) Use the equation of the fitted line to predict what
pavement deflection would be observed when the
surface temperature is 85F.
(c) What is the mean pavement deflection when the
surface temperature is 90F?
(d) What change in mean pavement deflection would
be expected for a 1F change in surface
temperature?
Solution
• Need to have
n
n
n
S xx   ( xi  x )   xi 
2
2
i 1
( xi ) 2
i 1
n
i 1
S xx  143215.8 
14782
20
 33991.6
• Hence, the slope and intercept
S xy 141.445
ˆ
1 

 0.0041612
S xx 33991.6
ˆ  12.75  (0.0041612)( 1478)  0.3299892
0
20
20
• Regression line
y0   0  1 x1
yˆ  0.3299892  0.0041612 x
Solution-Cont.
• Graph of the regression line
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
y
-50
0
50
x
100
150
Solution-Cont.
• Pavement deflection
yˆ  0.3299892  0.0041612(85)  0.6836
• Mean pavement deflection
yˆ  0.3299892  0.0041612(90)  0.7045
• Change in mean pavement deflection
ˆ1  0.00416
Properties of the Least Estimators
• Assumed that the error term  in the model is a
random variable
• Estimators will be viewed as random variables
• Properties of the slope
E ( ˆo )  o
2

1
x

V ( ˆo )    

 n S xx 
2
• Properties of the intercept
E ( ˆ1 )  1
V ( ˆ1 ) 
2
S xx
Analysis of Variance Approach
• Used to test for significance of regression
• Partitions the total variability in the response
variable into two components
n
( y
i
i 1
n
n
 y )   ( yˆ i  y )   ( yi  yˆ i ) 2
2
2
i 1
i 1
• First term is called error sum of squares
• Second term is called regression sum of squares
• Symbolically
SST=SSR+SSE
• SST is the total corrected sum of squares
Analysis of Variance
• SST, SSR, and SSE has n-1, 1, and n-2 d.o.f,
respectively
• SSR= β1Sxy and SSE=SST- β1Sxy
• Divide by its d.o.f
MSR=[SSR/1] and MSE=[SSE/n-2]
• Then F=MSR/MSE follows F1,n-2 distribution
Hypothesis Tests for Slope
• Adequacy of a linear regression model
• Appropriate hypotheses for slope are
H0: β1=β1,0
H1: β1#β1,0
• Test Statistic
Fo 
SS R / 1
MS R

SS E /( n  2) MS E
• Follows the F 1,n-2 distribution
• Reject H0 if f0>f,1,n-2
Analysis of Variance for Testing
Significance of Regression
Example
• Consider the data from the previous example on
x=roadway surface temperature and y=pavement
deflection.
• (a) Test for significance of regression using α =
0.05. What conclusions can you draw?
• (b) Estimate the standard errors of the slope and
intercept.
Solution
• Use the steps in hypotheses testing
1) Parameter of interest is slope of the regression line 1
2) H :   0
3) H1 : 1  0
4)  = 0.05
5) The test statistic is
0
f0 
1
MS R
SS R / 1

MS E SS E /( n  2)
6) Reject H0 if f0 > f,1,18 where f0.05,1,18 = 4.416
Solution
7) Using the results from the previous example
SS R  ˆ1S xy  (0.0041612)(141.445)  0.5886
.75
SS E  S yy  SS R  (8.86  1220
)  0.5886  0.143275
2
• Hence, the test statistic
f0 
0.5886
 73.95
0.143275 / 18
8) Since 73.95 > 4.416, reject H0 and conclude the
model specifies a useful relationship at  = 0.05
• Standard error
ˆ 2  MS E 
SS E 0.143275

 0.00796
n2
18
Confidence Intervals on the Slope and Intercept
• Interested to obtain C.I. estimates of the
parameters
• Width of these C.I. is a measure of the overall
quality of the regression line
• 100(1-α)% C.I. on the slope β1
ˆ1  t / 2,n 2
ˆ 2
S xx
 1  ˆ1  t / 2,n 2
ˆ 2
S xx
• 100(1-α)% C.I. on the intercept β0
ˆo  t / 2,n2
2

1 x2 
1
x


2
ˆ     o  ˆo  t / 2,n 2 ˆ   
 n S xx 
 n S xx 
2
Confidence Interval on the Mean Response
• Constructed on the mean response at a specified
value of x, say, x0
• Called a C.I. about the regression line
• C.I. about the mean response at the value of x=x0
ˆ y x  t / 2,n 2
o
 1 ( xo  x )2 

( xo  x ) 2 
2 1
ˆ  n 
 ˆ y xo  ˆ y xo  t / 2,n 2 ˆ  n 


S
S
xx
xx




2
• Applies only to the interval
Prediction of New Observations
• An important application of a regression model
• New observation is independent of the observations
used to develop the regression model
• C.I. for Y|x is inappropriate
• Prediction interval on a future observation at the value
x0
yˆ o  t / 2,n 2
2
2




1
(
x

x
)
1
(
x

x
)
2
2
o
o
ˆ 1  
 Yo  yˆ o  t / 2,n 2 ˆ 1  


n
S
n
S
xx
xx




• Always wider than the C.I. at x0
• Depends on both the error from the fitted model and
the error associated with future observations
Example
• The first example presented data on roadway
surface temperature x and pavement deflection y
• Find a 99% confidence interval on each of the
following:
• (a) Slope
• (b) Intercept
• (c) Mean deflection when temperature x=85o F
• (d) Find a 99% prediction interval on pavement
deflection when the temperature is 90oF.
Solution
a) Confidence interval on the slope
2
ˆ

ˆ1  t / 2,n 2
 1  ˆ1  t / 2 ,n 2
S xx
• Critical value
t/2,n-2 = t0.005,18 = 2.878
• Hence
0.0041612  (2.878)(0.000484)
0.0027682  1  0.0055542
b) Confidence interval on the intercept
x2 
x2 
21
2 1
ˆ
ˆ
o  t / 2,n2 ˆ     o  o  t / 2,n 2 ˆ   
 n S xx 
 n S xx 
0.3299892  (2.878)(0.04095)
0.2121351   0  0.4478433
ˆ 2
S xx
Solution
c) 99% confidence interval on  when x=85 F
ˆ Y | x  0.683689
0
ˆ Y | x  t.005,18 ˆ 2 ( 1n  ( x S x ) )
2
0
0
xx
 73.9 )
0.683689  (2.878) 0.00796( 201  (8533991
.6 )
2
0.683689  0.0594607
0.6242283  ˆ Y | x0  0.7431497
d) 99% prediction interval when x=90 F.
yˆ 0  0.7044949
yˆ 0  t.005,18 ˆ 2 (1  1n  ( x0Sxxx ) )
2
 73.9 )
0.7044949  2.878 0.00796(1  201  (9033991
.6 )
2
0.7044949  0.2640665
0.4404284  y0  0.9685614
Residual Analysis
• Helpful in checking the
• Patterns of residual
errors are approximately
plots
normally distributed with
constant variance
• Useful in determining
whether additional terms
in the model are required
• Construct normal
probability plot of
residuals
Coefficient of Determination(R2)
• Judge the adequacy of a regression model
• Coefficient of determination
R2 
SSR SST  SSE

SST
SST
• Referred as the amount of variability in the data
explained by the regression model and
• SSR is that portion of SST that is explained by the
use of the regression model
• SSE is that portion of SST that is not explained by
the use of the regression model
Transformation of Data Points
• Inappropriateness of straight-line regression model
• Scatter diagram
• Consider the exponential function
Y=β0eβ1x transformed to a straight line
• By a logarithmic transformation
ln Y=lnβ0 +β1 x + ln 
• Another intrinsically linear function is
Y=β0+β1(1/x)+
• By using the reciprocal transformation z=1/x
Y=β0 +β1 z + 
• Transformed error terms  are normally distributed
Correlation
• Assumed that x is a mathematical variable and
that Y is a random variable
• Many applications involve situations in which both
X and Y are random variables
• Suppose observations are jointly distributed
random variables
• Measures the strength of linear association
between two variables and denoted by 
• Shows how closely the points in a scatter diagram
are spread around the regression line
Hypothesis Tests
• Useful to test the hypotheses
H0: =0
H1: #0
• Appropriate test statistic
To 
R n2
1  R2
• Follows the t distribution with n-2 degrees of
freedom
• Reject the null hypothesis if
to  t ,n2
Next Agenda
• Chapters 13 deals with designing and
conducting engineering experiments
• ANOVA in designing single factor
experiments will be emphasized