Statistics for Managers Using Microsoft Excel, 4/e

Download Report

Transcript Statistics for Managers Using Microsoft Excel, 4/e

The Basic Practice of Statistics
6th Edition
Chapter 5, 24
Simple Linear Regression
Chapter Goals
After completing this segment, you should be
able to:
 Explain the simple linear regression model
 Obtain and interpret the simple linear regression
equation for a set of data
 Explain measures of variation and determine whether
the independent variable is significant
 Recognize some potential problems if regression
analysis is used incorrectly
Correlation vs. Regression
 A scatter plot (or scatter diagram) can be used
to show the relationship between two variables
 Correlation analysis is used to measure
strength of the association (linear relationship)
between two variables
 Correlation is only concerned with strength of the
relationship
 No causal effect is implied with correlation
Introduction to
Regression Analysis
 Regression analysis is used to:
 Predict the value of a dependent variable based on the
value of at least one independent variable
 Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to explain
Independent variable: the variable used to explain
the dependent variable
Simple Linear Regression
Model
 Only one independent variable, X
 Relationship between X and Y is
described by a linear function
 Changes in Y are assumed to be caused
by changes in X
Types of Relationships
Linear relationships
Y
Curvilinear relationships
Y
X
Y
X
Y
X
X
Types of Relationships
(continued)
Strong relationships
Y
Weak relationships
Y
X
Y
X
Y
X
X
Types of Relationships
(continued)
No relationship
Y
X
Y
X
Simple Linear Regression
Model
The population regression model:
Population
Y intercept
Dependent
Variable
Population
Slope
Coefficient
Independent
Variable
Random
Error
term
Yi  β0  β1Xi  εi
Linear component
Random Error
component
Simple Linear Regression
Model
(continued)
Y
Yi  β0  β1Xi  εi
Observed Value
of Y for Xi
εi
Predicted Value
of Y for Xi
Slope = β1
Random Error
for this Xi value
Intercept = β0
Xi
X
Simple Linear Regression
Equation
The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted)
Y value for
observation i
Estimate of
the regression
intercept
Estimate of the
regression slope
ˆ  b b X
Y
i
0
1 i
Value of X for
observation i
The individual random error terms ei have a mean of zero
Least Squares Method
 b0 and b1 are obtained by finding the values
of b0 and b1 that minimize the sum of the
squared differences between Y and Yˆ :
2
2
ˆ
min  (Yi Yi )  min  (Yi  (b0  b1Xi ))
Finding the Least Squares
Equation
 The coefficients b0 and b1 , and other
regression results in this chapter, will be
found using MINITAB
Interpretation of the
Slope and the Intercept
 b0 is the estimated average value of Y
when the value of X is zero
 b1 is the estimated change in the
average value of Y as a result of a
one-unit change in X
Simple Linear Regression
Example
 A real estate agent wishes to examine the
relationship between the selling price of a home
and its size (measured in square feet)
 A random sample of 10 houses is selected
 Dependent variable (Y) = house price in $1000s
 Independent variable (X) = square feet
Sample Data for House Price
Model
House Price in $1000s
(Y)
Square Feet
(X)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Graphical Presentation
 House price model: scatter plot
House Price ($1000s)
450
400
350
300
250
200
150
100
50
0
0
500
1000
1500
2000
Square Feet
2500
3000
Regression Using MINITAB
Stat/ Regression/Fitted Line Plot
Response [=dependent variable]
Predictor [=explanatory variable]
OK
Minitab Output
Regression Analysis:
House Price versus Square Feet
The regression equation is
House Price = 98.25 + 0.1098 Square Feet
S = 41.3303
R-Sq = 58.1%
R-Sq(adj) = 52.8%
Analysis of Variance
Source
Regression
Error
Total
DF
1
8
9
SS
18934.9
13665.6
32600.5
MS
18934.9
1708.2
F
11.08
P
0.010
Graphical Presentation
Fitted Line Plot
House Price = 98.25 + 0.1098 Square Feet
S
R-Sq
R-Sq(adj)
400
House Price
350
300
250
200
1000
1200
1400
1600
1800
2000
Square Feet
2200
2400
2600
41.3303
58.1%
52.8%
Graphical Presentation
House Price ($1000s)
 House price model: scatter plot and
regression
line
450
Intercept
= 98.248
400
350
Slope
= 0.10977
300
250
200
150
100
50
0
0
500
1000
1500
2000
2500
3000
Square Feet
house price  98.24833  0.10977 (squarefeet)
Interpretation of the
Intercept, b0
house price  98.24833  0.10977 (squarefeet)
 b0 is the estimated average value of Y when the
value of X is zero (if X = 0 is in the range of
observed X values)
 Here, no houses had 0 square feet, so b0 = 98.24833
just indicates that, for houses within the range of
sizes observed, $98,248.33 is the portion of the
house price not explained by square feet
Interpretation of the
Slope Coefficient, b1
house price  98.24833  0.10977 (squarefeet)
 b1 measures the estimated change in the
average value of Y as a result of a oneunit change in X
 Here, b1 = .10977 tells us that the average value of a
house increases by .10977($1000) = $109.77, on
average, for each additional one square foot of size
Predictions using
Regression Analysis
Predict the price for a house
with 2000 square feet:
house price  98.25  0.1098 (sq.ft.)
 98.25  0.1098(2000)
 317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Interpolation vs. Extrapolation
 When using a regression model for prediction,
only predict within the relevant range of data
Relevant range for
interpolation
House Price ($1000s)
450
400
350
300
250
200
150
100
50
0
0
500
1000
1500
2000
Square Feet
2500
3000
Do not try to
extrapolate
beyond the range
of observed X’s
Measures of Variation
 Total variation is made up of two parts:
SST 
SSR 
Total Sum of
Squares
Regression Sum
of Squares
SST   ( Yi  Y)2
ˆ  Y)2
SSR   ( Y
i
SSE
Error Sum of
Squares
ˆ )2
SSE   ( Yi  Y
i
where:
Y
= Average value of the dependent variable
Yi = Observed values of the dependent variable
Yˆ i = Predicted value of Y for the given Xi value
Measures of Variation
(continued)
 SST = total sum of squares
 Measures the variation of the Yi values around their
mean Y
 SSR = regression sum of squares
 Explained variation attributable to the relationship
between X and Y
 SSE = error sum of squares
 Variation attributable to factors other than the
relationship between X and Y
Measures of Variation
(continued)
Y
Yi
^
SSE = Ʃ(Yi - Yi )2
^
Y
_
SST = Ʃ(Yi - Y)2
^
Y
^ _2
SSR = Ʃ(Yi - Y)
_
Y
Xi
_
Y
X
Coefficient of Determination, r2
 The coefficient of determination is the portion
of the total variation in the dependent variable
that is explained by variation in the
independent variable
 The coefficient of determination is also called
r-squared and is denoted as r2
SSR regressionsum of squares
r 

SST
total sum of squares
2
note:
0  r 1
2
Examples of Approximate
r2 Values
Y
r2 = 1
r2 = 1
X
100% of the variation in Y is
explained by variation in X
Y
r2
=1
Perfect linear relationship
between X and Y:
X
Examples of Approximate
r2 Values
Y
0 < r2 < 1
X
Weaker linear relationships
between X and Y:
Some but not all of the
variation in Y is explained
by variation in X
Y
X
Examples of Approximate
r2 Values
r2 = 0
Y
No linear relationship
between X and Y:
r2 = 0
X
The value of Y does not
depend on X. (None of the
variation in Y is explained
by variation in X)
Minitab Output
Analysis of Variance
Source
Regression
Error
Total
r2 
DF
1
8
9
SS
18934.9
13665.6
32600.5
MS
18934.9
1708.2
SSR 18934.9348

 0.58082
SST 32600.5000
F
11.08
P
0.010
58.08% of the variation in
house prices is explained by
variation in square feet
Standard Error of Estimate
 The standard deviation of the variation of
observations around the regression line is
estimated by
n
S YX
SSE


n2
2
ˆ
(
Y

Y
)
 i i
i1
Where
SSE = error sum of squares
n = sample size
n2
Minitab Output
SYX  41.33032
S = 41.3303
R-Sq = 58.1% R-Sq(adj) = 52.8%
Comparing Standard Errors
SYX is a measure of the variation of observed
Y values from the regression line
Y
Y
small sYX
X
large sYX
X
The magnitude of SYX should always be judged relative to the
size of the Y values in the sample data
i.e., SYX = $41.33K is moderately small relative to house prices in
the $200 - $300K range
Inferences About the Slope
Using Minitab
Stat/Regression/Regression/Fit Regression Model
Enter Response and Continuous Variables
Model Summary
S
41.3303
R-sq
58.08%
R-sq(adj)
52.84%
R-sq(pred)
24.35%
Coefficients
Term
Constant
Square Feet
Coef
98.2
0.1098
SE Coef
58.0
0.0330
T-Value
1.69
3.33
P-Value
0.129
0.010
Regression Equation
House Price = 98.2 + 0.1098 Square Feet
VIF
1.00
Inferences About the Slope
 The standard error of the regression slope
coefficient (b1) is estimated by
S YX
Sb1 

SSX
S YX
2
(X

X
)
 i
where:
Sb1
= Estimate of the standard error of the least squares slope
S YX
SSE = Standard error of the estimate

n2
Minitab Output
Model Summary
S
41.3303
R-sq
58.08%
R-sq(adj)
52.84%
R-sq(pred)
24.35%
Coefficients
Term
Constant
Square Feet
Coef
98.2
0.1098
SE Coef
58.0
0.0330
T-Value
1.69
3.33
P-Value
0.129
0.010
VIF
1.00
Regression Equation
House Price = 98.2 + 0.1098 Square Feet
Sb1  0.03297
Comparing Standard Errors of
the Slope
Sb1 is a measure of the variation in the slope of regression
lines from different possible samples
Y
Y
small Sb1
X
large Sb1
X
Inference about the Slope:
t Test
 t test for a population slope
 Is there a linear relationship between X and Y?
 Null and alternative hypotheses
H0: β1 = 0
H1: β1  0
(no linear relationship)
(linear relationship does exist)
 Test statistic
b1  β1
t
Sb1
d.f.  n  2
where:
b1 = regression slope
coefficient
β1 = hypothesized slope
Sb1 = standard
error of the slope
Inference about the Slope:
t Test
(continued)
House Price
in $1000s
(y)
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Estimated Regression Equation:
house price  98.25  0.1098 (sq.ft.)
The slope of this model is 0.1098
Does square footage of the house
affect its sales price?
Inferences about the Slope:
t Test Example
From Minitab output:
H0: β1 = 0
H1: β1  0
Model Summary
S
41.3303
R-sq
58.08%
R-sq(adj)
52.84%
R-sq(pred)
24.35%
Sb1
Coefficients
Term
Constant
Square Feet
Coef
98.2
0.1098
SE Coef
58.0
0.0330
T-Value
1.69
3.33
Regression Equation
House Price = 98.2 + 0.1098 Square Feet
P-Value
0.129
0.010
VIF
1.00
b1
b1  β1 0.10977  0
t

 3.32938
Sb1
0.03297
Inferences about the Slope:
t Test Example
(continued)
Test Statistic: t = 3.329
H0: β1 = 0
H1: β1  0
From Minitab output:
Decision:
Reject H0
d.f. = 10-2 = 8
a/2=.025
a/2=.025
Conclusion:
Reject H0
Do not reject H0
-tα/2
-2.3060
0
Reject H0
tα/2
2.3060 3.329
There is sufficient evidence
that square footage affects
house price
Inferences about the Slope:
t Test Example
(continued)
P-value = 0.010
H0: β1 = 0
H1: β1  0
From Minitab output:
Term
Constant
Square Feet
This is a two-tail test, so
the p-value is
P(t > 3.329)+P(t < -3.329)
= 0.01
(for 8 d.f.)
P-value
Coef
98.2
0.1098
SE Coef
58.0
0.0330
T-Value
1.69
3.33
P-Value
0.129
0.010
VIF
1.00
Decision: P-value < α so
Reject H0
Conclusion:
There is sufficient evidence
that square footage affects
house price
F Test for Significance
F Test statistic:
where
MSR
FSTAT 
MSE
SSR
MSR 
k
MSE 
SSE
n  k 1
where FSTAT follows an F distribution with k numerator and (n – k - 1)
denominator degrees of freedom
(k = the number of independent variables in the regression model)
F-Test for Significance
Minitab Output
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
8
9
With 1 and 8 degrees
of freedom
SS
MS
F
P
18935 18935 11.08 0.010
13666 1708
32600
FSTAT 
p-value for
the F-Test
MSR 18934.9348

 11.0848
MSE 1708.1957
F Test for Significance
(continued)
Test Statistic: 11.08
P- value = 0.01
H0: β1 = 0 (slope=0)
H1: β1 ≠ 0
 = .05
Decision:
Reject H0 at  = 0.05
 = .05
0
Do not
reject H0
p = .01
F
Reject H0
F.05 = ???
11.08
Conclusion:
There is sufficient evidence that
house size affects selling price
t Test for a Correlation Coefficient
 Hypotheses
H0: ρ = 0
HA: ρ ≠ 0
(no correlation between X and Y)
(correlation exists)
 Test statistic

t
r -ρ
1 r
n2
2
(with n – 2 degrees of freedom)
w here
r   r 2 if b1  0
r   r 2 if b1  0
Example: House Prices
Is there evidence of a linear relationship
between square feet and house price at the
.05 level of significance?
H0: ρ = 0
H1: ρ ≠ 0
(No correlation)
(correlation exists)
α =.05 , df = 10 - 2 = 8
t
r ρ
1 r 2
n2

.762  0
1  .7622
10  2
 3.33
Example: Test Solution
t
r ρ
1 r 2
n2

.762  0
1  .7622
10  2
 3.33
Conclusion:
There is
evidence of a
linear association
at the 5% level of
significance
d.f. = 10-2 = 8
a/2=.025
Reject H0
-tα/2
-2.3060
a/2=.025
Do not reject H0
0
Reject H0
tα/2
2.3060
Decision:
Reject H0
3.33
Chapter Summary
 Introduced types of regression models
 Discussed determining the simple linear
regression equation
 Described measures of variation
 Described inference about the slope
 Discussed correlation -- measuring the strength
of the association