The Simple Liner Regression Model and Correlation

Download Report

Transcript The Simple Liner Regression Model and Correlation

Forecasting Using the
Simple Linear Regression
Model and Correlation
What is a forecast?
Using a statistical method on past data
to predict the future.
Using experience, judgment and surveys
to predict the future.
Why forecast?
to enhance planning.
 to force thinking about the future.
 to fit corporate strategy to future conditions.
 to coordinate departments to the same future.
 to reduce corporate costs.

Kinds of Forecasts
Causal forecasts are when changes in a
variable (Y) you wish to predict are caused
by changes in other variables (X's).
 Time series forecasts are when changes in a
variable (Y) are predicted based on prior
values of itself (Y).
 Regression can provide both kinds of
forecasts.

Types of Relationships
Positive Linear Relationship
Negative Linear Relationship
Types of Relationships
(continued)
Relationship NOT Linear
No Relationship
Relationships
If the relationship is not linear, the
forecaster often has to use math
transformations to make the
relationship linear.
Correlation Analysis
•Correlation measures the strength of the
linear relationship between variables.
•It can be used to find the best predictor
variables.
• It does not assure that there is a causal
relationship between the variables.
The Correlation Coefficient
•
•
•
•
Ranges between -1 and 1.
The Closer to -1, The Stronger Is The
Negative Linear Relationship.
The Closer to 1, The Stronger Is The
Positive Linear Relationship.
The Closer to 0, The Weaker Is Any
Linear Relationship.
Graphs of Various
Correlation (r) Values
Y
Y
Y
X
r = -1
X
r = -.6
Y
X
r=0
Y
r = .6
X
r=1
X
The Scatter Diagram
Plot of all (Xi , Yi) pairs
Y
80
60
40
20
X
0
0
20
40
60
The Scatter Diagram
Is used to visualize the relationship
and to assess its linearity. The
scatter diagram can also be used to
identify outliers.
Regression Analysis
Regression Analysis can be used to model
causality and make predictions.
Terminology: The variable to be
predicted is called the dependent or
response variable. The variables used in
the prediction model are called
independent, explanatory or predictor
variables.
Simple Linear Regression Model
•The relationship between variables is
described by a linear function.
•A change of one variable causes the
other variable to change.
Population Linear Regression
• Population Regression Line Is A Straight Line that
Describes The Dependence of One Variable on The
Other
Population
Slope
Coefficient
Population
Y intercept
Random
Error
Yi     X i  i
Dependent
(Response)
Variable
Population
Regression
Line
Independent
(Explanatory)
Variable
How is the best line found?
Y
Yi     X i  i
Observed
Value
 i = Random Error

YX     X i

X
Observed Value
Sample Linear Regression
Sample Regression Line Provides an Estimate of
The Population Regression Line
Sample
Y Intercept
Yi  b0  b1 X i  ei
b0 provides an estimate of  
b1 provides an estimate of 
Sample
Slope
Coefficient
Residual
Sample
Regression
Line
Y
Simple Linear Regression:
An Example
You wish to examine the
relationship between the
square footage of produce
stores and their annual
sales. Sample data for 7
stores were obtained. Find
the equation of the straight
line that fits the data best
Store
Square
Feet
Annual
Sales
($1000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
The Scatter Diagram
Annua l Sa le s ($000)
12000
10000
8000
6000
4000
2000
0
0
Excel Output
1000
2000
3000
4000
S q u a re F e e t
5000
6000
The Equation for the
Regression Line

Y i  b0  b1 X i
 1636 . 415  1 . 487 X i
From Excel Printout:
C o e ffi c i e n ts
I n te r c e p t
1 6 3 6 .4 1 4 7 2 6
X V a ria b le 1 1 .4 8 6 6 3 3 6 5 7
Graph of the Regression
Line
Annua l Sa le s ($000)
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
S q u a re F e e t
5000
6000
Interpreting the Results

Yi = 1636.415 +1.487Xi
The slope of 1.487 means that each increase of one
unit in X, we predict the average of Y to increase by
an estimated 1.487 units.
The model estimates that for each increase of 1
square foot in the size of the store, the expected
annual sales are predicted to increase by $1487.
The Coefficient of
Determination
r2 =
SSR
SST
=
regression sum of squares
total sum of squares
The Coefficient of Determination (r2 )
measures the proportion of variation in Y
explained by the independent variable X.
Coefficients of Determination
(R2) and Correlation (R)
r2 = 1,
r = +1
Y
^=b +b X
Y
i
0
1 i
X
Coefficients of Determination
(R2) and Correlation (R) (continued)
r2 = .81, r = +0.9
Y
^=b +b X
Y
i
0
1 i
X
Coefficients of Determination
(R2) and Correlation (R) (continued)
r2 = 0,
r=0
Y
^=b +b X
Y
i
0
1 i
X
Coefficients of Determination
(R2) and Correlation (R) (continued)
r2 = 1,
r = -1
Y
^=b +b X
Y
i
0
1 i
X
Correlation: The Symbols
• Population correlation coefficient 
(‘rho’) measures the strength
between two variables.
• Sample correlation coefficient r
estimates  based on a set of sample
observations.
Example: Produce Stores
From Excel Printout
R e g r e ssi o n S ta ti sti c s
M u lt ip le R
R S q u a re
0 .9 7 0 5 5 7 2
0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re 0 . 9 3 0 3 7 7 5 4
S t a n d a rd E rro r
O b s e rva t io n s
6 1 1 .7 5 1 5 1 7
7
Inferences About the Slope
• t Test for a Population Slope
Is There A Linear Relationship between X and Y ?
• Null and Alternative Hypotheses
H0: 1 = 0 (No Linear Relationship)
H1: 1  0 (Linear Relationship)
•Test Statistic:
b1   1
Where Sb 
t 
1
S b1
SYX
n
2
(
X

X
)
 i
i 1
and df = n - 2
Example: Produce Stores
Data for 7 Stores:
Store
1
2
3
4
5
6
7
Square
Feet
Annual
Sales
($000)
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Estimated
Regression
Equation:

Yi = 1636.415 +1.487Xi
The slope of this model
is 1.487.
Is Square Footage of
the store affecting its
Annual Sales?
Inferences About the
Slope: t Test Example
Test Statistic:
H0: 1 = 0
H1: 1  0
  .05
df  7 - 2 = 5
Critical value(s):
Reject
.025
From Excel Printout
t S tat
I n te r c e p t
X V a ria b le 1
3.6244333
0.0151488
9.009944
0.0002812
Decision:
Reject
Reject H0
Conclusion:
.025
-2.5706 0 2.5706
P-valu e
t
There is evidence of a
linear relationship.
Inferences About the Slope Using
A Confidence Interval
Confidence Interval Estimate of the Slope
b1 tn-2 Sb1
Excel Printout for Produce Stores
L o w er 95%
I n te r c e p t
U p p er 95%
475.810926
2797.01853
X V a r i a b l e 11 . 0 6 2 4 9 0 3 7
1.91077694
At 95% level of Confidence The confidence Interval for
the slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear relationship
between annual sales and the size of the store.
Residual Analysis
Is used to evaluate validity of
assumptions. Residual analysis
uses numerical measures and
plots to assure the validity of the
assumptions.
Linear Regression
Assumptions
1. X is linearly related to Y.
2. The variance is constant for each value
of Y (Homoscedasticity).
3. The Residual Error is Normally
Distributed.
4. If the data is over time, then the errors
must be independent.
Residual Analysis for Linearity
Y
Y
X
e
X
X
e
X
Not Linear

Linear
Residual Analysis for
Homoscedasticity
Y
Y
e
X
X
e
X
Heteroscedasticity
X

Homoscedasticity
Residual Analysis for Independence:
The Durbin-Watson Statistic
It is used when data is collected over time.
It detects autocorrelation; that is, the residuals in
one time period are related to residuals in another
time period.
It measures violation of independence assumption.
n
D
 ( ei  ei  1 )
i 2
n
2
 ei
i 1
2
Calculate D and
compare it to the value
in Table E.8.
Preparing Confidence
Intervals for Forecasts
Interval Estimates for
Different Values of X
Confidence
Interval for the
mean of Y
Confidence Interval
Y for a individual Yi
_
X
X
A Given X
Estimation of
Predicted Values
Confidence Interval Estimate for YX
The Mean of Y given a particular Xi
Standard error
of the estimate
Yˆi  t n  2  Syx
t value from table
with df=n-2
Size of interval vary according to
distance away from mean, X.
1
( Xi  X )
 n
n  ( X  X )2
i
2
i 1
Estimation of
Predicted Values
Confidence Interval Estimate for
Individual Response Yi at a Particular Xi
Addition of 1 increases width of
interval from that for the mean of Y
Yˆi  t n  2  Syx
1
( Xi  X )
1  n
n  ( X  X )2
i
2
i 1
Example: Produce Stores
Data for 7 Stores:
Store
Square
Feet
Annual
Sales
($000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Predict the annual
sales for a store with
2000 square feet.
Regression Model Obtained:

Yi = 1636.415 +1.487Xi
Estimation of Predicted
Values: Example
Confidence Interval Estimate for YX
Find the 95% confidence interval for the average annual sales
for stores of 2,000 square feet

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29
Yˆi  t n  2  Syx
SYX = 611.75
1
( X i  X )2
 n
n  ( X  X )2
i
i 1
tn-2 = t5 = 2.5706
= 4610.45  612.66
Confidence interval for mean Y
Estimation of Predicted
Values: Example
Confidence Interval Estimate for Individual Y
Find the 95% confidence interval for annual sales of one
particular store of 2,000 square feet

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29
Yˆ i  t n  2  Syx
SYX = 611.75
tn-2 = t5 = 2.5706
1
( X i  X )2
1  n
= 4610.45  1687.68
n  ( X  X )2
Confidence interval
i
i 1
for individual Y