Simple Regression - Villanova University

Download Report

Transcript Simple Regression - Villanova University

Simple Linear
Regression
and
Correlation
1
Learning Objectives

Describe the Linear Regression Model

State the Regression Modeling Steps

Explain Ordinary Least Squares

Compute Regression Coefficients

Predict Response Variable

Describe Residual & Influence Analysis

Interpret Computer Output
2
Models

Representation of Some Phenomenon

Mathematical Model Is a Mathematical
Expression of Some Phenomenon

Often Describe Relationships between
Variables

Types
» Deterministic Models
» Probabilistic Models
3
Deterministic Models
 Hypothesize
Exact Relationships
 Suitable When Prediction Error is
Negligible
 Force Is Exactly Mass Times
Acceleration
F = m·a
© 1984-1994 T/Maker Co.
4
Probabilistic Models
 Hypothesize
2 Components
» Deterministic
» Random Error
 Sales
Volume Is 10 Times Advertising
Spending Plus Random Error
» Y = 10X + e
» Random Error May Be Due to Factors
Other Than Advertising
5
Types of
Probabilistic Models
Probabilistic
Models
Regression
Models
Correlation
Models
Other
Models
6
Regression Models
Answer ‘What Is the Relationship Between
the Variables?’
 Equation Used

» 1 Numerical Dependent (Response) Variable
– What Is to Be Predicted
» 1 or More Numerical or Categorical
Independent (Explanatory) Variables

Used Mainly for Prediction
7
Regression Modeling Steps
1. Define Problem or Question
2. Specify Model
3. Collect Data
4. Do Descriptive Data Analysis
5. Estimate Unknown Parameters
6. Evaluate Model
7. Use Model for Prediction
8
Problem Definition

Most Critical Step
» Don’t Want Right Answer to Wrong Question

What Are the Model Objectives?

Who Will Use the Model?
What Will Be the Benefits?
 Are Resources Available (Data, etc.)?


How Will the Results Be Implemented?
9
Regression Modeling Steps
1. Define Problem or Question
2. Specify Model
3. Collect Data
4. Do Descriptive Data Analysis
5. Estimate Unknown Parameters
6. Evaluate Model
7. Use Model for Prediction
10
Specifying the Model

Define Variables
» Conceptual (e.g., Advertising, Price)
» Empirical (e.g., List Price, Regular Price)
» Measurement (e.g., $, Units)

Hypothesize Nature of Relationship
» Expected Effects (i.e., Coefficients’ Signs)
» Functional Form (Linear or Non-Linear)
» Interactions
11
Model Specification Is Based
on Theory
 1.
Economic & Business Theory
 2. Mathematical Theory
 3. Previous Research
 4. ‘Common Sense’
12
Which Functional Form?
Sales
Sales
Advertising
Sales
Advertising
Sales
Advertising
Advertising
13
Types of
Regression Models
1 Explanatory
Variable
Regression
Models
Multiple
Simple
Linear
2+ Explanatory
Variables
NonLinear
Linear
NonLinear
14
Linear Equations
Y
Y = mX + b
m = Slope
Change
in Y
Change in X
b = Y-intercept
X
High School Teacher
© 1984-1994 T/Maker Co.
15
Linear Regression Model

Relationship Between Variables Is a
Linear Function
Population
Y-Intercept
Population
Slope
Random
Error
Y = a + b1 X i + e i
Dependent
(Response)
Variable
Independent
(Explanatory)
Variable
16
Population & Sample
Regression Models
Population
Unknown
Relationship
Random Sample
Yi = a + b1X i + ei
Yi = a + b1X i + e i
17
Population
Linear Regression Model
Y
Yi = a + b1X i + e i
Observed
Value
ei = Random Error
mYX = a + b1X i
X
Observed Value
18
Sample
Linear Regression Model
Y
Yi = a + b1X i + ei
(3,Y)
Y-Y
Residual
(3,Y)
ei = Random
Error
Yi = a + b1X i
Unsampled
Observation
X
Observed Value
19
Regression Modeling Steps
1. Define Problem or Question
2. Specify Model
3. Collect Data
4. Do Descriptive Data Analysis
5. Estimate Unknown Parameters
6. Evaluate Model
7. Use Model for Prediction
20
Scatter Diagram
1. Plot of All (Xi, Yi) Pairs
2. Suggests How Well Model Will Fit
60
40
20
0
Y
0
20
40
X
60
21
Regression Modeling Steps
1. Define Problem or Question
2. Specify Model
3. Collect Data
4. Do Descriptive Data Analysis
5. Estimate Unknown Parameters
6. Evaluate Model
7. Use Model for Prediction
22
Thinking Challenge
How Would You Draw a Line Through the Points?
How Do You Determine Which Line ‘Fits Best’?
60
40
20
0
Y
0
20
40
X
60
23
Ordinary Least Squares

‘Best Fit’ Means Difference Between
Actual Values (Y ) & Predicted
Values ( Y ) Are a Minimum
» But Positive Differences Off-Set
Negative
(
n
i =1

Yi - Y$i
)
2
n
=  e 2i
i =1
OLS Minimizes the Sum of the
Squared Differences (or Errors)
24
Ordinary Least Squares
Graphically
n
OLS Minimizes
2
2
2
2
2
e
e
e
e
e
 i = 1+ 2+ 3+ 4
i =1
Yi = a + b1X i + ei
Y
e4
e2
e1
e3
Y$i = a + b1X i
X
25
Coefficient Equations
Sample Regression
Equation
Y$i = a + b1X i
# (Xi, Yi) Pairs
n
Sample Slope
b1 =
 X iYi - nXY
i =1
n

2
Xi
i =1
Sample Y-Intercept
()
-n X
a = Y - b1X
2
Average Xi’s,
Then Square
26
Computation Table
Yi
Xi
X1
Y1
X2
Y2
2
X1
2
X2
2
Yi
2
Y1
2
Y2
:
:
:
:
:
Yn
2
Xn
2
Yn
XnYn
SYi
2
SXi
SYi
SXiYi
Xi
Xn
SXi
2
2
XiYi
X1Y1
X2Y2
27
Parameter Estimation
Example
You’re a marketing analyst for Hasbro
Toys. You gather the following data:
Ad $
Sales (Units)
1
1
2
1
3
2
4
2
5
4
What is the relationship
between sales & advertising?
28
Scatter Diagram
Sales vs. Advertising
Sales
4
3
2
1
0
0
1
2
3
4
5
Advertising
29
Parameter Estimation
Solution Table
Xi
Yi
2
Xi
2
Yi
XiYi
1
1
1
1
1
2
1
4
1
2
3
2
9
4
6
4
2
16
4
8
5
4
25
16
20
15
10
55
26
37
30
Parameter Estimation Solution
n
b1 =
 X iYi - nXY
i =1
n

2
Xi
-n X
2
=
37 - 5 ( 3)( 2)
55 - 5 ( 9 )
= 0.70
i =1
a = Y - b1X = 2 - 0.70 ( 3 ) = -0.10
$
Y = -0.10 + 0.70 X
31
Coefficient Interpretation
Solution

Slope (b1)
» Sales Volume (Y) Is Expected to Increase
by .7 Units for Each $1 Increase in
Advertising (X)

Y-Intercept (a)
» Average Value of Sales Volume (Y) Is
-.10 Units When Advertising (X) Is 0
– Difficult to Explain to Marketing Manager
– Expect Some Sales Without Advertising
32
Interpretation of Coefficients

Slope (b1)
» Estimated Y Changes by b1 for Each 1
Unit Increase in X
– If b1 = 2, then Sales (Y) Is Expected to
Increase by 2 for Each 1 Unit Increase in
Advertising (X)

Y-Intercept (a)
» Average Value of Y When X = 0
– If a = 4, then Average Sales (Y) Is Expected
to Be 4 When Advertising (X) Is 0
33
4.5
4.0
3.5
3.0
2.5
2.0
SALES
1.5
1.0
.5
0
ADVERT
1
2
3
4
5
6
Parameter Estimation
SPSS Output
i
a
c
d
e
d
f
t
i
s
c
B
e
M
i
t
E
g
1
(
1
5
7
5
A
0
1
4
6
5
a
D
Parameter Estimation
Thinking Challenge
You’re an economist for the county
cooperative. You gather the following
data:
Fertilizer (lb.) Yield (lb.)
4
3.0
6
5.5
10
6.5
12
9.0
What is the relationship
between fertilizer & crop yield?
© 1984-1994 T/Maker Co.
38
Scatter Diagram
Crop Yield vs. Fertilizer
Yield (lb.)
10
8
6
4
2
0
0
5
10
15
Fertilizer (lb.)
39
Scatter Diagram
Crop Yield vs. Fertilizer*
Yield (lb.)
10
8
6
4
2
0
0
5
10
15
Fertilizer (lb.)
41
Parameter Estimation Solution
Table*
2
2
Xi
Yi
Xi
Yi
XiYi
4
3.0
16
9.00
12
6
5.5
36
30.25
33
10
6.5
100
42.25
65
12
9.0
144
81.00
108
32
24.0
296
162.50
218
42
Parameter Estimation
Solution*
n
b1 =
 X iYi - nXY
i =1
n

2
Xi
-n X
2
i =1
=
218 - 4( 8)(6)
296 - 4( 64)
= 0.65
()
a = Y - b1X = 6 - 0.65 8 = 0.80
$
Yi = 0.80 + 0.65 X i
43
Coefficient Interpretation
Solution*

Slope (b1)
» Crop Yield (Y) Is Expected to Increase by .65
lb. for Each 1 lb. Increase in Fertilizer (X)

Y-Intercept (a)
» Average Crop Yield (Y) Is Expected to Be 0.8
lb. When No Fertilizer (X) Is Used
44
Regression Modeling Steps
1. Define Problem or Question
2. Specify Model
3. Collect Data
4. Do Descriptive Data Analysis
5. Estimate Unknown Parameters
6. Evaluate Model
7. Use Model for Prediction
45
Evaluating the Model
1. How Well Does the Model Describe the
Relationship Between the Variables?
2. Closeness of ‘Best Fit’
Closer the Points to the Line the Better
3. Assumptions Met
4. Significance of Parameter Estimates
5. Outliers (Unusual Observations)
46
Evaluating Model Steps
1. Examine Variation
Measures
2. Test Coefficients for
Significance
3. Do Residual Analysis
Y$i = a + b1X i
4. Do Influence Analysis
47
Random Error Variation
Variation of Actual Y from Predicted Y
 Measured by Standard Error of Estimate

» Sample Standard Deviation of e
» Denoted SYX

Affects Several Factors
» Parameter Significance
» Prediction Accuracy
48
Standard Error of Estimate
 (ei - e )
n
SYX =
n - k -1
n
SYX =
2
i =1
Yi
i =1
2
-a
(
n
=
Yi - Y$i
i =1
)
2
n - k -1
n
n
i =1
i =1
Yi - b1  X iYi
n - k -1
49
Standard Error of the
Estimate Example
You’re a marketing analyst for Hasbro
Toys. You find a = -.1 & b1 = .7.
Ad $
Sales (Units)
1
1
2
1
3
2
4
2
5
4
What is the standard error of the
estimate?
50
Solution Table
Xi
Yi
2
Xi
2
Yi
XiYi
1
1
1
1
1
2
1
4
1
2
3
2
9
4
6
4
2
16
4
8
5
4
25
16
20
15
10
55
26
37
51
Standard Error of Estimate
Solution
n
SYX =
SYX =
Yi
i =1
2
-a
n
n
i =1
i =1
Yi - b1  X iYi
n - k -1
26 - ( -.1)(10) - (.7)(37)
5 - 1- 1
= .6055
52
Rule of Thumb for Interpreting
the Standard Error of Estimate
Regression line + 1(std. error): about
68% of the data points are expected to
fall in this interval
 Regression line + 2(std. error): about
95% of the data points are expected to
fall in this interval
 Regression line + 3(std. error): about
99.7% of the data points are expected to
fall in this interval

53
Graphic Representation of
Standard Error of Estimate
One
Standard
Error
One
Standard
Error
Y
_
X
Xgiven
X
54
u
E
t
q
q
R
m
M
a
5 1
a
P
55
Regression Modeling Steps
1. Define Problem or Question
2. Specify Model
3. Collect Data
4. Do Descriptive Data Analysis
5. Estimate Unknown Parameters
6. Evaluate Model
7. Use Model for Prediction
56
Prediction With Regression
Models

Types of Predictions
» Point Estimates
» Interval Estimates

What Is Predicted
» Population Mean Response (mYX) for Given X
– Point on Population Regression Line
» Individual Response (Yi) for Given X
57
What Is Predicted
Y
YIndividual
Mean Y (mYX)
mYX= a + b 1X
Prediction, ^
Y
XGiven
X
58
Confidence Interval Estimate
of Mean Y (mYX)
Y$ - tn - k -1,a / 2  SY$  mYX  Y$ + tn - k -1,a / 2  SY$
where
SY$ = SYX
X given - X )
(
+ n
n
2
2
 X i - n(X )
2
1
i =1
59
Factors Affecting
Interval Width
1. Level of Confidence (1 - a)
Width Increases as Confidence Increases
2. Data Dispersion (SYX)
Width Increases as Variation Increases
3. Sample Size
Width Decreases as Sample Size Increases
4. Distance of Xgiven from Mean`X
Width Increases as Distance Increases
60
Why Distance from Mean?
Y
Greater
Dispersion
Than X1
_
Y
X1
`X
X2
X
61
Confidence Interval Estimate
Example
You’re a marketing analyst for Hasbro
Toys. You find b0 = -.1, b1 = .7 & SYX =
.60553.
Ad $
Sales (Units)
1
1
2
1
3
2
4
2
5
4
Estimate the mean sales when
advertising is $4 at the .05 level.
62
Solution Table
Xi
Yi
2
Xi
2
Yi
XiYi
1
1
1
1
1
2
1
4
1
2
3
2
9
4
6
4
2
16
4
8
5
4
25
16
20
15
10
55
26
37
63
Confidence Interval Estimate
Solution
Y$i = -010
. + 070
. Xi
Y$ - tn- P -1,a / 2  SY$  mYX  Y$ + tn- P -1,a / 2  SY$
Y$ = -0.1 + 0.7( 4) = 2.7
SY$ =.60553
1
5
(
+
(4 - 3)
X to be
Predicted
2
55 - 5 (3)
2
= 0.3316
)
(
)
2.7 - 3.1824 0.3316  mYX  2.7 + 3.1824 0.3316
16445
.
 mYX  3.7553
Prediction Interval of
Individual Response
$
$
Y - tn - k -1,a / 2  S ind  YP  Y + tn - k -1,a / 2  Sind
where
S
ind
= SYX
X given - X )
(
1+ + n
n
2
2
 X i - n(X )
2
1
i =1
Note!
65
Why the Extra SYX?
Y
Y we're trying to
predict
e
Expected
(Mean) Y, ( myx)
myx= a + b 1X
Prediction, ^
Y
Xgiven
X
66
Prediction Interval of Individual
Response Solution Y$ = -010
. + 0.70X
i
i
$ +t
Y$ - tn - k -1a
S
Y
Y
S




, /2
P
n
k
,
/
2
-1
a
ind
ind
Y$ = -0.1 + 0.7( 4) = 2.7
1 ( 4 - 3)
Sind=.60553 1+ +
5 55 -( 5) 3
2
(
2.7 - 3.1824
1.3
2
)  mYX  2.7 +
.3
= 1.3
(
3.1824
14371
.
.
 Y  6.8371
1.3
)
Hyperbolic Interval Bands
Y
_
X
Xgiven
X
68
Interval Estimate
SPSS Output
ad sales
lmci_1
1.00
2.00
3.00
4.00
5.00
-.89270
.24450
1.13819
1.64450
1.90730
1.00
1.00
2.00
2.00
4.00
umci_1
2.09270
2.35550
2.86181
3.75550
4.89270
lici_1
-1.83757
-.89719
-.11100
.50281
.96243
uici_1
3.03757
3.49719
4.11100
4.89719
5.83757
70
Regression Modeling Steps
1. Define Problem or Question
2. Specify Model
3. Collect Data
4. Do Descriptive Data Analysis
5. Estimate Unknown Parameters
6. Evaluate Model
7. Use Model for Prediction
71
Evaluating the Model
1. How Well Does the Model Describe the
Relationship Between the Variables?
2. Closeness of ‘Best Fit’
Closer the Points to the Line the Better
3. Assumptions Met
4. Significance of Parameter Estimates
5. Outliers (Unusual Observations)
72
Evaluating Model Steps
1. Examine Variation
Measures
2. Test Coefficients for
Significance
3. Do Residual Analysis
Y$i = a + b1X i
4. Do Influence Analysis
73
Measures of Variation
in Regression

Total Sum of Squares (SST)
» Measures Variation of Observed Yi
Around the Mean`Y

Explained Variation (SSR)
» Variation Due to Relationship Between
X&Y

Unexplained Variation (SSE)
» Variation Due to Other Factors
74
Variation Measures
Y
Yi
Unexplained Sum of
^ )2
Squares (Yi - Y
i
SSE
(xi,Yi)
SST
Total Sum of
Squares (Yi
-`Y)2
(xi,Yi)
$
Yi = a + b1X i
Explained Sum of
^ -`Y)2
Squares (Y
i
(xi,Yi)
Xi
SSR
`Y
X
75
Relationship
SST = SSR + SSE
SST = SSR SSE
+
SST SST SST
1
SSR
SSR
SSE
=
+
SST
SST
SST
76
Coefficient of Determination
Proportion of Variation ‘Explained’
by
ˆ
Relationship Between X & Y
0  r2  1
Explained Variation SSR
r =
=
Total Variation
SST
2
a
=
n
n
i =1
i =1
Yi + b1 X iYi - n Y
n
Yi
i =1
2
-nY
2
2
77
Coefficient of
Determination Examples
Y
r2 = 1
Y
r2 = 1
^=b +b X
Y
i
0
1 i
^=b +b X
Y
i
0
1 i
X
Y
r2 = .8
X
Y
^=b +b X
Y
i
0
1 i
X
r2 = 0
^=b +b X
Y
i
0
1 i
X
78
Adjusted
Coefficient of Determination
Proportion of Variation ‘Explained’ by
Relationship Between X & Y
 Reflects

» Sample Size
» Number of Independent Variables

Equation
2
radj
(
= 1- 1- r
2
) n - 2
n -1
79
Coefficient of
Determination Example
You’re a marketing analyst for Hasbro
Toys. You find a = -.1 & b1 = .7.
Ad $
Sales (Units)
1
1
2
1
3
2
4
2
5
4
What is the coefficient of
determination?
80
Solution Table
Xi
Yi
2
Xi
2
Yi
XiYi
1
1
1
1
1
2
1
4
1
2
3
2
9
4
6
4
2
16
4
8
5
4
25
16
20
15
10
55
26
37
81
Coefficient of Determination
Solution
Y$i = -0.10 + 0.70 X i
a
r =
2
n
n
i =1
i =1
Yi + b1 X iYi - n Y
2
81.67% of
Variation in
Sales Is Due
Advertising
()
-0.10(10)+ 0.70(37)- 5(2)
26 - 5(2)
n
Yi
2
-nY
2
i =1
=
=.8167
2
2
82
Coeficient of Determination
SPSS Output
M ode l Summary
Model
1
R
.904 a
R Square
.817
a. Predictors: (Constant), ADVERT
Adjusted
R Square
.756
Std. Error
of the
Estimate
.6055
Types of
Probabilistic Models
Probabilistic
Models
Regression
Models
Correlation
Models
Other
Models
84
Correlation Models


Answer ‘How Strong Is the Linear
Relationship Between 2 Variables?’
Coefficient of Correlation Used
» Population Correlation Coefficient
Denoted
r (Rho)
» Values Range from -1 to +1
» Measures Degree of Association

Used Mainly for Understanding
85
Sample
Coefficient of Correlation
Pearson Product-Moment Coefficient
of
ˆ
Correlation
r = Coefficient of Determination
n
=
 ( X i - X )(Yi - Y
i =1
 (X i - X)
n
i =1
2

)
 (Yi - Y )
n
2
i =1
86
Coefficient of Correlation
Values
Perfect
Negative
Correlation
-1.0
Perfect
Positive
Correlation
No
Correlation
-.5
Increasing Degree
of Negative
Correlation
0
+.5
+1.0
Increasing Degree
of Positive
Correlation
87
Coefficient of Correlation
& Regression Model
Y
r=1
Y
r = -1
^=a +b X
Y
i
1 i
^=a +b X
Y
i
1 i
X
Y
r = .89
X
Y
^=a +b X
Y
i
1 i
X
r=0
^=a +b X
Y
i
1 i
X
88
Test of
Coefficient of Correlation
Tests If There Is a Linear Relationship
Between 2 Numerical Variables
 Same Conclusion as Testing
Population Slope b1
 Hypotheses

» H0: r = 0 (No Correlation)
» H1: r  0 (Correlation)
89
Evaluating Model Steps

Examine Variation Measures
Test Coefficients
for Significance
 Do Residual Analysis


Do Influence Analysis
$
Yi = a + b1X i
90
Test of Slope Coefficient

Tests If There Is a Linear
Relationship Between X & Y

Involves Population Slope b1

Hypotheses
» H0: b1 = 0 (No Linear Relationship)
» H1: b1  0 (Linear Relationship)

Theoretical Basis Is Sampling
Distribution of Slopes
91
Sampling Distribution
of Sample Slopes
Y
Sample 1 Line
Sample 2 Line
Population Line
X
Sampling Distribution
sb1
b1
b1
All Possible
Sample Slopes
 Sample 1:
2.5
 Sample 2:
1.6
 Sample 3:
1.8
 Sample 4:
2.1
 Very Large Number
of Sample Slopes
92
Test of Slope Coefficient Test
Statistic
t n - k -1 =
b1 - b1
Sb1
where
Sb1 =
SYX
n
 Xi
i -1
2
( )
-n X
2
93
Test of Slope Coefficient
Example
You’re a marketing analyst for Hasbro
Toys. You find b0 = -.1, b1 = .7 & SYX =
.60553.
Ad $
Sales (Units)
1
1
2
1
3
2
4
2
5
4
Is the relationship significant
at the .05 level?
94
Solution Table
Xi
Yi
2
Xi
2
Yi
XiYi
1
1
1
1
1
2
1
4
1
2
3
2
9
4
6
4
2
16
4
8
5
4
25
16
20
15
10
55
26
37
95
Test Statistic
Solution
Sb1 =
SYX
n
()
 Xi - n X
2
i -1
t n - P -1 =
b1 - b1
Sb1
=
=
2
0.70 - 0
0.1915
0.60553
(
0
.
1915
=
2
55 - 5 3 )
= 3.656
96
Test of Slope Parameter
Solution
H0: b1 = 0
Test Statistic:
 H1: b1  0
b1 - b1 0.70 - 0
t=
=
= +3.655
Sb1
0.1915
 a = .05
 df = 5 - 1 - 1 = 3
 Critical Value(s):
Decision:
Reject
Reject
Reject at a = .05

.025
.025
-3.1824 0 3.1824
t
Conclusion:
There is Evidence of a
Relationship
97
Test of Slope Parameter
Computer Output
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate
Error
Param=0 Prob>|T|
INTERCEP 1 -0.1000
0.6350
-0.157
0.8849
ADVERT
1
0.7000
0.1914
3.656
0.0354
bP
Sb
P
t = bP / Sb
P
P-Value
98
Test of Slope Parameter
SPSS Output
------------------ Variables in the Equation -----------------Variable
AD
(Constant)
B
SE B
Beta
T
Sig T
.700000
-.100000
.191485
.635085
.903696
3.656
-.157
.0354
.8849
Evaluating Model Steps

Examine Variation Measures

Test Coefficients for
Significance

Do Residual Analysis

Do Influence Analysis
$
Yi = a + b1X i
100
Residual Analysis
Residual Analysis

Purposes
» Examine Functional Form (Linear vs.
Non-Linear Model)
» Evaluate Violations of Assumptions

Graphical Analysis of Residuals
» Plot Residuals vs. X Values
» Residuals Mean Errors
– Difference Between Actual Y & Predicted Y
Y
(X1, mY1)
X1
For one value X1, a population
contains may Y values. Their
mean is mY1.
Y
A Population Regression Line
mY = a + BX
X
A Sample Regression Line
Y
The sample line
approximates the
population regression line.
y = a + bx
x
Population Y and sample y
Population and Sample Regression Lines
Population
Regression Line
One of many sample
regression lines
Population X and sample x
Histogram of Y values at X = X1
Y
mY1 = a + BX1
mY = a + BX
X1
X
Histogram of Y Values at X = X1
f(e)
Y
X1
X
mY1 = a + BX1
mY = a + BX
Normal Distribution of Y Values
when X = X1
f(e)
The standard deviation of the
normal distribution is the
standard error of estimate.
Y
X1
X
mY1 = a + BX1
mY = a + BX
Normality & Constant
Variance Assumptions
f(e)
Y
X2
X
X1
A Normal Regression Surface
f(e)
Every cross-sectional slice of the
surface is a normal curve.
Y
X2
X
X1
Linear Regression Assumptions

Normality
 Y Values Are Normally Distributed For Each X
 e is a normally distributed random variable with a mean of Zero
[ E(e ) = 0 ]

Homoscedasticity (Constant Variance)
 Standard deviation of the e values is the same regardless of the
given value of X
 Variance of e is same for all values of X.

Independence of Errors
 The residuals ( e ) are independent of each other
 The size of the error for a paarticular value of x is not related to the
size of the error for any other value of x
Each of these distributions:
1. Is normal
2. Has the standard deviation, estimated by syx
One
Standard
Deviation
One
Standard
Deviation
Line of
Regression
All three means
lie on line of
regression
X1
X2
X3
X
Residual Plots for Normality
 Construct
histogram of
residuals
 Plot residuals vs. X values
Residual Plot 1 for Normality
Construct histogram of residuals
Nearly symmetric
Centered near or at zero
Shape is approximately normal
10
8
6
4
Std. Dev = 1.61
Mean = 0.0
N = 31.00
2
0
-3.0 -2.0 -1.0 0.0
RESIDUAL
1.0
2.0
3.0
Residual Plot 2 for Normality
Plot residuals vs. X values
Points should be distributed about the
horizontal line at 0
Otherwise, normality is violated
Residuals
0
X
Residual Plots for Normality
Plot of Residuals vs X
Values
Histogram of Residual
10
Residuals
8
6
0
4
Std. Dev = 1.61
Mean = 0.0
N = 31.00
2
0
-3.0 -2.0 -1.0 0.0
RESIDUAL
1.0
2.0
3.0
X
Using SPSS to Test for
Normality of Residuals

Statistics/Regression/Linear
» Dependent - Earnings
» Independent - Rdexpend
» Plot/Standardized Residual Plot: Histogram
» Save
– Predicted Value (Unstandardized or
Standardized)
– Residual (Unstandardizedor Standardized)

Graph/Scatter/Simple
» Y-Axis: res_1 ( zre_1 )
» X-Axis: rdexpend
List of Data, Predicted Values and Residuals
NUMBER
SALES
PRE_1
RES_1
2
5
1
3
4
1
5
24.00
28.00
22.00
26.00
25.00
24.00
26.00
24.05556
26.88889
23.11111
25.00000
25.94444
23.11111
26.88889
-.05556
1.11111
-1.11111
1.00000
-.94444
.88889
-.88889
DUNTON'S WORLD OF SOUND
Histogram
Frequency
Dependent Variable: SALES
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Std. Dev = .91
Mean = 0.00
N = 7.00
-1.00
-.50
0.00
.50
1.00
Regression Standardized Residual
DUNTON'S WORLD OF SOUND
Plot of Residuals vs Number
1.5
Residual
1.0
.5
0.0
-.5
-1.0
-1.5
0
1
2
3
NUMBER
DUNTON'S WORLD OF SOUND
4
5
6
The Electronic Firms
An accounting standards board
investingating the treatment of
research and developmnet expenses by
the nation’s major electronic firms was
interested in the relationship between a
firm’s research and development
expenditures and its earnings.
Earnings = 6.840 + 10.671(rdexpend)
List of Data, Predicted Values and Residuals
RDEXPEND EARNINGS
15.00
8.50
12.00
6.50
4.50
2.00
.50
1.50
14.00
9.00
7.50
.50
2.50
3.00
6.00
Data
221.00
83.00
147.00
69.00
41.00
26.00
35.00
40.00
125.00
97.00
53.00
12.00
34.00
48.00
64.00
PRE_1
RES_1
ZPR_1
ZRE_1
166.90075
97.54224
134.88913
76.20116
54.86008
28.18373
12.17792
22.84846
156.23021
102.87751
86.87170
12.17792
33.51900
38.85427
70.86589
54.09925
-14.54224
12.11087
-7.20116
-13.86008
-2.18373
22.82208
17.15154
-31.23021
-5.87751
-33.87170
-.17792
.48100
9.14573
-6.86589
1.84527
.48229
1.21620
.06291
-.35647
-.88070
-1.19523
-.98554
1.63558
.58713
.27260
-1.19523
-.77585
-.67101
-.04194
2.39432
-.64361
.53600
-.31871
-.61342
-.09665
1.01006
.75909
-1.38218
-.26013
-1.49909
-.00787
.02129
.40477
-.30387
Predicted
Value
Residual Standardized Standardized
Predicted Value Residual
ELECTRONIC FIRMS
Histogram
Dependent Variable: EARNINGS
6
Frequency
5
4
3
2
Std. Dev = .96
Mean = 0.00
N = 15.00
1
0
-1.50
-.50
-1.00
.50
0.00
1.50
1.00
2.50
2.00
Regression Standardized Residual
ELECTRONIC FIRMS
Plot of Residuals vs R&D Expenditures
Plot of Residuals vs X Values
60
Residual
40
20
0
-20
-40
0
2
4
6
8
10
RDEXPEND
ELECTRONIC FIRMS
12
14
16
Standardized Residual
Plot of St. Residuals vs RDexpend
Plot of Standardized Residuals vs X Value
3
2
1
0
-1
-2
0
2
4
6
8
10
RDEXPEND
ELECTRONIC FIRMS
12
14
16
Homoscedasticity
Constant Variance
Correct Specification
Heteroscedasticity
SR
SR
0
0
X
X
Fan-Shaped.
Standardized Residuals Used.
127
Using SPSS to Test for
Homoscedasticity of Residuals
 Graph/Scatter/Simple
»Y-Axis: res_1 (zres_1)
»X Axis: rdexpend
Test for Homoscedasticity
Plot of Residuals vs Number
1.5
Residual
1.0
.5
0.0
-.5
-1.0
-1.5
0
1
2
3
4
5
6
NUMBER
129
DUNTON’S WORLD OF SOUND
Test for Homoscedasticity
Plot of Residuals vs R&D Expenditures
Plot of Residuals vs X Values
60
Residual
40
20
0
-20
-40
0
2
4
6
8
10
RDEXPEND
ELECTRONIC FIRMS
12
14
16
Residual Plot for
Independence
Correct Specification
Not Independent
SR
SR
X
X
Plots Reflect Sequence Data Were Collected.
131
Two Types of Autocorrelation
Positive Autocorrelation: successive
terms in time series are directly related
 Negative Autocorrelation: successive
terms are inversely related

132
Positive autocorrelation:
Residuals tend to be followed
by residuals with the same sign
Residual
y-y
20
0
-20
0
4
8
12
Time Period, t
16
20
Negative Autocorrelation:
Residuals tend to change signs
from one period to the next
Residual
y-y
20
0
-20
0
4
8
12
Time Period, t
16
20
Problems with autocorrelated
time-series data
sy.x and sb are biased downwards
 Invalid probability statements about
regression equation and slopes
 F and t tests won’t be valid
 May imply that cycles exist
 May induce a falsely high or low
agreement between 2 variables

135
Using SPSS to Test for
Independence of Errors
 Graph/Sequence
» Variables: res_1
 Durbin-Watson
Statistic
136
Time Sequence of Residuals
1.5
Residual
1.0
.5
0.0
-.5
-1.0
-1.5
1
2
3
4
5
Sequence number
DUNTON’S WORLD OF SOUND
6
7
Time Sequence Plot of Residuals
60
Residual
40
20
0
-20
-40
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Sequence number
ELECTRONIC FIRMS
140
Durbin-Watson Procedure

Used to Detect Autocorrelation
» Residuals in One Time Period Are
Related to Residuals in Another Period
» Violation of Independence Assumption

Durbin-Watson Test Statistic
n
D=
 ei - ei -1
i =2
n
2
e
 i
i =1
2
H0 : No positive autocorrelation exists
(residuals are random)
H1 : Positive autocorrelation exists
Accept Ho if d> du
Reject Ho if d < dL
Inconclusive if dL < d < du
d=
Testing for Positive
Autocorrelation
There is
The test is
positive
autocorrelation inconclusive
0
dL
du
There is no evidence
of autocorrelation
4
2
143
Using SPSS with
Autocorrelation
Statistics/Regression/Linear
 Dependent; Independent
 Statistics/Durbin-Watson (use only time
series data)

 If
DW indicates autocorrelation,
then …
Statistics/Time Series/Autoregression
 Cochrane-Orcutt
 OK

144
M ode l Summary
Model
1
R
.904 a
R Square
.817
a. Predictors: (Constant), ADVERT
Adjusted
R Square
.756
Std. Error
of the
Estimate
.6055
DurbinWatson
2.509
Solutions for autocorrelation
Changes in the dependent and
independent variables - first differences
 Transform the variables
 Include an independent variable that
measures the time of the observation
 Use lagged variables (once lagged
value of dependent variable is
introduced as independent variable,
Durbon-Watson test is not valid

146
Residual Plot for Linearity
(Functional Form)
Correct Specification
Add X2 Term
e
e
X
X
Plot of Residuals vs R&D Expenditures
Plot of Residuals vs X Values
60
Residual
40
20
0
-20
-40
0
2
4
6
8
10
RDEXPEND
ELECTRONIC FIRMS
12
14
16
Evaluating Model Steps

Examine Variation Measures

Test Coefficients for
Significance

Do Residual Analysis

Do Influence Analysis
$
Yi = a + b1X i
149
Influence Analysis
Outliers
Influence Analysis
Examines Observations that Strongly
Affect Coefficient Values
 Example - During Data Collection Union
Strike Occurred
 Should Try to Understand Why
Influential Observations Occurred
 Cautiously Consider Deleting
Observation

151
Effect of
Influential Observation
Y
Influential
Observation
Line With
Influential
Observation
Line Without Influential
Observation
X
152
Regression Cautions





Violated Assumptions
Relevancy of
Historical Data
Level of Significance
Extrapolation
Cause & Effect
155
Extrapolation
Y
Interpolation
Extrapolation
Extrapolation
Relevant Range
X
156
Cause & Effect
Liquor
Consumption
# Teachers
157
Conclusion

Described the Linear Regression Model

Stated the Regression Modeling Steps

Explained Ordinary Least Squares

Computed Regression Coefficients

Described Residual & Influence Analysis

Predicted Response Variable

Interpreted Computer Output
158