Transcript Stats 845

Fitting Equations to Data
A Common situation:
Suppose that we have a
• single dependent variable Y (continuous
numerical)
and
• one or several independent variables, X1, X2, X3,
... (also continuous numerical, although there are
techniques that allow you to handle categorical
independent variables).
• The objective will be to “fit” an equation to the
data collected on these measurements that
explains the dependence of Y on X1, X2, X3, ...
What is the value of these
equations?
Equations give very precise and concise
descriptions (models) of data explaining
how dependent variables are related to
independent variables.
Examples
•
•
•
Linear models Y= Blood Pressure, X = age
Y=aX+b+e
Exponential growth or decay models Y =
Average of 5 best times for the 100m during an
Olympic year, X = the Olympic year.
kX
Y  ae
b +e
Another growth model. (The Gompertz model)
 be kX
Y  ae
e
dY
a 
or
 " rateof increaseof Y" = kY ln 
dx
Y
Y = size of a cancerous tumor, X = time after
implantation.
Note: the presence of the random error term,
e, (random noise).
This is a important term in any statistical
model.
Without this term the model is deterministic
and doesn’t require the statistical analysis
What is the value of these
equations?
1.Equations give very precise and concise descriptions
(models) of data and how dependent variables are
related to independent variables.
2.The parameters of the equations usually have very
useful interpretations relative to the phenomena that
is being studied.
3.The equations can be used to calculate and estimate
very useful quantities related to phenomena. Relative
extrema, future or out-of-range values of the
phenomena
4.Equations can provide the framework for
comparison.
The Multiple Linear
Regression Model
An important statistical model
Again we assume that we have a single dependent variable Y
and p (say) independent variables X1, X2, X3, ... , Xp.
The equation (model) that generally describes the relationship
between Y and the Independent variables is of the form:
Y = f(X1, X2,... ,Xp | q1, q2, ... , qq) + e
where q1, q2, ... , qq are unknown parameters of the function f
and e is a random disturbance (usually assumed to have a
normal distribution with mean 0 and standard deviation s).
In Multiple Linear Regression we assume the
following model
Y = b0 + b1 X1 + b2 X2 + ... + bp Xp + e
This model is called the Multiple Linear Regression
Model.
Again are unknown parameters of the model and
where b0, b1, b2, ... , bp are unknown parameters and
e is a random disturbance assumed to have a normal
distribution with mean 0 and standard deviation s.
The importance of the Linear model
1.
It is the simplest form of a model in which each
independent variable has some effect on the .dependent
variable Y. When fitting models to data one tries to find
the simplest form of a model that still adequately describes
the relationship between the dependent variable and the
independent variables. The linear model is sometimes the
first model to be fitted and only abandoned if it turns out to
be inadequate.
2. In many instances a linear model is the most
appropriate model to describe the dependence
relationship between the dependent variable and
the independent variables. This will be true if the
dependent variable increases at a constant rate as
any or the independent variables is increased
while holding the other independent variables
constant.
3.
Many non-Linear models can be put into the
form of a Linear model by appropriately
transforming the dependent variables and/or any
or all of the independent variables. This important
fact ensures the wide utility of the Linear model.
(i.e. the fact the many non-linear models are
linearizable.)
An Example
The following data comes from an experiment that
was interested in investigating the source from which
corn plants in various soils obtain their phosphorous.
The concentration of inorganic phosphorous (X1)
and the concentration of organic phosphorous (X2)
was measured in the soil of n = 18 test plots. In
addition the phosphorous content (Y) of corn grown
in the soil was also measured. The data is displayed
below:
Inorganic
Phosphorous
X1
Organic
Phosphorous
X2
Plant
Available
Phosphorous
Y
Inorganic
Phosphorous
X1
Organic
Phosphorous
X2
Plant
Available
Phosphorous
Y
0.4
53
64
12.6
58
51
0.4
23
60
10.9
37
76
3.1
19
71
23.1
46
96
0.6
34
61
23.1
50
77
4.7
24
54
21.6
44
93
1.7
65
77
23.1
56
95
9.4
44
81
1.9
36
54
10.1
31
93
26.8
58
168
11.6
29
93
29.9
51
99
Coefficients
Intercept
56.2510241 (b0)
X1
1.78977412 (b1)
X2
0.08664925 (b2)
Equation:
Y = 56.2510241 + 1.78977412 X1 + 0.08664925 X2
Summary of the Statistics
used in
Multiple Regression
The Least Squares Estimates:
ˆb , b
ˆ ,b
ˆ , ... , b
ˆ
0 1 2
p
- The values that minimize
n
RSS 

i1
n
2
yi  yˆ i ) 

2
yi  b 0  b 1 xi1  b 2 xi2  ...  b p xip ) .
i1
Note: yˆi  b0  b1 xi1   b p x p1
= predicted value of yi
The Analysis of Variance Table Entries
a) Adjustedn Total Sum of Squares (SSTotal)
SSTotal 
 y  y) . d.f.  n  1)
_ 2
i
i1
b) Residual Sum of Squares (SSError)
n
RSS  SSError 

2
yi  yˆ i ) . d.f.  n  p  1)
i1
c) Regression Sum of Squares (SSReg)
n
SSReg  SSb 1 ,b 2 , ... , b p ) 
Note:

_ 2
ˆ
yi  y) . d.f. p)
i1
n

i1
n
_ 2
y i  y ) 

n
_ 2
yˆ i  y) 
i1
i.e. SSTotal = SSReg +SSError

i1
2
yi  yˆ i ) .
The Analysis of Variance Table
Source
Sum of Squares
d.f.
p
Error
SSReg
SSError
n-p-1
Total
SSTotal
n-1
Regression
Mean Square
SSReg/p = MSReg
SSError/(n-p-1) =MSError = s2
F
MSReg/s2
Uses:
1. To estimate s2 (the error variance).
- Use s2 = MSError to estimate s2.
2. To test the Hypothesis
H0: b1 = b1 = b2= ... = bp = 0.
Use the test statistic
F = MSReg/ s2
= [(1/p)SSReg]/[(1/(n-p-1))SSError] .
- Reject H0 if F > Fa(p,n-p-1).
3. To compute other statistics that are useful in describing the
relationship between Y (the dependent variable) and X1, X2,
... ,Xp (the independent variables).
a) R2 = the coefficient of determination
= SSReg/SSTotal
n
=
 yˆ i  y)
2
i 1
n
2

y

y
)
 i
i 1
= the proportion of variance in Y explained by
X1, X2, ... ,Xp
1 - R2 = the proportion of variance in Y
that is left unexplained by X1, X2, ... , Xp
= SSError/SSTotal.
b) Ra2 = "R2 adjusted" for degrees of freedom.
= 1 -[the proportion of variance in Y that is left
unexplained by X1, X2,... , Xp adjusted for d.f.]
= 1 - [(1/(n-p-1))SSError]/[(1/(n-1))SSTotal] .
= 1 - [(n-1)SSError]/[(n-p-1)SSTotal] .
= 1 - [(n-1)/(n-p-1)] [1 - R2 ].
c) R= R2 = the Multiple correlation coefficient
of Y with X1, X2, ... ,Xp
=
SS Re g
SS Total
= the maximum correlation between Y and a
linear combination of X1, X2, ... ,Xp
Comment: The statistics F, R2, Ra2 and R are
equivalent statistics.
Using SPSS
Note: The use of another statistical package
such as Minitab is similar to using SPSS
After starting the SSPS program the following
dialogue box appears:
If you select Opening an existing file and press OK
the following dialogue box appears
The following dialogue box appears:
If the variable names are in the file ask it to read the
names. If you do not specify the Range the program will
identify the Range:
Once you “click OK”, two windows will appear
One that will contain the output:
The other containing the data:
To perform any statistical Analysis select the
Analyze menu:
Then select Regression and Linear.
The following Regression dialogue box appears
Select the Dependent variable Y.
Select the Independent variables X1, X2, etc.
If you select the Method - Enter.
All variables will be put into the equation.
There are also several other methods that can be
used :
1. Forward selection
2. Backward Elimination
3. Stepwise Regression
Forward selection
1. This method starts with no variables in the
equation
2. Carries out statistical tests on variables not in
the equation to see which have a significant
effect on the dependent variable.
3. Adds the most significant.
4. Continues until all variables not in the
equation have no significant effect on the
dependent variable.
Backward Elimination
1. This method starts with all variables in the
equation
2. Carries out statistical tests on variables in the
equation to see which have no significant
effect on the dependent variable.
3. Deletes the least significant.
4. Continues until all variables in the equation
have a significant effect on the dependent
variable.
Stepwise Regression (uses both forward and
backward techniques)
1. This method starts with no variables in the
equation
2. Carries out statistical tests on variables not in
the equation to see which have a significant
effect on the dependent variable.
3. It then adds the most significant.
4. After a variable is added it checks to see if any
variables added earlier can now be deleted.
5. Continues until all variables not in the
equation have no significant effect on the
dependent variable.
All of these methods are procedures for
attempting to find the best equation
The best equation is the equation that is the
simplest (not containing variables that are not
important) yet adequate (containing variables
that are important)
Once the dependent variable, the independent variables and
the Method have been selected if you press OK, the Analysis
will be performed.
The output will contain the following table
Model Summary
Model
1
R
.822a
R Sq uare
.676
Adjusted
R Sq uare
.673
Std. Error
of the
Estimate
4.46
a. Predictors: (Constant), WEIGHT, HORSE, ENGINE
R2 and R2 adjusted measures the proportion of variance
in Y that is explained by X1, X2, X3, etc (67.6% and
67.3%)
R is the Multiple correlation coefficient (the maximum
correlation between Y and a linear combination of X1,
X2, X3, etc)
The next table is the Analysis of Variance Table
ANOVAb
Model
1
Reg ression
Residual
Total
Sum of
Squares
16098.158
7720.836
23818.993
df
3
388
391
Mean
Square
5366.053
19.899
F
269.664
Sig .
.000a
a. Predictors: (Constant), WEIGHT, HORSE, ENGINE
b. Dependent Variable: MPG
The F test is testing if the regression coefficients of
the predictor variables are all zero.
Namely none of the independent variables X1, X2, X3,
etc have any effect on Y
The final table in the output
Coefficientsa
Model
1
(Constant)
ENGINE
HORSE
WEIGHT
Unstandardized
Coefficients
B
Std. Error
44.015
1.272
-5.53E-03
.007
-5.56E-02
.013
-4.62E-03
.001
Standardi
zed
Coefficien
ts
Beta
-.074
-.273
-.504
t
34.597
-.786
-4.153
-6.186
Sig .
.000
.432
.000
.000
a. Dependent Variable: MPG
Gives the estimates of the regression coefficients,
there standard error and the t test for testing if they are
zero
Note: Engine size has no significant effect on
Mileage
a the table below:
The estimated equation
from
Coefficients
Model
1
(Constant)
ENGINE
HORSE
WEIGHT
Unstandardized
Coefficients
B
Std. Error
44.015
1.272
-5.53E-03
.007
-5.56E-02
.013
-4.62E-03
.001
Standardi
zed
Coefficien
ts
Beta
-.074
-.273
-.504
t
34.597
-.786
-4.153
-6.186
Sig .
.000
.432
.000
.000
a. Dependent Variable: MPG
Is:
Mileage  44.0 
5.53
5.56
4.62
Engine 
Horse 
Weight  Error
1000
100
1000
Note the equation is:
Mileage  44.0 
5.53
5.56
4.62
Engine 
Horse 
Weight  Error
1000
100
1000
Mileage decreases with:
1.
With increases in Engine Size (not
significant, p = 0.432)
With increases in Horsepower (significant,
p = 0.000)
With increases in Weight (significant, p =
0.000)
Properties of the Least Squares
Estimators:
ˆb , b
ˆ ,b
ˆ , ... , b
ˆ
0 1 2
p
1. Normally distributed (If there error terms are
Normally distributed)
2. Unbiased Estimators of the Linear Parameters b0,
b1, b2, ... bp.
3. Minimum Variance (Minimum Standard Error)
of all Unbiased Estimators of the Linear
Parameters b0, b1, b2, ... bp.
Comments:
ˆ , S.E. b
ˆ )  s ˆ depends on
The standard error of b
i
i
bi
1. The Error Variance s2 (and s).
2. sXi, the standard deviation of Xi (the ith
independent variable).
3. The sample size n.
4. The correlations between all pairs of
variables.
The standard error of bˆ i , S.E. bˆ i )  sbˆ
i
•
•
•
•
decreases as s decreases.
decreases as sXi increases.
decreases as n increases.
increases as the correlation between pairs of
independent variables increases.
–
In fact the standard error of the least squares
estimates can be extremely high if there is a high
correlation between one of the independent
variables and a linear combination of the remaining
independent variables. (the problem of
Multicollinearity).
The Covariance Matrix,Correlation and XTX
inverse matrix
The Covariance Matrix
 S.E. bˆ )2 Covbˆ 0 ,bˆ 1 ) ... Covbˆ 0 ,bˆ p )
0



S.E. bˆ 1 )2 ... Covbˆ 1 ,bˆ p )


...





2
ˆ
S.E. bp ) 

where
ˆ ,b
ˆ )  r s ˆ s ˆ  r S.E. b
ˆ )S.E. b
ˆ )
covb
i j
ij bi bj
ij
i
j
ˆ andb
ˆ.
and where r  correlation between b
ij
i
j
The Correlation Matrix
  r01 ... r0p 


 1 ... r1p 






1 

The XTX inverse matrix

a
a
a0p 
 00 01

a11  a1p 

 .


app 

If we multiply each entry in the XTX inverse
matrix by s2 = MSError this matrix turns into
the covariance matrix for :
ˆ ,b
ˆ ,b
ˆ , ... , b
ˆ
b
0 1 2
p
Thus
2
ˆ
S.E. bi )  s2 aii and Covbˆ i ,bˆ j )  s2 aij .
These matrices can be used to compute
standard Errors for linear combinations of the
regression coefficients
Namely
ˆ c b
ˆ   c b
ˆ
ˆ c b
L
0 0
1 1
p p
)
S .E. Lˆ  sLˆ 
n
2
ˆ )] 2  2 c c cov( bˆ , bˆ )
c
[
S
.
E
.(
b
i
i j
i
i
j

s
i 0
i j
n
c
i 0
n
2
i
c
i 0
s  2 c c r s s
2
bˆi
2
i
i j
i
j
a  2 c c a
ii
i j
i
bˆi
ij
j
ij
bˆ j
ˆ  bˆ i  bˆ j , then
For example if L
S.E. bˆ i  bˆ j )  sbˆ ibˆ j 

2

2
) S.E. bˆ i ) ) S.E. bˆ j   2 1)1)covbˆ i ,bˆ j ) 

S.E. bˆ i )2 S.E. bˆ j 2  2 covbˆ i ,bˆ j ) 

s2bˆ s2bˆ  2 rij sbˆ sbˆ
i
j
 s aii  ajj  2aij
i
j
An Example
Suppose one is interested in how the cost per month
(Y) of heating a plant is determined the average
atmospheric temperature in the Month (X1) and the
number of operating days in the month (X2). The data on
these variables was collected for n = 25 months selected
at random and is given on the following page.
Y = cost per month of heating a plant
X1 = average atmospheric temperature in the month
X2 = the number of operating days for the plant in the
month.
Month
Y
X1
X2
The Least Squares Estimates:
1
1098
35.3
20
2
1113
29.7
20
3
1251
30.8
23
4
840
58.8
20
5
927
61.4
21
6
873
71.3
22
7
636
74.4
11
8
850
76.7
23
9
782
70.7
21
10
914
57.5
20
11
824
46.4
20
12
1219
28.9
21
13
1188
28.1
21
Constant
14
957
39.1
19
X1
15
1094
46.8
23
X2
16
958
48.5
20
17
1009
59.3
22
18
811
70.0
22
19
683
70.0
22
20
888
74.5
23
21
768
72.1
20
22
847
58.1
21
23
886
44.6
20
24
1036
33.4
20
25
1108
28.6
Estimate
Standard Error
Constant
912.6
110.28
X1
-7.24
0.80
X2
20.29
4.577
The Covariance Matrix
Constant
X1
X2
Constant
12162
X1
-49.203
.63390
X2
-464.36
.76796
20.947
Constant
X1
X2
1.000
-.1764
-.0920
1.000
.0210
The Correlation Matrix
1.000
The XTX Inverse matrix
Constant
Constant
22
2.778747
X1
X2
-0.011242
-0.106098
X1
0.14207x10-
0.175467x10-3
X2
3
0.478599
The Analysis of Variance Table
Source
Regression
Error
Total
df
2
22
24
SS
541871
96287
638158
MS
270936
4377
F
61.899
Summary Statistics
(R2, Radjusted2 = Ra2 and R)
R2 = 541871/638158 = .8491
(explained variance in Y - 84.91 %)
Ra2 = 1 - [1 - R2][(n-1)/(n-p-1)]
= 1 - [1 - .8491][24/22]
= .8354 (83.54 %)
R = .8491 =.9215
= Multiple correlation coefficient
1400
1200
C
O
S
T
1000
800
600
20
30
40
50
TEMP
60
70
80
1400
1200
C
O
S
T
1000
800
600
10
15
20
DAYS
25
Three-dimensional Scatter-plot of Cost, Temp and Days.
Example
Motor Vehicle example
Variables
1. (Y) mpg – Mileage
2. (X1) engine – Engine size.
3. (X2) horse – Horsepower.
4. (X3) weight – Weight.
Select Analysis->Regression->Linear
To print the correlation matrix or the
covariance matrix of the estimates select
Statistics
Check the box for the covariance matrix of the
estimates.
Here is the table giving the estimates and their
standard errors.
Coefficientsa
Model
1
(Constant)
ENGINE
HORSE
WEIGHT
Unstandardized
Coefficients
B
Std. Error
44.015
1.272
-5.53E-03
.007
-5.56E-02
.013
-4.62E-03
.001
a. Dependent Variable: MPG
Standardi
zed
Coefficien
ts
Beta
-.074
-.273
-.504
t
34.597
-.786
-4.153
-6.186
Sig .
.000
.432
.000
.000
Here is the table giving the correlation matrix
and covariance matrix of the regression
estimates:
Coefficient Correlationsa
Model
1
Correlations
Covariances
WEIGHT
HORSE
ENGINE
WEIGHT
HORSE
ENGINE
WEIGHT
1.000
-.129
-.725
5.571E-07
-1.29E-06
-3.81E-06
HORSE
-.129
1.000
-.518
-1.29E-06
1.794E-04
-4.88E-05
ENGINE
-.725
-.518
1.000
-3.81E-06
-4.88E-05
4.941E-05
a. Dependent Variable: MPG
What is missing in SPSS is covariances and
correlations with the intercept estimate (constant).
This can be found by using the following trick
1. Introduce a new variable (called constnt)
2. The new “variable” takes on the value 1
for all cases
Select Transform->Compute
The following dialogue box appears
Type in the name of the target variable - constnt
Type in ‘1’ for the Numeric Expression
This variable is now added to the data file
Add this new variable (constnt) to the list of
independent variables
Under Options make sure the box – Include
constant in equation – is unchecked
The coefficient of the new variable will be the
constant.
Here are the estimates of the parameters with
their standard errors
Coefficientsa,b
Model
1
ENGINE
HORSE
WEIGHT
CONSTNT
Unstandardized
Coefficients
B
Std. Error
-5.53E-03
.007
-5.56E-02
.013
-4.62E-03
.001
44.015
1.272
Standardi
zed
Coefficien
ts
Beta
-.049
-.250
-.577
1.781
t
-.786
-4.153
-6.186
34.597
Sig .
.432
.000
.000
.000
a. Dependent Variable: MPG
b. Linear Regression throug h the Origin
Note the agreement with parameter estimates
and their standard errors as previously
calculated.
Here is the correlation matrix and the
covariance matrix of the estimates.
Coefficient Correlationsa,b
Model
1
Correlations
Covariances
CONSTNT
ENGINE
HORSE
WEIGHT
CONSTNT
ENGINE
HORSE
WEIGHT
CONSTNT
1.000
.761
-.318
-.824
1.619
6.808E-03
-5.427E-03
-7.821E-04
a. Dependent Variable: MPG
b. Linear Regression throug h the Origin
ENGINE
.761
1.000
-.518
-.725
6.808E-03
4.941E-05
-4.88E-05
-3.81E-06
HORSE
-.318
-.518
1.000
-.129
-5.43E-03
-4.88E-05
1.794E-04
-1.29E-06
WEIGHT
-.824
-.725
-.129
1.000
-7.82E-04
-3.81E-06
-1.29E-06
5.571E-07
Testing for Hypotheses
related to
Multiple Regression.
Testing for Hypotheses related to
Multiple Regression.
The General Linear Hypothesis
h11b1 + h12b2 + h13b3 +... + h1pbp = h1
h21b1 + h22b2 + h23b3 +... + h2pbp = h2
...
hq1b1 + hq2b2 + hq3b3 +... + hqpbp = hq
where h11,h12, h13, ... , hqp and h1,h2, h3, ... , hq are
known coefficients.
H0:
Examples
1.
2.
3.
4.
5.
6.
H0: b1 = 0
H0: b1 = 0, b2 = 0, b3 = 0
H0: b1 = b2
H0: b1 = b2 , b3 = b4
H0: b1 = 1/2(b2 + b3)
H0: b1 = 1/2(b2 + b3), b3 = 1/3(b4 + b5 + b6)
When testing hypotheses there are two
models of interest.
1. The Complete Model
Y = b0 + b1X1 + b2X2 + b3X3 +... + bpXp+ e
2. The Reduced Model
The model implied by H0.
You are interested in knowing whether the complete
model can be simplified to the reduced model.
Some Comments
1. The complete model contains more parameters and will
always provide a better fit to the data than the reduced
model.
2. The Residual Sum of Squares for the complete model will
always be smaller than the R.S.S. for the reduced model.
3. If the reduction in the R.S,S. is small as we change from the
reduced model to the complete model, the reduced model
should be accepted as providing an adequate fit.
4. If the reduction in the R.S,S. is large as we change from the
reduced model to the complete model, the reduced model
should be rejected as providing an adequate fit and the
complete model should be kept.
These principles form the basis for the following test.
Testing the General Linear Hypothesis
The F-test for H0 is performed by carrying out two
runs of a multiple regression package.
Run 1: Fit the complete model.
Resulting in the following Anova Table:
Source
Regression
Residual (Error)
Total
df
p
n-p-1
n-1
Sum of Squares
SSReg
SSError
SSTotal
Run 2: Fit the reduced model (q parameters
eliminated)
Resulting in the following Anova Table:
Source
Regression
Residual (Error)
Total
df
p-q
n-p+q-1
n-1
Sum of Squares
SS1Reg
SS1Error
SSTotal
The Test:
The Test is carried out using the Test Statistic
F
1
q
Reduction in the Residual Sum of Squares
Residual Mean Square for Complete model
 SSH0 

2
s
1
q
where SSH0 = SS1Error- SSError= SSReg- SS1Reg
and s2 = SSError/(n-p-1).
The test statistic, F, has an F-distribution with
n1 = q d.f. in the numerator and n2 = n – p - 1
d.f. in the denominator if H0 is true.
Distribution when H0 is true
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
The Critical Region
Reject H0 if F > Fa(q, n – p – 1)
Fa(q, n – p – 1)
The Anova Table for the Test:
Source
Regression
(for the
reduced model)
Departure
from H0
Residual
(Error)
Total
df
p-q
Sum of Squares
SS1Reg
Mean Square
[1/(p-q)]SS1Reg
F
MS1Reg/s2
q
SSH0
(1/q)SSH0
MSH0/s2
n-p-1 SSError
n-1
SSTotal
s2
Some Examples:
Four independent Variables
X1 , X2 , X3, X4
The Complete Model
Y = b0 + b1X1 + b2X2 + b3X3 + b4X4+ e
1)
a) H0: b3 = 0, b4 = 0 (q = 2)
b) The Reduced Model:
Y = b0 + b1X1 + b2X2 + e
Dependent Variable: Y
Independent Variables: X1 , X2
2) a) H0: b3 = 4.5, b4 = 8.0 (q = 2)
b) The Reduced Model:
Y – 4.5X3 – 8.0X4 = b0 + b1X1 + b2X2 + e
Dependent Variable:
Y – 4.5X3 – 8.0X4
Independent Variables: X1 , X2
Example
Motor Vehicle example
Variables
1. (Y) mpg – Mileage
2. (X1) engine – Engine size.
3. (X2) horse – Horsepower.
4. (X3) weight – Weight.
Suppose we want to test:
H0: b1 = 0 against HA: b1 ≠ 0
i.e. engine size(engine) has no effect on
mileage(mpg).
The Full model:
Y = b0 + b1 X1 + b2 X2 + b1 X3 + e
(mpg) (engine) (horse) (weight)
The reduced model:
Y = b0 + b2 X2 + b1 X3 + e
The ANOVA Table for the Full model:
ANOVAb
Model
1
Reg ression
Residual
Total
Sum of
Squares
16098.158
7720.836
23818.993
df
3
388
391
Mean
Square
5366.053
19.899
a. Predictors: (Constant), WEIGHT, HORSE, ENGINE
b. Dependent Variable: MPG
F
269.664
Sig .
.000a
The ANOVA Table for the Reduced model:
ANOVAb
Model
1
Reg ression
Residual
Total
Sum of
Squares
16085.855
7733.138
23818.993
df
2
389
391
Mean
Square
8042.928
19.880
F
404.583
Sig .
.000a
a. Predictors: (Constant), WEIGHT, HORSE
b. Dependent Variable: MPG
The reduction in the residual sum of squares
= 7733.138452 - 7720.835649 = 12.30280251
The ANOVA Table for testing
H0: b1 = 0 against HA: b1 ≠ 0
Regression
1b=1 =0 0
Residual
Total
Sum of Squares
df
Mean Square
F
Sig.
16085.85502
2
8042.927509 404.18628 0.0000
12.30280251
1
12.30280251 0.6182605 0.4322
7720.835649 388
19.89906095
23818.99347 391
Now suppose we want to test:
H0: b1 = 0, b2 = 0 against HA: b1 ≠ 0 or b2 ≠ 0
i.e. engine size (engine) and horsepower
(horse) have no effect on mileage (mpg).
The Full model:
Y = b0 + b1 X1 + b2 X2 + b1 X3 + e
(mpg) (engine) (horse) (weight)
The reduced model:
Y = b0 + b1 X3 + e
The ANOVA Table for the Full model
ANOVAb
Model
1
Reg ression
Residual
Total
Sum of
Squares
16098.158
7720.836
23818.993
df
3
388
391
Mean
Square
5366.053
19.899
a. Predictors: (Constant), WEIGHT, HORSE, ENGINE
b. Dependent Variable: MPG
F
269.664
Sig .
.000a
The ANOVA Table for the Reduced model:
ANOVAb
Model
1
Reg ression
Residual
Total
Sum of
Squares
15519.970
8299.023
23818.993
df
1
390
391
Mean
Square
15519.970
21.280
F
729.337
Sig .
.000a
a. Predictors: (Constant), WEIGHT
b. Dependent Variable: MPG
The reduction in the residual sum of squares
= 8299.023 - 7720.835649 = 578.1875392
The ANOVA Table for testing
H0: b1 = 0, b2 = 0 against HA: b1 ≠ 0 or b2 ≠ 0
Sum of Squares
df
Mean Square
F
Sig.
Regression
15519.97028
1
15519.97028 779.93481 0.0000
0, b22 =
= 00
b11== 0,
578.1875392
2
289.0937696 14.528011 0.0000
Residual
7720.835649 388
19.89906095
Total
23818.99347 391
Testing the General Linear
Hypothesis
Another Example
In the following example:
Weight Gain was being measured along
with the amount of protein in the diet
due to the following sources
– Beef,
– Pork, and
– two types of cereals.
Dependent Variable
Y = Weight Gain
Independent Variables
X1 = the amount of protein in the diet due to the Beef
source,
X2 = the amount of protein in the diet due to the Pork
source,
X3 = the amount of protein in the diet due to the
Cereal 1 source
X4 = the amount of protein in the diet due to the
Cereal 2 source.
The Multiple Linear model
Y = b 0 + b 1 X 1 + b 2 X2 + b 3 X3 + b 4 X4 + e
or
Weight Gain = b0 + b1 (Beef)
+ b2 (Pork)
+ b3 (Cereal 1)
+ b4 (Cereal 2) + e
The weight gains are given in the
following table below:
case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Beef
3.48
1.77
6.39
9.97
7.41
3.58
1.2
6.8
2.3
6.47
5.08
0.62
6.47
7.35
Pork
8.95
4.93
3.01
0.67
4.19
4.1
2.64
0.97
9.95
0.6
4.98
2.24
2.19
0.18
Cereal 1
9.26
2.77
4.92
8.56
8.41
2.05
6.03
4.8
0.89
9.17
8.65
7.79
2.5
0.67
Cereal 2
4.72
0.45
1.79
8.42
4.43
1.1
5.55
5.98
6.74
7.27
3.24
0.08
3.08
7.87
Weight Gain
43.05
34.29
31.79
41.94
45.29
32.02
26.93
36.45
31.52
39.67
37.72
29.01
31.15
31.89
The Summary Statistics of Regression
computation are given below:
Regression Statistics
Multiple R
0.89243188
R Square
0.79643465
Adjusted R Square
0.70596117
Standard Error
3.03382552
Observations
14
The estimates of the regression coefficients
and their standard errors are given below:
Coefficients
Intercept
19.4614989
Standard
Error
2.9165956
t Stat
P-value
Lower 95%
Upper 95%
6.67267651
9.1339E-05
12.8636963
26.0593016
X1
1.47769633
0.41288474
3.57895604
0.00594044
0.54368545
2.41170721
X2
0.97584224
0.32968801
2.95989606
0.01596221
0.23003558
1.7216489
X3
0.94351642
0.26479013
3.56326123
0.00608811
0.34451907
1.54251378
X4
-0.0344526
0.36188355
-0.0952035
0.92623923
-0.8530907
0.78418551
ANOVA
df
SS
MS
Regression
4
324.093267
81.0233168
Residual
9
82.8368756
9.20409729
13
406.930143
Total
F
8.8029618
Significance F
0.00355147
Note that bi is the rate of increase in weight
gain due to increase in protein with respect to
the given source of protein.
One of course would be interested in whether
weight gain increased with protein for any of
the sources of protein.
That is testing the Null Hypothesis
H0: b1 = 0 , b2 = 0, b3 = 0 and b4 = 0
against the alternative Hypothesis
HA: at least one bi  0.
This can be achieved by using the Anova
Table below:
df
SS
MS
Regression
4
324.093267
81.0233168
Residual
9
82.8368756
9.20409729
13
406.930143
Total
F
8.8029618
Significance F
0.00355147
Test statistic – F ratio
F distribution
Significance – p value
F
• F distribution describes the behaviour or the
F statistics when H0 is true.
• If associated p-value is small, H0 should be
rejected in favour of HA.
• The cut-off values are a = .05 or a = .01
However one would also be interested in
making more specific comparisons.
Namely, comparing effect on weight gain of
– the two meat sources
and
– the two cereal sources
on weight gain
In this case we would be interested in testing
the Null Hypothesis
H0: b1 = b2, b3 = b4 against
the alternative Hypothesis
HA: b1  b2 or b3  b4.
Then assuming
H0: b1 = b2 , b3 = b4
the reduced model becomes
Y = b0 + b1 (X1 + X2) + b3 (X3 + X4) + e
Dependent Variable: Y
Independent Variables: (X1 + X2) and
(X3 + X4)
The Anova Table for the reduced model:
df
SS
MS
F
2
276.132469
138.066235
11.6112813
Residual
11
130.797674
11.8906976
Total
13
406.930143
Regression
Significance F
0.0019451
The Anova Table for the complete model:
df
SS
MS
Regression
4
324.093267
81.0233168
Residual
9
82.8368756
9.20409729
13
406.930143
Total
F
8.8029618
Significance F
0.00355147
the Anova Table to carrying out the test:
df
SS
MS
F
Significance F
b1 + b2= 0 , b3 + b4 = 0
2
276.132469
138.066235
15.0005188
0.00136222
b1 = b2 , b3 = b4
2
47.9607982
23.9803991
2.60540478
0.12802848
Residual
9
82.8368756
9.20409729
13
406.930143
Total
DUMMY VARIABLES