A Broad Overview of Key Statistical Concepts

Download Report

Transcript A Broad Overview of Key Statistical Concepts

Hypothesis tests for slopes in
multiple linear regression model
Using the general linear test and
sequential sums of squares
An example
Study on heart attacks in rabbits
• An experiment in 32 anesthetized rabbits
subjected to an infarction (“heart attack”)
• Three experimental groups:
– Hearts cooled to 6º C within 5 minutes of
occluded artery (“early cooling”)
– Hearts cooled to 6º C within 25 minutes of
occluded artery (“late cooling”)
– Hearts not cooled at all (“no cooling”)
Study on heart attacks in rabbits
• Measurements made at end of experiment:
– Size of the infarct area (in grams)
– Size of region at risk for infarction (in grams)
• Primary research question:
– Does the mean size of the infarcted area differ
among the three treatment groups – no cooling,
early cooling, late cooling – when controlling
for the size of the region at risk for infarction?
A potential regression model
yi   0  1 xi1   2 xi 2  3 xi 3    i
where …
• yi is size of infarcted area (in grams) of rabbit i
• xi1 is size of the region at risk (in grams) of rabbit i
• xi2 = 1 if early cooling of rabbit i, 0 if not
• xi3 = 1 if late cooling of rabbit i, 0 if not
and … the independent error terms i follow a normal
distribution with mean 0 and equal variance 2.
The estimated regression function
Size of Infarcted Area (grams)
The regression equation is
InfSize = - 0.135 + 0.613 AreaSize
- 0.243 X2 - 0.0657 X3
1.0
E
L
C
0.9
0.8
Control
0.7
0.6
Late
0.5
0.4
Early
0.3
0.2
0.1
0.0
0.5
1.0
Size of Area at Risk (grams)
1.5
Possible hypothesis tests for slopes
#1. Is the regression model containing all
three predictors useful in predicting the size
of the infarct?
H 0 : 1   2   3  0
H A : at least one  i  0
#2. Is the size of the infarct significantly
(linearly) related to the area of the region at
risk?
H 0 : 1  0
H A : 1  0
Possible hypothesis tests for slopes
#3. (Primary research question) Is the size of
the infarct area significantly (linearly) related
to the type of treatment after controlling for the
size of the region at risk for infarction?
H 0 :  2  3  0
H A : at least one  i  0
Linear regression’s
general linear test
An aside
Three basic steps
• Define a (larger) full model.
• Define a (smaller) reduced model.
• Use an F statistic to decide whether or not
to reject the smaller reduced model in favor
of the larger full model.
The full model
The full model (or unrestricted model) is the
model thought to be most appropriate for the data.
For simple linear regression, the full model is:
yi   0  1 xi   i
College entrance test score
The full model
22
Y  E Y    0  1 x
18
14
10
Yi   0  1 x    i
6
1
2
3
High school gpa
4
5
The full model
Y  E Y    0  1 x
Grade point average
4
3
2
Yi   0  1 x    i
60
65
70
Height (inches)
75
The reduced model
The reduced model (or restricted model) is the
model described by the null hypothesis H0.
For simple linear regression, the null hypothesis
is H0: β1 = 0. Therefore, the reduced model is:
Yi   0   i
The reduced model
College entrance test score
25
Y  E Y    0
15
Yi   0   i
5
1
2
3
High school gpa
4
5
The reduced model
Grade point average
4
Y  E Y    0
3
Yi   0   i
2
55
65
Height (inches)
75
The general linear test approach
• “Fit the full model” to the data.
– Obtain least squares estimates of β0 and β1.
– Determine error sum of squares – “SSE(F).”
• “Fit the reduced model” to the data.
– Obtain least squares estimate of β0.
– Determine error sum of squares – “SSE(R).”
The general linear test approach
SSE ( F )    yi  yˆ i   7.5028
2
SSE ( R)    yi  y   7.5035
2
Grade point average
4
yˆ R  y  3.015
3
yˆ F  2.95  0.001x
2
55
65
Height (inches)
75
The general linear test approach
2
SSE F     yi  yˆ i   17173
2
SSE  R     yi  y   53637
yˆ F  389  5.98 x
Mortality
200
150
yˆ R  y  152.88
100
30
40
Latitude (at center of state)
50
The general linear test approach
• Compare SSE(R) and SSE(F).
• SSE(R) is always larger than (or same as) SSE(F).
– If SSE(F) is close to SSE(R), then variation around
fitted full model regression function is almost as large
as variation around fitted reduced model regression
function.
– If SSE(F) and SSE(R) differ greatly, then the additional
parameter(s) in the full model substantially reduce the
variation around the fitted regression function.
How close is close?
The test statistic is a function of SSE(R)-SSE(F):
 SSE ( R)  SSE ( F )   SSE ( F ) 
  

F  
df R  df F

  df F 
*
The degrees of freedom (dfR and dfF) are those
associated with the reduced and full model error
sum of squares, respectively.
Reject H0 if F* is large (or if the P-value is small).
But for simple linear regression,
it’s just the same F test as before
 SSE ( R)  SSE ( F )   SSE ( F ) 
  

F  
df R  df F

  df F 
*
df R  n 1
df F  n  2
SSE ( R)  SSTO
SSE ( F )  SSE
 SSTO  SSE   SSE  MSR
  
 
F  
 n  1  n  2   n  2  MSE
*
The formal F-test
for slope parameter β1
Null hypothesis
H0: β1 = 0
Alternative hypothesis HA: β1 ≠ 0
Test statistic
MSR
F 
MSE
*
P-value = What is the probability that we’d get an F* statistic as
large as we did, if the null hypothesis is true?
The P-value is determined by comparing F* to an F distribution
with 1 numerator degree of freedom and n-2 denominator
degrees of freedom.
Example:
Alcoholism and muscle strength?
• Report on strength tests for a sample of 50
alcoholic men
– x = total lifetime dose of alcohol (kg per kg of
body weight)
– y = strength of deltoid muscle in man’s nondominant arm
Fit the reduced model
Reduced Model Fit
yˆ R  y  20.164
strength
30
20
10
0
10
20
30
40
alcohol
SSE ( R)   Yi  Y   1224.32
n
i 1
2
Fit the full model
Full Model Fit
yˆ F  26.37  0.3x
strength
30
20
10
0
10
20
30
alcohol
n

SSE ( F )   Yi  Yˆi
i 1

2
 720.27
40
The ANOVA table
Analysis of Variance
Source
DF
SS
MS
F
Regression 1 504.04 504.040 33.5899
Error
48 720.27 15.006
Total
49 1224.32
SSE(R)=SSTO
P
0.000
SSE(F)=SSE
There is a statistically significant linear association between
alcoholism and arm strength.
Sequential (or extra)
sums of squares
Another aside
What is a
sequential sum of squares?
• It can be viewed in either of two ways:
– It is the reduction in the error sum of squares
(SSE) when one or more predictor variables are
added to the model.
– Or, it is the increase in the regression sum of
squares (SSR) when one or more predictor
variables are added to the model.
Notation
• The error sum of squares (SSE) and
regression sum of squares (SSR) depend on
what predictors are in the model.
• So, note what variables are in the model.
– SSE(X1) denotes the error sum of squares when
X1 is the only predictor in the model
– SSR(X1, X2) denotes the regression sum of
squares when X1 and X2 are both in the model
Notation
• The sequential sum of squares of adding:
– X2 to the model in which X1 is the only
predictor is denoted SSR(X2 | X1)
– X1 to the model in which X2 is the only
predictor is denoted SSR(X1 | X2)
– X1 to the model in which X2 and X3 are
predictors is denoted SSR(X1 | X2, X3)
– X1 and X2 to the model in which X3 is the only
predictor is denoted SSR(X1, X2 | X3)
Allen Cognitive Level (ACL) Study
• David and Riley (1990) investigated relationship
of ACL test to level of psychopathology in a set of
69 patients in a hospital psychiatry unit:
– Response y = ACL score
– x1 = vocabulary (Vocab) score on Shipley Institute of
Living Scale
– x2 = abstraction (Abstract) score on Shipley Institute of
Living Scale
– x3 = score on Symbol-Digit Modalities Test (SDMT)
Regress y = ACL on x1 = Vocab
The regression equation is
ACL = 4.23 + 0.0298 Vocab
...
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
67
68
SS
2.6906
40.3590
43.0496
SSR X1   2.6906
MS
2.6906
0.6024
F
4.47
P
0.038
SSE  X1   40.3590
SSTO( X1 )  43.0496
Regress y = ACL on
x1 = Vocab and x3 = SDMT
The regression equation is
ACL = 3.85 - 0.0068 Vocab + 0.0298 SDMT
...
Analysis of Variance
Source
DF
SS
MS
F
Regression
2
11.7778
5.8889
12.43
Residual Error 66
31.2717
0.4738
Total
68
43.0496
Source
Vocab
SDMT
DF
1
1
P
0.000
Seq SS
2.6906
9.0872
SSR X1 , X 3   11.7778 SSE  X1 , X 3   31.2717
SSTO( X1 , X 3 )  43.0496
The sequential sum of squares
SSR(X3 | X1)
SSR(X3 | X1) is the reduction in the error
sum of squares when X3 is added to the
model in which X1 is the only predictor:
SSR X 3 | X1   SSE ( X1 )  SSE ( X1 , X 3 )
SSR X 3 | X1   40.3590  31.2717  9.0873
The sequential sum of squares
SSR(X3 | X1)
SSR(X3 | X1) is the increase in the regression
sum of squares when X3 is added to the model
in which X1 is the only predictor:
SSR X 3 | X1   SSR( X1 , X 3 )  SSR( X1 )
SSR X 3 | X1   11.7778  2.6906  9.0872
The sequential sum of squares
SSR(X3 | X1)
The regression equation is
ACL = 3.85 - 0.0068 Vocab + 0.0298 SDMT
...
Analysis of Variance
Source
DF
SS
MS
F
Regression
2
11.7778
5.8889
12.43
Residual Error 66
31.2717
0.4738
Total
68
43.0496
Source
Vocab
SDMT
DF
1
1
P
0.000
Seq SS
2.6906
9.0872
SSR X1   2.6906
SSR X 3 | X1   9.0872
(Order in which predictors are added determine the “Seq SS” you get.)
Regress y = ACL on x3 = SDMT
The regression equation is
ACL = 3.75 + 0.0281 SDMT
...
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
67
68
SS
11.680
31.370
43.050
SSR X 3   11.680
MS
11.680
0.468
F
24.95
SSE  X 3   31.370
SSTO( X 3 )  43.050
P
0.000
(Order in which predictors are added determine the “Seq SS” you get.)
Regress y = ACL on
x3 = SDMT and x1 = Vocab
The regression equation is
ACL = 3.85 + 0.0298 SDMT - 0.0068 Vocab
...
Analysis of Variance
Source
DF
SS
MS
F
Regression
2
11.7778
5.8889
12.43
Residual Error 66
31.2717
0.4738
Total
68
43.0496
Source
SDMT
Vocab
DF
1
1
P
0.000
Seq SS
11.6799
0.0979
SSE  X1 , X 3   31.2717
SSR X1 , X 3   11.7778
SSTO( X1 , X 3 )  43.0496
The sequential sum of squares
SSR(X1 | X3)
SSR(X1 | X3) is the reduction in the error
sum of squares when X1 is added to the
model in which X3 is the only predictor:
SSR X1 | X 3   SSE ( X 3 )  SSE ( X1 , X 3 )
SSR X1 | X 3   31.370  31.2717  0.0983
The sequential sum of squares
SSR(X1 | X3)
SSR(X1 | X3) is the increase in the regression
sum of squares when X1 is added to the model
in which X3 is the only predictor:
SSR X1 | X 3   SSR( X1 , X 3 )  SSR( X 3 )
SSR X1 | X 3   11.7778  11.680  0.0978
(Order in which predictors are added determine the “Seq SS” you get.)
Regress y = ACL on
x3 = SDMT and x1 = Vocab
The regression equation is
ACL = 3.85 + 0.0298 SDMT - 0.0068 Vocab
...
Analysis of Variance
Source
DF
SS
MS
F
Regression
2
11.7778
5.8889
12.43
Residual Error 66
31.2717
0.4738
Total
68
43.0496
Source
SDMT
Vocab
DF
1
1
P
0.000
Seq SS
11.6799
0.0979
SSR X 3   11.6799
SSR X1 | X 3   0.0979
More sequential sums of squares
(Regress y on x3, x1, x2)
The regression equation is
ACL = 3.95 + 0.0274 SDMT
- 0.0174 Vocab + 0.0122 Abstract
...
Analysis of Variance
Source
DF
SS
MS
F
Regression
3
12.3009
4.1003
8.67
Residual Error 65
30.7487
0.4731
Total
68
43.0496
Source
SDMT
Vocab
Abstract
DF
1
1
1
Seq SS
11.6799
0.0979
0.5230
P
0.000
SSR X 3   11.6799
SSR X1 | X 3   0.0979
SSR X 2 | X1 , X 3   0.5230
Two- (or three- or more-) degree of
freedom sequential sums of squares
The regression equation is
ACL = 3.95 + 0.0274 SDMT
- 0.0174 Vocab + 0.0122 Abstract
...
Analysis of Variance
Source
DF
SS
MS
F
Regression
3
12.3009
4.1003
8.67
Residual Error 65
30.7487
0.4731
Total
68
43.0496
Source
SDMT
Vocab
Abstract
DF
1
1
1
Seq SS
11.6799
0.0979
0.5230
P
0.000
SSR X1 | X 3   0.0979
SSR X 2 | X1 , X 3   0.5230
SSR X1 , X 2 | X 3   0.6209
SSR X1 , X 2 | X 3   SSE ( X 3 )  SSE ( X1 , X 2 , X 3 )
SSR X1 , X 2 | X 3   31.370  30.7487  0.6213
The hypothesis tests
for the slopes
Possible hypothesis tests for slopes
#1. Is the regression model containing all
three predictors useful in predicting the size
of the infarct?
H 0 : 1   2   3  0
H A : at least one  i  0
#2. Is the size of the infarct significantly
(linearly) related to the area of the region at
risk?
H 0 : 1  0
H A : 1  0
Possible hypothesis tests for slopes
#3. (Primary research question) Is the size of
the infarct area significantly (linearly) related
to the type of treatment upon controlling for
the size of the region at risk for infarction?
H 0 :  2  3  0
H A : at least one  i  0
Testing all slope parameters are 0
Full model
yi   0  1 xi1   2 xi 2  3 xi 3   i
SSE ( F )  SSE
df F  n  4
Reduced model
Yi   0   i
SSE ( R)  SSTO
df R  n 1
Testing all slope parameters are 0
The general linear test statistic:
SSE R   SSE F  SSE F 
F 

df R  df F
df F
*
becomes the usual overall F-test:
SSR SSE
MSR
F 


n  4 MSE
3
*
Testing all slope parameters are 0
H 0 : 1   2   3  0
H A : at least one  i  0
Use overall F-test and P-value reported in
ANOVA table.
The regression equation is
InfSize = - 0.135 + 0.613 AreaSize - 0.243 X2 - 0.0657 X3
...
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
3
0.95927
0.31976 16.43
0.000
Residual Error 28
0.54491
0.01946
Total
31
1.50418
Testing one slope is 0,
say β1 = 0
Full model
yi   0  1 xi1   2 xi 2  3 xi 3   i
SSE (F )  SSE  X1 , X 2 , X 3 
df F  n  4
Reduced model
yi   0   2 x i 2   3 x i 3   i
SSE ( R)  SSE  X 2 , X 3 
df R  n  3
Testing one slope is 0,
say β1 = 0
The general linear test statistic:
SSE R   SSE F  SSE F 
F 

df R  df F
df F
*
becomes a partial F-test:
SSR X1 | X 2 , X 3  SSE  X1 , X 2 , X 3 
F 

n  4
1
*
F* 
MSR ( X 1 | X 2 , X 3 )
MSE  X 1 , X 2 , X 3 
Equivalence of t-test
to partial F-test for one slope
Since there is only one numerator degree of
freedom in the partial F-test for one slope, it
is equivalent to the t-test.
t
2
n p 
 F(1,n p)
The t-test is a test for the marginal significance
of the x1 predictor after x2 and x3 have been
taken into account.
The regression equation is
InfSize = - 0.135 - 0.2430 X2
- 0.0657 X3 + 0.613 AreaSize
Predictor
Constant
X2
X3
AreaSize
Coef
-0.1345
-0.24348
-0.06566
0.6127
S = 0.1395
SE Coef
0.1040
0.06229
0.06507
0.1070
R-Sq = 63.8%
T
-1.29
-3.91
-1.01
5.72
P
0.206
0.001
0.322
0.000
R-Sq(adj) = 59.9%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
X2
X3
AreaSize
DF
1
1
1
DF
3
28
31
SS
0.95927
0.54491
1.50418
Seq SS
0.29994
0.02191
0.63742
MS
0.31976
0.01946
F
16.43
P
0.000
Equivalence of the t-test to the
partial F-test
The t-test:
t *  5.72
and P  0.000...  0.001
The partial F-test:
SSR X1 | X 2 , X 3  SSE  X1 , X 2 , X 3  0.63742
*
F 


 32.7554
n  4
1
0.01946
F distribution with 1 DF in numerator and 28 DF in denominator
x
P( X <= x )
32.7554
1.0000
t *2  5.72 2  32.7184  F *
The regression equation is
InfSize = - 0.135 + 0.613 AreaSize
- 0.243 X2 - 0.0657 X3
Predictor
Constant
AreaSize
X2
X3
Coef
-0.1345
0.6127
-0.24348
-0.06566
S = 0.1395
SE Coef
0.1040
0.1070
0.06229
0.06507
R-Sq = 63.8%
T
-1.29
5.72
-3.91
-1.01
P
0.206
0.000
0.001
0.322
R-Sq(adj) = 59.9%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
AreaSize
X2
X3
DF
1
1
1
DF
3
28
31
SS
0.95927
0.54491
1.50418
Seq SS
0.62492
0.31453
0.01981
MS
0.31976
0.01946
F
16.43
P
0.000
Testing whether two slopes are 0,
say β2 = β3 = 0
Full model
yi   0  1 xi1   2 xi 2  3 xi 3   i
SSE (F )  SSE  X1 , X 2 , X 3 
df F  n  4
Reduced model
yi   0  1 xi1   i
SSE ( R)  SSE X1 
df R  n  2
Testing whether two slopes are 0,
say β2 = β3 = 0
The general linear test statistic:
SSE R   SSE F  SSE F 
F 

df R  df F
df F
*
becomes a partial F-test:
SSR X 2 , X 3 | X1  SSE  X1 , X 2 , X 3 
F 

n  4
2
*
F* 
MSR ( X 2 , X 3 | X 1 )
MSE ( X 1 , X 2 , X 3 )
The regression equation is
InfSize = - 0.135 + 0.613 AreaSize
- 0.243 X2 - 0.0657 X3
Predictor
Constant
AreaSize
X2
X3
Coef
-0.1345
0.6127
-0.24348
-0.06566
S = 0.1395
SE Coef
0.1040
0.1070
0.06229
0.06507
R-Sq = 63.8%
T
-1.29
5.72
-3.91
-1.01
P
0.206
0.000
0.001
0.322
R-Sq(adj) = 59.9%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
AreaSize
X2
X3
DF
1
1
1
DF
3
28
31
SS
0.95927
0.54491
1.50418
Seq SS
0.62492
0.31453
0.01981
MS
0.31976
0.01946
F
16.43
P
0.000
Testing whether β2 = β3 = 0
SSR X 2 , X 3 | X1  SSE  X1 , X 2 , X 3 
F 

n  4
2

0.31453  0.01981
*
F 
 0.01946  8.59
2
*
F distribution with 2 DF in numerator
and 28 DF in denominator
x
8.5900
P( X <= x )
0.9988
P  1  0.9988  0.0012    0.05