Stat 512 Class 1 - Purdue University

Download Report

Transcript Stat 512 Class 1 - Purdue University

Topic 27: Strategies of
Analysis
Outline
• Strategy for analysis of two-way
studies
– Interaction is not significant
– Interaction is significant
• What if levels of factor or factors is
quantitative?
• What if cell size is n=1?
An analytical strategy
• Run the model with main effects and
the two-way interaction
• Plot the data, the means and look at
the normal quantile plot
• Check the significance test for the
interaction
AB interaction not sig
• Possibly rerun the analysis without the
interaction. Only consider if dfe is small
• For a main effect that is not significant
– No evidence to conclude that the
levels of this factor are associated
with different means of the response
variable
– Possibly rerun without this factor
giving a one-way anova. Watch out for
Type II errors that inflate MSE
AB interaction not sig
• If both main effects are not significant
– Model could be run as Y=;
– Basically assuming a one population
model
• For a main effect with more than two
levels that is significant, use the means
or lsmeans statement with the tukey
multiple comparison procedure
• Contrasts and linear combinations can
also be examined using the contrast and
estimate statements
AB interaction sig but not
important
• Plots and a careful examination of the
cell means may indicate that the
interaction is not very important even
though it is statistically significant
• Use the marginal means for each
significant main effect to describe the
important results
• You may need to qualify these results
using the interaction
AB interaction sig but not
important
• Use the methods described for the
situation where the interaction is not
significant but keep the interaction in the
model
• Carefully interpret the marginal means
as averages over the levels of the other
factor
AB interaction is sig and
important
• Options include
– Treat as a one-way with ab levels; use
tukey to compare means; contrasts and
estimate can also be useful
– Report that the interaction is significant;
plot the means and describe the pattern
– Analyze the levels of A for each level of
B (use a BY statement) or vice versa
Specification of contrast
and estimate statements
• When using the contrast statement
always double check your results with
the estimate statement
• The order of factors is determined by the
order in the class statement, not the
order in the model statement
• Contrast should come a priori from
research questions
KNNL Example
• KNNL p 833
• Y is the number of cases of bread sold
• A is the height of the shelf display, a=3
levels: bottom, middle, top
• B is the width of the shelf display, b=2:
regular, wide
• n=2 stores for each of the 3x2
treatment combinations
Contrast/Estimate Example
• Suppose we want to compare the
average of the two height = middle
cells with the average of the other
four cells
• With this approach, the contrast
should correspond to a research
question formulated before
examining the data
Contrast/Estimate
• First, formulate question as a contrast
using the cell means model
– (μ21 + μ22)/2 = (μ11 + μ12 + μ31 + μ32)/4
– -.25μ11-.25μ12+.5μ21+.5μ22-.25μ31-.25μ32 =0
• Then translate the contrast into the
factor effects model using
μij = μ + αi + βj + (αβ)ij
Contrast/Estimate
•
•
•
•
•
•
•
-.25μ11 = -.25(μ + α1 + β1 + (αβ)11)
-.25μ12 = -.25(μ + α1 + β2 + (αβ)12)
+.5μ21 = +.5 (μ + α2 + β1 + (αβ)21)
+.5μ22 = +.5 (μ + α2 + β2 + (αβ)22)
-.25μ31 = -.25(μ + α3 + β1 + (αβ)31)
-.25μ32 = -.25(μ + α3 + β2 + (αβ)32)
C = (-.5α1 + α2 - .5α3) +
(-.25(αβ)11-.25(αβ)12+.5(αβ)21+.5(αβ)22.25(αβ)31-.25(αβ)32)
Proc glm
proc glm data=a1;
class height width;
model sales=height width
height*width;
Contrast and estimate
contrast 'middle two vs
height -.5 1 -.5
height*width -.25 -.25
estimate 'middle two vs
height -.5 1 -.5
height*width -.25 -.25
means height*width;
run;
all others'
.5 .5 -.25 -.25;
all others'
.5 .5 -.25 -.25;
Output
Cont middle two vs all others
DF SS
MS
F
Pr > F
1 1536 1536 148.65
<.0001
Par middle two vs all others
Estimate StErr
t Pr > |t|
24 1.96 12.19
<.0001
Check with means
1 1 45
1 2 43
2 1 65
2 2 69
3 1 40
3 2 44
(65+69)/2 – (45+43+40+44)/4
= 67 – 43 = 24
One Quantitative factor
and one categorical
• Plot the means vs the quantitative
factor for each level of the categorical
factor
• Consider linear and quadratic terms for
the quantitative factor (regression)
• Consider different relationships for the
different levels of the categorical factor
• Lack of fit analysis can be useful
Two Quantitative factors
• Plot the means
– vs A for each level B
– vs B for each level A
• Consider linear and quadratic terms
• Consider products to allow for
interaction
• Lack of fit analysis can be useful
One observation per cell
• For Yijk we use
– i to denote the level of the factor A
– j to denote the level of the factor B
– k to denote the kth observation in cell (i,j)
• i = 1, . . . , a levels of factor A
• j = 1, . . . , b levels of factor B
• n = 1 observation in cell (i,j)
Factor effects model
•
•
•
•
•
μij = μ + αi + βj
μ is the overall mean
αi is the main effect of A
βj is the main effect of B
Because we have only one observation
per cell, we do not have enough
information to estimate the interaction
in the usual way…it is confounded with
error
Constraints
• Σi αi= 0
• Σjβj = 0
• SAS GLM Constraints
– αa= 0 (1 constraint)
– βb = 0 (1 constraint)
ANOVA Table
Source df
SS
A
a-1
SSA
B
b-1
SSB
Error (a-1)(b-1) SSE
Total ab-1
SST
MS
F
MSA MSA/MSE
MSB MSB/MSE
MSE
_
MST
Would normally be df for
interaction
Expected Mean Squares
• Based on model which has no
interaction
• E(MSE) = σ2
• E(MSA) = σ2 + b(Σiαi2)/(a-1)
• E(MSB) = σ2 + a(Σjβj2)/(b-1)
• Here, αi and βj, are defined with the
usual factor effects constraints
KNNL Example
• KNNL p 882
• Y is the premium for auto insurance
• A is the size of the city, a=3 levels:
small, medium and large
• B is the region, b=2: East, West
• n=1 the response is the premium
charged by a particular company
The data
data a2;
infile 'c:\...\CH20TA02.txt';
input premium size region;
if size=1 then sizea='1_small ';
if size=2 then sizea='2_medium';
if size=3 then sizea='3_large ';
proc print data=a1; run;
Output
Obs premium size region sizea
1
140
1
1 1_small
2
100
1
2 1_small
3
210
2
1 2_medium
4
180
2
2 2_medium
5
220
3
1 3_large
6
200
3
2 3_large
Proc glm
proc glm data=a2;
class sizea region;
model premium=
sizea region/solution;
means sizea region / tukey;
output out=a3 p=muhat;
run;
Output
Class Level Information
Class Levels
Values
sizea 3 1_small 2_medium 3_large
region 2
1 2
Number of observations
6
Output
Source DF
MS
F
Model
3 3550 71.00
Error
2
50
Total
5
P
0.0139
Output
Source DF
MS
F
P
sizea
2 4650 93.00 0.0106
region 1 1350 27.00 0.0351
Output solution
Parameter
Estimate
Int
195 B
sizea 1_small
-90 B
sizea 2_medium -15 B
sizea 3_large
0 B
region
1
30 B
region
2
0 B
Check vs predicted
values (muhat)
reg
1
2
1
2
1
2
sizea muhat
1_s
135=195-90+30
1_s
105=195-90
2_m
210=195-15+30
2_m
180=195-15
3_l
225=195
+30
3_l
195=195
Multiple comparisons
Size
Mean
N
sizea
A
A
A
210
2
3_large
195
2
2_medium
B
120
2
1_small
Multiple comparisons
Region
Mean
N region
A
190
3
1
B
160
3
2
The anova results told us that
these were different. No need for
multiple comparison adjustment
Plot the data
symbol1 v='E' i=join c=blue;
symbol2 v='W' i=join c=green;
title1 'Plot of the data';
proc gplot data=a3;
plot premium*sizea=
region;
run;
The plot
Possible
interaction?
Plot the estimated model
symbol1 v='E' i=join c=blue;
symbol2 v='W' i=join c=green;
title1 'Plot of the model
estimates';
proc gplot data=a2;
plot muhat*sizea=region;
run;
The plot
Tukey test for additivity
• Can’t look at usual interaction but
can add one additional term to the
model (θ) to look at specific type of
interaction
• μij = μ + αi + βj + θαiβj
• We use one degree of freedom to
estimate θ
• There are other variations, eg θiβj
Step 1: Get muhat
(grand mean)
proc glm data=a2;
model premium=;
output out=aall p=muhat;
proc print data=aall;
run;
Output
Obs premium size region muhat
1
140
1
1
175
2
100
1
2
175
3
210
2
1
175
4
180
2
2
175
5
220
3
1
175
6
200
3
2
175
Step 2: Get muhatA
(factor A means)
proc glm data=a2;
class size;
model premium=size;
output out=aA p=muhatA;
proc print data=aA;
run;
Output
Obs premium size region muhatA
1
140
1
1
120
2
100
1
2
120
3
210
2
1
195
4
180
2
2
195
5
220
3
1
210
6
200
3
2
210
Step 3: Find muhatB
(factor B means)
proc glm data=a2;
class region;
model premium=region;
output out=aB p=muhatB;
proc print data=aB;
run;
Output
Obs
1
2
3
4
5
6
remium size region muhatB
140
1
1
190
100
1
2
160
210
2
1
190
180
2
2
160
220
3
1
190
200
3
2
160
Step 4: Combine and
compute
data a4; merge aall aA aB;
alpha=muhatA-muhat;
beta=muhatB-muhat;
atimesb=alpha*beta;
proc print data=a4;
var size region atimesb;
run;
Output
Obs size region atimesb
1
1
1
-825
2
1
2
825
3
2
1
300
4
2
2
-300
5
3
1
525
6
3
2
-525
Step 5: Run glm
proc glm data=a4;
class size region;
model premium=size region
atimesb/solution;
output out=a5 p=tukmuhat;
run;
Output
Source DF
size
2
region
1
atimesb 1
F Value
360.37
104.62
6.75
Pr > F
0.0372
0.0620
0.2339
Not significant…not enough
evidence of this type of
interaction but really small study
(1 df error)
Output
Par
Est
Int
195
size1
-90
size2
-15
size3
0
region1
30
region2
0
atimesb -0.006
SE
t
B 2.9 66.49
B 3.5 -25.05
B 3.5 -4.18
B
.
.
B 2.9 10.23
B
.
.
0.002 -2.60
P
0.0096
0.0254
0.1496
.
0.0620
.
0.2339
Does this look
more like the
means plot?
Last slide
• Read KNNL Chapters 19-20
• We used program topic27.sas to
generate the output for today