Statistical Data Analysis: Primer

Transcript Statistical Data Analysis: Primer

Stat 470-5
•
Today: General Linear Model
•
Assignment 1:
General Linear Model
• ANOVA model can be viewed as a special case of the general linear
model or regression model
• Suppose have response, y, which is thought to be related to p
predictors (sometimes called explanatory variables or regressors)
• Predictors: x1, x2,…,xp
• Model:
Example: Rainfall (Exercise 2.16)
• In winter, a plastic rain gauge cannot be used to collect precipitation because it
will freeze and crack. Instead, metal cans are used to collect snowfall and the
snow is allowed to melt indoors. The water is then poured into a plastic rain
gauge and a measurement recorded. An estimate of snowfall is obtained by
multiplying this measurement by 0.44.
• One observer questions this and decides to collect data to test the validity of this
approach
• For each rainfall in a summer, she measures: (i) rainfall using a plastic rain
gauge, (ii) using a metal can
• What is the current model being used?
Example: Rainfall (Exercise 2.16)
Scatter Plot of Rainfall Data
4.0
3.0
2.0
1.0
0.0
0
1
2
3
Rain Collected in Metal Can (x)
4
5
6
7
Example: Rainfall (Exercise 2.16)
• Seems to be a linear relationship
• Will use regression to establish linear relationship between x and y
• What should the slope be?
Example: Rainfall (Exercise 2.16)
Model Summaryb
Model
1
R
.995a
Adjusted
R Sq uare
.990
R Sq uare
.991
Std. Error of
the Estimate
.06668
a. Predictors: (Constant), X
b. Dependent Variable: Y
ANOVAb
Model
1
Reg ression
Residual
Total
Sum of
Squares
25.860
.245
26.105
df
1
55
56
Mean Square
25.860
.004
F
5816.213
Sig .
.000a
a. Predictors: (Constant), X
b. Dependent Variable: Y
Coefficientsa
Model
1
(Constant)
X
Unstandardized
Coefficients
B
Std. Error
3.579E-02
.012
.444
.006
a. Dependent Variable: Y
Standardized
Coefficients
Beta
.995
t
2.931
76.264
Sig .
.005
.000
Example: Rainfall (Exercise 2.16)
Normal Q-Q Plot of Residuals
.2
.1
0.0
-.1
-.2
-.2
-.1
Observed Value
0.0
.1
.2
.3
.4
Example: Rainfall (Exercise 2.16)
Plot of Residuals vs X
.4
.3
.2
Residuals
.1
0.0
-.1
-1
X
0
1
2
3
4
5
6
7
Example: Rainfall (Exercise 2.16)
Residuals vs Predicted
.4
.3
.2
Residuals
.1
0.0
-.1
0.0
.5
Predicted Value
1.0
1.5
2.0
2.5
3.0
3.5
Comments
• General linear model may have many predictors
• Is suitable for many situations
• Easily done in all stats packages
Designs So Far…
• Have considered 1-factor designs:
– Paired comparisons (paired t-test)
– Completely randomized design (ANOVA)
• Frequently have more than one factor
• We will learn to design and analyze such experiments
Example: Penicillin Experiment
• Objective: Compare four processes for making penicillin
• The raw material used in the process is thought to vary substantially
from batch to batch
• Experiment Design:
– Use five separately produced batches of raw material
– Divide each batch into four sub-batches
– Randomly assign each process to one sub-batch.
– Randomize the production order within each batch
– Measure the yield (%)
Blocking
•
Paired comparisons (Section 2.1) is a special case of a Randomized Complete
Block (RCB) design
•
More generally:
– Have k treatments
– have b blocks
– each of the k treatments is applied (in random order) to each block
Blocking
•
Units within a block are more homogeneous than units between blocks
•
Can remove variability due to blocks (e.g., boy to boy variability) from the
comparison of treatments
Model
yij     i   j   ij
•
•
•
i=1, 2, …, b;
j=1, 2, …,k;
 ij ~ N (0, 2 )
ANOVA Table
Source of
Variation
Block
Treatment
Residual
Total
Degrees of
Freedom
b-1
k-1
(b-1)k-1)
bk-1
Sum of
Squares
Mean
Squares
F
Hypothesis Tests
Multiple Comparisons
Example: Penicillin Experiment
•
Objective: Compare four processes for making penicillin
•
The raw material used in the process is thought to vary substantially from
batch to batch
•
Experiment Design:
– Use five separately produced batches of raw material
– Divide each batch into four sub-batches
– Randomly assign each process to one sub-batch.
– Randomize the production order within each batch
– Measure the yield (%)
•
This is a RCB design with
b=
k=
Data: Penicillin Example
Penicillin Experiment
Process
A
B
C
D
batch ave.
B1
89
88
97
94
92
B2
84
77
92
79
83
Batch
B3
81
87
87
85
85
B4
87
92
89
84
88
B5
79
81
80
88
82
Proc. ave.
84
85
89
86
86
Yield versus Process (grouped by blocks)
Data: Penicillin Experiment
100
Yield (%)
95
B1
B2
B3
B4
B5
90
85
80
75
70
A
B
C
Process
D
Observations:
• Some consistent differences among batches: generally, B1 high, B5
low
• No apparent consistent differences among processes
ANOVA – Randomized Block Design
ANOVA - Penicillin Experiment
Source of Variation
SS
Processes
70.0
Batches
264.0
Error
226.0
Total
560.0
df
3
4
12
19
MS
23.33
66.00
18.83
F
P-value
1.24
0.34
3.50
0.04
F crit
3.490
3.259
Conclusions
• F-value for Processes is not significant at   0.05
• F-value for Batches (P = .04) is significant at   0.05 … indicates
some differences among batches of raw material
• We suspected batch differences; that’s why the design was done this
way. This result is no surprise or of particular interest, in this case.
• Which would you use?
Diagnostic Checking
• Residual plots -- penicillin experiment
– To check Normality assumption:
• plot all residuals: dot chart, histogram, Normal prob. plot
– To check assumption of equal variances:
• dot plot of residuals by Treatment
• dot plot of residuals by Block
– Other possible checks:
• plot residuals vs. testing order
• plot residuals vs. other potential sources of variability
– e.g., vs. technician, or machine, etc.
Randomized Block Design -- Summary
• Objective:
– Compare several treatments for a factor
– eliminate source of variability from comparison of treatments
– broaden conclusions
• Experimental Method:
– create b blocks each with a experimental units
– in each block, randomly assign each treatment to one experimental unit
• Analysis:
– ANOVA: Blocks, Treatments, Error are sources of variation
Why Bother?
• Can remove variability due to blocks (e.g., boy to boy variability) from
the comparison of treatments
• Removing source of variability often increases power to detect
treatment differences
• Make comparisons on more homogeneous units
Examples of Blocking Variables
•
Blocks are units that can be sub-divided into sub-units
– Time:
– Space
– People:
– Batches:

Statistical Data Analysis: Primer

Transcript Statistical Data Analysis: Primer

Directory