Basics of Statistics - University of Delaware

Download Report

Transcript Basics of Statistics - University of Delaware

Repeated Measures


The term repeated measures refers to data sets with multiple
measurements of a response variable on the same experimental
unit or subject.
In repeated measurements designs, we are often concerned with
two types of variability:



Between-subjects - Variability associated with different groups of subjects
who are treated differently (equivalent to between groups effects in one-way
ANOVA)
Within-subjects - Variability associated with measurements made on an
individual subject.
Examples of Repeated Measures designs:
A.
B.
C.
Two groups of subjects treated with different drugs for whom responses are measured at
six-hour increments for 24 hours. Here, DRUG treatment is the between-subjects factor and
TIME is the within-subjects factor.
Students in three different statistics classes (taught by different instructors) are given a test
with five problems and scores on each problem are recorded separately. Here CLASS is a
between-subjects factor and PROBLEM is a within-subjects factor.
Volunteer subjects from a linguistics class each listen to 12 sentences produced by 3 text to
speech synthesis systems and rate the naturalness of each sentence. This is a completely
within-subjects design with factors SENTENCE (1-12) and SYNTHESIZER.
Repeated Measures
 When measures are made over time as in example
A we want to assess:
 how the dependent measure changes over time
independent of treatment (i.e. the main effect of time)
 how treatments differ independent of time (i.e., the main
effect of treatment)
 how treatment effects differ at different times (i.e. the
treatment by time interaction).
 Repeated measures require special treatment
because:
 Observations made on the same subject are not
independent of each other.
 Adjacent observations in time are likely to be more
correlated than non-adjacent observations
Repeated Measures
 Methods of repeated measures ANOVA
 Univariate - Uses a single outcome measure.
 Multivariate - Uses multiple outcome
measures.
 Mixed Model Analysis - One or more factors
(other than subject) are random effects.
 We will discuss only univariate approach
Repeated Measures
A layout of simple repeated measures with three treatments on
n subjects.
Subjects
1
2
3
.
.
.
n
Trt 1
y11
y21
y31
.
.
.
yn1
Trt2
y12
y22
y32
.
.
.
yn2
Trt3
y13
y23
y33
.
.
.
yn3
Repeated Measures
 Assumptions:
 Subjects are independent.
 The repeated observations for each subject
follows a multivariate normal distribution
 The correlation between any pair of within
subjects levels are equal. This assumption is
known as sphericity.
Repeated Measures
 Test for Sphericity:
 Mauchley’s test
 Violation of sphericity assumption leads to inflated F
statistics and hence inflated type I error.
 Three common Corrections for violation of sphericity:
 Greenhouse-Geisser correction
 Huynh-Feldt correction
 Lower Bound Correction
 All these three methods adjust the degrees of freedom
using a correction factor called Epsilon.
 Epsilon lies between 1/k-1 to 1, where k is the number of
levels in the within subject factor.
Repeated Measures - SPSS
Demo
 Analyze > General Linear model > Repeated
Measures
 Within-Subject Factor Name: e.g. time,
Number of Levels: number of measures of the
factors, e.g. we have two measurements
PLUC.pre and PLUC.post. So, number of level is
2 for our example. > Click on Add
 Click on Define and select Within-Subjects
Variables (time): e.g. PLUC.pre(1) and
PLUC.pre(2)
 Select Between-Subjects Factor(s): e.g. grp
Repeated Measures ANOVA SPSS
output:
Within-Subjects Factors
Measure: MEASURE_1
time
1
Dependent
Variable
PLUC.pre
2
PLUC.post
Between-Subjects Factors
N
grp
1
20
2
20
3
20
Mauchly's Test of Sphericity(b)
Measure: MEASURE_1
Epsilon(a)
Within Subjects Effect
time
Mauchly's W
1.000
Approx. ChiSquare
.000
df
Sig.
0
.
GreenhouseGeisser
1.000
Huynh-Feldt
Lower-bound
1.000
Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent variables is
proportional to an identity matrix.
a May be used to adjust the degrees of freedom for the averaged tests of significance. Corrected tests are displayed in
the Tests of Within-Subjects Effects table.
b Design: Intercept+grp
Within Subjects Design: time
1.000
Repeated measures ANOVA SPSS
output
Tests of Within-Subjects Effects
Measure: MEASURE_1
Source
time
time * grp
Error(time)
Type III Sum
of Squares
172.952
Sphericity Assumed
df
1
Mean Square
172.952
F
33.868
Sig.
.000
Greenhouse-Geisser
172.952
1.000
172.952
33.868
.000
Huynh-Feldt
172.952
1.000
172.952
33.868
.000
Lower-bound
172.952
1.000
172.952
33.868
.000
Sphericity Assumed
46.818
2
23.409
4.584
.014
Greenhouse-Geisser
46.818
2.000
23.409
4.584
.014
Huynh-Feldt
46.818
2.000
23.409
4.584
.014
Lower-bound
46.818
2.000
23.409
4.584
.014
Sphericity Assumed
291.083
57
5.107
Greenhouse-Geisser
291.083
57.000
5.107
Huynh-Feldt
291.083
57.000
5.107
Lower-bound
291.083
57.000
5.107
Tests of Within-Subjects Contrasts
Measure: MEASURE_1
Source
time
time
Linear
time * grp
Error(time)
Type III Sum
of Squares
df
Mean Square
F
Sig.
172.952
1
172.952
33.868
.000
Linear
46.818
2
23.409
4.584
.014
Linear
291.083
57
5.107
Tests of Between-Subjects Effects
Measure: MEASURE_1
Transformed Variable: Average
Source
Intercept
Type III Sum
of Squares
df
Mean Square
F
Sig.
14371.484
1
14371.484
3858.329
.000
grp
121.640
2
60.820
16.328
.000
Error
212.313
57
3.725
Repeated Measures - R Demo
 Arrangement of the data can be done in MSExcel. However the following R commands will
arrange the data in default.csv for the ANOVA
of repeated measuresattach(x)
sid1<- c(sid,sid)
grp1<-c(grp,grp)
plc<- c(PLUC.pre, Pluc.post)
time<-c(rep(1,60),rep(2,60))
 The following code is for repeated measures
ANOVA
summary(aov(plc~factor(grp1)*factor(time) + Error(factor(sid1))))
Repeated Measures R output:
Error: factor(sid1)
Df Sum Sq Mean Sq F value
Pr(>F)
factor(grp1) 2 121.640 60.820 16.328 2.476e-06 ***
Residuals
57 212.313
3.725
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: Within
Df Sum Sq Mean Sq F value
Pr(>F)
factor(time)
1 172.952 172.952 33.8676 2.835e-07 ***
factor(grp1):factor(time) 2 46.818 23.409 4.5839
0.01426 *
Residuals
57 291.083
5.107
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Odds Ratio
 Statistical analysis of binary responses leads to a
conclusion about two population proportions or
probabilities.
 The odds in favor of an event is the quantity p/1-p,
where p is the probability (proportion) of the event.
 The odds ratio of an event is the ratio of the odds of the
event occurring in one group to the odds of it occurring
in another group.
 Let p1 be the probability of an event in one group and
p2 be the probability of the same event in the other
group. Then the odds ratio,
OR 
p1 /(1  p1 )
p2 /(1  p2 )
Odds Ratio
Estimating odds ratio in 2x2 table
Cancer
Smokers
NonSmokers
a
c
No
Cancer
b
d
a+b
c+d
a+c
b+d
N=a+b+c+d
The odds of cancer among smokers is a/b and the odds of cancer in
non-smokers is c/d and the ratio of two odds is,
a / b ad
OR 
c/d

bc
Odds Ratio
 The odds ratio must be greater than or equal to zero. As
the odds of the first group approaches zero, the odds ratio
approaches zero. As the odds of the second group
approaches zero, the odds ratio approaches positive
infinity
 An odds ratio of 1 indicates that the condition or event
under study is equally likely in both groups.
 An odds ratio greater than 1 indicates that the condition
or event is more likely in the first group.
 An odds ratio less than 1 indicates that the condition or
event is less likely in the first group.
 Log odds …
Odds Ratio-Demo
 MS-Excel: No default function
 SPSS: Analyze > Descriptive Statistics >
Crosstabs > Select Variables: Row(s):
Two groups we want to compare e.g.
shades, Column(s): Response variable
e.g (pc)> Click on Statistics > Select
Risk.
 R: Odds ratio are calculated from
Logistic regression model.
Odds ratio SPSS output:
pc * Shades Crosstabulation
Count
Shades
1
pc
Total
2
0
23
7
30
1
7
23
30
30
30
60
Total
Risk Estimate
95% Confidence Interval
Value
Odds Ratio for pc (0 / 1)
Lower
Upper
10.796
3.263
35.718
For cohort Shades = 1
3.286
1.668
6.473
For cohort Shades = 2
.304
.154
.600
N of Valid Cases
60
Logistic Regression
 Consider a binary response variable Y with two
outcomes (1 and 0). Let p be the proportion of
the event with outcome 1.
 The logit(p) can be defined as log of odds of
the event with outcome 1. Then the logistic
regression for the response variable Y on a set
of explanatory variables X1, X2, …., Xr can be
written as
logit(p) = log(p/1-p)= b0 +b1X1+ … +brXr
Logistic Regression
 From the model of logistic regression, the odds in favor of
Y=1, (p/1-p), can be expressed interms of some r
independent variables,
p/1-p = exp( b0 + b1X1 + +brXr )
 If there is no influence of the explanatory variables (i.e.
X1=0, … Xr =0), the odds in favor of Y=1, (p/1-p), is
exp(b0).
 If an independent variable (e. g. X1 in the model) is
categorical with two categories, then exp(b1) implies the
ratio of odds (in favor of Y=1) in two categories of the
variable X1.
 If an independent variable (e. g. X2 in the model) is
numeric, then exp(b2) implies the change in odds (p/1-p)
for 1 unit change in X2.
Logistic Regression-Demo
 MS-Excel: No default functions
 SPSS: Analyze > Regression > Binary
Logistic > Select Dependent variable: >
Select independent variable (covariate)
 R: uses the ‘glm’ function for Logistic
regression analysis. Codes are in the Routput slide.
Logistic Regression SPSS output
Dependent Variable Encoding
Original Value
0
Internal Value
0
1
1
Categorical Variables Codings
Parameter
coding
Frequency
Shades
(1)
1
30
1.000
2
30
.000
Classification Table(a,b)
Predicted
pc
Step 0
Observed
pc
0
Percentage
Correct
1
0
0
30
.0
1
0
30
100.0
Overall Percentage
50.0
a Constant is included in the model.
b The cut value is .500
Variables in the Equation
B
Step 0
Constant
.000
S.E.
.258
Wald
.000
df
Sig.
1.000
1
Variables not in the Equation
Step 0
Variables
Overall Statistics
Shades(1)
Score
17.067
17.067
df
1
Sig.
.000
1
.000
Exp(B)
1.000
Logistic Regression SPSS output
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
17.985
1
.000
Block
17.985
1
.000
Model
17.985
1
.000
Model Summary
Step
1
-2 Log
likelihood
65.193(a)
Cox & Snell
R Square
.259
Nagelkerke R
Square
.345
a Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.
Classification Table(a)
Predicted
pc
Step 1
Observed
pc
0
Percentage
Correct
1
0
23
7
76.7
1
7
23
76.7
Overall Percentage
76.7
a The cut value is .500
Variables in the Equation
B
Step
1(a)
Shades(1)
Constant
-2.379
1.190
a Variable(s) entered on step 1: Shades.
S.E.
Wald
df
Sig.
Exp(B)
.610
15.189
1
.000
.093
.432
7.594
1
.006
3.286
Logistic Regression R output
Shades[Shades==2]<-0
> ftc<-glm(pc~Shades, family=binomial(link = "logit"))
> summary(ftc)
Call:
glm(formula = pc ~ Shades, family = binomial(link = "logit"))
Deviance Residuals:
Min
1Q
-1.706e+00 -7.290e-01
Median
-1.110e-16
3Q
7.290e-01
Max
1.706e+00
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
1.1896
0.4317
2.756 0.00585 **
Shades
-2.3792
0.6105 -3.897 9.73e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.178
Residual deviance: 65.193
AIC: 69.193
on 59
on 58
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
> confint(ftc)
Waiting for profiling to be done...
2.5 %
97.5 %
(Intercept) 0.3944505 2.114422
Shades
-3.6497942 -1.237172
> exp(ftc$coefficients)
(Intercept)
Shades
3.2857143
0.0926276
> exp(confint(ftc))
Waiting for profiling to be done...
2.5 %
97.5 %
(Intercept) 1.48356873 8.2847933
Shades
0.02599648 0.2902037
>