Transcript Document

Statistics for Medical Researchers
Hongshik Ahn
Professor
Department of Applied Math and Statistics
Stony Brook University
Biostatistician, Stony Brook GCRC
Contents
1.
2.
3.
4.
5.
Experimental Design
Descriptive Statistics and Distributions
Comparison of Means
Comparison of Proportions
Power Analysis/Sample Size
Calculation
6. Correlation and Regression
2
1. Experimental Design
 Experiment
 Treatment: something that researchers
administer to experimental units
 Factor: controlled independent variable
whose levels are set by the experimenter
 Experimental design
 Control
 Treatment
 Placebo effect
 Blind
 single blind, double blind, triple blind
3
1. Experimental Design
 Randomization
 Completely randomized design
 Randomized block design: if there are
specific differences among groups of subjects
 Permuted block randomization: used for s
mall studies to maintain reasonably good balance
among groups
 Stratified block randomization: matching
4
1. Experimental Design
 Completely randomized design
The computer generated sequence:
4,8,3,2,7,2,6,6,3,4,2,1,6,2,0,…….
 Two Groups (criterion: even-odd):
AABABAAABAABAAA……
 Three Groups:
(criterion:{1,2,3}~A, {4,5,6}~B, {7,8,9}~C; ignore 0’s)
BCAACABBABAABA……
 Two Groups: different randomization ratios(eg.,2:3):
(criterion:{0,1,2,3}~A, {4,5,6,7,8,9}~B)
BBAABABBABAABAA……..

5
1. Experimental Design
 Permuted block randomization
With a block size of 4 for two groups(A,B), there are 6
possible permutations and they can be coded as:
1=AABB, 2=ABAB, 3=ABBA, 4=BAAB, 5=BABA, 6=BBAA
 Each number in the random number sequence in turn
selects the next block, determining the next four participant
allocations (ignoring numbers 0,7,8 and 9).
 e.g., The sequence 67126814…. will produce BBAA AABB
ABAB BBAA AABB BAAB.
 In practice, a block size of four is too small since
researchers may crack the code and risk selection bias.
Mixing block sizes of 4 and 6 is better with the size kept un
known to the investigator.

6
1. Experimental Design
 Methods of Sampling




Random sampling
Systematic sampling
Convenience sampling
Stratified sampling
7
1. Experimental Design
 Random Sampling
 Selection so that each individual member has an
equal chance of being selected
 Systematic Sampling
 Select some starting point and then select every
k th element in the population
8
1. Experimental Design
 Convenience Sampling
 Use results that are easy to get
9
1. Experimental Design
 Stratified Sampling
 Draw a sample from each stratum
10
2.




Descriptive Statistics & Distributions
Parameter: population quantity
Statistic: summary of the sample
Inference for parameters: use sample
Central Tendency
 Mean (average)
 Median (middle value)
 Variability




Variance: measure of variation
Standard deviation (sd): square root of variance
Standard error (se): sd of the estimate
Median, quartiles, min., max, range, boxplot
 Proportion
11
2.
Descriptive Statistics & Distributions
 Normal distribution
12
2.
Descriptive Statistics & Distributions
 Standard normal distribution:
 Mean 0, variance 1
13
2.
Descriptive Statistics & Distributions
 Z-test for means
 T-test for means if sd is unknown
14
3.
Inference for Means
 Two-sample t-test
 Two independent groups: Control and treatment
 Continuous variables
 Assumption: populations are normally distributed
 Checking normality
 Histogram
 Normal probability curve (Q-Q plot): straight?
 Shapiro-Wilk test, Kolmogorov-Smirnov test,
Anderson-Darling test
 If the normality assumption is violated
 T-test is not appropriate.
 Possible transformation
 Use non-parametric alternative: Mann-Whitney Utest (Wilcoxon rank-sum test)
15
3.
Inference for Means
 A clinical trial on effectiveness of drug A in prev
enting premature birth
 30 pregnant women are randomly assigned to
control and treatment groups of size 15 each
 Primary endpoint: weight of the babies at birth
Treatment
n
Control
15
15
mean
7.08
6.26
sd
0.90
0.96
16
3.
Inference for Means
 Hypothesis: The group means are different






Null hypothesis (Ho): 1 = 2
Alternative hypothesis (H1): 1  2
Significance level:  = 0.05
Assumption: Equal variance
Degrees of freedom (df): n1  n2  2
Calculate the T-value (test statistic)
T
( x1  x2 )  ( 1  2 )
s p (1 / n1 )  (1 / n2 )
 P-value: Type I error rate (false positive rate)
 Reject Ho if p-value < 
 Do not reject Ho if p-value > 
17
3.
Inference for Means
Previous example: Test at   0.05
2
2
(
n

1
)
s

(
n

1
)
s
14
(.
90
)

14
(.
96
)
2
1
1
2
2
sp 

 0.866
n1  n2  2
15  15  2
2
t
2
( x1  x2 )  ( 1  2 )

s p (1 / n1 )  (1 / n2 )
7.08  6.26
 2.413
0.866 (1 / 15)  (1 / 15)
P-value: 0.026 < 0.05
Reject the null hypothesis that there is no drug effect.
18
3.
Inference for Means
 Confidence interval (CI):
 An interval of values used to estimate the true val
ue of a population parameter.
 The probability 1-  that is the proportion of
times that the CI actually contains the population
parameter, assuming that the estimation process
is repeated a large number of times.
 Common choices: 90% CI ( = 10%),
95% CI
( = 5%),
99% CI ( = 1%)
19
3. Inference for Means
CI for a comparison of two means:
( x1  x2 )  E  1  2  ( x1  x2 )  E
where
E  t / 2,n1 n2 2 s p (1 / n1 )  (1 / n2 )
A 95% CI for the previous example:
E  t.025, 28 s p (1 / 15)  (1 / 15)  (2.048) .866[(1 / 15)  1 / 15)]  .70
(7.08  6.26)  .70  (.12,1.52)
3.
Inference for Means
 SAS programming for Two-Sample T-test
 Data steps :
Click ‘File’
Click ‘Import Data’
Select a data source
Click ‘Browse’ and find the path of the data file
Click ‘Next’
Fill the blank of ‘Member’ with the name of the SAS data set
Click ‘Finish’
 Procedure steps :
Click ‘Solutions’
Click ‘Analysis’
Click ‘Analyst’
Click ‘File’
Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’
Click ‘ Hypothesis Tests’
Click ‘Two-Sample T-test for Means’
Select the independent variable as ‘Group’ and the dependent variable as
‘Dependent’
Choose the interested Hypothesis and Click ‘OK’
21
3.
Inference for Means
Click ‘File’ to import data and
create the SAS data set.
Click ‘Solution’ to create a
project to run statistical test
Click ‘File’ to open the SAS data
set.
Click ‘Statistics’ to select the
statistical procedure.
22
3.
Inference for Means
 Mann-Whitney U-Test (Wilcoxon Rank-Sum
Test)
 Nonparametric alternative to two-sample t-test
 The populations don’t need to be normal

H0: The two samples come from populations
with equal medians
 H1: The two samples come from populations
with different medians
23
3.
Inference for Means
 Mann-Whitney U-Test Procedure
 Temporarily combine the two samples into
one big sample, then replace each sample
value with its rank
 Find the sum of the ranks for either one of
the two samples
 Calculate the value of the z test statistic
24
3.
Inference for Means
 Mann-Whitney U-Test,
Example
 Numbers in parentheses
are their ranks beginning
with a rank of 1 assigne
d to the lowest value of
17.7.
 R1 and R2: sum of ranks
25
3.
Inference for Means
 Hypothesis: The group means are different
 Ho: Men and women have same median BMI’s
 H1: Men and women have different median BMI’s
n1 (n1  n2  1) 13(13  12  1)
R 

 169
2
2
R 
z
n1n2 (n1  n2  1)

12
R  R
R
(13)(12)(13  12  1)
 18.385
12
187  169

 0.98
18.385
p-value  0.33, thus we do not reject H0 at =0.05.
There is no significant difference in BMI between
men and women.
26
3.
Inference for Means
 SAS Programming for Mann-Whitney U-Test
Procedure
 Data steps :
The same as slide 21.
 Procedure steps :
Click ‘Solutions’
Click ‘Analysis’
Click ‘Analyst’
Click ‘File’
Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’
Click ‘ ANOVA’
Click ‘Nonparametric One-Way ANOVA’
Select the ‘Dependent’ and ‘Independent’ variables respectively
and choose the interested test
Click ‘OK’
27
3.
Inference for Means
Click ‘File’ to open the SAS
data set.
Click ‘Statistics’ to select the
statistical procedure.
Select the dependent and independent variables:
28
3.
Inference for Means
 Paired t-test





Mean difference of matched pairs
Test for changes (e.g., before & after)
The measures in each pair are correlated.
Assumption: population is normally distributed
Take the difference in each pair and perform onesample t-test.
 Check normality
 If the normality assumption is viloated
 T-test is not appropriate.
 Use non-parametric alternative: Wilcoxon signed
rank test
29
3.
Inference for Means
 Notation for paired t-test
 d = individual difference between the two
values of a single matched pair
 µd = mean value of the differences d for the
population of paired data
= mean value of the differences d for the
paired sample data
d

d
 sd = standard deviation of the differences d

for the paired sample data
n = number of pairs
30
3.
Inference for Means
 Example: Systolic Blood Pressure
ID
Without OC’s
With OC’s
Difference
1
115
128
13
2
112
115
3
3
107
106
-1
4
119
128
9
5
115
122
7
6
138
145
7
7
126
132
6
8
105
109
4
9
104
102
-2
10
115
117
2
OC: Oral contraceptive
31
3.
Inference for Means
 Hypothesis: The group means are different
 Ho: d  0 vs. H1: d  0
 Significance level:  = 0.05
 Degrees of freedom (df): n  1  9
 Test statistic
d  d
4.8
t

 3.32
sd / n 4.57 / 10
 P-value: 0.009, thus reject Ho at =0.05
 The data support the claim that oral
contraceptives affect the systolic bp.
32
3.
Inference for Means
 Confidence interval for matched pairs
 100(1-)% CI:
sd
sd 

, d  t / 2,n 1
 d  t / 2,n 1

n
n

 95% CI for the mean difference of the systolic bp:
d  t0.025 ,9
sd
4.57
 4.8  2.26
 4.8  3.27
10
10
 (1.53, 8.07)
33
3.
Inference for Means
 SAS Programming for Paired T-test
 Data steps :
The same as slide 21.
 Procedure steps :
Click ‘Solutions’
Click ‘Analysis’
Click ‘Analyst’
Click ‘File’
Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’
Click ‘ Hypothesis tests’
Click ‘Two-Sample Paired T-test for means’
Select the ‘Group1’ and ‘Group2’ variables respectively
Click ‘OK’
(Note: You can also calculate the difference, and use it as the
dependent variable to run the one-sample t-test)
34
3.
Inference for Means
Click ‘File’ to open the SAS
data set.
Click ‘Statistics’ to select the
statistical procedure.
Put the two group variables into ‘Group 1’ and ‘Group 2’
35
3.
Inference for Means
 Comparison of more than two means:
 ANOVA (Analysis of Variance)
 One-way ANOVA: One factor, eg., control, drug
1, drug 2
 Two-way ANOVA: Two factors, eg., drugs, age g
roups
 Repeated measures: If there is a repeated meas
ures within subject such as time points
36
3.
Inference for means
 Example: Pulmonary disease
 Endpoint: Mid-expiratory flow (FEF) in L/s
 6 groups: nonsmokers (NS), passive smokers (PS),
noninhaling smokers (NI), light smokers (LS),
moderate smokers (MS) and heavy smokers (HS)
Group name
Mean FEF
SD FEF
n
NS
3.78
0.79
200
PS
3.30
0.77
200
NI
3.32
0.86
50
LS
3.23
0.78
200
MS
2.73
0.81
200
HS
2.59
0.82
200
37
3.
Inference for means
 Example: Pulmonary disease
 Ho: group means are the same
 H1: not all the groups means are the same
SS
df
Between 184.38
5
36.875
1044
0.636
Within
663.87
Total
848.25
MS F statistic P-value
58.0
<0.001
 P-value<0.001
 There is a significant difference in the mean FEF
among the groups.
 Comparison of specific groups: linear contrast
 Multiple comparison: Bonferroni adjustment (/n)
38
3.
Inference for Means
 SAS Programming for One-Way ANOVA
 Data steps :
The same as slide 21.
 Procedure steps :
Click ‘Solutions’
Click ‘Analysis’
Click ‘Analyst’
Click ‘File’
Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’
Click ‘ ANOVA’
Click ‘One-Way ANOVA’
Select the ‘Independent’ and ‘Dependent’ variables respectively
Click ‘OK’
39
3. Inference for Means
Click ‘File’ to open the SAS
data set.
Click ‘Solutions’ to select the
statistical procedure.
Select the dependent and Independent variables:
40
4.
Inference for Proportions
 Chi-square test






Testing difference of two proportions
n: #successes, p: success rate
Requirement: np  5 & n(1  p)  5
H0: p1 = p2
H1: p1  p2 (for two-sided test)
If the requirement is not satisfied, use Fisher’s
exact test.
41
5.




Power/Sample Size Calculation
Decide significance level (eg. 0.05)
Decide desired power (eg. 80%)
One-sided or two-sided test
Comparison of means: two-sample t-test
 Need to know sample means in each group
 Need to know sample sd’s in each group
 Calculation: use software (Nquery, power, etc)
 Comparison of proportions: Chi-square test
 Need to know sample proportions in each group
 Continuity correction
 Small sample size: Fisher’s exact test
 Calculation: use software
42
6.
Correlation and Regression
 Correlation
 Pearson correlation for continuous variables
 Spearman correlation for ranked variables
 Chi-square test for categorical variables
 Pearson correlation
 Correlation coefficient (r): -1<r<1
 Test for coefficient: t-test
 Larger sample  more significant for the same
value of the correlation coefficient
 Thus it is not meaningful to judge by the
magnitude of the correlation coefficient.
 Judge the significance of the correlation by pvalue
43
6.
Correlation and Regression
 Regression
 Objective


Find out whether a significant linear relationship exists
between the response and independent variables
Use it to predict a future value
 Notation


X: independent (predictor) variable
Y: dependent (response) variable
 Multiple linear regression model
y   0   1x1  ...  κxk  

Where

is the random error
 Checking the model (assumption)



Normality: q-q plot, histogram, Shapiro-Wilk test
Equal variance: predicted y vs. error is a band shape
Linear relationship: predicted y vs. each x
44
6.
Correlation and Regression
Weight (x1) in LB
Age (x2)
Blood pressure (y)
152
50
120
183
20
141
171
20
124
165
30
126
158
30
117
161
50
129
149
60
123
158
50
125
170
40
132
153
55
123
164
40
132
190
40
155
185
20
147
45
6.
Correlation and Regression
The regression equation is

y  65.1  1.08x1  0.425x2

The mean blood pressure increases by 1.08 if weight (x1)
increases by one pound and age (x2) remains fixed.
Similarly, a 1-year increase in age with the weight held
fixed will increase the mean blood pressure by 0.425.
Predictor
Coefficient
se
T-ratio
P-value
Constant
-65.10
14.94
-4.36
0.001
x1
1.077
0.077
13.98
0.000
x2
0.425
0.073
5.82
0.000
s=2.509
R2=95.8%
 Error sd  is estimated as 2.509 with df=13-3=10
 95.8% of the variation in y can be explained by the
regression.
46
6. Correlation and Regression
 SAS Programming for Linear Regression
 Data steps :
The same as slide 21.
 Procedure steps :
Click ‘Solutions’
Click ‘Analysis’
Click ‘Analyst’
Click ‘File’
Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’
Click ‘ Regression’
Click ‘Linear’
Select the ‘Dependent’ (Response) variable and the ‘Explanatory’
(Predictor) variable respectively
Click ‘OK’
47
6.
Correlation and Regression
Click ‘File’ to open the SAS
data set.
Click ‘Solutions’ to select the
statistical procedure.
Select the dependent and explanatory variables:
48
6.
Correlation and Regression
 Other regression models
 Polynomial regression
 Transformation
 Logistic regression
49