Introduction to choosing the correct statistical test + Tests for Continuous Outcomes I.

Transcript Introduction to choosing the correct statistical test + Tests for Continuous Outcomes I.

Introduction to choosing the correct statistical test

+ Tests for Continuous Outcomes I

Which test should I use?

Outcome Variable Are the observations independent or correlated?

independent correlated Assumptions Continuous (e.g. pain scale, cognitive function) Binary or categorical (e.g. fracture yes/no) Time-to-event (e.g. time to fracture) Ttest ANOVA Linear correlation Linear regression Relative risks Chi-square test Logistic regression Kaplan-Meier statistics Cox regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling McNemar’s test Conditional logistic regression GEE modeling Outcome is normally distributed (important for small samples).

Outcome and predictor have a linear relationship.

Sufficient numbers in each cell (>=5) n/a Cox regression assumes proportional hazards between groups

Which test should I use?

dependent variable?

Are the observations independent or correlated?

Outcome Variable independent correlated Assumptions Continuous (e.g. pain scale, cognitive function) Binary or categorical (e.g. fracture yes/no) Time-to-event (e.g. time to fracture) Ttest ANOVA Linear correlation Linear regression Relative risks Chi-square test Logistic regression Kaplan-Meier statistics Cox regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling McNemar’s test Conditional logistic regression GEE modeling Outcome is normally distributed (important for small samples).

Outcome and predictor have a linear relationship.

Sufficient numbers in each cell (>=5) n/a Cox regression assumes proportional hazards between groups

2. Are the observations

Which test should I use?

Outcome Variable Are the observations independent or correlated?

Outcome and predictor have a linear relationship.

Sufficient numbers in each cell (>=5) n/a Cox regression assumes proportional hazards between groups

3. Are key model

Which test should I use?

Outcome Variable Are the observations independent or correlated?

Outcome and predictor have a linear relationship.

Sufficient numbers in each cell (>=5) n/a Cox regression assumes proportional hazards between groups

Are the observations correlated?

What is the unit of observation?      person* (most common) limb half a face physician clinical center Are the observations independent or correlated?

  Independent: observations are unrelated (usually different, unrelated people) Correlated: some observations are related to one another, for example: the same person over time (repeated measures), legs within a person, half a face

Example: correlated data

 Split-face trial:    Researchers assigned 56 subjects to apply SPF 85 sunscreen to one side of their faces and SPF 50 to the other prior to engaging in 5 hours of outdoor sports during mid day. The outcome is sunburn (yes/no).

Unit of observation = side of a face Are the observations correlated? Yes. Russak JE et al. JAAD 2010; 62: 348-349.

Results ignoring correlation:

Table I -- Dermatologist grading of sunburn after an average of 5 hours of skiing/snowboarding ( P = .03; Fisher’s exact test)

Sun protection factor

85 50

Sunburned

1 8

Not sunburned

55 48

Fisher’s exact test compares the following proportions: 1/56 versus 8/56. Note that individuals are being counted twice!

Correct analysis of data:

Table 1. Correct presentation of the data from: Russak JE et al. JAAD 2010; 62: 348-349. ( P = .016; McNemar’s exact test).

SPF-85 side Sunburned Not sunburned SPF-50 side Sunburned 1 Not sunburned 0 7 48

McNemar’s exact test evaluates the probability of the following: In all 7 out of 7 cases where the sides of the face were discordant (i.e., one side burnt and the other side did not), the SPF 50 side sustained the burn.

Correlations

 Ignoring correlations will:   overestimate p-values for within-person or within-cluster comparisons underestimate p-values for between person or between-cluster comparisons

Common statistics for various types of outcome data

correlated?

assumptions met?

Outcome Variable independent correlated Assumptions Continuous (e.g. pain scale, cognitive function) Binary or categorical (e.g. fracture yes/no) Time-to-event (e.g. time to fracture)

Ttest ANOVA Linear correlation Linear regression

Relative risks Chi-square test Logistic regression Kaplan-Meier statistics Cox regression

Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling

McNemar’s test Conditional logistic regression GEE modeling Outcome is normally distributed (important for small samples).

Outcome and predictor have a linear relationship.

Sufficient numbers in each cell (>=5) n/a Cox regression assumes proportional hazards between groups

Key assumptions of linear models

Assumptions for linear models (ttest, ANOVA, linear correlation, linear regression, paired ttest, repeated-measures ANOVA, mixed models): 1.

Normally distributed outcome variable 2.

• • Most important for small samples; large samples are quite robust against this assumption.

Predictors have a linear relationship with the outcome Graphical displays can help evaluate this.

Common statistics for various types of outcome data

Outcome Variable Continuous (e.g. pain scale, cognitive function) Binary or categorical (e.g. fracture yes/no) Time-to-event (e.g. time to fracture) Are the observations independent or correlated?

independent

Ttest ANOVA Linear correlation Linear regression

Are key model assumptions met?

Repeated-measures ANOVA Mixed models/GEE modeling

distributed (important for small samples).

Outcome and predictor have a linear relationship.

Relative risks Chi-square test Logistic regression McNemar’s test Conditional logistic regression GEE modeling Sufficient numbers in each cell (>=5) Kaplan-Meier statistics Cox regression n/a Cox regression assumes proportional hazards between groups

Key assumptions for categorical tests

Assumptions for categorical tests (relative risks, chi-square, logistic regression, McNemar’s test): 1.

Sufficient numbers in each cell (np>=5) In the sunscreen trial, “exact” tests (Fisher’s exact, McNemar’s exact) were used because of the sparse data.

With sparse data

  Need to use “exact” tests Need to be cautious with regression modeling, as there is a risk of over fitting

Sparse Data, Example Retrospective study comparing prophylaxis during rehabilitation

Risk Factor n No.VTEs

and VTEs. 140 14 All

Pharmacologic prophylaxis during rehabilitation Tinzaparin 3500 units Tinzaparin 4500 units Enoxaparin

Pharmacologic prophylaxis before admission None Enoxaparin 40 or 30 mg 2 × daily Heparin 5000 units 3 × daily Treatment doses LMWH † IVC filter Absent Present AIS level Nontraumatic A, B, C, or D

14 58 68

33 78 24 2 113 27 54 78

5 5 4

7 4 3 0 11 3 3 11 A much higher proportion of tinzaparin 3500 patients had VTEs.

Could it be due to confounding?

Enoxaparin Versus Tinzaparin for Venous Thromboembolic Prophylaxis During Rehabilitation

PM&R

2012; 4:11-17.

Retrospective study comparing prophylaxis during rehabilitation

Risk Factor n No.VTEs

and VTEs. 140 14 All

Pharmacologic prophylaxis during rehabilitation Tinzaparin 3500 units Tinzaparin 4500 units Enoxaparin

Pharmacologic prophylaxis before admission None Enoxaparin 40 or 30 mg 2 × daily Heparin 5000 units 3 × daily Treatment doses LMWH † IVC filter Absent Present AIS level Nontraumatic A, B, C, or D

14 58 68

33 78 24 2 113 27 54 78

5 5 4

7 4 3 0 11 3 3 11 Note the sparse data due to low numbers of VTEs and low numbers of tinzaparin 3500-treated patients.

Enoxaparin Versus Tinzaparin for Venous Thromboembolic Prophylaxis During Rehabilitation

PM&R

2012; 4:11-17.

Characteristic Tinzaparin 3500 units n = 14

AIS level, n (%) Nontraumatic A, B, or C D Not available Walk or use wheelchair, n (%) Use wheelchair Walk Not available Pharmacologic prophylaxis before admission, n (%) None Enoxaparin 40 or 30 mg 2 × daily Heparin 5000 units 3 × daily Treatment doses LMWH Not available 9 4 1 0 0 0 0 1 13 12 1 1 Dividing tinzaparin 3500 participants by their other characteristics identifies some of them uniquely. Authors ran regressions to adjust for confounding. But may be impossible to adjust for some confounders; and small numbers risk over-fitting.

Initial regression model

 VTE = intercept ( 1 parameter) + prophylaxis during rehabilitation (2 parameters) + AIS level (1 parameter) + age (1 parameter) + prophylaxis before rehabilitation (3 parameters)  14 events/8 parameters…high risk of over-fitting

Outcome Variable Continuous (e.g. pain scale, cognitive function)

Continuous outcome (means)

Are the observations independent or correlated?

independent correlated

Ttest:

compares means between two independent groups

Paired ttest:

after) compares means between two related groups (e.g., the same subjects before and

ANOVA:

compares means between more than two independent groups

Pearson’s correlation

coefficient (linear correlation): shows linear correlation between two continuous variables

Linear regression:

multivariate regression technique used when the outcome is continuous; gives slopes

Repeated-measures ANOVA:

compares changes over time in the means of two or more groups (repeated measurements)

Mixed models/GEE

modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time Alternatives if the normality assumption is violated (and small sample size): Non-parametric statistics Wilcoxon sign-rank test: non-parametric alternative to the paired ttest

Wilcoxon sum-rank test

(=Mann-Whitney U test): non parametric alternative to the ttest

Kruskal-Wallis test:

non parametric alternative to ANOVA

Spearman rank correlation coefficient:

coefficient non-parametric alternative to Pearson’s correlation

Binary or categorical outcomes (proportions)

Outcome Variable Binary or categorical (e.g. fracture, yes/no) Are the observations correlated?

independent correlated

Chi-square test:

compares proportions between more than two groups

McNemar’s chi-square test:

compares binary outcome between correlated groups (e.g., before and after)

Relative risks:

or risk ratios odds ratios

Logistic regression:

multivariate technique used when outcome is binary; gives multivariate-adjusted odds ratios

Conditional logistic regression:

multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data)

GEE modeling:

multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Alternative to the chi square test if sparse cells:

Fisher’s exact test:

(some cells <5).

compares proportions between independent groups when there are sparse data

McNemar’s exact test:

compares proportions between correlated groups when there are sparse data (some cells <5).

Time-to-event outcome (survival data)

Outcome Variable Are the observation groups independent or correlated?

independent Time-to event (e.g., time to fracture)

Kaplan-Meier statistics:

each group (usually displayed graphically); compares survival functions with log-rank test estimates survival functions for

Cox regression:

Multivariate technique for time-to-event data; gives multivariate-adjusted hazard ratios correlated n/a (already over time) Modifications to Cox regression if proportional hazards is violated: Time-dependent predictors or time dependent hazard ratios (tricky!)

Tests for continuous outcomes I…

To be continued next week…

Outcome Variable Continuous (e.g. pain scale, cognitive function)

Continuous outcome (means)

Are the observations independent or correlated?

independent correlated

Ttest:

compares means between two independent groups

Paired ttest:

after) compares means between two related groups (e.g., the same subjects before and

ANOVA:

compares means between more than two independent groups

Pearson’s correlation

coefficient (linear correlation): shows linear correlation between two continuous variables

Linear regression:

multivariate regression technique used when the outcome is continuous; gives slopes

Repeated-measures ANOVA:

compares changes over time in the means of two or more groups (repeated measurements)

Mixed models/GEE

Wilcoxon sum-rank test

(=Mann-Whitney U test): non parametric alternative to the ttest

Kruskal-Wallis test:

non parametric alternative to ANOVA

Spearman rank correlation coefficient:

coefficient non-parametric alternative to Pearson’s correlation

Example: two-sample t-test

 In 1980, some researchers reported that “men have more mathematical ability than women” as evidenced by the 1979 SAT’s, where a sample of 30 random male adolescents had a mean score ± 1 standard deviation of 436 ± 77 and 30 random female adolescents scored lower: 416 ± 81 (genders were similar in educational backgrounds, socio-economic status, and age). Do you agree with the authors’ conclusions?

Two sample ttest

Statistical question: Is there a difference in SAT math scores between men and women?

 What is the outcome variable? Math SAT scores      What type of variable is it? Continuous Is it normally distributed? Yes Are the observations correlated? No Are groups being compared, and if so, how many? Yes, two two-sample ttest

Two-sample ttest mechanics…

Data Summary

n Sample Mean Sample Standard Deviation Group 1: women Group 2: men 30 30 416 436 81 77

Two-sample t-test

1. Define your hypotheses (null, alternative) H 0 : ♂-♀ math SAT = 0 Ha: ♂-♀ math SAT ≠ 0 [two-sided]

Two-sample t-test

2. Specify your null distribution: F and M have approximately equal standard deviations/variances, so make a “pooled” estimate of standard deviation/variance:

s p

 81  2 77  79

 79 2 The standard error of a difference of two means is:

s p



s p

 79 30 2  79 2 30  20 .

4 Differences in means follow a T-distribution for small samples; Z-distribution for large samples…

T distribution

   A t-distribution is like a Z distribution, except has slightly fatter tails to reflect the uncertainty added by estimating the standard deviation.

The bigger the sample size (i.e., the bigger the sample size used to estimate  ), then the closer t becomes to Z. If n>100, t approaches Z.

Student’s t Distribution

Note: t Z as n increases Standard Normal (t with df =  )

(

= 13) t-distributions are bell shaped and symmetric, but have ‘fatter’ tails than the normal

(

= 5) 0 from “Statistics for Managers” Using Microsoft ® Excel 4 th Edition, Prentice-Hall 2004 t

Student’s t Table

Upper Tail Area

.25

.10

.05

1 1.000 3.078 6.314

0.817 1.886

2.920

3 0.765 1.638 2.353

Let: n = 3 df =

 - 1 = 2 = .10

 /2 =.05

 /2 = .05

The body of the table contains t values, not probabilities from “Statistics for Managers” Using Microsoft ® Excel 4 th Edition, Prentice-Hall 2004

0 2.920

t distribution values

With comparison to the Z value

Confidence t t t Z Level (10 d.f.) (20 d.f.) (30 d.f.) ____

.80 1.372 1.325 1.310 1.28

.90 1.812 1.725 1.697 1.64

.95 2.228 2.086 2.042 1.96

.99 3.169 2.845 2.750 2.58

Note: t Z as n increases from “Statistics for Managers” Using Microsoft ® Excel 4 th Edition, Prentice-Hall 2004

Two-sample t-test

2. Specify your null distribution: F and M have approximately equal standard deviations/variances, so make a “pooled” estimate of standard deviation/variance:

s p

 81  2 77  79

 79 2 The standard error of a difference of two means is:

s p



s p

 79 30 2  79 2 30  20 .

4 Differences in means follow a T-distribution; here we have a T-distribution with 58 degrees of freedom (60 observations – 2 means)…

Two-sample t-test

3. Observed difference in our experiment = 20 points

Two-sample t-test

4. Calculate the p-value of what you observed

Critical value for two-tailed p-value of .05 for T 58 =2.000

0.98<2.000, so p>.05







33 20 .

4 0



5. Do not reject null! No evidence that men are better in math ;)

Corresponding confidence interval…

20  2 .

00 * 20 .

4   20 .

8  60 .

8 Note that the 95% confidence interval crosses 0 (the null value).

Review Question 1

A t-distribution: a.

Is approximately a normal distribution if n>100.

Can be used interchangeably with a normal distribution as long as the sample size is large enough.

Reflects the uncertainty introduced when using the sample, rather than population, standard deviation. All of the above.

Review Question 2

In a medical student class, the 6 people born on odd days had heights of 64.6

 4 inches; the 10 people born on even days had heights of 71.1

 5 inches. Height is roughly normally distributed. Which of the following best represents the correct statistical test for these data?



71 .



64 .



4 .

5 6 .

5 4 .



1 .

44 ;



71 .



64 .

6 4 .



6 .

5 1 .



4 .

6 ;



0001 16



71 .



64 .

6 4 .

7 2 10



4 .

7 2 6



6 .

5 2 .



2 .

7 ;





71 .



64 .



4 .

5 6 .

5 4 .



1 .

44 ;



Outcome Variable Continuous (e.g. pain scale, cognitive function)

Continuous outcome (means)

Are the observations independent or correlated?

independent correlated

Ttest:

compares means between two independent groups

Paired ttest:

after) compares means between two related groups (e.g., the same subjects before and

ANOVA:

compares means between more than two independent groups

Pearson’s correlation

coefficient (linear correlation): shows linear correlation between two continuous variables

Linear regression:

multivariate regression technique used when the outcome is continuous; gives slopes

Repeated-measures ANOVA:

compares changes over time in the means of two or more groups (repeated measurements)

Mixed models/GEE

Wilcoxon sum-rank test

(=Mann-Whitney U test): non parametric alternative to the ttest

Kruskal-Wallis test:

non parametric alternative to ANOVA

Spearman rank correlation coefficient:

coefficient non-parametric alternative to Pearson’s correlation

Example: paired ttest

TABLE 1. Difference between Means of "Before" and "After" Botulinum Toxin A Treatment Before BTxnA After BTxnA Difference Significance

Social skills Academic performance

Date success

Occupational success Attractiveness Financial success Relationship success Athletic success * ** Significant at 5% level. Significant at 1% level. 5.90

5.86

5.17

6.08

4.94

5.67

5.68

5.15

5.84

5.78

5.30

5.97

5.07

5.61

5.68

5.38

NS .08

.13

.11

.13

NS NS .23

.293

.068

.014

.013

* .030

* .230

.967

.000

Paired ttest

Statistical question: Is there a difference in date success after BoTox?

 What is the outcome variable? Date success      What type of variable is it? Continuous Is it normally distributed? Yes Are the observations correlated? Yes, it’s the same patients before and after How many time points are being compared? Two paired ttest

Paired ttest mechanics

Calculate the change in date success score for each person.

Calculate the average change in date success for the sample. (=.13) Calculate the standard error of the change in date success. (=.05) Calculate a T-statistic by dividing the mean change by the standard error (T=.13/.05=2.6).

Look up the corresponding p-values. (T=2.6 corresponds to p=.014). Significant p-values indicate that the average change is significantly different than 0.

Paired ttest example 2…

Patient 1 5 6 2 3 4 BP Before (diastolic) 100 89 83 98 108 95 BP After 92 84 80 93 98 90

Example problem: paired ttest

Patient 1 2 3 4 Diastolic BP Before 100 89 83 98 D. BP After 92 84 80 93 5 108 98 6 95

Null Hypothesis: Average Change = 0

90 Change -8 -5 -3 -5 -10 -5

Example problem: paired ttest

 







 

6 6

s x



(



 

6 ) 2



(



 

6 ) 2



(



 

6 ) 2 ...



5 4





5 32 5



2 .

s x T

 

2 .



1 .

0 6



 

6 1 .

0 Null Hypothesis: Average Change = 0 With 5 df, T>2.571 corresponds to p<.05 (two-sided test)

Change -8 -5 -3 -5 -10 -5

Example problem: paired ttest

95% CI : 6



2.571

* (1.0)



(-3.43

, 8.571) Note: does not include 0.

Change -8 -5 -3 -5 -10 -5

Outcome Variable Continuous (e.g. pain scale, cognitive function)

Continuous outcome (means)

Are the observations independent or correlated?

independent correlated

Ttest:

compares means between two independent groups

Paired ttest:

after) compares means between two related groups (e.g., the same subjects before and

ANOVA:

compares means between more than two independent groups

Pearson’s correlation

coefficient (linear correlation): shows linear correlation between two continuous variables

Linear regression:

multivariate regression technique used when the outcome is continuous; gives slopes

Repeated-measures ANOVA:

compares changes over time in the means of two or more groups (repeated measurements)

Mixed models/GEE

Wilcoxon sum-rank test

(=Mann-Whitney U test): non parametric alternative to the ttest

Kruskal-Wallis test:

non parametric alternative to ANOVA

Spearman rank correlation coefficient:

coefficient non-parametric alternative to Pearson’s correlation

Using our class data…

  Hypothesis: Students who consider themselves street smart drink more alcohol than students who consider themselves book smart.

Null hypothesis: no difference in alcohol drinking between street smart and book smart students.

“Non-normal” class data…alcohol…

Wilcoxon sum-rank test

Statistical question: Is there a difference in alcohol drinking between street smart and book smart students?

 What is the outcome variable? Weekly alcohol intake (drinks/week)    What type of variable is it? Continuous Is it normally distributed? No (and small n) Are the observations correlated? No   Are groups being compared, and if so, how many? two Wilcoxon sum-rank test

Results:

Book smart: Street smart: Mean=1.6 drinks/week; median = 1.5

Mean=2.7 drinks/week; median = 3.0

Wilcoxon rank-sum test mechanics…

    Book smart values (n=13): 0 0 0 0 1 1 2 2 2 3 3 4 5 Street Smart values (n=7): 0 0 2 3 3 5 6 Combined groups (n=20): 0 0 0 0 0 0 1 1 2 2 2 2 3 3 3 3 4 5 5 6 Corresponding ranks: 3.5* 3.5 3.5 3.5 3.5 3.5 7.5 7.5 10.5 10.5 10.5 10.5 14.5 14.5 14.5 14.5 17 18.5 18.5 20 *ties are assigned average ranks; e.g., there are 6 zero’s, so zero’s get the average of the ranks 1 through 6.

Wilcoxon rank-sum test…

       Ranks, book smart: 3.5 3.5 3.5 3.5 7.5 7.5 10.5 10.5 10.5 14.5 14.5 17 18.5 Ranks, street smart: 3.5 3.5 10.5 14.5 14.5 18.5 20 Sum of ranks book smart: 3.5+3.5+3.5+3.5+7.5+7.5+10.5+10.5+10.5+ 14.5+14.5+17+18.5= 125 Sum of ranks street smart: 3.5+3.5+10.5+14.5 +14.5+18.5+20= 85 Wilcoxon sum-rank test compares these numbers accounting for the differences in sample size in the two groups.

Resulting p-value (from computer) = 0.24

Not significantly different!

Example 2, Wilcoxon sum-rank test…

10 dieters following Atkin’s diet vs. 10 dieters following Jenny Craig Hypothetical RESULTS: Atkin’s group loses an average of 34.5 lbs.

J. Craig group loses an average of 18.5 lbs.

Conclusion: Atkin’s is better?

Example: non-parametric tests

BUT, take a closer look at the individual data… Atkin’s, change in weight (lbs): +4, +3, 0, -3, -4, -5, -11, -14, -15, -300 J. Craig, change in weight (lbs) -8, -10, -12, -16, -18, -20, -21, -24, -26, -30

Jenny Craig

30 25 20 t P e r c e n 15 10 5 0 -30 -25 -20 -15 -10 -5 0 Weight Change 5 10 15 20

Atkin’s

30 25 20 t P e r c e n 15 10 5 0 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 Weight Change -80 -60 -40 -20 0 20

Wilcoxon Rank-Sum test

       RANK the values, 1 being the least weight loss and 20 being the most weight loss.

Atkin’s +4, +3, 0, -3, -4, -5, -11, -14, -15, -300 1, 2, 3, 4, 5, 6, 9, 11, 12, 20 J. Craig -8, -10, -12, -16, -18, -20, -21, -24, -26, -30 7, 8, 10, 13, 14, 15, 16, 17, 18, 19

Wilcoxon Rank-Sum test

  Sum of Atkin’s ranks: 1+ 2 + 3 + 4 + 5 + 6 + 9 + 11+ 12 + 20=73  Sum of Jenny Craig’s ranks: 7 + 8 +10+ 13+ 14+ 15+16+ 17+ 18+19=137   Jenny Craig clearly ranked higher!

P-value *(from computer) = .018

Review Question 3

When you want to compare mean blood pressure between two groups, you should: a.

Use a ttest Use a nonparametric test Use a ttest if blood pressure is normally distributed.

Use a two-sample proportions test.

Use a two-sample proportions test only if blood pressure is normally distributed.

Outcome Variable Continuous (e.g. pain scale, cognitive function)

Continuous outcome (means)

Are the observations independent or correlated?

independent correlated

Ttest:

compares means between two independent groups

Paired ttest:

after) compares means between two related groups (e.g., the same subjects before and

ANOVA:

compares means between more than two independent groups

Pearson’s correlation

coefficient (linear correlation): shows linear correlation between two continuous variables

Linear regression:

multivariate regression technique used when the outcome is continuous; gives slopes

Repeated-measures ANOVA:

compares changes over time in the means of two or more groups (repeated measurements)

Mixed models/GEE

Wilcoxon sum-rank test

(=Mann-Whitney U test): non parametric alternative to the ttest

Kruskal-Wallis test:

non parametric alternative to ANOVA

Spearman rank correlation coefficient:

coefficient non-parametric alternative to Pearson’s correlation

DHA and eczema…

P-values from Wilcoxon sign rank tests

Figure 3 from: Koch C, Dölle S, Metzger M, Rasche C, Jungclas H, Rühl R, Renz H, Worm M. Docosahexaenoic acid (DHA) supplementation in atopic eczema: a randomized, double-blind, controlled trial. Br J Dermatol. 2008 Apr;158(4):786-92. Epub 2008 Jan 30.

Wilcoxon sign-rank test

Statistical question: Did patients improve in SCORAD score from baseline to 8 weeks?   What is the outcome variable? SCORAD What type of variable is it? Continuous     Is it normally distributed? No (and small numbers) Are the observations correlated? Yes, it’s the same people before and after How many time points are being compared? two  Wilcoxon sign-rank test

Wilcoxon sign-rank test mechanics…

    1. Calculate the change in SCORAD score for each participant. 2. Rank the absolute values of the changes in SCORAD score from smallest to largest. 3. Add up the ranks from the people who improved and, separately, the ranks from the people who got worse.

4. The Wilcoxon sign-rank compares these values to determine whether improvements significantly exceed declines (or vice versa).

Outcome Variable Continuous (e.g. pain scale, cognitive function)

Continuous outcome (means)

Are the observations independent or correlated?

independent correlated

Ttest:

compares means between two independent groups

Paired ttest:

after) compares means between two related groups (e.g., the same subjects before and

ANOVA:

compares means between more than two independent groups

Pearson’s correlation

coefficient (linear correlation): shows linear correlation between two continuous variables

Linear regression:

multivariate regression technique used when the outcome is continuous; gives slopes

Repeated-measures ANOVA:

compares changes over time in the means of two or more groups (repeated measurements)

Mixed models/GEE

Wilcoxon sum-rank test

(=Mann-Whitney U test): non parametric alternative to the ttest

Kruskal-Wallis test:

non parametric alternative to ANOVA

Spearman rank correlation coefficient:

coefficient non-parametric alternative to Pearson’s correlation

ANOVA example

Mean micronutrient intake from the school lunch by school Calcium (mg) Mean

S1 a , n=28

117.8

62.4

S2 b , n=25

158.7

70.5

S3 c , n=21

206.5

86.2

P-value d 0.000

Iron (mg) Mean 2.0

2.0

0.854

Folate (μg) Zinc (mg) SD Mean

Mean SD 0.6

26.6

13.1

1.9

1.0

0.6

38.7

14.5

1.5

1.2

0.6

42.6

15.1

1.3

0.4

0.000

0.055

a School 1 (most deprived; 40% subsidized lunches).

b School 2 (medium deprived; <10% subsidized).

c School 3 (least deprived; no subsidization, private school).

d ANOVA; significant differences are highlighted in bold (P<0.05).

FROM: Gould R, Russell J, Barker ME. School lunch menus and 11 to 12 year old children's food choice in three secondary schools in England are the nutritional standards being met? Appetite.

Jan;46(1):86-92. 2006

ANOVA

Statistical question: Does calcium content of school lunches differ by school type (privileged, average, deprived)  What is the outcome variable? Calcium      What type of variable is it? Continuous Is it normally distributed? Yes Are the observations correlated? No Are groups being compared and, if so, how many? Yes, three  ANOVA

ANOVA (ANalysis Of VAriance)

 

Idea

: For two or more groups, test difference between means, for normally distributed variables. Just an extension of the t-test (an ANOVA with only two groups is mathematically equivalent to a t-test).

One-Way Analysis of Variance

 Assumptions, same as ttest  Normally distributed outcome   Equal variances between the groups Groups are independent

Hypotheses of One-Way ANOVA

H 0 : μ 1



μ 2



μ 3

 

1 : Not all of the population means are the same

ANOVA

 It’s like this: If I have three groups to compare:    I could do three pair-wise ttests, but this would increase my type I error So, instead I want to look at the pairwise differences “all at once.” To do this, I can recognize that variance is a statistic that let’s me look at more than one difference at a time…

The “F-test”

Is the difference in the means of the groups more than background noise (=variability within groups)?

Summarizes the mean differences between all groups at once.



Variabilit y between Variabilit y within groups groups

Analogous to pooled variance from a ttest.

The F-distribution

 A ratio of variances follows an F-distribution:  2 

between

within

F n

 The F-test tests the hypothesis that two variances are equal.  F will be close to 1 if sample variances are equal.

0 :  2

between

  2

within H a

:  2

between

  2

within

ANOVA example 2

  Randomize 33 subjects to three groups: 800 mg calcium supplement vs. 1500 mg calcium supplement vs. placebo. Compare the spine bone density of all 3 groups after 1 year.

Spine bone density vs. treatment

1.2

1.1

S P I N E 1.0

0.9

Between group variation Within group variability Within group variability Within group variability 0.8

0.7

PLACEBO 800mg CALCIUM 1500 mg CALCIUM

Group means and standard deviations

   Placebo group (n=11):  Mean spine BMD = .92 g/cm 2  standard deviation = .10 g/cm 2 800 mg calcium supplement group (n=11)   Mean spine BMD = .94 g/cm 2 standard deviation = .08 g/cm 2 1500 mg calcium supplement group (n=11)  Mean spine BMD =1.06 g/cm 2  standard deviation = .11 g/cm 2

Between-group variation. The size of the groups.

The F-Test

The difference of each group’s mean from the overall mean.

between



ns x

2  11 * ( (.

92  .

97 ) 2  (.

94  .

97 ) 2 3  1  ( 1 .

06  .

97 ) 2 )  .

063 2

s within



avg s

2  1 3 (.

10 2  .

08 2  .

11 2 )  .

0095 The average amount of variation within groups.

2 , 30  2

s between

s within

 .

063 .

0095  6 .

6 Large F value indicates that the between group variation exceeds the within group variation (=the background noise).

Review Question 4

Which of the following is an assumption of ANOVA?

The outcome variable is normally distributed.

The variance of the outcome variable is the same in all groups.

The groups are independent.

All of the above.

None of the above.

ANOVA summary

 A statistically significant ANOVA (F-test) only tells you that

at least

two of the groups differ, but not which ones differ.

 Determining

which

groups differ (when it’s unclear) requires more sophisticated analyses to correct for the problem of multiple comparisons…

Question:

Why not just do

3 pairwise ttests?

  Answer: because, at an error rate of 5% each test, this means you have an overall chance of up to 1 (.95) 3 = 14% of making a type-I error (if all 3 comparisons were independent) If you wanted to compare 6 groups, you’d have to do 15 pairwise ttests; which would give you a high chance of finding something significant just by chance.

Multiple comparisons

Correction for multiple comparisons

• • • How to correct for multiple comparisons

post-hoc

… Bonferroni correction (adjusts p by most conservative amount; assuming all tests independent, divide p by the number of tests) Tukey (adjusts p) Scheffe (adjusts p)

1. Bonferroni

For example, to make a Bonferroni correction, divide your desired alpha cut-off level (usually .05) by the number of comparisons you are making. Assumes complete independence between comparisons, which is way too conservative.

Obtained P-value Original Alpha # tests New Alpha Significant?

.001

.05

5 .010

Yes .011

.019

.032

.048

.05

4 3 2 1 .013

.017

.025

.050

Yes No No Yes

2/3. Tukey and Sheffé

 Both methods increase your account for the fact that you’ve done multiple comparisons, but are less conservative than Bonferroni (let computer calculate for you!).

p

-values to

Review Question 5

I am doing an RCT of 4 treatment regimens for blood pressure. At the end of the day, I compare blood pressures in the 4 groups using ANOVA. My p-value is .03. I conclude: a.

All of the treatment regimens differ.

I need to use a Bonferroni correction.

One treatment is better than all the rest.

At least one treatment is different from the others. In pairwise comparisons, no treatment will be different.

Outcome Variable Continuous (e.g. pain scale, cognitive function)

Continuous outcome (means)

Are the observations independent or correlated?

independent correlated

Ttest:

compares means between two independent groups

Paired ttest:

after) compares means between two related groups (e.g., the same subjects before and

ANOVA:

compares means between more than two independent groups

Pearson’s correlation

coefficient (linear correlation): shows linear correlation between two continuous variables

Linear regression:

multivariate regression technique used when the outcome is continuous; gives slopes

Repeated-measures ANOVA:

compares changes over time in the means of two or more groups (repeated measurements)

Mixed models/GEE

Wilcoxon sum-rank test

(=Mann-Whitney U test): non parametric alternative to the ttest

Kruskal-Wallis test:

non parametric alternative to ANOVA

Spearman rank correlation coefficient:

coefficient non-parametric alternative to Pearson’s correlation

Non-parametric ANOVA (Kruskal-Wallis test)

Statistical question: Do nevi counts differ by training velocity (slow, medium, fast) group in marathon runners?

  What is the outcome variable? Nevi count What type of variable is it? Continuous     Is it normally distributed? No (and small sample size) Are the observations correlated? No Are groups being compared and, if so, how many? Yes, three  non-parametric ANOVA

Example: Nevi counts and marathon runners

Richtig et al. Melanoma Markers in Marathon Runners: Increase with Sun Exposure and Physical Strain. Dermatology 2008;217:38-44 .

Non-parametric ANOVA

Kruskal-Wallis one-way ANOVA (just an extension of the Wilcoxon Sum-Rank test for 2 groups; based on ranks)

Example: Nevi counts and marathon runners

By non-parametric ANOVA, the groups differ significantly in nevi count (p<.05) overall. By Wilcoxon sum-rank test (adjusted for multiple comparisons), the lowest velocity group differs significantly from the highest velocity group (p<.05) Richtig et al. Melanoma Markers in Marathon Runners: Increase with Sun Exposure and Physical Strain. Dermatology 2008;217:38-44 .

Review Question 6

I want to compare depression scores between three groups, but I’m not sure if depression is normally distributed. What should I do?

Don’t worry about it—run an ANOVA anyway.

Test depression for normality.

Use a Kruskal-Wallis (non-parametric) ANOVA. Nothing, I can’t do anything with these data.

Run 3 nonparametric ttests.

Review Question 7

If depression score turns out to be very non-normal, then what should I do?

Don’t worry about it—run an ANOVA anyway.

Test depression for normality.

Use a Kruskal-Wallis (non-parametric) ANOVA. Nothing, I can’t do anything with these data.

Run 3 nonparametric ttests.

Review Question 8

I measure blood pressure in a cohort of elderly men yearly for 3 years. To test whether or not their blood pressure changed over time, I compare the mean blood pressures in each time period using a one-way ANOVA. This strategy is: a.

Correct. I have three means, so I have to use ANOVA.

Wrong. Blood pressure is unlikely to be normally distributed.

Wrong. The variance in BP is likely to greatly differ at the three time points.

Correct. It would also be OK to use three ttests.

Wrong. The samples are not independent.