No Slide Title

Download Report

Transcript No Slide Title

Multiple testing in high throughput biology

Petter Mostad

Overview

• Review of hypothesis testing • Multiple testing: What is the issue? • Types of errors to control • Adjusting p-values • A Bayesian alternative

Statistical hypothesis testing

• Fix a null-hypothesis H 0 , and a model for what data would look like if H 0 is true. • Find a test-statistic (something computed from the data) which will be more extreme when H 0 is not true.

• Compute the test statistic from the data • Compute its p-value : The probability of observing this or a more extreme test statistic if H 0 is true • Reject H 0 if p is very small (e.g., less than 0.05)

Example: testing the mean of a sample

• Assume you have observed values • Assume they come from a normal distribution with expectation μ and variance σ². • Test statistic :

x

H 0 , ,..., : μ = μ 0

t

   0

x

1

x

2

x n

2

s

/

n x

• Under the null-hypothesis, t has a t n-1 -distribution, i.e., a t-distribution with n-1 d.f. • The p-value is the probability of more extreme values than t in this distribution. • p-value < 0.05 if and only if μ 0 is in confidence interval for μ

Example: Non-parametric tests

• Tests where the null hypothesis does not specify a particular probability model for the data • Example: Are the values in two groups of values different? To find out, rank all the values. • Wilcoxon rank sum test statistic: Add the ranks of all items in one group, and compare to the sum of the other ranks. • Compare with the rank sums one would get if one set of items is selected at random. This gives a p value.

Example: Finding differentially expressed genes

• Assume we want to find diff.exp. genes using 4 cDNA arrays with 6000 genes comparing ”treated” samples with ”untreated”.

• Assume there are NO diff.exp. genes, only noise, and that all 6000x4 values are normally distributed. • Of the 6000 t-test statistics computed, roughly 300 will have values ”more extreme” than the 0.05 cutoff value, just by chance. These are false positives . • This problem is a consequence of the

hypothesis testing approach

and

multiple testing

, and not of the particular method chosen.

Multiple testing in general

• Multiple testing is in fact always a problem: When using a confidence level of 5%, 1 in 20 null hypotheses will be falsely rejected, whether the tests occur in the same or different research projects. • Problem seems more acute when a large number of formalized statistical tests are performed, as in modern high-throughput biology. • What to do?

Ways to deal with the problem

• Adjusting p-values: – Adjusting the levels of p-values to account for multiple tests. – Changing the interpretation of p-values. • Reformulating statistics away from hypothesis testing towards (Bayesian) model choices.

Types of errors

Type 1 error: false positives # true null hyp # false null hyp SUM # not rejected U T W Type 2 errors: false negatives # rejected V S R SUM M0 M1 M Traditionally: Want to find a procedure controlling size of V, while at the same time keeping T as small as possible.

Some error rates:

• Per comparison error rate: • Family-wise error rate:

PCER

E

(

V

)

M FWER

 Pr(

V

 0 ) • False discovery rate:

FDR

E V R

|

R

 0 • Positive false discovery rate: Pr(

R

pFDR

E V R

|

R

 0 0 )

Types of control

• Exact control: Controlling error rates conditionally on the set of true null hypotheses. • Weak control: Controlling error rates under the condition that all null hypotheses are true.

• Strong control: Controlling error rates no matter what hypotheses are true or false.

Controlling FWER with Bonferroni

• Adjust p-values p 1 , p 2 , ..., p M

Bonferroni

method: using

i

 min(

Mp i

, 1 ) • Provides strong control of FWER • Very conservative, hard to get any hypotheses rejected.

Alternative: Sidak adjustment

• Given p-values p 1 , p 2 , ..., p M , the adjusted p values are now ~

i

 1  ( 1 

p i

)

M

• Provides strong control of FWER if the test statistics are independent (rarely the case)

Holm step-down adjusted p-values

• Order the tests so that the raw p-values are ordered:

p

1 

p

2  ...

p M

• Compute adjusted p-values:

i

 max

k

 1 ,..,

i

 min  (

M

k

 1 )

p i

, 1   • Provides strong control of FWER • Somewhat less conservative than Bonferroni

Permutation testing

• Given items divided into two or more groups • Compute a test statistic capturing the way in which your data seem to be extreme. • Compare with the test statistic computed when the items are permuted. This gives a p-value. • Wilcoxon rank sum test can be seen as a permutation test • Can be a good way to find statistically significant differences without using parametric assumptions

Combining permutation testing with adjusting p-values

• Given data matrix, with some columns corresponding to ”treated” samples, some to ”untreated”. • Permute columns, to get permutation derived p-values. • Adjust these, using for example the Holm step-down method.

Correcting for multiple testing with permutations (Dudoit / Speed) • Assumes treatment and control samples have been compared to a common reference sample. • Use test statistic

m

2 

m

1

s

1 2

n

1 

s

2 2

n

2 • By permuting data from treated and control samples we can approximate the null distribution of the test statistic. • Use permutation for multiple testing problem

Adjusting p-values

|

t

| |

t

(

b j

)

p

| *

j

(

b

 1 ,..., • For each

permutation b,

envelope.

u

* |

t j

|

B

) |

t j

|

Adjusting p-values

QQ-plots

• We may also compute t-values by computing differences within groups instead of between groups. • Find extreme quantiles of null distribution, and compare with the first t-values. • Compare null t-values, or normal distribution, with the first t-values in qq-plot. • May compute single-gene p values with quantiles in the null-distribution.

Controlling FDR

• Start with ordering tests so that raw p-values are ordered:

p

1 

p

2  ...

p M

• Adjusted p-values can then be computed:

i

 min

k

i

,..,

M

min

M k

• This can be interpreted as a

step-up

investigating the

largest

p-value.

p i

, 1   procedure, starting with • Provides strong control of FDR under indepencence of test statistics, and under certain more general conditions.

Interpreting FDR-adjusted p-values

• Remember that fdr-adjusted p-values must be interpreted differently: • If you select all genes with fdr-adjusted p-values less than 0.05, it means that you expect a proportion of 0.05 of these to be false discoveries. • In contrast, if you select all genes with holm adjusted p-values less than 0.05, it means that you can expect the chance of seing

any

false positive gene as 0.05.

Multiple testing using BioConductor software

• In the limma package, the topTable function can be used to adjust p-values. • It contains a number of different procedures, for example bonferroni, holm, and fdr. • The package multtest can be used to perform permutation-based computations

Critisism of FDR approach

• It is ”possible to cheat”: – You have a number of hypotheses you want to reject, but p-values are not quite good enough. – Add to your hypotheses a number of untrue hypotheses, with low p-values. – The number of rejections will rise, but not the number of false rejections, so your FDR improves, and you ”prove” the hypotheses you care about.

Critisism of hypothesis testing in general

• Assume a pharmaceutical company wants to prove a new drug is effective • If they compute new p-values continuously, as new data comes in, they can just stop whenever the p-value dips below 0.05. • This gives ”unfair advantage” compared to testing only after the end of the entire study, as they do multiple tests. • ”Solution”: Deciding before study at what times to compute p-values, and adjust them accordingly. • Result: Two companies can have exactly identical data, but one gets drug approved, the other not.

Example: A Bayesian approach (Scott and Berger, 2003)

• Objective: Given a large number of values of normally distributed noise with expectation zero, and some values which deviate substantially more from zero, how to find these deviating values, limiting errors of types 1 and 2. • Crucial ingredient: Use data to estimate the probability that values are deviating.

Hierarchical model used:

X i

| 

i

,  2 , 

i

~

N

( 

i

i

,  2 ) 

i

|  2 ~

N

( 0 ,  2 ) 

i

|

p

~

Ber

( 1 

p

) (  2 ,  2 )  (  2   2 )  2

p

~

Unif

( 0 , 1 ) • The model above specifies ”noise values”, for which 

i

 

i

 1 • We would like to fit the model to data, and thus find 

Example computations:

• We assume we have 10 ”signal observations”: -8.48, -5.43, -4.81, -2.64, 2.40, 3.32, 4.07, 4.81, 5.81, 6.24

• We have n=10, 50, 500, or 5000 ”noise observations”, normally distributed N(0,1). • We find posterior probabilities for each observation to be noise or signal: Can we fish out the signals from the noise?

n 10 50 500 5000 Central seven ’signal’ observations: -5.4

1 1 1 1.0

-4.8

1 1 1 .98

-2.6

.94

.71

.26

.03

Results

-2.4

.89

.59

.17

.02

3.3

.99

.94

.67

.16

4.1

1 1 .96

.67

4.81

1 1 1 .98

False positives: #p i >.6

1 0 1 2 Note: The 3 more extreme signal observations always have prob. 1 of being signals. Note: The number of ”false positives” does not change much with the amount of noise: Adjustment for multiple testing is automatic! Note: The most central observations are ”drowned out” when the noise increases.

Multiple testing for microarray data in practice:

• Too few repetitions (2 or 3) of each sample is often used to reliably find differentially expressed genes. • Instead of finding a definite list of diff.exp. genes, a ranking list may be the output, ordering genes according to how likely it is that they are diff. exp.. The top of the list is chosen for further validation.

• Combination with other sources of information may be used to find interesting genes.