A comparison of multiple treatments

Download Report

Transcript A comparison of multiple treatments

Data Analysis

Nick Holmes Pathology Flow Cytometry Facility

http://www.bio.cam.ac.uk/~nh106/flowsite/flowindex.html

Or via Quick links on http://www.path.cam.ac.uk/

Flow Cytometry data is technically a discontinuous variable function as the values recorded in your FCS data files are the results from analogue to digital converters (ADCs) The number of bins (

or channels as flow jocks call them

) varies greatly for today’s cytometers..

E,g,

Facscan Cyan Canto CytekDxP Accuri 1024 65536 262144 262144 16777216

Different analysis programmes will display these data on a variety of scales. Often the data are rebinned, ie channels are recombined into a smaller number of values. This helps make distributions look smooth. However, more complex data processing algorithms can also be used e.g. in FlowJo; these will make your data look different (sometimes nicer).

Sources of true background noise

Thermionic emission Stray light Electrical circuits

Sources of variation in signal

Mains (transformer) fluctuation Laser power fluctuation Cell (particle) position within the beam (fluidic fluctuation)

Sample preparation and staining

Biological Variance

It is becoming increasingly common for operators to QC cytometers by regularly running beads – some machines have automated software routines for doing this. If I do it manually what kind of variability am I prepared to tolerate before I decide the machine needs attention? Typically I would have a batch of control beads and standard settings which would have placed my beads in linear channel 128 on a 256 channel display (note really these are in channels 32512-32767 if I am using our Cyan). When we repeat this e.g. every week, as well as reasonable precision (CV<3-8% depending on fluorescence parameter), I am also expecting fluctuations in mean and median from week to week. I guess I am not going to worry if the median lies in the range 115-140 or so.

Which measure of central tendancy?

Often we will want to use a single statistic value to compare populations but which one gives the best description of non-Gaussian populations?

median 553 geometric mean 621 mode 200 mean 1755 I think you can see that both mode and mean do a pretty poor job at describing the complex population. Geometric mean and median are quite similar. I prefer median as a general rule

When are cells positive?

Here our control (red) population has a lower peak (mode) than our test sample.

But does this mean all or most of the cells are positive for our test antibody (antigen by implication)?

I would favour a conservative interpretation, namely that the main peak in the test sample is negative – its slightly higher fluorescence is probably due to differences between the control and test antibody preparations (coupling profile, aggregation profile etc.) Even then, bear in mind that

wherever

you set your lower boundary for positivity, you count some negatives as positives or some positives as negatives – usually both. So this boundary is a sensitive and subjective measure. Be careful when making conclusions that the exact position of the boundary doesn’t alter your result!

Basic KS comparisons are too sensitive to distinguish real from instrument variation

D max =1 p<0.001* D max =O.1857

p<0.001*

The right-hand above panel shows 2 samples from the same tube so it cant be ‘truly’ different except by sampling

* Kolmogorov-Smirnov comparison in FlowJo

If we use KS to compare histograms, any d>0.042 would be significant at p<10 -6 for n=4096* and you might think that this level of cut off could avoid mere spurious statistical noise making things appear different when they aren’t. However, even this level isn’t high enough – or anywhere near actually Unfortunately FlowJo doesn’t return D values for KS Nor is it easy to calculate (with precision) Dc for p<10 -7 or less In practice, you have to use common sense judgement. If things don’t look different enough to be believably BIOLOGICALLY different than don’t let stats trick you. Conversely anything which looks biologically meaningful WILL be statistically different if you just compare the raw data of histograms

BUT is this what we need to establish?

* In fact a general approximation can be made that p ≈10exp-6 for any d  n ≈2.69

The Mann-Whitney U test: a simple test for reproducibility

• • Mann-Whitney U test will be generally applicable wherever you want to compare univariate distributions for a test sample and a control sample If the ranges do not overlap, you only need 3 samples of each to get p<0.05

Overlapping ranges require more samples but for example median FI Controls test 5.61

5.83

5.87

6.01

7.15

5.99

6.58

7.02

7.31

7.39

Gives U = 4; p < 0.05

Where inter experiment variability is too high, the Wilcoxon’s signed rank test will still deliver significance at 5% for 5 pairs of samples

Provided that which the control value is always lower than the test sample within each experimental pair

Example: a monoclonal antibody against a novel antigen is used to stain the cell line BRAVO The fluorescence obtained by indirect staining with the Mab + anti-mouse Ig is compared to that obtained with same secondary + an isotype control. The experiment is repeated on 5 separate occasions. The following median FI values are obtained Control test 3 5 6 7 14 16 11 13 8 10 By Mann-Whitney U=10, P>0.1

By Wilcoxon W=0, P<0.05 1% significance requires seven such samples NB: These two sets are very close. I would still exercise caution in interpreting the data. Clearly they give a small reproducible difference. This could mean that the BRAVO expresses low levels of the novel antigen. Alternatively, there may be unknown differences in the unspecific binding of the control and novel antibody, e.g. a higher level of dimers, trimers etc – almost all antibody preps contain some higher order species.

A comparison of multiple treatments

If we want to compare multiple pairs of samples within an experiment then we need a different method. Friedmann’s test provides a method to assess the possibility that a dataset is different by either ‘block’ (by which we would mean experiment/run) or by ‘treatment’ by which we could mean different antibodies, different concentrations of antibodies or drug etc. However, in order to compare individual pairs of treatments we need to apply Dunn’s post test to the data – Friedmann only tests whether the null hypothesis that all samples might plausibly be drawn from the same population is demonstrably false at some defined level of certainty.

An example of Friedmann’s test

We have T lymphocyte cell line which expresses GFP under the control of a minimal promoter with 3 NFAT binding sites upstream of the transcription start site. Thus we can measure the degree of NFAT activity after anti-CD3 activation. We have 4 different drugs which we believe may inhibit the dephosphorylation of NFAT (required for nuclear entry, hence transcriptional activity). We treat cells with anti-CD3 and, independently, each of the 4 drugs with vehicle as a control and we measure the fluorescence of cells stimulated by 488nm light and measured between 515-545nm. We did this experiment 3 times using, so far as possible, the same cells, drug doses, cytometer settings etc.

Expt 1 2 3

Median GFP fluorescence

Vehicle 2657 2347 2784 A 2612 2333 2636 B 2271 2201 2311 C 1907 1899 2089 D 1439 1333 1566 mean as % C 2596 2527 97 2261 87 1965 76 1466 56

1 2

Expt

Convert these to Ranks within each experiment

Vehicle

A B C D 5 5 4 4 3 3 2 2 1 1 3 ΣR

i

ΣR

i 2

5 15 225 4 12 144 3 9 81 2 6 36 1 3 9 45 495

Friedmann’ s test statistic S is given by

S

R

1 2 

R

2 2 

R

3 2 

R

4 2 

R

5 2 

R

2

n

i n

  1

R i

2 

R

2

n

Where n=the number of treatments, R i is the sum of ranks of treatment i and R is the sum of all R i For our example S=495- 405 thus S= 90 We can use tables of significance level to find that for 3 replicates of 5 treatments, S=86 has a probability p=0.009 that all values are drawn from the same underlying population.

This only tells us that the drugs made a difference!

Dunn’s Post test for pairwise comparison

If we want to ask whether particular drugs were effective , and whether some were better than others we need to do pairwise comparisons of treatments. We could chose two levels of query.

1. For each drug, does it inhibit the activation of GFP expression?

2. For all drugs, is A>B>C etc ?

Whenever you perform multiple comparisons within a dataset, you need to correct for the fact that the more comparisons you do, the more probable it is that you will see an effect by chance.

Query 1 makes a total of 4 comparisons and Q2 10 comparisons so we divide the level of significance we are prepared to accept by these values* i.e for Q1 we need p ≤0.0125 and for Q2, p≤0.005, if we want to use the conventional low level significance threshold (p=0.05). We need to find the value of z from the normal distribution that corresponds to that two tailed probability – this can be done using online calculators e.g.

http://graphpad.com/quickcalcs/Statratio1.cfm

* Technically the correction is to 1-(0.95) 1/N where N is the number of comparisons but 0.05/N is a close approximation for small N

To compare groups i and j, we find the absolute value of the difference between the mean ranks in group i and the mean ranks in group j then divide this difference in mean ranks by its standard deviation (square root of [(N*(N+1)/12)*(1/Ni + 1/Nj)]).

Here N is the total number of data points in all groups, and Ni and Nj are the number of data points in the two groups being compared. Furthermore the ranks are calculated using all samples rather than the within ‘block’ ranks used for the Friedmann test

If the ratio calculated in the preceding paragraph is larger that the critical value of z then we conclude that the difference is statistically significant.

For Q1 we require z ≥ 2.498 for p≤0.0125

For Q2 we need z ≥ 2.807 for p≤0.005

The upshot is that if we asked Q1, then only drug D gave significantly different activation For Q2 only the same Vehicle- drug D comparison can be said to be significantly different

This does not mean that drug C doesn’t inhibit NFAT activation or that there is no difference between drug D and drug A!

It means we need to do more replicate experiments to show further significance.

Set X Expt 1 2 3 means % V control Vehicle 2657 2347 2784 2596 2612 2333 2636 2527 97 A 2271 2201 2311 2261 87 B 1907 1899 2089 1965 76 C 1439 1333 1566 1466 56 D Dunn’s test Vehicle control Vs NS NS NS p<0.05

Expt B C D Set Y means 1 2 3 Vehicle 797.92

758.02

853.77

803.24

A 109.45

103.98

117.11

110.18

28.76

27.32

30.77

28.95

16.73

15.89

17.9

16.84

10.09

9.59

10.8

10.16

% V control 14 4 2 1 ANOVAR (set Y) Ctrl vs P<<0.001 P<<0.001 P<<0.001 P<<0.001

Set Y These two datasets have exactly the same ranks For Set Y however, the Dunn results clearly miss something important Parametric ANOVAR may be permissible (

after all we don’t know for certain that the underlying distribution of median values ISNT Gaussian

)

• Use common sense

• Biological significance not statistical significance

• Compare like with like • Reduce heterogeneity as far as possible • Use non-parametric tests – Mann-Whitney – Wilcoxon’s – Friedmann + Dunn’s

• Clear, reproducible cytometry data does not need stats; but if pushed you can risk using parametric stats with care