Transcript fgtut 4047

Statistical Analysis of cDNA
microarrays II
Terry Speed
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Outline
Different types of questions asked in
microarray experiments
Cluster analysis
Single gene method
A synthesis
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Gene Expression Data
Gene expression data on p genes for n samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 …
Genes
1
2
3
4
5
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
...
...
...
...
...
Gene expression level of gene i in mRNA sample j
=
Log( Red intensity / Green intensity)
Log(Avg. PM - Avg. MM)
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Experiments, horses for courses
mRNA levels compared in many different contexts
— Tumour cell lines
— Different tissues, same organism
— Same tissue, different organisms (wt, ko, tg)
— Same tissue, same organism (trt vs ctl)
— Time course experiments
No single method of analysis can be appropriate
for all. Rather, each type of experiment requires
its own analysis.
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Cluster Analysis
Can cluster genes, cell samples, or both.
Strengthens signal when averages are taken
within clusters of genes (Eisen).
Useful (essential ?) when seeking new
subclasses of cells, tumours, etc.
Leads to readily interpreted figures.
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Clusters
Taken from
Nature February, 2000
Paper by Allzadeh. A et al
Distinct types of diffuse large
B-cell lymphoma identified by
Gene expression profiling,
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Discovering sub-groups
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Which genes have changed?
This is a common enough question. We will illustrate one
approach when replicates are available.
GOAL: Identify genes with altered expression in the livers of
one line of mice with very low HDL cholesterol levels
compared to inbred control mice.
Experiment: Apo AI knock-out mouse model
8 knockout (ko) mice and 8 control (ctl) mice (C57Bl/6).
16 hybridisations: mRNA from each of the 16 mice is labelled
with Cy5, pooled mRNA from control mice is labelled with
Cy3.
Probes: ~6,000 cDNAs, including 200 related to lipid
metabolism.
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Which genes have changed?
1. For each gene and each hybridisation (8 ko + 8 ctl),
use M=log2(R/G).
2. For each gene form the t statistic:
average of 8 ko Ms - average of 8 ctl Ms
sqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2)
3. Form a histogram of 6,000 t values.
4. Do a normal Q-Q plot; look for values “off the line”.
5. Adjust for multiple testing.
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Histogram
ApoA1
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Plot of t-statistics
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Assigning p-values to
measures of change
• Estimate p-values for each comparison (gene) by
using the permutation distribution of the tstatistics.
16

8   12, 870 possible
• For each of the
permutation of the trt / ctl labels, compute the
two-sample t-statistics t* for each gene.
• The unadjusted p-value for a particular gene is
estimated by the proportion of t*’s greater than
the observed t in absolute value.
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Multiple Testing
Problem: We have just performed ~6000 tests!
=> need to control the family-wise false positive rate (Type I
error).
=> use adjusted p-values.
Bonferroni adjustment. Multiply p-values by number of tests.
Too conservative, doesn’t take into account the dependence
structure between the genes.
Westfall & Young. Estimate adjusted p-values using the
permutation distribution of statistics which take into
account the dependence structure between the genes.
Less conservative.
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Apo A1: Adjusted and Unadjusted p-values for the
50 genes with the larges absolute t-statistics.
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Apo AI. Genes with adjusted p-value < 0.01
Gene
ApoAI
Adjusted
t
Num Den
p
0.00
-22.85 -3.19 0.14
Sterol C5-desaturase
0.00
-13.14 -1.06 0.08
Catechol Omethyltransferase
Apo CIII
0.00
-12.21 -1.90 0.16
0.00
-11.88 -1.02 0.09
ApoAI
0.00
-11.44 -3.09 0.27
EST
0.00
-9.11
-1.02 0.11
Apo CIII
0.00
-8.36
-1.04 0.12
Sterol desaturase
0.01
-7.72
-1.04 0.13
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Limitations
Cluster analyses:
1) Usually outside the normal framework of
statistical inference;
2) less appropriate when only a few genes are likely
to change.
3) Needs lots of experiments
Single gene tests:
1) may be too noisy in general to show much
2) may not reveal coordinated effects of positively
correlated genes.
3) hard to relate to pathways.
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
A synthesis
We and others (Stanford) are working on methods
which try to combine the best of both of the
preceding approaches.
Try to find clusters of genes and average their
responses to reduce noise and enhance
interpretability.
Use testing to assign significance with averages of
clusters of genes as we did with single genes.
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Clustering genes
E.g. p=5
Cluster 6=(1,2)
Cluster 7=(1,2,3)
Cluster 8=(4,5)
Cluster 9=
(1,2,3,4,5)
1
2 3 4
5
Let p = number of genes.
1. Calculate within class
correlation.
2. Perform hierarchical
clustering which will produce
(2p-1) clusters of genes.
3. Average within clusters of
genes.
4 Perform testing on averages
of clusters of genes as if they
were single genes.
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Data - Ro1
Transgenic mice with a modified Gi coupled receptor (Ro1).
Experiment: induced expression of Ro1 in mice.
8 control (ctl) mice
9 treatment mice eight weeks after Ro1 being induced.
Long-term question: Which groups of genes work together.
Based on paper: Conditional expression of a Gi-coupled
receptor causes ventricular conduction delay and a lethal
cardiomyopathy, see Redfern C. et al. PNAS, April 25, 2000.
http://www.pnas.org also
http://www.GenMAPP.org/ (Conklin lab, UCSF)
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Histogram
Cluster of genes
(1703, 3754)
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Top 15 averages of gene clusters
T
-13.4
-12.1
11.8
11.7
11.3
11.2
-10.7
10.7
10.7
10.6
-10.4
-10.4
-10.4
10.3
Group ID
7869
3754
6175
4689
6089
1683
2272
9955
5179
3916
8255
4772
10548
9476
= (1703, 3754)
Might be influenced by 3754
 1 0.7 0.7 
0.7 1 0.8 

0.7 0.8 1 

= (6194, 1703, 3754)
= (4572, 4772, 5809)
Correlation
= (2534, 1343, 1954)
= (6089, 5455, 3236, 4014)
1 0.5 0.5
0.5 1 0.8

0.5 0.8 1 

Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Limitation
Hard to extend this method to negatively
correlated clusters of genes. Need to consider
together with other methods.
Need to identify high averages of clusters of
genes that are due to high averages from subclusters of those genes.
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research
Acknowledgments
Yee Hwa Yang
Sandrine Dudoit
Natalie Roberts
Ben Bolstad
Ingrid Lonnstedt
Karen Vranizan
Matt Callow (LBL)
Bruce Conklin (UCSF)
WEHI Bioinformatics group
Department of Statistics, University of California, Berkeley , and
Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research