Transcript Slide 1

Genomics on Obesity
June 7th-8th 2007, Toulouse
Laboratoire de Pharmacologie et Toxicologie INRA, Toulouse, France
Statistics and bioinformatics applied to
–omics technologies: Part I
Pascal GP MARTIN
PM, june 07
[email protected]
Contents
Abstract
I.
Data pre-processing and normalization
Slide
3
4-7
8-11
III. Class comparison
12-17
IV. Class discovery
18-23
V. Conclusion
24-25
PM, june 07
II. Different answers for different questions
VI. References
26-29
Abstract
High throughput technologies are becoming increasingly used in biology. They are
applied, for example, to the genomic, transcriptomic, proteomic and metabolomic
fields. They generate large datasets from which relevant biological knowledge
needs to be extracted from relatively noisy data.
The characteristics of these datasets such as the measure of thousands of
variables on a limited set of samples, and the presence of missing values and their
noisy nature, require the use of statistical methods which are able to handle such
features. Data filtering, transformation and normalization are among the first
non-trivial steps in the process of analyzing these datasets and require both
statistical skills and biological considerations in order to yield informative data.
Equally the purpose of the experiment and its statistical analysis must be clear,
before the experiment is done: it generally falls into one of three categories : 1)
class comparison which aims at identifying variables exhibiting a class effect
(genes that are differentially expressed among the groups of samples in the case
of transcriptomics), 2) class discovery which aims at identifying subgroups of
samples or variables sharing similar profiles and 3) class prediction which aims at
building a classifier able to correctly predict the class of new, unknown samples
or variables.
The most common statistical tools used to answer the first two categories of
questions are presented along with the major drawbacks in their practical use
I. The data
 p x n matrix with p=genes and n=samples (generally p>>n)
methods handling p>>n
 Relatively noisy data (background, marks, image analysis…)
Replicates, filters, methods handling missing values
 Generally contains a lot of non-informative data (not all genes
are expressed in a given sample)
Filters
 Most often requires data transformation and normalization
PM, june 07
correcting for asymmetric distributions, biases
I. Sources of bias and noise
Sources of variations
Systematic
Normalization
(generates data that are comparable)
Random
Replicates
PM, june 07
Technical
Biological
I. Normalization
A set of genes is used as a reference and is considered not to change on average
across samples and conditions.
!! WARNING !! : the error of the mean for the reference set adds up to the error of each gene
 All genes (global mean/median, genomic DNA)
+ small error of the mean / - not always applicable
 Internal controls (housekeeping genes)
+ same systematic bias / - hard choice, small number
 External controls (spiked-in RNAs)
+ small variations / - highly sensitive to errors in RNA sample assay
 Stable genes (ranks,…)
PM, june 07
+ small error of the mean / - not always applicable
 Lowess (normalization by intensity ranges)
+ small error of the mean / - may underestimate ratios if many genes are regulated
I. Data transformation
 log transformation
400
200
100
Frequency
300
200
150
100
0
Frequency
log
50

Histogram of y
0
log( a.b)log( a)log( b)
log a log( a)log( b)
b
loga22.log( a)
Histogram of x
Multiplicative effects become additive effects
0
2
4
6
8
0.0
0.5
x
1.0
1.5
2.0
2.5
3.0
3.5
y
 z-scores
{X1, X2, …, Xn} are the data. X is the mean and Sx the standard deviation
PM, june 07
xi  x
zi 
sx
Centered data : Z = 0
Zi has no physical units
Scaled data : Sz = 1
 ranks (robustness but loss of quantitative information)
II. Different answers for different questions
Class comparison
Question : « Which genes are differentially expressed between the two (or
more) experimental conditions »
Examples : treated vs untreated, transgenic vs wild-type, obese vs lean,…
PM, june 07
Tools : threshold definition, statistical tests, analysis of variance, other
modeling tools (regression, mixed effects models, mixture models…)
II. Different answers for different questions
Class discovery
Question : « Can I identify homogeneous subgroups of genes (or sample)
which are characterized by similar expression profiles»
Examples : search for co-regulated genes, for tumor subtypes, for outliers,
data exploration …
PM, june 07
Tools : pseudo-profiles, unsupervised clustering, factorial methods (Principal
Component Analysis, Multidimensional Scaling, …)
II. Different answers for different questions
Class prediction
Question : « Can I find a rule to classify my samples (or genes) with a
minimal error rate and apply this rule to predict the class of a new sample (or
gene)»
Examples : search for genes function, predicting tumor type, prognosis
(metastasis, transplant rejection, survival rate…), biomarker of exposure
PM, june 07
Tools : classification (supervised) : discriminant analysis, classification tree
and random forests, support vector machines, generalized linear model
II. Different answers for different questions
Other bioinformatic tools / interpreting the results
Promoter analysis to understand the basis of gene co-regulation (Genomatix
tools, TransFac, …)
Pathway analysis to understand the consequences of gene regulations at the
pathway/network level (KEGG, cMAP, Ingenuity, …)
Integrating the results with biological knowledge (bibliographic: PubMed,
Ingenuity, bibliosphere,…, Gene-centered: Gene Ontology, Panther,…)
Using the results to model biological systems (collaborations with physics,
chemistry, informatics, mathematics, …)
PM, june 07
Data reconciliation to understand the links between different levels of
observations (transcriptome/proteome, transcriptome/metabolome,…)
…
III. Class comparison: the t-test
If M1, M2, …, Mn are n log(ratios) from our microarray experiment
(independent and following N(,2) where  and 2 are unknown).
Null Hypothesis H0:  = 0  log(ratio) = 0  ratio = 1
Alternative hypothesis H1:  ≠ 0 (two-sided test)
If H0 is true, then:
Tobserved 
1 n
where M   M i
n i 1
M 0

S/ n
nM
 Student (n-1 )
S
is the estimation of the mean
1 n
2
M i  M  is the estimation of the variance
and S 

n  1 i 1
PM, june 07
2
Two main problems : instability of the estimations (small sample size
+ outliers) and dealing with the multiplicity of the test statistics.
III. Class comparison: the t-test
T
nM
S
Instability of the estimations
Stabilizing by using robust statistics (ranks) or estimators (median)
Stabilizing S2 by making groups of genes or by modifying the test
statistic (e.g. Efron B et al., 2000 or Lönnstedt I & Speed TP, 2001)
•Efron B et al. Microarrays and their use in a comparative experiment, Technical Report, Dept of Statistics, Stanford Univ., Stanford, CA, 2000
•Lönnstedt I & Speed TP. Replicated microarray data, Statistica Sinica, 12(1):31-46, 2001.
The multiplicity problem
Analogy (a bit sordid…) : Consider a russian roulette «game» where 5 bullets are
randomly placed in a gun which can contain 100 bullets. A «player» has a 5% chance to
loose. If 100 players play at the same time, then 5 players will loose on average.
For a given gene and a given type I error rate (a=5%), we know that this gene has a 5%
probability to be a false positive. Thus, when doing one test at a=5% for each gene, we know that
the number of false positives will be 5% times the number of tests (5 FP for 100 tests, 50 FP for
1000 tests,…, 2200 FP for 44000 tests)
PM, june 07
Control the false discovery rate (proportion of FP among the rejected H0)
instead of the probability to get one FP (Family-wise error rate)
• Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A practical approach to multiple testing. J. Roy. Stat. Soc. 1995, 57(1):289-300
• Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statistical Science. 2003, 18(1):71-103
• Tusher VG, Tibshirani R, Chu G. SAM applied to the ionizing radiation response. PNAS. 2001, 98:5116-21.
III. Class comparison: analysis of variance
M1, M2, …, Mn are n log(ratios) from our microarray experiment. Their
variance is estimated by :
-
+
M
n
ˆ 
i 1
i
M)
n 1
SS
PM, june 07
2
M
 (M
2
M
III. Class comparison: analysis of variance
Example : 1 factor (Treatment) with 3 levels (A, B, C). Expression of gene X.
Treatment
Treatment
Treatment
A
B
C
-
-
-
yA
yB
yC
Expression
of gene X
3
y
Expr of X
~
Treatment
+
e
Total SS
=
Treatment SS
+
Residual SS
5
 ( y
i 1 k 1
PM, june 07
ik
 y)
3
2
=
5
 ( y
i 1 k 1
i
 y)
3
2
+
5
 ( y
i 1 k 1
ik
 yi )
2
III. Class comparison: analysis of variance
PM, june 07
N = 15 observations, I=3 levels for the «treatment» factor
Sources
of
variation
Degrees
of
freedom
Sum of
Squares
(SS)
Mean Square
(=Variance)
Treatment
I-1 = 2
SStreatment
MStreat = SStreat / (I-1)
Résiduals
N-I = 12
SSresid
MSresid = SSresid / (N-I)
Total
variation
N-1 = 14
SStotal
F Statistic (Fisher) p-value
F = MStreat /
MSresid
0.001
Here, the treatment factor has a significant effect on the expression of gene X which means
that expression of gene X is different between at least one of the 3 treatments compared to the
other ones (we do not not which treatment is different from the other at this stage).
III. Class comparison: analysis of variance
When can I use the analysis of variance ?

Homoscedasticity (Homogeneity of the variances)
1 eˆ ijk(résiduals)
Can be assessed graphically
2
eˆ ijk(résiduals)
1 Not OK. Try to use a transformation
such as log, cube root, …
2 OK, good for anova
Fitted value
0
0
 Independance (the effect of a factor on an individual does not influence the
other individuals)
Generally postulated
residuals
PM, june 07
 Normality of the residuals
Quantiles of N(0,1)
Can be tested (Shapiro-Wilk, KS) but generally
assessed graphically through QQplot of residuals
IV. Class discovery
SEVERAL AIMS (object = sample or variable) : reducing the dimension,
grouping the objects together, understanding why the objects are grouped
together, examining the objects and their distances, taking into account the
links between the objects
Gene 2
PM, june 07
ns
ns
Gene 1
IV. Class discovery
1
j
p
L. Lebart, A. Morineau, M. Piron
1
X (n,p) =
Expression level of gene j
for sample i
xij
i
n
Row vector
1
j
Column vector
p
j
1
j’
i
i’
i
Factorial methods
(PCA, MDS…)
n
n
points in ℝp
ℝp
PM, june 07
p
Clustering methods (hierarchical,
k-means, SOM…)
points in ℝn
ℝn
Configuration of a cloud of points in a multidimensional space
IV. Ascendant hierarchical clustering
AIM : group together (or cluster) the objects in «homogeneous» groups
Procedure : A distance measure between objects (single or groups) is defined
 Beginning: each object is a class
 Algorithm: at each step; the 2 closest objects are grouped together
 End : one class groups all the objects
RESULT : a tree (dendrogram) is built
PM, june 07
Initially, 2 choices must be made : the distance metric and the agglomerative
criterion
Then, the number of clusters must be chosen
PPAR.12
PPAR.3.2
PPAR.9
PPAR.6.1
PPAR.3.3
PPAR.3
PPAR.3.1
PPAR.0.1
PPAR.0.3
PPAR.0.2
PPAR.0
PPAR.9.2
PPAR.12.3
PPAR.12.1
PPAR.9.3
PPAR.12.2
PPAR.9.1
PPAR.6.2
PPAR.6.3
PPAR.6
PPAR.48.2
PPAR.48.3
PPAR.48.1
PPAR.36
PPAR.36.3
PPAR.36.2
PPAR.36.1
PPAR.48
PPAR.24.2
PPAR.24
PPAR.24.3
PPAR.24.1
PPAR.18.3
PPAR.18
PPAR.18.2
PPAR.18.1
PPAR.72
PPAR.72.3
PPAR.60.2
PPAR.60
PPAR.60.3
PPAR.60.1
PPAR.72.1
PPAR.72.2
wt.0.1
wt.0
wt.3.2
wt.12.1
wt.12
wt.9.2
wt.9
wt.9.1
wt.12.2
wt.3
wt.3.1
wt.3.3
wt.9.3
wt.6.1
wt.6.3
wt.6.2
wt.6
wt.18.3
wt.18
wt.18.2
wt.18.1
wt.60.1
wt.60.3
wt.72
wt.72.2
wt.60
wt.72.1
wt.60.2
wt.72.3
wt.48.3
wt.48.1
wt.48.2
wt.48
wt.36
wt.36.1
wt.36.3
wt.36.2
wt.24.1
wt.24
wt.24.2
PGP Martin et al, unpublished data
IV. Hierarchical clustering and Heatmaps
PPARa -/Wild-type
LH
ACBP
LDLr
AM2R
OCTN2
Elo5
Elo1
apoC3
S14
FAS
CYP7a
cHMGCoAS
FPPS
ALAT
FDFT
HMGCoAred
GSTpi2
CYP2c29
SPI1
LCE
apoB
apoE
b.actin
GSTmu
catalase
LEF1
GSTa
BSEP
ALDH1
G6Pase
Elo3
LFABP
apoA.IV
PEPCK
Lpin1
GK
CEBPg
UCP2
CYP27a1
CYP2b13
LPK
CAR1
delta6
cMOAT
Tpbeta
MDR2
LCPT1
PDK4
ACOTH
ADISP
X36b4
Lpin2
FIAF
CYP8b1
CPT2
HPNCL
cjun
PECI
GA3PDH
Eci
MCAD
CYP2b10
ASAT
CYP3A11
apoA.V
CytB
apoA.I
C16SR
SCD1
PMDCI
BIEN
mHMGCoAS
AOX
CYP4A14
CYP4A10
Lipogenesis,
cholesterogenesis,
fatty acid transport
Downregulated in
PPARa -/- (catalase, L-
FABP)
PPARa target genes
(HPNCL, CPT2, MCAD)
PPARa target genes
(AOX, Cyp4a, mHMGCoAs)
IV. Principal Component Analysis
2d principal component
1st principal component
PM, june 07
Centre of
gravity of
the fish
PCA allows us to define spaces of lower dimension (compared to the
initial space) on which the projection of the initial cloud of point
presents a minimum of distortion. It means that these new spaces
capture a maximum of information (or variability).
IV. Principal Component Analysis
 Principal Component Analysis on hepatic fatty acid composition (21 fatty acids x 60 samples):
Variables
REF
EF
R
EF R EF
R
R EF
SAT
SAT
SAT
SAT
SAT SAT
SAT
SATSAT
SATSAT
SAT
-4
0.5
D HA
D HA
DDHA
HA
w3
w3
-6
C16.1w9
w3 w3
w3
-2
0
PC1 (32%)
C18.1w9
C18.1w7
C14.0
C16.1w7
PC2
R EF
R EFR R
EFR EF
R EF EF
C20.3w6
C18.2w6
0.0
w6
DHA
HA
D
DDHA
HA
D HA
HA
DDHA
C22.4w6
C22.5w6
C20.2w6
C20.4w6
D HA
C20.1w9
C18.3w6
C18.0
C20.3w9
C16.0
-0.5
Wild-type
Fatty acids
C18.3w3
C20.3w3
w3
C22.5w3
C20.5w3
w3
w3
w3 w3
w3
w3
2
4
-1.0
-0.5
0.0
Axe 1
PC1
PM, june 07
C22.6w3
-1.0
w6
w6
w6
w6
w6
w6 w6w6
w6
w6
Axe 2
PC2 (26%)
w6
Mice
PPARa -/-
-2
0
2
4
1.0
1st Plane (PC1+PC2): 58% of the experimental variance is captured
3rd PC: 16% = genotype effects (not shown)
 Dietary FA modulate hepatic FA composition
0.5
1.0
Conclusion
The Statistician and the Biologist
(reproduced from a presentation of R. Tibshirani)
They are both being executed, and are each granted one last request
The Statistician asks that he be allowed to give one last lecture on his
Grand Theory of Statistics.
The Biologist asks that he be executed first.
… and the last word is for R.A. Fisher…
PM, june 07
« To call in a statistician after the experiment is done may be no more
than asking him to perform a postmortem examination: he may be
able to say what the experiment died of »
Thank you for your attention !!
PM, june 07
Laboratoire de Statistique et Probabilités
Some references: bioinformatics/databases
Gene Ontology (GO) : http://www.geneontology.org/
Outils de recherche et d’analyse GO : http://fatigo.bioinfo.cnio.es/
http://www.godatabase.org/cgi-bin/amigo/go.cgi
http://david.niaid.nih.gov/david/
http://vortex.cs.wayne.edu:8080/index.jsp
Réseaux : http://www.genmapp.org/introduction.asp
http://www.genome.jp/kegg/
Analyses de promoteurs : http://www.genomatix.de/
http://www.cbrc.jp/research/db/TFSEARCH.html
PM, june 07
http://www.gene-regulation.com/
Some references
Books on microarray data analysis :
Statistical Analysis of Gene Expression Microarray Data. Terry Speed (Chapman & HALL)
Design and Analysis of DNA Microarray Investigations. Richard M. Simon et al. (Springer)
The Analysis of Gene Expression Data. Giovanni Parmigiani et al. (Springer)
Nature Genetics 2002 : The Chipping Forecast II. Vol 32 supplement 2. pp 461-552
General Statistics books (in french) :
Statistique Théorique et Appliquée –Tomes 1 et 2. Pierre Dagnelie (De Boeck Univ.)
Analyse statistique à plusieurs variables. Pierre Dagnelie (Tec&Doc)
Initiation aux traitements statistiques. B. Escofier et J. Pagès (Presses Univ. de Rennes)
Statistique exploratoire multidimensionnelle. L. Lebart, A. Morineau, M. Piron (Dunod)
PM, june 07
Analyses factorielles simples et multiples. B. Escofier et J. Pagès (Dunod)
Some references
Articles on data normalization :
Normalization for cDNA microarray data: a robust composite method addressing single and
multiple slide systematic variation. Yang YH et al. Nucleic Acids Res. 2002.
A comparison of normalization methods for high density oligonucleotide array data based on
variance and bias. Bolstad BM et al. Bioinformatics. 2003.
Empirical evaluation of data transformations and ranking statistics for microarray analysis. Qin LX
and Kerr KF. Nucleic Acids Res. 2004.
Articles on experimental design :
Statistical design and the analysis of gene expression microarray data. Kerr MK and Churchill GA.
Genet Res. 2001.
Experimental design for gene expression microarrays. Kerr MK and Churchill GA. Biostatistics.
2001.
Questions and answers on design of dual-label microarrays for identifying differentially expressed
genes. Dobbin K, Shih JH and Simon R. J Natl Cancer Inst. 2003.
Experimental design and low-level analysis of microarray data. Bolstad BM et al. Int Rev
Neurobiol. 2004.
PM, june 07
Some references
Articles on linear models for microarray data analysis :
Assessing gene significance from cDNA microarray expression data via mixed models. Wolfinger
RD et al. J Comput Biol. 2001.
Linear models for microarray data analysis: hidden similarities and differences. Kerr MK. J
Comput Biol. 2003.
Analysis of oligonucleotide array experiments with repeated measures using mixed models. Li H
et al. BMC Bioinformatics. 2004.
Articles on multidimentional exploratory analysis :
Cluster analysis and display of genome-wide expression patterns. Eisen MB et al. Proc Natl Acad
Sci USA. 1998.
Large-scale clustering of cDNA-fingerprinting data. Herwig R et al. Genome Res. 1999.
PM, june 07
Using biplots to interpret gene expression patterns in plants. Chapman S et al. Bioinformatics.
2002.
NOTE : These references are only provided to illustrate the intensity and diversity of the research efforts in the field of
microarray data analysis and many more publications could have been worth mentioned.