Transcript Slide 1
Genomics on Obesity June 7th-8th 2007, Toulouse Laboratoire de Pharmacologie et Toxicologie INRA, Toulouse, France Statistics and bioinformatics applied to –omics technologies: Part I Pascal GP MARTIN PM, june 07 [email protected] Contents Abstract I. Data pre-processing and normalization Slide 3 4-7 8-11 III. Class comparison 12-17 IV. Class discovery 18-23 V. Conclusion 24-25 PM, june 07 II. Different answers for different questions VI. References 26-29 Abstract High throughput technologies are becoming increasingly used in biology. They are applied, for example, to the genomic, transcriptomic, proteomic and metabolomic fields. They generate large datasets from which relevant biological knowledge needs to be extracted from relatively noisy data. The characteristics of these datasets such as the measure of thousands of variables on a limited set of samples, and the presence of missing values and their noisy nature, require the use of statistical methods which are able to handle such features. Data filtering, transformation and normalization are among the first non-trivial steps in the process of analyzing these datasets and require both statistical skills and biological considerations in order to yield informative data. Equally the purpose of the experiment and its statistical analysis must be clear, before the experiment is done: it generally falls into one of three categories : 1) class comparison which aims at identifying variables exhibiting a class effect (genes that are differentially expressed among the groups of samples in the case of transcriptomics), 2) class discovery which aims at identifying subgroups of samples or variables sharing similar profiles and 3) class prediction which aims at building a classifier able to correctly predict the class of new, unknown samples or variables. The most common statistical tools used to answer the first two categories of questions are presented along with the major drawbacks in their practical use I. The data p x n matrix with p=genes and n=samples (generally p>>n) methods handling p>>n Relatively noisy data (background, marks, image analysis…) Replicates, filters, methods handling missing values Generally contains a lot of non-informative data (not all genes are expressed in a given sample) Filters Most often requires data transformation and normalization PM, june 07 correcting for asymmetric distributions, biases I. Sources of bias and noise Sources of variations Systematic Normalization (generates data that are comparable) Random Replicates PM, june 07 Technical Biological I. Normalization A set of genes is used as a reference and is considered not to change on average across samples and conditions. !! WARNING !! : the error of the mean for the reference set adds up to the error of each gene All genes (global mean/median, genomic DNA) + small error of the mean / - not always applicable Internal controls (housekeeping genes) + same systematic bias / - hard choice, small number External controls (spiked-in RNAs) + small variations / - highly sensitive to errors in RNA sample assay Stable genes (ranks,…) PM, june 07 + small error of the mean / - not always applicable Lowess (normalization by intensity ranges) + small error of the mean / - may underestimate ratios if many genes are regulated I. Data transformation log transformation 400 200 100 Frequency 300 200 150 100 0 Frequency log 50 Histogram of y 0 log( a.b)log( a)log( b) log a log( a)log( b) b loga22.log( a) Histogram of x Multiplicative effects become additive effects 0 2 4 6 8 0.0 0.5 x 1.0 1.5 2.0 2.5 3.0 3.5 y z-scores {X1, X2, …, Xn} are the data. X is the mean and Sx the standard deviation PM, june 07 xi x zi sx Centered data : Z = 0 Zi has no physical units Scaled data : Sz = 1 ranks (robustness but loss of quantitative information) II. Different answers for different questions Class comparison Question : « Which genes are differentially expressed between the two (or more) experimental conditions » Examples : treated vs untreated, transgenic vs wild-type, obese vs lean,… PM, june 07 Tools : threshold definition, statistical tests, analysis of variance, other modeling tools (regression, mixed effects models, mixture models…) II. Different answers for different questions Class discovery Question : « Can I identify homogeneous subgroups of genes (or sample) which are characterized by similar expression profiles» Examples : search for co-regulated genes, for tumor subtypes, for outliers, data exploration … PM, june 07 Tools : pseudo-profiles, unsupervised clustering, factorial methods (Principal Component Analysis, Multidimensional Scaling, …) II. Different answers for different questions Class prediction Question : « Can I find a rule to classify my samples (or genes) with a minimal error rate and apply this rule to predict the class of a new sample (or gene)» Examples : search for genes function, predicting tumor type, prognosis (metastasis, transplant rejection, survival rate…), biomarker of exposure PM, june 07 Tools : classification (supervised) : discriminant analysis, classification tree and random forests, support vector machines, generalized linear model II. Different answers for different questions Other bioinformatic tools / interpreting the results Promoter analysis to understand the basis of gene co-regulation (Genomatix tools, TransFac, …) Pathway analysis to understand the consequences of gene regulations at the pathway/network level (KEGG, cMAP, Ingenuity, …) Integrating the results with biological knowledge (bibliographic: PubMed, Ingenuity, bibliosphere,…, Gene-centered: Gene Ontology, Panther,…) Using the results to model biological systems (collaborations with physics, chemistry, informatics, mathematics, …) PM, june 07 Data reconciliation to understand the links between different levels of observations (transcriptome/proteome, transcriptome/metabolome,…) … III. Class comparison: the t-test If M1, M2, …, Mn are n log(ratios) from our microarray experiment (independent and following N(,2) where and 2 are unknown). Null Hypothesis H0: = 0 log(ratio) = 0 ratio = 1 Alternative hypothesis H1: ≠ 0 (two-sided test) If H0 is true, then: Tobserved 1 n where M M i n i 1 M 0 S/ n nM Student (n-1 ) S is the estimation of the mean 1 n 2 M i M is the estimation of the variance and S n 1 i 1 PM, june 07 2 Two main problems : instability of the estimations (small sample size + outliers) and dealing with the multiplicity of the test statistics. III. Class comparison: the t-test T nM S Instability of the estimations Stabilizing by using robust statistics (ranks) or estimators (median) Stabilizing S2 by making groups of genes or by modifying the test statistic (e.g. Efron B et al., 2000 or Lönnstedt I & Speed TP, 2001) •Efron B et al. Microarrays and their use in a comparative experiment, Technical Report, Dept of Statistics, Stanford Univ., Stanford, CA, 2000 •Lönnstedt I & Speed TP. Replicated microarray data, Statistica Sinica, 12(1):31-46, 2001. The multiplicity problem Analogy (a bit sordid…) : Consider a russian roulette «game» where 5 bullets are randomly placed in a gun which can contain 100 bullets. A «player» has a 5% chance to loose. If 100 players play at the same time, then 5 players will loose on average. For a given gene and a given type I error rate (a=5%), we know that this gene has a 5% probability to be a false positive. Thus, when doing one test at a=5% for each gene, we know that the number of false positives will be 5% times the number of tests (5 FP for 100 tests, 50 FP for 1000 tests,…, 2200 FP for 44000 tests) PM, june 07 Control the false discovery rate (proportion of FP among the rejected H0) instead of the probability to get one FP (Family-wise error rate) • Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A practical approach to multiple testing. J. Roy. Stat. Soc. 1995, 57(1):289-300 • Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statistical Science. 2003, 18(1):71-103 • Tusher VG, Tibshirani R, Chu G. SAM applied to the ionizing radiation response. PNAS. 2001, 98:5116-21. III. Class comparison: analysis of variance M1, M2, …, Mn are n log(ratios) from our microarray experiment. Their variance is estimated by : - + M n ˆ i 1 i M) n 1 SS PM, june 07 2 M (M 2 M III. Class comparison: analysis of variance Example : 1 factor (Treatment) with 3 levels (A, B, C). Expression of gene X. Treatment Treatment Treatment A B C - - - yA yB yC Expression of gene X 3 y Expr of X ~ Treatment + e Total SS = Treatment SS + Residual SS 5 ( y i 1 k 1 PM, june 07 ik y) 3 2 = 5 ( y i 1 k 1 i y) 3 2 + 5 ( y i 1 k 1 ik yi ) 2 III. Class comparison: analysis of variance PM, june 07 N = 15 observations, I=3 levels for the «treatment» factor Sources of variation Degrees of freedom Sum of Squares (SS) Mean Square (=Variance) Treatment I-1 = 2 SStreatment MStreat = SStreat / (I-1) Résiduals N-I = 12 SSresid MSresid = SSresid / (N-I) Total variation N-1 = 14 SStotal F Statistic (Fisher) p-value F = MStreat / MSresid 0.001 Here, the treatment factor has a significant effect on the expression of gene X which means that expression of gene X is different between at least one of the 3 treatments compared to the other ones (we do not not which treatment is different from the other at this stage). III. Class comparison: analysis of variance When can I use the analysis of variance ? Homoscedasticity (Homogeneity of the variances) 1 eˆ ijk(résiduals) Can be assessed graphically 2 eˆ ijk(résiduals) 1 Not OK. Try to use a transformation such as log, cube root, … 2 OK, good for anova Fitted value 0 0 Independance (the effect of a factor on an individual does not influence the other individuals) Generally postulated residuals PM, june 07 Normality of the residuals Quantiles of N(0,1) Can be tested (Shapiro-Wilk, KS) but generally assessed graphically through QQplot of residuals IV. Class discovery SEVERAL AIMS (object = sample or variable) : reducing the dimension, grouping the objects together, understanding why the objects are grouped together, examining the objects and their distances, taking into account the links between the objects Gene 2 PM, june 07 ns ns Gene 1 IV. Class discovery 1 j p L. Lebart, A. Morineau, M. Piron 1 X (n,p) = Expression level of gene j for sample i xij i n Row vector 1 j Column vector p j 1 j’ i i’ i Factorial methods (PCA, MDS…) n n points in ℝp ℝp PM, june 07 p Clustering methods (hierarchical, k-means, SOM…) points in ℝn ℝn Configuration of a cloud of points in a multidimensional space IV. Ascendant hierarchical clustering AIM : group together (or cluster) the objects in «homogeneous» groups Procedure : A distance measure between objects (single or groups) is defined Beginning: each object is a class Algorithm: at each step; the 2 closest objects are grouped together End : one class groups all the objects RESULT : a tree (dendrogram) is built PM, june 07 Initially, 2 choices must be made : the distance metric and the agglomerative criterion Then, the number of clusters must be chosen PPAR.12 PPAR.3.2 PPAR.9 PPAR.6.1 PPAR.3.3 PPAR.3 PPAR.3.1 PPAR.0.1 PPAR.0.3 PPAR.0.2 PPAR.0 PPAR.9.2 PPAR.12.3 PPAR.12.1 PPAR.9.3 PPAR.12.2 PPAR.9.1 PPAR.6.2 PPAR.6.3 PPAR.6 PPAR.48.2 PPAR.48.3 PPAR.48.1 PPAR.36 PPAR.36.3 PPAR.36.2 PPAR.36.1 PPAR.48 PPAR.24.2 PPAR.24 PPAR.24.3 PPAR.24.1 PPAR.18.3 PPAR.18 PPAR.18.2 PPAR.18.1 PPAR.72 PPAR.72.3 PPAR.60.2 PPAR.60 PPAR.60.3 PPAR.60.1 PPAR.72.1 PPAR.72.2 wt.0.1 wt.0 wt.3.2 wt.12.1 wt.12 wt.9.2 wt.9 wt.9.1 wt.12.2 wt.3 wt.3.1 wt.3.3 wt.9.3 wt.6.1 wt.6.3 wt.6.2 wt.6 wt.18.3 wt.18 wt.18.2 wt.18.1 wt.60.1 wt.60.3 wt.72 wt.72.2 wt.60 wt.72.1 wt.60.2 wt.72.3 wt.48.3 wt.48.1 wt.48.2 wt.48 wt.36 wt.36.1 wt.36.3 wt.36.2 wt.24.1 wt.24 wt.24.2 PGP Martin et al, unpublished data IV. Hierarchical clustering and Heatmaps PPARa -/Wild-type LH ACBP LDLr AM2R OCTN2 Elo5 Elo1 apoC3 S14 FAS CYP7a cHMGCoAS FPPS ALAT FDFT HMGCoAred GSTpi2 CYP2c29 SPI1 LCE apoB apoE b.actin GSTmu catalase LEF1 GSTa BSEP ALDH1 G6Pase Elo3 LFABP apoA.IV PEPCK Lpin1 GK CEBPg UCP2 CYP27a1 CYP2b13 LPK CAR1 delta6 cMOAT Tpbeta MDR2 LCPT1 PDK4 ACOTH ADISP X36b4 Lpin2 FIAF CYP8b1 CPT2 HPNCL cjun PECI GA3PDH Eci MCAD CYP2b10 ASAT CYP3A11 apoA.V CytB apoA.I C16SR SCD1 PMDCI BIEN mHMGCoAS AOX CYP4A14 CYP4A10 Lipogenesis, cholesterogenesis, fatty acid transport Downregulated in PPARa -/- (catalase, L- FABP) PPARa target genes (HPNCL, CPT2, MCAD) PPARa target genes (AOX, Cyp4a, mHMGCoAs) IV. Principal Component Analysis 2d principal component 1st principal component PM, june 07 Centre of gravity of the fish PCA allows us to define spaces of lower dimension (compared to the initial space) on which the projection of the initial cloud of point presents a minimum of distortion. It means that these new spaces capture a maximum of information (or variability). IV. Principal Component Analysis Principal Component Analysis on hepatic fatty acid composition (21 fatty acids x 60 samples): Variables REF EF R EF R EF R R EF SAT SAT SAT SAT SAT SAT SAT SATSAT SATSAT SAT -4 0.5 D HA D HA DDHA HA w3 w3 -6 C16.1w9 w3 w3 w3 -2 0 PC1 (32%) C18.1w9 C18.1w7 C14.0 C16.1w7 PC2 R EF R EFR R EFR EF R EF EF C20.3w6 C18.2w6 0.0 w6 DHA HA D DDHA HA D HA HA DDHA C22.4w6 C22.5w6 C20.2w6 C20.4w6 D HA C20.1w9 C18.3w6 C18.0 C20.3w9 C16.0 -0.5 Wild-type Fatty acids C18.3w3 C20.3w3 w3 C22.5w3 C20.5w3 w3 w3 w3 w3 w3 w3 2 4 -1.0 -0.5 0.0 Axe 1 PC1 PM, june 07 C22.6w3 -1.0 w6 w6 w6 w6 w6 w6 w6w6 w6 w6 Axe 2 PC2 (26%) w6 Mice PPARa -/- -2 0 2 4 1.0 1st Plane (PC1+PC2): 58% of the experimental variance is captured 3rd PC: 16% = genotype effects (not shown) Dietary FA modulate hepatic FA composition 0.5 1.0 Conclusion The Statistician and the Biologist (reproduced from a presentation of R. Tibshirani) They are both being executed, and are each granted one last request The Statistician asks that he be allowed to give one last lecture on his Grand Theory of Statistics. The Biologist asks that he be executed first. … and the last word is for R.A. Fisher… PM, june 07 « To call in a statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of » Thank you for your attention !! PM, june 07 Laboratoire de Statistique et Probabilités Some references: bioinformatics/databases Gene Ontology (GO) : http://www.geneontology.org/ Outils de recherche et d’analyse GO : http://fatigo.bioinfo.cnio.es/ http://www.godatabase.org/cgi-bin/amigo/go.cgi http://david.niaid.nih.gov/david/ http://vortex.cs.wayne.edu:8080/index.jsp Réseaux : http://www.genmapp.org/introduction.asp http://www.genome.jp/kegg/ Analyses de promoteurs : http://www.genomatix.de/ http://www.cbrc.jp/research/db/TFSEARCH.html PM, june 07 http://www.gene-regulation.com/ Some references Books on microarray data analysis : Statistical Analysis of Gene Expression Microarray Data. Terry Speed (Chapman & HALL) Design and Analysis of DNA Microarray Investigations. Richard M. Simon et al. (Springer) The Analysis of Gene Expression Data. Giovanni Parmigiani et al. (Springer) Nature Genetics 2002 : The Chipping Forecast II. Vol 32 supplement 2. pp 461-552 General Statistics books (in french) : Statistique Théorique et Appliquée –Tomes 1 et 2. Pierre Dagnelie (De Boeck Univ.) Analyse statistique à plusieurs variables. Pierre Dagnelie (Tec&Doc) Initiation aux traitements statistiques. B. Escofier et J. Pagès (Presses Univ. de Rennes) Statistique exploratoire multidimensionnelle. L. Lebart, A. Morineau, M. Piron (Dunod) PM, june 07 Analyses factorielles simples et multiples. B. Escofier et J. Pagès (Dunod) Some references Articles on data normalization : Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Yang YH et al. Nucleic Acids Res. 2002. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bolstad BM et al. Bioinformatics. 2003. Empirical evaluation of data transformations and ranking statistics for microarray analysis. Qin LX and Kerr KF. Nucleic Acids Res. 2004. Articles on experimental design : Statistical design and the analysis of gene expression microarray data. Kerr MK and Churchill GA. Genet Res. 2001. Experimental design for gene expression microarrays. Kerr MK and Churchill GA. Biostatistics. 2001. Questions and answers on design of dual-label microarrays for identifying differentially expressed genes. Dobbin K, Shih JH and Simon R. J Natl Cancer Inst. 2003. Experimental design and low-level analysis of microarray data. Bolstad BM et al. Int Rev Neurobiol. 2004. PM, june 07 Some references Articles on linear models for microarray data analysis : Assessing gene significance from cDNA microarray expression data via mixed models. Wolfinger RD et al. J Comput Biol. 2001. Linear models for microarray data analysis: hidden similarities and differences. Kerr MK. J Comput Biol. 2003. Analysis of oligonucleotide array experiments with repeated measures using mixed models. Li H et al. BMC Bioinformatics. 2004. Articles on multidimentional exploratory analysis : Cluster analysis and display of genome-wide expression patterns. Eisen MB et al. Proc Natl Acad Sci USA. 1998. Large-scale clustering of cDNA-fingerprinting data. Herwig R et al. Genome Res. 1999. PM, june 07 Using biplots to interpret gene expression patterns in plants. Chapman S et al. Bioinformatics. 2002. NOTE : These references are only provided to illustrate the intensity and diversity of the research efforts in the field of microarray data analysis and many more publications could have been worth mentioned.