Transcript Document
Contacts • Data sets are available at: http://www.bioinformatica.unito.it/bioinformatics/DAGEL.II/ • Post-workshop support: – support on data analysis course will be supplied by Prof. Calogero, sending an email to: • [email protected] containing in the Subject: Post DW support – If personalized training is needed on Bioconductor programming please refer to the AIRBB personalized courses web page: http://www.bioinformatica.unito.it/bioinformatics/AIRBB_c ourses/AIRBB_courses.html Workshop topics • We will analyse the following topics: – Experimental design. – Quality controls. – Data filtering. – Statistical inference of differential expression. – Annotation. – Assessing the biological meaning of differential expression. Essential issues • Any analysis of microarray data is useless if : – there is not a clear biological question to be investigated; – biological experiments are not carefully designed to minimize error sources: • • • • human intervention, reagent lots, equipments. etc. Experimental design • If the biological material is not a limiting factor “think wide”: – Experiment should be designed with many replicas (>3) – Time course experiments should be designed with many points (>4). – Investigate part of the experiment by microarrays and use the rest for further validations. – Discuss the experiment with the statistician/bioinformatician involved in data analysis Pools versus single samples • The basic assumption underlying sample pooling is biological averaging: – the expression from a pooled sample averages out the expression from the individual contributing samples. Bioinformatics 2004, 20:3318 Pools versus single samples • Shih shows that the assumption -“expression from a pooled sample averages out the expression from the individual contributing samples”- may not hold especially when the signals are strong. • The pooling bias appears to be particularly severe for Affymetrix arrays. • Furthermore, it is impossibile to associate the gene expression from the pooled sample with the individual phenotypic information: – Making unfeasible certain statistical inference or predictions for individuals. • Conclusions: – Researcher has to be cautious about designing a pooled experiment. – Pooling of samples is recommended when there is not enough RNA from each individual sample to run an array. Experiment setup • Experiments involving various samples and conditions need to be carefully designed to avoid unwanted effects. T1 C1 C1 C1 C2 C2 C2 Day 1 T1 T1 T2 T2 T2 Day 2 C1 C1 C1 T1 C2 C2 C2 T2 T1 T2 T2 T1 Day 1 Day 2 Analysis pipe-line Sample Preparation Hybridization Scanning + Image Analysis Array Fabrication Quality control Filtering Normalization statistical analysis Annotation Platform specific devices Bioconductor Biological Knowledge extraction Open source software • Bioconductor – is an open source and open development software project to provide tools for the analysis and comprehension of genomic data. • SAM – is a statistical technique for finding significant genes in a set of microarray experiments. • TMEV 4.0 – is an application that allows the viewing of processed microarray slide representations and the identification of genes and expression patterns of interest. A variety of normalization algorithms and clustering analyses allow the user flexibility in creating meaningful views of the expression data. Analysis of microarray • Bioconductor (www.bioconductor.org) is an open source and open development software project to provide tools for the analysis and comprehension of genomic data. • Advantages: – highly dinamic, – free of charge, – introductory and advance courses are available in Europe/USA every year. • Disadvantages: – limited graphical interface, – limited documentation. oneChannelGUI • This is a graphical interface to Bioconductor libraries devoted to the analysis of data derived from single channel platforms. • affylmGUI is a graphical interfase to limma library, which allows differential expression detection by mean of linear model analysis. • oneChannelGUI is an extension of affylmGUI capabilities. oneChannelGUI • 3’ IVT/Gene arrays: – Primary (probe level QC, probe set summary and normalization), secondary analysis (replicates QC, filtering, statistical analysis, classification) and data mining (GO enrichment). • Exon arrays: – Secondary analysis (replicates QC, filtering, statistical analysis, classification, basic Splice Index inspection) using expression console as source of primary data. • Large data set (i.e. probe set expression in tab delimited format): – Secondary analysis (replicates QC, filtering, statistical analysis, classification) using expression console/GEO/ArrayExpress data as source of primary data. Starting R and oneChannelGUI Setting the virtual RAM at 2GB: C:\..\R\R-2.3.0\bin\Rgui.exe --max-mem-size=2048M A Double click on “R” to start B A B Click on “Package” to load Bioconductor packages A B Click on “Load package” to select the oneChannelGUI package Click on “OK” to load the oneChannelGUI package C A Click on “Yes” to start the affylmGUI interface. B Wait few seconds! Yes C Yes Click on “Yes” to start the oneChannelGUI interface. Standard affylmGUI menu Overlaying oneChannelGUI to affylmGUI will change the default affylmGUI menu to the oneChannelGUI menu for 3’IVT Affymetrix arrays oneChannelGUI menu for 3’IVT arrays oneChannelGUI • Analysis steps for 3’IVT arrays: – Loading .CEL files to be analyzed – Arrays quality control: • Raw data plots • Robust probe-level model library (affyPLM) – NUSE plot – RLE plot A Summary of loaded data: none is available since no CEL files have been loaded A Click on “File” to start a new project B C Click on “New” to start a new project Selected 3’IVT arrays D Selected as working dir the folder containing the .CEL files Selected the “targets” file. Then press OK to continue Targets file is a tab delimited text file containing the description of the experiment. It is made of three columns: Name: the name you want to assign to each array. FileName: the names of the corresponding .CEL file Target: the experimental condition associated to the array (e.g. mock, treated, etc). At least two conditions should be present. Define the name of you analysis. Press OK to continue. …. Now the array will be loaded in a specific R object called environment. Raw data are now loaded and are ready for normalization. Widget to create a target for Affy arrays Widget to create a target for Affy arrays Widget to create a target for Affy arrays Widget to create a target for Affy arrays Exercise 1 (15 minutes) • Data set for this exercise: – spugnini.et.al: 7 CEL files were generated for two prototypic situations (biological replicas): – C, untreated MSTO-211H cells: • baldi.c1_b.CEL, baldi.c2.CEL, baldi.c3.CEL – T, MSTO-211H cells treated with 780 μM piroxicam for 48 hours: • baldi.f1_b.CEL, baldi.f2.CEL, baldi.f3.CEL, baldi.f4.CEL See next page Exercise 1 • Go in the folder spugnini.et.al. • Create, with excel, a tab delimited file named targets.txt: – Targets file is made of three columns with the following header: • Name • FileName • Target – In column Name place a brief name (e.g. c1, c2, etc) – In column FileName place the name of the corresponding .CEL file – In column Target place the experimental conditions (e.g. control, treatment, etc) See next page Exercise 1 • Open R • Load the oneChannelGUI • Start a new project: – Change the working dir in spugnini.et.al – Load the targets.txt file (previously created) – Set as project name: ex1 Analysis pipe-line Quality control Normalization Filtering Statistical analysis Biological Knowledge extraction Annotation A The next steps are few simple basic quality controls. B Click on Quality Control menu You can now evaluate: Intensity histogram for one array at a time. A E C D You can now evaluate: Intensity density plot for one array at a time. A E C D A You can now evaluate: all arrays intensities as box plots. C A B B A It is possible that cRNA concentration in sample sE2 was over estimated and a low cRNA amount was loaded on the array. As result a lot of signals are below the value 100 [log2(100) = 6.44] A Some other basic controls can be done after the calculation of the probe set intensity summary using a special Bioconductor library affyPLM B Fit the model (BE PATIENT!!!) The end of the fitting procedure is given by a message. Then the NUSE/RLE function is automatically called C affyPLM QC library • affyPLM provides a number of useful tools based on probe-level modelling procedures. • affyPLM package allows arrays quality controls. What is a Probe Level Model? • A Probe Level Model (PLM) is a model that is fit to probe-intensity data. • affyPLM fits a model with probe level and chip level parameters on a probe set by probe set basis. • In quality control chip level parameters are a factor variable with a level for each array. What is a PLMset? • The main function for fitting PLM is the function fitPLM. • This function will fit a linear model with an effect estimated for each chip and an effect for each probe. • fitPLM implements iteratively re-weighted least squares M-estimation regression. • The fitted model is stored in a PLMset object containing chip level parameter estimates and the corresponding standard errors. Default fitted model log2 PMkij kj ki kij • where kj is the log2 probe set expression value on array j for probeset k and ki are probe effects. • To make the model identifiable the constrain I i 1 ki 0 is used. • For this default model, the parameter estimates given are probe set expression values. Relative Log Expression (RLE) • RLE values are computed for each probe set by comparing the expression value on each array against the median expression value for that probeset across all arrays. • Assuming that most genes are not changing in expression across arrays means ideally most of these RLE values will be near 0. • Boxplots of these values, for each array, provides a quality assessment tool. • RLE plots: – Estimation of expression gi for each gene g on each array i. – Compute the median value across arrays for each gene Relative Log Expression (RLE) Normalized Unscaled Standard Errors (NUSE) • Standard error measures the amount of errors done fitting y for every x value. se= • Normalized Unscaled Standard Errors (NUSE) can also be used for assessing quality. • The standard error estimates obtained for each gene on each array from fitPLM are taken and standardized across arrays so that the median standard error for that genes is 1 across all arrays. • This process accounts for differences in variability between genes. • An array were there are elevated SE relative to the other arrays is typically of lower quality. • Boxplots of these values, separated by array can be used to compare arrays. medSEˆ NUSE ˆgi SE ˆgi gi A C B A B A Since the fitPLM object can be very big. It is a good idea, to delete it after quality control. Before Delete PLM After Delete PLM Exercise 2 (20 minutes) • Starting from the data set you have loaded – check the Raw data with the available plots – analyze the data using NUSE and RLE plots. • Answer the following questions: – Is there any array characterized by a very narrow probe intensity distribution? • YES NO – Is there any array which is significantly different with respect to the others by mean of NUSE or RLE plots? • YES NO See next page Exercise 2 • If any array has a NUSE very different from the others create a new target file without it and load again the remaining .CEL files. Analysis pipe-line Quality control Normalization Filtering Statistical analysis Biological Knowledge extraction Annotation affylmGUI • Analysis steps: – Calculating probe set summaries: • RMA • GCRMA • PLIER – Normalization: • Quantile method Brief summary about probe set intensity calculation • RMA methodology (Irizarry et al., 2003) performs background correction, normalization, and summarization in a modular way. RMA does not take in account unspecific probe hybridization in probe set background calculation. • GCRMA is a version of RMA with a background correction component that makes use of probe sequence information (Wu et al., 2004). • The PLIER (Probe Logarithmic Error Intensity Estimate) method produces an improved signal by accounting for experimentally observed patterns in probe behavior and handling error at the appropriately at low and high signal values. • Methods such as PLIER+16 and GCRMA, which use model-based background correction, maintain relatively good accuracy without losing much precision. Yeast DNA on human chips •G,C stretch length better perform with respect to G + C content. •The distribution of 4 G-C strata in the non-specific sample well approximate to a log-normal distribution. Observed log (base 2) expression versus nominal log concentration (in picoMolar). Why Normalization ? • • • • • To remove systematic biases, which include, Sample preparation Variability in hybridization Spatial effects Scanner settings Experimenter bias Extracted from D. Hyle presentation, http://www.bioinf.man.ac.uk/microarray What Normalization Is & What It Isn’t • • • • Methods and Algorithms Applied after some Image Analysis Applied before subsequent Data Analysis Allows comparison of experiments • Not a cure for poor data. Where Normalization Fits In Sample Preparation Hybridization Scanning + Image Analysis Normalization Data Analysis Array Fabrication Normalization Spot location, assignment of intensities, background correction etc. Subsequent analysis, e.g clustering, uncovering genetic networks Quantile normalization Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005) Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005) Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005) Quantile normalized Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005) M-A Plots log R M-A plot is 45° rotation of standard scatter plot M 0 45° A log G M = log R – log G A = ½[ log R + log G ] M = Minus A = Add Extracted form Irizarry presentation at Bioconductor Course (Brixen IT, 2005) The next step is normalization and calculation of probe set summary. A B Click on probe set menu and select the probe set summary and normalization option. Normalization and intensity calculation come together. Three Normalization/intensity calculation option are available: RMA + quantile normalization GCRMA + quantile normalization PLM + quantile normalization At any time it is possible to check the structure of the normalized data set Many option but slow computation. Be patient!!!! Info about the process are shown in the main R window Exercise 3a (20 minutes) • Data set for following exercises: • estrogen.IGF1 : 18 CEL files were generated for 4 prototypic situations (biological replicas): – MCF7 ctrl, untreated MCF7: • M1.CEL, M4.CEL, M7.CEL – MCF7 E2, MCF7 treated for 3 hours with Estrogen: • M3.CEL, M6.CEL, M9.CEL – MCF7 IGF, MCF7 treated for 3 hours with IGF1: • M2.CEL, M5.CEL, M8.CEL – SKER3 ctrl, untreated SKER3: • s1.CEL, s4.CEL, s7.CEL – SKER3 E2, SKER3 treated for 3 hours with Estrogen: • s3.CEL, s6.CEL, s9.CEL – SKER3 IGF, SKER3 treated for 3 hours with IGF1: • s2.CEL, s5.CEL, s8.CEL See next page Exercise 3a • Go in the folder estrogen.IGF1. • Create, with excel, a tab delimited file named targets.txt: – Targets file is made of three columns with the following header: • Name • FileName • Target – In column Name place a brief name (e.g. c1, c2, etc) – In column FileName place the name of the corresponding .CEL file – In column Target place the experimental conditions (e.g. control, treatment, etc) • Create a target only for MCF7 with/without IGF1 treatment See next page Exercise 3a • Calculate Probe set summaries with GCRMA for the estrogen.IGF1 set. • Save the limma project as ex3. See next page Expression Console® (Affymetrix person) • Presentation of Expression Console®: – General structure – Loading exon libraries – QC for raw data – Exporting gene level and exon level analysis Replicates quality control • To evaluate sample replicates quality we will use a partition technique called Principal component analysis (PCA) . Principal component analysis • Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. • The first principal component accounts for as much of the variability in the data as possible • Each succeeding component accounts for as much of the remaining variability as possible. • The components can be thought of as axes in ndimensional space, where n is the number of components. Each axis represents a different trend in the data. PCA PCA2 PCA1 1 2 2° PC will be orthogonal to the 1st In general the first three components account for nearly all the variability. Therefore, PCA can be reasonably represented in a 3D space. A B t4 is clearly an outlier! To perform sample replicates QC we use principal component analysis (PCA) This check is performed on probe set summaries! Exercise 3b • Use the estrogen.IGF1 data set to evaluate quality replicas by mean of PCA analysis. • Questions: – CTRL and TRT are well separated? • YES NO – Replicates are homogeneous? • YES NO Analysis pipe-line Quality control Normalization Filtering Statistical analysis Biological Knowledge extraction Annotation Filtering • Filtering affects the false discovery rate. • Researcher is interested in keeping the number of tests/genes as low as possible while keeping the interesting genes in the selected subset. • If the truly differentially expressed genes are overrepresented among those selected in the filtering step, the FDR associated with a certain threshold of the test statistic will be lowered due to the filtering. Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004 Filtering can be performed at various levels: • Annotation features: – Specific gene features (i.e. GO term, presence of transcriptional regulative elements in promoters, etc.) • Signal features: – % intensities greater of a user defined value – Interquantile range (IQR) greater of a defined value Specific gene feature • In transcriptional studies focusing on genes characterized by specific feature (i.e. transcription factor elements in promoters) the best filtering approach is selecting only those genes linked to the “peculiar feature”. • For example: – Identification of genes modulated by estradiol:ER or IGF1 by direct binding to Estrogen-Responsive Elements (ERE): • HGU133plus2: – 54675 probe sets 19951 Entrez Genes • HGU133plus2 with ERE in putative promoter regions: – 6764 probe sets 3058 Entrez Genes Specific gene feature • Data derived from specifically devoted annotation data set can be used for functional filtering. • The Ingenuity Pathways Knowledge Base is the world's largest curated database of biological networks created from millions of individually modeled relationships between: – proteins, genes, complexes, cells, tissues, drugs, diseases. • The Ingenuity Pathways Analysis software (IPA) identifies relations between genes. • The relations that can be grasped are: – Regulates – Regulated by – Binds Start an Ingenuity session at: https://analysis.ingenuity.com/pa/login/login.jsp Specific classes of proteins can be searched and exported A key word can also be used to perform a wide search After selection of the Functions & diseases of interest genes should be visualized as gene details before exportation in a file to be used for filtering expression data Exporting results in a table as previously The Entrez Gene IDs present in this file can be used to extract e specific subset of genes. To use filtering using a list of EG you need to extract from the IPA table only the Entrez genes of interest and save them on a text file without header. Exercise 4 (30 minutes) • Export from Ingenuity a table related to transcription factors • Create a file containing only the Entrez genes without header and use it to filter the data set of exercise 3. • Save the affylm project as ex4.TF.lma • Inspect the sample PCA: – CTRL and TRT are well separated? • YES NO Non–specific filtering • This technique has as its premise the removal of genes that are deemed to be not expressed or unchanged according to some specific criterion that is under the control of the user. • The aim of non–specific filtering is to remove genes that, e. g. due to their low overall intensity or variability, are unlikely to carry information about the phenotypes under investigation. Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004 Intensity distributions Bg level probe sets RMA GCRMA How to define the efficacy of a filtering procedure? enrichment100 N spikeinAfterFiltering N probesets N probesetsA fterFilteringN spikein • This enrichment is very similar to that used to evaluate the purification folds of a protein after a chromatographic step. E. AfterChrom gP.BeforeChrom enrichm ent 100 gP. AfterChrom E.BeforeChrom Filtering by genefilter pOverA (keep if ≥ 25% probe sets have intensities ≥ log2(100)) 22300 42/42 SpikeIn Enrichment: 100% 5553 42/42 SpikeIn Enrichment: 401% Filtering by InterQuantile Range 25% IQR 75% The distribution of all intensity values of a differential expression experiment are the summary of the distribution of each gene expression over the experimental conditions How filtering by genefilter IQR works? How filtering by IQR works? The filter removes genes that show little changes within the experimental points Filtering by genefilter IQR (removing if intensities IQR0.25, 0.5) 22300 42/42 SpikeIn Enrichment: 100% 244 42/42 SpikeIn Enrichment: 9139% 68 42/42 SpikeIn Enrichment: 32794% A C B D A B C In this example will be selected only those genes characterized by having in at least 50% of the arrays an intensity ≥ 100. Exercise 5 (20 minutes) • Use the estrogen.IGF1 data normalized with gcrma (ex3). • Perform and IQR filter at 0.25 followed by an intensity filter at 50% of the arrays with and intensity over 100. • Export the data as tab delimited file. • Question: – How many probe sets are left after the first and the second filter? • ………………………………………………………… – Which are the changes observable at the level of probe sets distribution after the first and the second filter? • …………………………………………………………………………… • Save the affylm project as ex5.lma. • Make the same filtering on ex4.TF.lma • Question: – How many probe sets are left after the first and the second filter? QC and filtering for exon data • At the time oneChannelGUI was setup Bioconductor tools for handling raw data from Affymetrix exon arrays were not available. • For this reason the oneChannelGUI uses the libraries and primary analysis outputs from Affymetrix Expression Console®. • Exon raw data quality control is perfomed using the Expression Console®. • Sample QC and filtering are performed on oneChannleGUI. Exon arrays on oneChannelGUI • On oneChannelGUI gene level and exon level data from Expression Console® are loaded. • User needs to specify where Expression Console® library files are located, at any time a new exon data set is loaded. Loading an exon array data set it is necessary to indicate the organism and which kind of exon data are going to be loaded (core, extended, full) Loading an exon array data set it is necessary to specify the location of Expression Console libraries. Subsequently three files have to be loaded: The target file, which has the same structure previously described. The tab delimited files containing GENE and EXON level data exported form the Expression Console. A new Menu is then available for exon data Exon arrays QC on oneChannelGUI The brain (b) replicates are very poor. The quality is particularly bad for exon data. However, we have to consider that these data are derived from tissues coming from different post-mortem donors. Exon arrays filtering Since the knowledge on exon data is still relatively limited we have little empirical information about background threshold. Exon/intron housekeeping gene information available in exon data might be a possible approach to define it. Different color lines indicate the possible thresholds to be selected. In black are shown the intensity density plots for introns as in red those for exons. IQR filter works as described for 3’IVT arrays. However, any filter done at gene level will also affect the corresponding exon data. Starting condition After filtering Intensity filter is instead based on the threshold previously selected on the basis of exon/intron HK expression signals. In this example we are keeping only the genes where all samples have a signal greater than the pre-defined BG. Exercise 6 (30 minutes) • Use breast and cerebellum Affymetrix exon data. • Calculate, with Expression Console, gene/exon level expression for the CORE sub set. • Evaluate the sample quality using gene/exon PCA. • Question: – Are replicates homogeneous? – YES NO • Define a BG threshold using introns information. • Perform an IQR filter at 0.25 followed by intensity filter at a threshold of your choise. • Save the affylm project as ex6.lma. Splice Index • The Splicing Index captures the basic metric for the analysis of alternative splicing. • It is a measure of how much exon specific expression (with gene induction factored out) differs between two samples. Defining function-oriented data set for splice index calculation A Use a set of function-oriented EGs to select probe set IDs B C Use the selected probe set IDs for “Filtering using a list of probe sets”. A B C ATTENTION: this is only a very rough descriptive instrument! Much work needs to be done on exon analysis! Splice Index inspection is performed modelling the splice index exon profiles for two experimental conditions. Results are saved on a pdf file in your working dir. The sub set of splice indexes to be inspected is defined using two filters: A D Example of one gene output B C This plot gives some advise about the scattering levels of the Splice Indexes over the gene under analysis A Model of splice indexes over the two experimental conditions. Red dashed lines indicate the confidence interval of the model. Plots of significance p-value of the alternative splicing versus the average Splice Index values. In this example only one exon seems to be differentially spliced : 14. Significance p-value of the alternative splicing versus the average Splice Index values. IN this example only one exon seems to be differentially spliced. Filtering conditions are shown over the plot of intensity values versus exon number. Exercise 7 (30 minutes) • Use ex6.lma. • Perform a filtering on the basis of a probe sets related to TF. – Create in NetAffx a Transcript cluster view containing only Transcript Cluster ID – Gene level probe set list is created interrogating NetaAffx with the EG used in exercise 4 • Calculate on the data sub set the splice Index. • Save the affylm project as ex7.lma. • Filter for p-value ≤0.01 and mean average Splice Index differences ≥2. • Inspect the output plots. Analysis pipe-line Quality control Normalization Filtering Statistical analysis Biological Knowledge extraction Annotation This step is the same for 3’IVT arrays and exon arrays gene level analyses Log2(T/C) is frequently used to evaluate fold change variation 200, 400, 800, 1600, 32000 100, 100, 100, 100, 100 100 100 100 100 100 200 400 800 1600 32000 8.00 6.00 log2(t/c) down-regulation compression t/c 4.00 2.00 0.00 -2.00 F65439 PT765 AC87654 log2(t/c) MN0987 S098345 AA238 L8754 G7599 U6539 H07498 -4.00 Fold change filtering • The intensity change between experimental groups (i.e. control versus treated) are known as: – Fold change. • Frequently an arbitrary threshold Trtd log 2 1 Ctrl is used to define a significant differential expression. Fold change filtering • There are no rules to define the correct fold change (fc) threshold for differential expression. • |fc|>1 is an arbitrary threshold. • Fc threshold estimation is dependent on the % of fc fluctuations due to experimental reasons. • Fc threshold estimation can be better appreciated in time/concentration course experiments. • “Biologically speaking” many small variations all together can be functionally important (i.e. |fc|=0.5 for all chr 21 genes induces the Down syndrome) Statistical analysis • Intensity changes between experimental groups (i.e. control versus treated) are known as: – Fold change. – Ranking genes based on fold change alone implicitly assigns equal variance to every gene. • Fold change alone is not sufficient to indicate the significance of the expression changes. • Fold change has to be supported by statistical information. Multiple testing errors • Performing multiple statistical tests two types of errors can occur: – Type I error (False positive) – Type II error (False negative) • Reduction of type I errors increases the number of type II errors. • It is important to identify an approach that reduces false positives with the minimum loss of information (false negative) The multiple tests problem • If the number of samples increases the tails of a distribution are getting more populated. Type I error correction • Null hypothesis (H0): the mean of treated and the mean of control for a gene i belong to the same distribution. • Type I error: H0 is false. • Sidak significance point: = acceptance level (es 0.05) g= n. of independent tests K ( g , ) 1 1 • If the p-values are lower of K (g,) all the remaining H0 are considered true. g Type I error correction (FWER) K ( g , ) 1 1 g P of diff. exprs. genes <10-6 < 10-6 2* 10-5 0.047 … ’ 1– 1– 1– 1– (1 (1 (1 (1 – – – – 0.05)1/5= 0.05)1/4= 0.05)1/3= 0.05)1/2= 0.102 0.0127 0.0170 0.0253 Statistical analysis • The sensitivity of statistical tests is affected by the number of available replicates. • Replicates can be: – Technical – Biological • Biological replicates better summarize the variability of samples belonging to a common group. • The minimum number of replicates is an important issue! How much replicates are important? Yang YH e Speed T, 2002 Sample size • Microarray experiments are often performed with a small number of biological replicates, resulting in low statistical power for detecting differentially expressed genes and concomitant high false positive rates. • The issue of how many replicates are required in a typical experimental system needs to be addressed. • Of particular interest is the difference in required sample sizes for similar experiments in inbred vs. outbred populations (e.g. mouse and rat vs. human). Assessing sample sizes in microarray experiments • Assessment of sample sizes for microarray data is a tricky exercise. • The reason why we are performing such analysis is to have a general feeling on the ability of our experimental data to robustly detect differential expression. • The method implemented in oneChannelGUI is that proposed by Warnes & Liu and implemented in the Bioconductor library ssize. Assessing sample sizes in microarray experiments • The key component of Warnes’ method is the generation of cumulative plot of the proportion of genes achieving a desired power as a function of sample size, based on simple gene-by-gene calculations. • Its real utility is as a visual tool for helping users to understand the trade off between sample size and statistical power. Assumptions • A microarray experiment is set up to compare gene expressions between one treatment group and one control group. • Microarray data has been normalized and transformed so that the data for each gene is sufficiently close to a normal distribution that a standard 2-sample pooled-variance t-test will reliably detect differentially expressed genes. • The tested hypothesis for each gene is: versus where μT and μC are means of gene expressions for treatment and control group respectively. • The analysis is done using the common variance described in: – Wei et al. BMC Genomics. 2004, 5:87 Sample size estimation • The required sample size of an experiment depends on: – variance component (σ), – the desired detectable fold change (δ), – the power to detect this change (1-β, the likelihood of detecting the change or the true positive rate), – a chosen type I error rate (α= false positive). IMPORTANT: This implementation of ssize functions uses BH type I error correction instead of Bonferroni, which is the default in ssize functions. = type II error rate, i.e. false negative. This is not |log2(FC)| To detect 95% of the differentially expressed genes, characterized by a power of 0.8, a sample size, FOR EACH GROUP, greater than 20 is needed. To detect 97% of the differentially expressed genes, characterized by a power of 0.8, a fold change greater than 10 (log2(10)=3.32) is needed. Assessing sample sizes in microarray experiments • Assessment of sample sizes and experimental power of a microarray experiment using the method implemented in the Bioconductor library sizepower. • The R package, sizepower, is used to calculate sample size and power in the planning stage of a microarray study. • It helps the user to determine how many samples are needed to achieve a specified power for a test of whether a gene is differentially expressed or, in reverse, to determine the power of a given sample size. Exercise 8 (20 minutes) • Use the breast/cerebellum filtered exon data (ex6.lma). • Using the ssize implementation – Check the fold change needed to achieve a power of 80% on the basis of the available sample size. • Using the sizepower implementation – Check the sample size and the experimental power selecting various threshold of false positives and log2(FC) Comments about experimental design • If the biological material is not a limiting factor “THINK WIDE”: – Experiment should be designed with many replicas (>3) – Time course experiments should be designed with many time points (>4). – Investigate part of the experiment by microarrays and use the rest for further validations. Statistical validation • Statistical validation can be performed using parametric and non-parametric tests. • Parametric tests: – The populations under analysis are normally distributed. • Non parametric tests: – There is no assumption on samples distribution. • Non parametric are less sensitive than parametric. Selecting differentially expressed genes Statistical validation method I Statistical validation method II Differential expression linked to a specific biological event. Statistical validation method III Selecting differentially expressed genes • Each method grasps some true signals but not all. • Each method catches some false signals. • The trick is to find the best condition to maximize true signals while minimizing fakes. Mean C Mean T Population Ctrl Population Trtd t mT mC nT nC 2 T Sample mean “s” Less than a 5% chance that the sample with mean s came from population C, i.e., s is significantly different from “mean C” at the p < 0.05 significance level. But we cannot reject the hypothesis that the sample came from population T. 2 C • T-statistics is widespread in assessing differential expression. • Unstable variance estimates that arise when sample size is small can be corrected using: – Error fudge factors (SAM) – Bayesian methods (Limma) SAM Significance Analysis of Microarray SAM (Significance analysis of microarrays) (Tusher et al. 2001) fudge factor regularizes the t -statistic by inflating the denominator s(i) is the pooled standard deviation, taking into account differing gene-specific variation across arrays. Some SAM designs Two-class unpaired: to pick out genes whose mean expression level is significantly different between two groups of samples (analogous to between subjects t-test). Two-class paired: samples are split into two groups, and there is a 1-to-1 correspondence between an sample in group A and one in group B (analogous to paired t-test). Multi-class: picks up genes whose mean expression is different across > 2 groups of samples (analogous to one-way ANOVA) Others: ……………………………………………………………… • SAM uses data permutations to define a set of significant differential expression. N N N T T T { N T N T N N T N T N T T T N T N T T N T N T N N } FDR is given by p0 * False / Called p0 is the prior probability pi0 that a gene is not differentially expressed How SAM calculates the False Discovery Rate for a specific delta? Permutations 4 720 3 2 1 Mean false A SAM analysis can be performed in Bioconductor using the siggenes library. Two class or multi class analysis is selected automatically due to the structure of Target information B C The delta table prompts to the user the information related to the amount of differentially expressed genes given a certain FDR. The user selects a delta value and check the behaviour of the differentially expressed genes. The user selects a delta value and check the behaviour of the differentially expressed genes. Subsequently the user performs a log2(fold change) filter to produce a table of differentially expressed genes. Subsequently the user performs a log2(fold change) filter to produce a table of differentially expressed genes. The table can be saved in a tab delimited file relative difference in gene expression Standard deviation Fold change Raw p-value Significance measurement derived from raw p-value Exercise 9 (20 minutes) • Use the estrogen.IGF1 data set. • Perform a two class analysis on ctrl versus IGF1 treatment, after IQR and/or intensity filter. • Select a FDR ≤ 0.1 and a |log2(FC)| ≥ 1 • Save results in ex9.xls