From Genome Sequence towards Analysis of the Transcriptome Genomic Sequence Information Nucleotide sequence and physical position on the chromosomes AATTCGCGCGAAT……. TTAAGCGCGCTTA…… Annotation “Genes”: Promoter, TSS, 5’UTR, Exons,Introns , 3’UTR, TTS “non.
Download ReportTranscript From Genome Sequence towards Analysis of the Transcriptome Genomic Sequence Information Nucleotide sequence and physical position on the chromosomes AATTCGCGCGAAT……. TTAAGCGCGCTTA…… Annotation “Genes”: Promoter, TSS, 5’UTR, Exons,Introns , 3’UTR, TTS “non.
From Genome Sequence towards Analysis of the Transcriptome Genomic Sequence Information Nucleotide sequence and physical position on the chromosomes AATTCGCGCGAAT……. TTAAGCGCGCTTA…… Annotation “Genes”: Promoter, TSS, 5’UTR, Exons,Introns , 3’UTR, TTS “non coding” regions Genomic Sequence Information Nucleotide sequence and physical position on the chromosomes based on Annotation based on biased non biased Transcriptome Analysis Part A: Gene centered microarrays for transcriptome analysis Microarrays are designed on the basis of gene annotation of the genome ……….. 5’UTR Exon 1 Intron Exon 2 Cap 3’UTR polyA gDNA ……… Transcript design of gene specific pobe(s): oligonucleotides (25-70mers) PCR amplification of gene specific sequences (GSTs) Microarray Remark: first generations of microarrays were based on ESTs (expressed sequence tags) GST: gene specific tag Today’s microarrays have specific probes for all annotated genes of an organism’s genome. These arrays are called “whole genome microarrays”. They allow to measure transcript abundance of all annotated genes in the genome within one experiment. Typical examples of microarray data: Here an RNA sample is analyzed against itself (self vs self) M=log2(R/G) A=log2 R*G Here a liver sample is analyzed against a heart sample Microarray technology, which made it possible to measure the expression of all genes within a genome at once, was a revolution in modern molecular biology Microarrays have been used since from “simple” gene-fishing up to systemic approaches (understanding regulatory networks) but also as a tool for diagnostics in medicine Let’s look at and discuss some specific examples on how this technology was used to analyze the transcriptome and what we can learn from it… Example 1: using microarray transcript profiling in order to understand complex biological processes Example 2: using microarrays as a diagnostic tool Example 1: the transcriptional program of cell entry into quiescence Cells have to decide whether to divide or whether to enter into the G0, quiescence. Actually most eucaryotic cells spend most of their lifetime in G0. G0 is not an “endpoint” and cells can return into the mitotic cell cycle. A good example is the wound response, where quiescent cells (fibroblasts and epidermal stem cells) start rapidly to proliferate. Once tissue repair is accomplished, cells reenter into G0. The transition from G0 into cell division has been studied in detail (one driving force for this is certainly its link to cancer biology) Much less is known about how cells exit the cell cycle in order to enter G0. Is it a passive process where loss of growth factors causes the down regulation of cell cycle genes? Or is it an active process with its own unique transcriptional program? Cell cycle Quiescence “Cells constantly sense their environment to decide whether to divide. Many genes that control the entry into cell division are known, and their excessive activation may cause cancer. In contrast, the way that cells cease to divide was thought to be a passive process, where signals for cell division gradually decay.” Liu H, Adler AS, Segal E, Chang HY (2007) A transcriptional program mediating entry into cellular quiescence. PLoS Genet 3(6): e91. doi:10.1371/journal.pgen.0030091 The experimental model: human fibroblast cell culture serum deprivation dividing quiescent serum stimulation Cells grown in 10% FBS Switch to a medium with 0.1%FBS dividing quiescent transcriptional program for entry into quiescence time points for sample taking/analysis transcript analysis with microarrays Cells grown in 0.1% FBS Switch to a medium with 10%FBS quiescent time points for sample taking/analysis dividing transcriptional program for entry into cell cycle transcript analysis with microarrays FBS: fetal bovine serum; contains all growth factors to induce cell division in fibroblast cell cultures fibroblast RNA time (h) 0 microarray 1 0.25 microarray 2 0.5 microarray 3 1 microarray 4 1.5 microarray 5 etc microarray 6..etc human reference RNA Genes that show induction or repression in the late phase show a symmetric regulation (induced upon SS but repressed under SD and vice versa) Genes, however, that change early on are not symmetrically regulated at all -> Entry and exit from G0 have their own unique transcriptional program to initiate the transition Early Response Genes SDERGs (serum deprivation early response genes) show an asymetric behavior between SD and SS: immediate induction in SD but less or not regulated during SS. -> entry into G0 has its unique gene expression program Several SDERGs are known to be induced by interferon (for example STAT1) Two SDERGs, SALL2 and MXI1, are putative tumor supressors. The authors suppose that some of the SDERGs might be “master regulators” of the entry into quiescence. SALL2: zinc finger transcription factor MXI1:MAX interactor 1, antagonist of MYC oncoproteins IRF1: interferon regulatory factor, transcription factor Which experiments could be performed to investigate a role of these three genes as “master regulators” of cell entry into G0? Gene nock-down with siRNAs Cells were transfected with siRNA constructs and the efficiency was determined by real-time quantitative PCR The effect of the gene knock-downs on cell cycle exit and the transcriptional program after SD FACS (fluorescence-activated cell sorting) analysis of cell cycle states: Transcript profiling of cells with siRNA constructs Do SDERGs play a role in human cancer? -> interrogation of public databases of microarray data from human cancers Prostate cancer Coordinate repression of SDERGs identify over 90% of prostate tumors relative to normal prostate Breast Cancer -> Diminished expression of SDERGs in grade 3 tumors (grade 3: high cell proliferation, less differentiation) -> Patients with diminished expression of SDERGs had significantly worse survival Summary: Microarray analysis permits profiling of the transcriptome at full genome level. The discussed example of cell entry into quiescence shows that full genome transcriptome analysis can yield insights into the transcriptional program of cells and identify genes as “master regulators” of this transcriptional program. The function of the genes which control entry into G0 was confirmed by an siRNA approach. Many microarray experiment results are stored in public databases. This wealth of data can be exploited in order to get insights of the implication of genes of interest in diverse biological processes. In the discussed example it was concluded that genes which control entry into G0 are implicated in human cancer. This class of genes (“SDERGs”) likely antagonize cell proliferation many cell types. Example 2: gene expression signatures as a diagnostic tool for tumor classification regarding clinical outcome metastasis free, high survival primary tumor time metastasis formation, low survival Prognostic factors to determine, whether a primary tumor will or will not develop metastases over time are of high clinical importance for the treatment decision. Example breast cancer: prognosis based on histological and clinical characteristics (St. Gallen criteria, NIH criteria) Can gene expression profiles of primary tumors serve as prognostic markers for the clinical outcome of cancer? Nature (2002), 415, 530-536 This is one of the first articles on the identification of a gene expression signature which can be used as a prognostic factor for cancer clinical outcome. (This article has been cited since then 2491 times!) Experimental design: samples from 98 primary tumors from patients with known clinical outcome after five years (disease free or development of distant metastases) were analyzed on 25’000 gene containing microarrays reference RNA (pool of all tumor RNAs) RNA tumor 1 RNA tumor 2 RNA tumor 3 RNA tumor 4 RNA tumor 5 RNA tumor X Unsupervised clustering analysis: Supervised classification was used to identify a gene signature with the highest correlation to disease outcome. With this approach an optimal number of 70 marker genes was identified, which showed high correlation to the disease outcome when analyzing the primary tumors. Validation of the 70 gene signature on 19 additional breast cancers (which haven’t been included in the supervised classification before) ->only 2 misclassifications out of 19 samples were obtained when using the 70 gene signature as a prognostic tool for disease outcome. Comparison of 70 gene signature prognosis to conventional prognosis criteria: The gene signature has almost the same efficiency as conventional consensus criteria to select high risk patients; the number of “poor diagnosis” classification in disease free cases (misclassification rate) is, however, much lower. Towards a higher throughput diagnostic tool: Using full genome microarrays as high throughput diagnostic tool is not feasible; a step towards routine clinical analysis of the cancer gene signature was the development of a small custom array, which performed as well as the full genome arrays Glas et al. BMC Genomics 2006, 7:278 Part B: Quantitative real-time PCR: a sensitive tool for medium to high throughput transcript analysis Microarray results are often validated by another method: real-time quantitative PCR (qPCR). qPCR is a very sensitive method for transcript analysis. In contrast to microarray analysis the number of genes to be queried is low (usually <100), but the sample throughput is high. qPCR is a highly flexible method for gene expression analysis. Its specificity is determined by the primer (and probe) sequences used for the amplification. In the following we’ll have a look at the basic essentials of this technology. Why can’t we use “classical” PCR to quantitate RNA? The amount of PCR product at the plateau is variable from reaction to reaction when same amounts of starting material are used: How can we measure the formation of PCR products in “real-time” (-> at the time of enzymatic synthesis)? -> The chemistries of PCR product detection via fluorescence The formation of PCR product during the cycling of a PCR reaction can be measured by fluorescence There are two methods commonly used: SYBR® Green and fluorescent probes A. SYBR® Green SYBR Green 1 fluorescence increases enormously when binding to dsDNA -> The fluorescence increases with the formation of dsDNA product B: Fluorescent probes: FRET: Fluorescence resonance energy transfer (or Förster resonance energy transfer) Note: the use of a fluorescent probe adds one more level of specificity compared to a SYBR Green assay The basis for quantification: determination of the Ct value • • • • Baseline Threshold Rn Ct Baseline = Basal level of fluorescence defined during the initial cycles of PCR (background fluorescence). Threshold = Fixed fluorescence level set above the baseline (statistical cutoff based upon background fluorescence). Rn = normalized Reporter signal, level of fluorescence detected during PCR. Calculated by dividing probe reporter dye signal by passive reference signal (ROX). Ct = threshold Cycle, PCR cycle at which an increase in reporter fluorescence above a baseline signal is first detected (cycle when fluorescence crosses the threshold). Exponential growth phase = linear part in logarithmic graphic A plot of the log of initial target copy number for a set of standards versus CT is a straight line The relation of Ct value and target quantity If the amount of target nucleic acid is doubled, the Ct value changes by 1 Relative Quantity = 2 CT (sample A) - CT (sample B) Note: the higher the initial copy number of target DNA the lower the Ct value! For a tenfold dilution series: 10 = 23.32 Relative Quantification The most common application of relative quantification is the analysis of geneexpression (transcript abundance) In a first enzymatic reaction, mRNA is transcribed into 1 strand of cDNA via reverse transcription The cDNA is used as template for the real-time PCR Example: we want to compare the changes in gene expression from one sample to another e.g. untreated cell culture vs treated one tissue against another one normal vs diseased etc For relative quantification studies an endogenous control is required for “normalization” between different samples An endogenous control is an mRNA, which is present at constant levels in the different samples to be analyzed The endogenous control normalizes for - RNA input variation - variation in cDNA synthesis efficiency (reverse transcription) The comparative Ct method (∆∆Ct method) Real time PCR: Applications A. Gene expression analysis (analysis of relative transcript levels) Isolation of total RNA reverse transcription -> cDNA quantitative real-time PCR (∆∆Ct method….) A1. Analysis of differential splice variants The expression of splice variants from a gene can be analyzed by real-time PCR. Essential for this approach is the location of the primers (and probe for TaqMan assays). Exon 1 2 3 4 Intron Full length 1 2 3 4 1 2 4 Splice form 1 1 3 4 Transcript specific PCR primers Splice form 2 B. Absolute quantification B1. Detection and quantitation of pathogens in clinical samples - e.g. determination of virus titer in blood samples use of a standard curve to determine the exact quantity of copies of a target (viral nucleic acid) in a defined volume of clinical sample B2. Detection and quantification of GMOs B3. Determination of the absolute amount of transcripts in a sample (allows the comparison of the abundance of different transcripts within a sample) C. SNP analysis The detection of a single nucleotide polymorphism is possible with probes, which allow the discrimination of only one nucleotide difference in the complementary sequence TaqMan probes with MGB are short enough to discriminate between SNP alleles (MGB: minor groove binder; a non-fluorescent quencher) Part C: Non gene-centered microarrays for transcriptome analysis: tiling arrays ……….. 5’UTR Exon 1 Intron Exon 2 3’UTR gDNA ……… specific tiling probes on microarray Tiling array: A microarray design in which the probes are selected to interrogate a genome with a consistent, pre-determined spacing between each probe. Tiling arrays allow to measure transcripts from the genome in a non-biased way: they are not centered towards annotated regions. Transcriptome analysis with tiling arrays has given completely new insights in how genomes are transcribed What was known from genome sequencing and annotation: The human genome contains only 1-2% of protein coding regions (exons) and one could assume that these protein coding sequences constitute the main part of the transcriptome. Experiments with tiling arrays have, however, revealed that a much more important portion of the human genome is transcribed, with many transcripts arising from unannotated regions. Some examples: Human chromosomes 21 and 22, analysis of cytosolic poly(A)+ RNA (Kampa et al, 2004, Genome Res.) 49% of the observed transcription lies outside annotated regions Transcriptional maps of 10 human chromosomes, poly(A)+ and poly(A)-, cytosolic and nuclear (Cheng et al, 2005, Science) Fig. 1. The correlation of detected transcription in one of eight cell lines to annotations along each of the 10 chromosomes is shown for each chromosome individually and as a collective of all chromosomes cytosolic, poly A+ J. Cheng et al., Science 308, 1149 -1154 (2005) Published by AAAS Fig. 3. Distribution of poly A+ and poly A- transcription in the nucleus and cytosol with respect to genome annotations J. Cheng et al., Science 308, 1149 -1154 (2005) Published by AAAS Important findings: A much more important part of the genome is transcribed as was previously assumed by gene annotation. Transcripts are found in the the cytosol and the nucleus- which sounds trivial, but there is a large proportion of transcripts that is unique for the nucleus and is never found in the cytosol. The nuclear transcriptome is fivefold bigger than the cytosolic. A big portion of transcripts contains no poly A+: there are about 2.2 times as many uniquely poly A– (43.7%) transcribed sequences as uniquely poly A+ (19.4%). There is a high degree of unannotated transcription: in the cytosolic poly A+ fraction 56% of transcribed base pairs and in the nuclear transcriptome even 80% There is a high degree of antisense transcription: the analysis of human transcripts with tiling arrays revealed 61% antisense transcription* Long-range interconnected transcription: many genes use alternative 5’ ends that lie tens and hundreds of kilobases away from annotated 5’ends. *: maybe an artifact resulting from reverse transcription see: Wu et al. (2008) Genome Biology, 9, R3 These results show that big parts of the genome are transcribed and annotated regions constitute only a small part of the transcriptome. [-> Why is it only by now that we discover this?] Non gene centered microarrays (and other methods) shed light in the complexity of genomic transcription but we have very little knowledge about the function of unannotated transcripts (also often refered to “non coding” transcripts; I prefer the expression “non protein coding transcripts” or transcripts with yet unknown function) Parts of the non protein coding RNAs code for micro RNAs and other small RNAs. The biology of non protein coding RNAs will be treated in my course in semester 8 in the Master GBE (Génomique Protéomique et Génétique quantitative) Furthermore non gene centered microarrays (and other techniques -> refer to the ENCODE project) give also rise to improve and alter gene annotation: it was discovered that some genes have distant exons which may be located kilobases away from the annotated regions One of the first publications describing this is a study done in Drosophila: In this study the transcriptome of Drosophila embryogenesis was analyzed with tiling arrays. (Analyzing the transcriptome over a time course during embryo development is clever, because most of the genes will be transcribed at one or another stage of development) In this study it was found that approximately 30% of all transcribed nucleotides map to unannotated regions of the fly genome. The authors estimate based on their data that: 29% of all unannotaded transcribed regions function as missed or alternative exons of known protein coding genes 15.6% of intergenic transcribed regions function as missed or alternative TSS (transcriptional start sites) Let’s look at one specific example Graphs of the signals obtained for the probes on the array in the genomic region around the gene RhoGAP88C 12 sequential 2 h time points during Drosophila embryogenesis Annotation: 50 kB Are the expressed regions 5’ of RhoGAP coregulated individual transcripts or do they give rise to one big transcript? By RT(reverse transcription) PCR it was shown that there is actually one big transcript produced joining RhoGAP with 3 upstream “genes” and a very distal 5’ UTR. But how can one show that this transcript has any function? -> combination of transcript data with mutant phenotypes For Drosophila there is a huge database of P-element mutants available. The position of the P-elements is mapped to the genome. P-elements are transposable elements that are widely used to mutate the genome of Drosophila melanogaster The white arrows indicate the location of lethal p-element insertions. Only one (the left) is directly in the gene for RhoGAP88C. Complementation studies give genetic evidence for one big functional transcript: one p-element mutant cannot complement another one, indicating that they are located in the same transcript This study shows, that the combination of genetic data (p-element mutations, complementation studies) together with molecular data is a powerful approach towards the identification of a biological function of unannotated transcription Part D: Ultra High Throughput Sequencing (UHTS) UHTS is a new technology with the potential to replace hybridization based technologies like tiling arrays Drawbacks of tiling arrays: - signal background - moderate sensitivity - sensible to polymorphisms in the sequence (especially true for short oligo arrays) - as discussed in the Drosophila article, putative new transcripts have to be verified by other methods (RT-PCR followed by sequencing) if one could sequence the transciptome in a quantitative way! This would be the solution! And the good news: today, one can do so! Classical sequencing and UHTS (also referred as New Generation Sequencing, NGS): what is the difference? Classical Seq. (Sanger method) 1 seq. reaction will read 1 template DNA (either cloned or PCR product). High throughput machines can process 96 samples in one run. Sequence reads of 500bases and more UHTS, Solexa machine 1 seq. run will read millions! of template DNAs at the same time, but the reads are short: 36-76 bases How can we analyze the transcriptome with UHTS? Isolate mRNA fragment mRNA ( ~ 200base fragments) transcribe into cDNA sequence cDNAs millions of 36 base reads map reads to genome and assemble reads into transcripts This method is called RNA-Seq Example of RNA-seq data: From Nagalakshmi et al (2008) Science 320, 1344 RNA-Seq: what information do the data give? -RNA-seq is quantitative:the read count per transcript is a measure of its abundance; the sensitivity and the dynamic range are higher than with microarrays -it is a non biased method: transcripts from any region of the genome can be determined (independent of annotation) -discovery of new transcripts -identification of splice sites and splice variants, 5’ and 3’ ends of transcripts See also: Nature Rev Genet 9 (2008) Nature Reviews Genetics 10, 57-63 (January 2009) | doi:10.1038/nrg2484 Innovation: RNA-Seq: a revolutionary tool for transcriptomics Zhong Wang, Mark Gerstein & Michael Snyder