From Genome Sequence towards Analysis of the Transcriptome Genomic Sequence Information Nucleotide sequence and physical position on the chromosomes AATTCGCGCGAAT……. TTAAGCGCGCTTA…… Annotation “Genes”: Promoter, TSS, 5’UTR, Exons,Introns , 3’UTR, TTS “non.
Download
Report
Transcript From Genome Sequence towards Analysis of the Transcriptome Genomic Sequence Information Nucleotide sequence and physical position on the chromosomes AATTCGCGCGAAT……. TTAAGCGCGCTTA…… Annotation “Genes”: Promoter, TSS, 5’UTR, Exons,Introns , 3’UTR, TTS “non.
From Genome Sequence towards Analysis of the Transcriptome
Genomic Sequence Information
Nucleotide sequence
and physical position
on the chromosomes
AATTCGCGCGAAT…….
TTAAGCGCGCTTA……
Annotation
“Genes”: Promoter, TSS,
5’UTR, Exons,Introns ,
3’UTR, TTS
“non coding” regions
Genomic Sequence Information
Nucleotide sequence
and physical position
on the chromosomes
based
on
Annotation
based
on
biased
non biased
Transcriptome Analysis
Part A: Gene centered microarrays for transcriptome analysis
Microarrays are designed on the basis of gene annotation of the genome
………..
5’UTR
Exon 1
Intron
Exon 2
Cap
3’UTR
polyA
gDNA
………
Transcript
design of gene specific pobe(s):
oligonucleotides (25-70mers)
PCR amplification of gene specific sequences (GSTs)
Microarray
Remark: first generations of microarrays were based on ESTs (expressed sequence tags)
GST: gene specific tag
Today’s microarrays have specific probes for all annotated genes of an organism’s
genome. These arrays are called “whole genome microarrays”.
They allow to measure transcript abundance of all annotated genes in the genome
within one experiment.
Typical examples of microarray data:
Here an RNA sample is analyzed against itself (self vs self)
M=log2(R/G)
A=log2
R*G
Here a liver sample is analyzed against a heart sample
Microarray technology, which made it possible to measure the expression of
all genes within a genome at once, was a revolution in modern molecular
biology
Microarrays have been used since from “simple” gene-fishing up to
systemic approaches (understanding regulatory networks) but also as a
tool for diagnostics in medicine
Let’s look at and discuss some specific examples on how this technology
was used to analyze the transcriptome and what we can learn from it…
Example 1: using microarray transcript profiling in order to understand complex
biological processes
Example 2: using microarrays as a diagnostic tool
Example 1: the transcriptional program of cell entry into quiescence
Cells have to decide whether to divide or whether to enter into the G0, quiescence.
Actually most eucaryotic cells spend most of their lifetime in G0. G0 is not an
“endpoint” and cells can return into the mitotic cell cycle. A good example is the
wound response, where quiescent cells (fibroblasts and epidermal stem cells) start
rapidly to proliferate. Once tissue repair is accomplished, cells reenter into G0.
The transition from G0 into cell division has been studied in detail (one driving force
for this is certainly its link to cancer biology)
Much less is known about how cells exit the cell cycle in order to enter G0. Is it a
passive process where loss of growth factors causes the down regulation of cell cycle
genes? Or is it an active process with its own unique transcriptional program?
Cell cycle
Quiescence
“Cells constantly sense their environment to decide whether to
divide. Many genes that control the entry into cell division are
known, and their excessive activation may cause cancer. In contrast,
the way that cells cease to divide was thought to be a passive
process, where signals for cell division gradually decay.”
Liu H, Adler AS, Segal E, Chang HY (2007) A transcriptional program mediating entry into cellular
quiescence. PLoS Genet 3(6): e91. doi:10.1371/journal.pgen.0030091
The experimental model:
human fibroblast cell culture
serum deprivation
dividing
quiescent
serum stimulation
Cells grown in 10% FBS
Switch to a medium with 0.1%FBS
dividing
quiescent
transcriptional program
for entry into
quiescence
time points for
sample taking/analysis
transcript analysis with microarrays
Cells grown in 0.1% FBS
Switch to a medium with 10%FBS
quiescent
time points for
sample taking/analysis
dividing
transcriptional program
for entry into
cell cycle
transcript analysis with microarrays
FBS: fetal bovine serum; contains all growth factors to induce cell division in fibroblast cell cultures
fibroblast RNA
time (h)
0
microarray 1
0.25
microarray 2
0.5
microarray 3
1
microarray 4
1.5
microarray 5
etc
microarray 6..etc
human reference
RNA
Genes that show induction or repression in the
late phase show a symmetric regulation
(induced upon SS but repressed under SD and
vice versa)
Genes, however, that change early on are not
symmetrically regulated at all ->
Entry and exit from G0 have their own unique
transcriptional program to initiate the transition
Early Response Genes
SDERGs (serum deprivation
early response genes) show
an asymetric behavior between
SD and SS: immediate induction in
SD but less or not regulated during
SS.
-> entry into G0 has its unique gene
expression program
Several SDERGs are known to be induced by interferon (for example STAT1)
Two SDERGs, SALL2 and MXI1, are putative tumor supressors.
The authors suppose that some of the SDERGs might be “master regulators”
of the entry into quiescence.
SALL2: zinc finger transcription factor
MXI1:MAX interactor 1, antagonist of MYC oncoproteins
IRF1: interferon regulatory factor, transcription factor
Which experiments could be performed to investigate a role of these three
genes as “master regulators” of cell entry into G0?
Gene nock-down with siRNAs
Cells were transfected with siRNA constructs and the efficiency was determined
by real-time quantitative PCR
The effect of the gene knock-downs on cell cycle exit and the transcriptional
program after SD
FACS (fluorescence-activated cell sorting) analysis of cell cycle states:
Transcript profiling of cells with siRNA constructs
Do SDERGs play a role in human cancer?
-> interrogation of public databases of
microarray data from human cancers
Prostate cancer
Coordinate repression of SDERGs identify over 90%
of prostate tumors relative to normal prostate
Breast Cancer
-> Diminished expression of SDERGs in grade 3 tumors
(grade 3: high cell proliferation, less differentiation)
-> Patients with diminished expression of SDERGs had
significantly worse survival
Summary:
Microarray analysis permits profiling of the transcriptome at full genome level.
The discussed example of cell entry into quiescence shows that full genome
transcriptome analysis can yield insights into the transcriptional program of cells
and identify genes as “master regulators” of this transcriptional program.
The function of the genes which control entry into G0 was confirmed by an siRNA
approach.
Many microarray experiment results are stored in public databases. This wealth
of data can be exploited in order to get insights of the implication of genes of
interest in diverse biological processes. In the discussed example it was concluded
that genes which control entry into G0 are implicated in human cancer. This class
of genes (“SDERGs”) likely antagonize cell proliferation many cell types.
Example 2: gene expression signatures as a diagnostic tool for tumor
classification regarding clinical outcome
metastasis free, high survival
primary tumor
time
metastasis formation, low survival
Prognostic factors to determine, whether a primary tumor will or will not develop
metastases over time are of high clinical importance for the treatment decision.
Example breast cancer: prognosis based on histological and clinical characteristics
(St. Gallen criteria, NIH criteria)
Can gene expression profiles of primary tumors serve as prognostic markers for
the clinical outcome of cancer?
Nature (2002), 415, 530-536
This is one of the first articles on the identification of a gene expression signature
which can be used as a prognostic factor for cancer clinical outcome.
(This article has been cited since then 2491 times!)
Experimental design: samples from 98 primary tumors from patients with known
clinical outcome after five years (disease free or development of distant metastases)
were analyzed on 25’000 gene containing microarrays
reference RNA (pool of all tumor RNAs)
RNA
tumor 1
RNA
tumor 2
RNA
tumor 3
RNA
tumor 4
RNA
tumor 5
RNA
tumor X
Unsupervised clustering analysis:
Supervised classification was used to identify a gene signature with the highest
correlation to disease outcome. With this approach an optimal number of 70
marker genes was identified, which showed high correlation to the disease outcome
when analyzing the primary tumors.
Validation of the 70 gene signature on 19 additional breast cancers (which haven’t
been included in the supervised classification before)
->only 2 misclassifications out of 19 samples were obtained when using the
70 gene signature as a prognostic tool for disease outcome.
Comparison of 70 gene signature prognosis to conventional prognosis criteria:
The gene signature has almost the same efficiency as conventional consensus
criteria to select high risk patients; the number of “poor diagnosis” classification
in disease free cases (misclassification rate) is, however, much lower.
Towards a higher throughput diagnostic tool:
Using full genome microarrays as high throughput diagnostic tool is not feasible;
a step towards routine clinical analysis of the cancer gene signature was the
development of a small custom array, which performed as well as the full genome
arrays
Glas et al.
BMC Genomics 2006, 7:278
Part B: Quantitative real-time PCR: a sensitive tool for medium to
high throughput transcript analysis
Microarray results are often validated by another method: real-time quantitative
PCR (qPCR). qPCR is a very sensitive method for transcript analysis. In contrast
to microarray analysis the number of genes to be queried is low (usually <100), but
the sample throughput is high.
qPCR is a highly flexible method for gene expression analysis. Its specificity is
determined by the primer (and probe) sequences used for the amplification.
In the following we’ll have a look at the basic essentials of this technology.
Why can’t we use “classical” PCR to quantitate RNA?
The amount of PCR product at the plateau is variable from reaction to
reaction when same amounts of starting material are used:
How can we measure the formation of PCR products in “real-time”
(-> at the time of enzymatic synthesis)?
-> The chemistries of PCR product detection via fluorescence
The formation of PCR product during the cycling of a PCR reaction can be
measured by fluorescence
There are two methods commonly used:
SYBR® Green and fluorescent probes
A. SYBR® Green
SYBR Green 1 fluorescence increases
enormously when binding to dsDNA
-> The fluorescence increases with
the formation of dsDNA product
B: Fluorescent probes:
FRET:
Fluorescence resonance energy transfer
(or Förster resonance energy transfer)
Note: the use of a fluorescent probe adds one more level of specificity compared
to a SYBR Green assay
The basis for quantification: determination of the Ct value
•
•
•
•
Baseline
Threshold
Rn
Ct
Baseline = Basal level of fluorescence defined during the initial cycles of PCR (background
fluorescence).
Threshold = Fixed fluorescence level set above the baseline (statistical cutoff based upon
background fluorescence).
Rn = normalized Reporter signal, level of fluorescence detected during PCR. Calculated by dividing
probe reporter dye signal by passive reference signal (ROX).
Ct = threshold Cycle, PCR cycle at which an increase in reporter fluorescence above a baseline
signal is first detected (cycle when fluorescence crosses the threshold).
Exponential growth phase = linear part in logarithmic graphic
A plot of the log of initial target copy number for a set of standards versus CT
is a straight line
The relation of Ct value and target quantity
If the amount of target nucleic acid is doubled, the Ct value changes by 1
Relative Quantity = 2
CT (sample A) - CT (sample B)
Note: the higher the initial copy number of target DNA the lower the Ct value!
For a tenfold dilution series:
10 = 23.32
Relative Quantification
The most common application of relative quantification is the analysis of geneexpression (transcript abundance)
In a first enzymatic reaction, mRNA is transcribed into 1 strand of cDNA via reverse
transcription
The cDNA is used as template for the real-time PCR
Example:
we want to compare the changes in gene expression from one
sample to another
e.g.
untreated cell culture vs treated
one tissue against another one
normal vs diseased
etc
For relative quantification studies an endogenous control is required for
“normalization” between different samples
An endogenous control is an mRNA, which is present at constant levels
in the different samples to be analyzed
The endogenous control normalizes for
- RNA input variation
- variation in cDNA synthesis efficiency (reverse transcription)
The comparative Ct method (∆∆Ct method)
Real time PCR: Applications
A. Gene expression analysis (analysis of relative transcript levels)
Isolation of total RNA
reverse transcription -> cDNA
quantitative real-time PCR (∆∆Ct method….)
A1. Analysis of differential splice variants
The expression of splice variants from a gene can be analyzed by real-time
PCR. Essential for this approach is the location of the primers (and probe for
TaqMan assays).
Exon
1
2
3
4
Intron
Full length
1
2
3
4
1
2
4
Splice form 1
1
3
4
Transcript specific PCR primers
Splice form 2
B. Absolute quantification
B1. Detection and quantitation of pathogens in clinical samples
- e.g. determination of virus titer in blood samples
use of a standard curve to determine the exact quantity of copies
of a target (viral nucleic acid) in a defined volume of clinical sample
B2. Detection and quantification of GMOs
B3. Determination of the absolute amount of transcripts in a sample
(allows the comparison of the abundance of different transcripts within
a sample)
C. SNP analysis
The detection of a single nucleotide polymorphism is possible with probes,
which allow the discrimination of only one nucleotide difference in the
complementary sequence
TaqMan probes with MGB are short enough to discriminate between SNP alleles
(MGB: minor groove binder; a non-fluorescent quencher)
Part C: Non gene-centered microarrays for transcriptome
analysis: tiling arrays
………..
5’UTR
Exon 1
Intron
Exon 2
3’UTR
gDNA
………
specific tiling probes on microarray
Tiling array:
A microarray design in which the probes are selected to interrogate a
genome with a consistent, pre-determined spacing between each probe.
Tiling arrays allow to measure transcripts from the genome in a non-biased way:
they are not centered towards annotated regions.
Transcriptome analysis with tiling arrays has given completely new
insights in how genomes are transcribed
What was known from genome sequencing and annotation:
The human genome contains only 1-2% of protein coding regions (exons) and
one could assume that these protein coding sequences constitute the main part of
the transcriptome.
Experiments with tiling arrays have, however, revealed that a much more
important portion of the human genome is transcribed, with many transcripts
arising from unannotated regions.
Some examples:
Human chromosomes 21 and 22, analysis of cytosolic poly(A)+ RNA
(Kampa et al, 2004, Genome Res.)
49% of the observed transcription lies outside annotated regions
Transcriptional maps of 10 human chromosomes, poly(A)+ and poly(A)-,
cytosolic and nuclear
(Cheng et al, 2005, Science)
Fig. 1. The correlation of detected transcription in one of eight cell lines to annotations along each of
the 10 chromosomes is shown for each chromosome individually and as a collective of all
chromosomes
cytosolic, poly A+
J. Cheng et al., Science 308, 1149 -1154 (2005)
Published by AAAS
Fig. 3. Distribution of poly A+ and poly A- transcription in the nucleus and cytosol with respect to
genome annotations
J. Cheng et al., Science 308, 1149 -1154 (2005)
Published by AAAS
Important findings:
A much more important part of the genome is transcribed as was previously
assumed by gene annotation.
Transcripts are found in the the cytosol and the nucleus- which sounds trivial, but
there is a large proportion of transcripts that is unique for the nucleus and is never
found in the cytosol. The nuclear transcriptome is fivefold bigger than the cytosolic.
A big portion of transcripts contains no poly A+:
there are about 2.2 times as many uniquely poly A– (43.7%) transcribed sequences
as uniquely poly A+ (19.4%).
There is a high degree of unannotated transcription: in the cytosolic poly A+ fraction
56% of transcribed base pairs and in the nuclear transcriptome even 80%
There is a high degree of antisense transcription: the analysis of human transcripts
with tiling arrays revealed 61% antisense transcription*
Long-range interconnected transcription: many genes use alternative 5’ ends that lie
tens and hundreds of kilobases away from annotated 5’ends.
*: maybe an artifact resulting from reverse transcription
see: Wu et al. (2008) Genome Biology, 9, R3
These results show that big parts of the genome are transcribed and annotated
regions constitute only a small part of the transcriptome.
[-> Why is it only by now that we discover this?]
Non gene centered microarrays (and other methods) shed light in the complexity
of genomic transcription but we have very little knowledge about the function
of unannotated transcripts (also often refered to “non coding” transcripts; I prefer
the expression “non protein coding transcripts” or transcripts with yet unknown
function)
Parts of the non protein coding RNAs code for micro RNAs and other small RNAs.
The biology of non protein coding RNAs will be treated in my course in semester 8
in the Master GBE (Génomique Protéomique et Génétique quantitative)
Furthermore non gene centered microarrays (and other techniques -> refer to
the ENCODE project) give also rise to improve and alter gene annotation:
it was discovered that some genes have distant exons which may be located
kilobases away from the annotated regions
One of the first publications describing this is a study done in Drosophila:
In this study the transcriptome of Drosophila embryogenesis was analyzed with
tiling arrays.
(Analyzing the transcriptome over a time course during embryo
development is clever, because most of the genes will be transcribed at one or
another stage of development)
In this study it was found that approximately 30% of all transcribed nucleotides
map to unannotated regions of the fly genome.
The authors estimate based on their data that:
29% of all unannotaded transcribed regions function as missed or alternative
exons of known protein coding genes
15.6% of intergenic transcribed regions function as missed or alternative TSS
(transcriptional start sites)
Let’s look at one specific example
Graphs of the signals obtained for the probes on the array in the genomic region
around the gene RhoGAP88C
12 sequential
2 h time points
during
Drosophila
embryogenesis
Annotation:
50 kB
Are the expressed regions 5’ of RhoGAP coregulated individual transcripts or do
they give rise to one big transcript?
By RT(reverse transcription) PCR it was shown that there is actually one big
transcript produced joining RhoGAP with 3 upstream “genes” and a very distal
5’ UTR.
But how can one show that this transcript has any function?
-> combination of transcript data with mutant phenotypes
For Drosophila there is a huge database of P-element mutants available.
The position of the P-elements is mapped to the genome.
P-elements are transposable elements that are widely used to mutate the
genome of Drosophila melanogaster
The white arrows indicate the location of lethal p-element insertions.
Only one (the left) is directly in the gene for RhoGAP88C.
Complementation studies give genetic evidence for one big functional
transcript: one p-element mutant cannot complement another one, indicating
that they are located in the same transcript
This study shows, that the combination of genetic data (p-element mutations,
complementation studies) together with molecular data is a powerful approach
towards the identification of a biological function of unannotated transcription
Part D: Ultra High Throughput Sequencing (UHTS)
UHTS is a new technology with the potential to replace hybridization based
technologies like tiling arrays
Drawbacks of tiling arrays:
- signal background
- moderate sensitivity
- sensible to polymorphisms in the sequence (especially true for short oligo arrays)
- as discussed in the Drosophila article, putative new transcripts have to be
verified by other methods (RT-PCR followed by sequencing)
if one could sequence the transciptome in a quantitative way! This would
be the solution!
And the good news: today, one can do so!
Classical sequencing and UHTS (also referred as New Generation Sequencing, NGS):
what is the difference?
Classical Seq. (Sanger method)
1 seq. reaction will read
1 template DNA (either cloned or
PCR product).
High throughput machines can process
96 samples in one run.
Sequence reads of 500bases and more
UHTS, Solexa machine
1 seq. run will read
millions! of template DNAs
at the same time,
but the reads are short:
36-76 bases
How can we analyze the transcriptome with UHTS?
Isolate mRNA
fragment mRNA ( ~ 200base fragments)
transcribe into cDNA
sequence cDNAs
millions of 36 base reads
map reads to genome and assemble reads
into transcripts
This method is called
RNA-Seq
Example of RNA-seq data:
From Nagalakshmi et al (2008) Science 320, 1344
RNA-Seq: what information do the data give?
-RNA-seq is quantitative:the read count per transcript is a measure of its
abundance; the sensitivity and the dynamic range are higher than with microarrays
-it is a non biased method: transcripts from any region of the genome can be
determined (independent of annotation)
-discovery of new transcripts
-identification of splice sites and splice variants, 5’ and 3’ ends of transcripts
See also:
Nature Rev Genet 9 (2008)
Nature Reviews Genetics 10, 57-63 (January 2009) | doi:10.1038/nrg2484
Innovation: RNA-Seq: a revolutionary tool for transcriptomics
Zhong Wang, Mark Gerstein & Michael Snyder