cDNA Sequencing, SAGE and Microarray Analysis Outline • • • • • Overview of transcription Construction of cDNA libraries cDNA sequencing Expression analysis via SAGE Microarray construction and their use in expression.
Download ReportTranscript cDNA Sequencing, SAGE and Microarray Analysis Outline • • • • • Overview of transcription Construction of cDNA libraries cDNA sequencing Expression analysis via SAGE Microarray construction and their use in expression.
cDNA Sequencing, SAGE and Microarray Analysis
Outline
• Overview of transcription • Construction of cDNA libraries • cDNA sequencing • Expression analysis via SAGE • Microarray construction and their use in expression analysis.
We can isolate mRNA and convert it to a stable form (cDNA)
DNA mRNA protein
Isolate, Reverse Transcribe, label
cDNA
’
s
Genome in numbers
Nucleic acid content of an average human cell Abundance distribution of mRNA species in a typical mammalian cell
Isolation of mRNA
cDNA library construction The big picture
cDNA synthesis
cDNA synthesis occurs in 5’ to 3’ direction, requires: – a template – nucleosides (dNTPs) – reverse transcriptase (retroviral polymerase) – a primer to initiate synthesis This duplex cDNA is exact copy of the original mRNA It is now ready for manipulation (cloning, library construction)
Priming alternatives for cDNA construction
Oligo dT: priming at 3’ terminus Random Hexamers: priming throughout sequence
Cloning: Blunt end vs. sticky
cDNA library construction
Ligation of cDNA into vector
Directional cloning
cDNA library
Definition of a good cDNA library Ideally containing at least one copy of every expressed gene Probablity for the above is a function of: fragment size – the longer the more likely to find gene represented genome size – smaller genome = increased chance to find gene represented expression – high expression = high likelihood to find gene represented For 99% probability, a mammalian cDNA library requires to contain ~800,000 clones Uses of a cDNA library Representation of the population of genes defining a cells phenotype Long-term stable storage of information Retrieve full-length genes rather than fragments (screening) Find gene “family” members (screening)
cDNA sequencing
• The advent of cDNA cloning combine with the creating of automated sequencers led to efforts to sequence the entire human transcriptome and to create arrays (on filters) of cDNAs (see reading materials).
• cDNA sequencing was viewed as the fastest way to get at the coding portion of the genome.
• Numerous companies sprung up to sequence and patent cDNA’s.
• cDNA sequencing was also used to measure gene expression levels.
cDNA sequencing --> expression analysis
• Expression level estimates: – Count the number of occurrences of a given cDNA sequence in a given library - highly expressed genes will have been sequenced more often.
– Use the above (in combination with the total number of sequences in the library) to estimate expression level.
Ex. PEDB (www.PEDB.org)
Web based expression analysis counting cDNA frequency www.pedb.org
Serial Analysis of Gene Expression (SAGE)
• Concept – cDNA sequencing is expensive – Can uniquely identify most mRNA species by a short sequence in a
defined location
in the gene (9bp tags are unique 95% of the time) – If we could produce a library of short sequences and ligate them together, then we could sequence the ligated DNA to measure the concentration of gene more efficiently
SAGE diagram
Linker: Primer A/B - TypeII site – Type I site A A A Primer A Primer A Sequence these -> B B B Primer B Primer B
Issues with SAGE (and cDNA sequencing for expression analysis)
• Low abundance clones – SAGE • in 1995, the estimate was that characterization of genes representing <100mRNA ’ s/cell would take a few months of work to quantify by a single in investigator (maybe 10 times quicker today) • Cost - if we assume even a low estimate of $6/sequencing reaction, 96 lanes * 4 runs/day*30 days * $6 = $69,000 to measure 460,000 tags (assume 40 tags/run). – cDNA sequencing • Same problem costs/time maybe 20-40 times higher • Hence expression information about low abundance clones is not accurate in cDNA or SAGE data in most cases.
• Leading to the advent of arrays…..
DNA Hybridization
Taking advantage of DNA hybridization On the surface A B In solution 4 copies of gene A, 1copy of gene B After Hybridization A B
DNA Arrays
• Spots of DNA arranged in a particular spatial arangement on a solid support • Supports - Filters(nylon, nitrocellulose), glass, silicon • Types – Spotted or placed - pre-synthesized DNA put onto a surface – Synthesized - DNA synthesized directly on the surface
The Original DNA Array
Petri dish with bacterial colonies Apply membrane and lift to make a filter containing DNA from each clone.
Probe and image to identify Clones homologous to the probe.
Vicki - A manual Gridding tool
Gridding tool modifications by : Michèl Schummer
Vicki and the gridding frame
Frame Design by: Michèl Schummer
Robotic Spotters for Filters
Types of filter based arrays
• PCR products - ORFs or cDNAs • Oligos - some times but generally not used for short products - oligos do not immobilize well on membranes • Living clones – Place membrane on Whatman paper soaked in media, can grow colonies directly on the arrays – Lysis of the colonies followed by cross-linking produced DNA arrays – Good for screening large libraries
Uses for Filter Based Arrays
• In general, filter based arrays were in vogue about 8-13 years ago in the pre-genomic days.
• Typically cDNA libraries were spotted as clones and the arrays were used to perform comparative expression analysis.
• Detection was typically performed with radioactive labeling/film or phosphorimaging.
• “ Interesting clones ” were identified (via differential expression) and then sequenced.
• For genomes that have not yet been sequenced, this can still be a cost effective approach, but rapid sequencing is changing that.
Selected cDNA arrays
• With unselected cDNA libraries, clones for highly expressed genes are over represented on the arrays.
• As time progressed a large number of cDNA unique cDNA ’ spot represented a single gene.
’ s were sequenced and hence it became possible selected s and to make arrays on which each • Around the same time, coatings for glass were developed that retained spotted DNA well. • This allowed for arrays to be produced on glass microscope slides which in turn allowed for fluorescence based detection technology.
Typical Path for cDNA clone acquisition
Image Consortium +others Sequence cDNA ’ s clones sequences Livermore, ATCC Gen Bank Reduce redundancy Unigene Commercial distributors Res. Genetics, InCyte, others Unigene sets Sequence checking Sequence verified sets Us
Spotted Arrays
Spotting “ pen ” Drop containing DNA in solution
C A G T T T G A C A G T T T G A
Reactive surface or coated surface
MD GenIII Arrayer
Plate hotel holds twelve 384-well plates Gridding head, 12 pins Slide holder 36 slides Features: •36 slides in 8 hours •7680 genes spotted in duplicate •Built-in humidity control
Cell Population #1 Cell Population #2 Extract mRNA Extract mRNA Glass slides enabled fluorescent detection in 2(or more) colors
Make cDNA Label w/ Green Fluor Make cDNA Label w/ Red Fluor
Co-hybridize
……………………….
……………………….
Slide with DNA from different genes Scan
Spotted arrays
• Initially, most spotted arrays were produced by spotting PCR products produced from selected cDNA clones.
• Issues – Must have the libraries in hand – Must not mix clones up – Must perform high throughput PCR to produce DNA to spot (again without mixing things up).
– LOTS of freezer space to store everything – cDNA ’ s are long and cross hybridization is a problem (although it is possible to spot oligo ’ s) – Quality manufacturing is difficult to maintain.
Oligo Arrays
• Synthesized or spotted arrays of short oligos of chosen sequence. (typically 20-60 base pairs) • Synthesis methods - ink jet, light directed.
• Spotting using reactive coupling.
• Used for re-sequencing, genotyping, diagnostics and expression arrays.
• MUCH better than cDNA arrays to distinguish related sequences • Only have to store the DNA few reagents) ’ s OR (better yet) if you synthesize DNA directly on the surface, you only need to store the sequence information (and a
Basic Oligo Synthesis
Protecting Group P Base + Glass Support Base Coupling P Base Base Glass Support P Base + P Base Base Remove Protecting Group Base Add Next Nucleotide P Base + Glass Support Glass Support
Ink-jets Can be Used to Direct Small Volumes of Liquids to Specific Sites
Agilent InkJet Array Technology
Resistor Off Resistor Off Resistor On Fill Reservoir Liquid Vaporizes Gas Expands Drop Breaks Off Reservoir Refills
< 1 msec
If, instead of using ink, one fills the reservoirs with different nucleotides, inkjets can be used to make DNA on a surface
~ 44,000 Features on 1
”
x3
”
Slide
Glass Can be Treated to Produce Hydrophilic
“
Wells
”
Agilent Printing Facility
Light-directed oligo synthesis
Number of different DNA sequences as a function of photolithographic resolution
Resolution 500 um 200 um 100 um 50 um 20 um 10 um Synthesis Site Density 400/cm 2 2500/cm 2 10,000/cm 2 40,000/cm 2 250,000/cm 2 1,000,000/cm 2
All possible oligos can be made in 4*N steps
Probe Length 4 8 10 15 Chemical Steps 16 32 40 60 Number of Possible Probes 256 65,536 1,048,576 1,073,741,824
Affymetrix Platform
• Each gene is represented by 11 probe pairs of 25 bp oligos • Each probe pair contains a perfect match and a mismatch to the gene sequence • Target sample is labeled with a biotinylated nucleotide and detected via a streptavidin phycoerythrin conjugate • One sample per array, one-color data
Affymetrix Expression Data
Data from the 11 probe pairs are used to calculated an aggregate signal for each gene
Strategies For Array Design
Known Exons Unknown transcript
Surrogate Strategy
Most expression arrays to date
Annotation Strategy
Exon arrays Splice variants
Tiling strategy
Unbiased look at the genome
Affymetrix Platform
• Expression arrays – Human, Mouse, Rat, Yeast,
E. coli
,
Drosophila
,
C
.
elegans
, Dog, Soybean,
Plasmodium, Anopheles
,
Pseudomonas
,
Arabidopsis
, Zebrafish,
Xenopus, etc.
• Exon arrays – Alternative splicing patterns • Mapping arrays – SNP analysis, loss of heterozygosity • Tiling array sets – Transcript mapping • Custom arrays
Issues with synthesized oligos
• Repetitive yield - e.g. for each reaction cycle, what percentage of the oligos react as intended estimated at 95% for light directed method, 98 99% for ink jet method • (0.95) 20 = 35.8%, (0.98) usually 60mers.
20 = 67% - net result Affy arrays are usually 25-mers, ink jet arrays are • For a single oligo, it can be shown that sensitivity plateaus at 50-70bp.
Relative merits of different methods of making oligo arrays
• Affy: – available first, large catalogue, small feature size possible • Inkjet: – much more flexible to design • Spotted: – less practical for large numbers (>a few 100) of oligo ’ s, can be made with std. spotting equipment. Libraries of oligos exist for more common organisms, so oligo deposition is feasible for some organisms.
Illumina
’
s Bead Arrays
ACGTGTCTACAGT TGCATCAGTGCA CGTGTATGCATGT TGCATCAGTGCA ATGCACTGTAGT Step 1 - synthesize beads in batches each batch with a sequence on it. Generally, color code the beads to keep track of which one has what molecule on it.
Step 2 - Etch the ends of optical fibers in a bundle or circular spots on a glass slide to create bead sized depressions.
Illumina
’
s Bead Arrays (cont)
ACGTGTCTACAGT TGCATCAGTGCA TGCATCAGTGCA CGTGTATGCATGT ATGCACTGTAGT Step 3 - Allow beads to self assemble an array on the end of the fibers or on the surface •These self assembled arrays can be used for the same applications as other DNA arrays.
•Since the assembly is random, one must over represent each desired oligo 10 ’ s of times to assure that each oligo is represented at least n times on the array.
•Decoding can also be accomplished by hybridizing short labeled oligos to the oligos on each bead. In practice, this is how it is usually done.
See www.illumina.com
Detection technologies
• Radio labeled probes – Film or phosphorimagers • Biotin labled – Post hyb with SA labeled with a fluor or an enzyme • Fluorescent probes – confocal scanning
Scanning with a confocal microscope
Expression Array Analysis
2- color Microarray Overview Measure Fluorescence in 2 channels red / green Control Test Prepare Fluorescently Labeled Probes
Slide from John Quackenbush, Dana Farber
Hybridize, Wash Analyze the data to identify patterns of gene expression
1-color Microarray Overview Weed Control Measure Fluorescence in 1 channel Hybridize, Wash Test Prepare Fluorescently Labeled Probes Bush
Slide adapted from John Quackenbush, Dana Farber
Analyze the data to identify patterns of gene expression
2-color vs. single color
• 2-color was originally designed due to problems in making reproducible arrays - e.g. the ratio on a spot is more reproducible than the absolute intensity if the spot size/concentration changes from array-to-array.
• With 2-colors, you don ’ t necessarily get twice as much data since it is typically to run an extra array in the inverted color scheme.
• Experimental design and cross experiment comparisons are much more complicated with 2 color arrays.
Expression Arrays are a Natural Extension of Genomic Analysis • Genome studies provide the source material for the arrays - eg. clones or manufactured DNA ’ s.
• For completely sequenced genomes, arrays allow a comprehensive survey of gene expression.
• This level of analysis is a revolution in biology.
Expression Arrays Have a Broad Range of Applicability • Cancer Studies - tumor vs. normal.
• Infectious disease studies - host response infection, infectious agent gene expression, viral diversity.
• Pharmaceutical studies - drug treated vs. non treated.
• Environmental - microbial diversity, effects of toxins, effect of growth conditions.
Expression Arrays Have a Broad Range of Applicability • Gene specific studies - deletion ( “ knockout ” ) vs. normal, over expression vs. normal.
• Agricultural studies - effects of pesticides, growth conditions, hormones.
• Developmental biology - cells from different areas/stages of developing organisms • Many others - any two samples of interest can be compared.
Challenges for Planning Good Array Experiments
• Experimental Design – Replicates are necessary and expensive – A simple experiment may not give a simple answer – What comparisons should be made?
• Data Analysis – How will differentially expressed genes be identified?
– How will errors be estimated?
– What software does this best?
– How will the data be mined?
Where are arrays going?
• As sequencing gets cheaper and cheaper, most assays that are currently done by arrays can be done more effectively by sequencing. Hence, the analytical use of arrays will be replaced by sequencing.
• However, arrays can also be used to enrich for specific genomic regions upstream of sequencing or can be used to create many sequences for the artificial production of genomes or genomic regions.