cDNA Sequencing, SAGE and Microarray Analysis Outline • • • • • Overview of transcription Construction of cDNA libraries cDNA sequencing Expression analysis via SAGE Microarray construction and their use in expression.

Download Report

Transcript cDNA Sequencing, SAGE and Microarray Analysis Outline • • • • • Overview of transcription Construction of cDNA libraries cDNA sequencing Expression analysis via SAGE Microarray construction and their use in expression.

cDNA Sequencing, SAGE and Microarray Analysis

Outline

• Overview of transcription • Construction of cDNA libraries • cDNA sequencing • Expression analysis via SAGE • Microarray construction and their use in expression analysis.

We can isolate mRNA and convert it to a stable form (cDNA)

DNA mRNA protein

Isolate, Reverse Transcribe, label

cDNA

s

Genome in numbers

Nucleic acid content of an average human cell Abundance distribution of mRNA species in a typical mammalian cell

Isolation of mRNA

cDNA library construction The big picture

cDNA synthesis

cDNA synthesis occurs in 5’ to 3’ direction, requires: – a template – nucleosides (dNTPs) – reverse transcriptase (retroviral polymerase) – a primer to initiate synthesis This duplex cDNA is exact copy of the original mRNA It is now ready for manipulation (cloning, library construction)

Priming alternatives for cDNA construction

Oligo dT: priming at 3’ terminus Random Hexamers: priming throughout sequence

Cloning: Blunt end vs. sticky

cDNA library construction

Ligation of cDNA into vector

Directional cloning

cDNA library

Definition of a good cDNA library Ideally containing at least one copy of every expressed gene Probablity for the above is a function of: fragment size – the longer the more likely to find gene represented genome size – smaller genome = increased chance to find gene represented expression – high expression = high likelihood to find gene represented For 99% probability, a mammalian cDNA library requires to contain ~800,000 clones Uses of a cDNA library Representation of the population of genes defining a cells phenotype Long-term stable storage of information Retrieve full-length genes rather than fragments (screening) Find gene “family” members (screening)

cDNA sequencing

• The advent of cDNA cloning combine with the creating of automated sequencers led to efforts to sequence the entire human transcriptome and to create arrays (on filters) of cDNAs (see reading materials).

• cDNA sequencing was viewed as the fastest way to get at the coding portion of the genome.

• Numerous companies sprung up to sequence and patent cDNA’s.

• cDNA sequencing was also used to measure gene expression levels.

cDNA sequencing --> expression analysis

• Expression level estimates: – Count the number of occurrences of a given cDNA sequence in a given library - highly expressed genes will have been sequenced more often.

– Use the above (in combination with the total number of sequences in the library) to estimate expression level.

Ex. PEDB (www.PEDB.org)

Web based expression analysis counting cDNA frequency www.pedb.org

Serial Analysis of Gene Expression (SAGE)

• Concept – cDNA sequencing is expensive – Can uniquely identify most mRNA species by a short sequence in a

defined location

in the gene (9bp tags are unique 95% of the time) – If we could produce a library of short sequences and ligate them together, then we could sequence the ligated DNA to measure the concentration of gene more efficiently

SAGE diagram

Linker: Primer A/B - TypeII site – Type I site A A A Primer A Primer A Sequence these -> B B B Primer B Primer B

Issues with SAGE (and cDNA sequencing for expression analysis)

• Low abundance clones – SAGE • in 1995, the estimate was that characterization of genes representing <100mRNA ’ s/cell would take a few months of work to quantify by a single in investigator (maybe 10 times quicker today) • Cost - if we assume even a low estimate of $6/sequencing reaction, 96 lanes * 4 runs/day*30 days * $6 = $69,000 to measure 460,000 tags (assume 40 tags/run). – cDNA sequencing • Same problem costs/time maybe 20-40 times higher • Hence expression information about low abundance clones is not accurate in cDNA or SAGE data in most cases.

• Leading to the advent of arrays…..

DNA Hybridization

Taking advantage of DNA hybridization On the surface A B In solution 4 copies of gene A, 1copy of gene B After Hybridization A B

DNA Arrays

• Spots of DNA arranged in a particular spatial arangement on a solid support • Supports - Filters(nylon, nitrocellulose), glass, silicon • Types – Spotted or placed - pre-synthesized DNA put onto a surface – Synthesized - DNA synthesized directly on the surface

The Original DNA Array

Petri dish with bacterial colonies Apply membrane and lift to make a filter containing DNA from each clone.

Probe and image to identify Clones homologous to the probe.

Vicki - A manual Gridding tool

Gridding tool modifications by : Michèl Schummer

Vicki and the gridding frame

Frame Design by: Michèl Schummer

Robotic Spotters for Filters

Types of filter based arrays

• PCR products - ORFs or cDNAs • Oligos - some times but generally not used for short products - oligos do not immobilize well on membranes • Living clones – Place membrane on Whatman paper soaked in media, can grow colonies directly on the arrays – Lysis of the colonies followed by cross-linking produced DNA arrays – Good for screening large libraries

Uses for Filter Based Arrays

• In general, filter based arrays were in vogue about 8-13 years ago in the pre-genomic days.

• Typically cDNA libraries were spotted as clones and the arrays were used to perform comparative expression analysis.

• Detection was typically performed with radioactive labeling/film or phosphorimaging.

• “ Interesting clones ” were identified (via differential expression) and then sequenced.

• For genomes that have not yet been sequenced, this can still be a cost effective approach, but rapid sequencing is changing that.

Selected cDNA arrays

• With unselected cDNA libraries, clones for highly expressed genes are over represented on the arrays.

• As time progressed a large number of cDNA unique cDNA ’ spot represented a single gene.

’ s were sequenced and hence it became possible selected s and to make arrays on which each • Around the same time, coatings for glass were developed that retained spotted DNA well. • This allowed for arrays to be produced on glass microscope slides which in turn allowed for fluorescence based detection technology.

Typical Path for cDNA clone acquisition

Image Consortium +others Sequence cDNA ’ s clones sequences Livermore, ATCC Gen Bank Reduce redundancy Unigene Commercial distributors Res. Genetics, InCyte, others Unigene sets Sequence checking Sequence verified sets Us

Spotted Arrays

Spotting “ pen ” Drop containing DNA in solution

C A G T T T G A C A G T T T G A

Reactive surface or coated surface

MD GenIII Arrayer

Plate hotel holds twelve 384-well plates Gridding head, 12 pins Slide holder 36 slides Features: •36 slides in 8 hours •7680 genes spotted in duplicate •Built-in humidity control

Cell Population #1 Cell Population #2 Extract mRNA Extract mRNA Glass slides enabled fluorescent detection in 2(or more) colors

Make cDNA Label w/ Green Fluor Make cDNA Label w/ Red Fluor

Co-hybridize

……………………….

……………………….

Slide with DNA from different genes Scan

Spotted arrays

• Initially, most spotted arrays were produced by spotting PCR products produced from selected cDNA clones.

• Issues – Must have the libraries in hand – Must not mix clones up – Must perform high throughput PCR to produce DNA to spot (again without mixing things up).

– LOTS of freezer space to store everything – cDNA ’ s are long and cross hybridization is a problem (although it is possible to spot oligo ’ s) – Quality manufacturing is difficult to maintain.

Oligo Arrays

• Synthesized or spotted arrays of short oligos of chosen sequence. (typically 20-60 base pairs) • Synthesis methods - ink jet, light directed.

• Spotting using reactive coupling.

• Used for re-sequencing, genotyping, diagnostics and expression arrays.

• MUCH better than cDNA arrays to distinguish related sequences • Only have to store the DNA few reagents) ’ s OR (better yet) if you synthesize DNA directly on the surface, you only need to store the sequence information (and a

Basic Oligo Synthesis

Protecting Group P Base + Glass Support Base Coupling P Base Base Glass Support P Base + P Base Base Remove Protecting Group Base Add Next Nucleotide P Base + Glass Support Glass Support

Ink-jets Can be Used to Direct Small Volumes of Liquids to Specific Sites

Agilent InkJet Array Technology

Resistor Off Resistor Off Resistor On Fill Reservoir Liquid Vaporizes Gas Expands Drop Breaks Off Reservoir Refills

< 1 msec

If, instead of using ink, one fills the reservoirs with different nucleotides, inkjets can be used to make DNA on a surface

~ 44,000 Features on 1

x3

Slide

Glass Can be Treated to Produce Hydrophilic

Wells

Agilent Printing Facility

Light-directed oligo synthesis

Number of different DNA sequences as a function of photolithographic resolution

Resolution 500 um 200 um 100 um 50 um 20 um 10 um Synthesis Site Density 400/cm 2 2500/cm 2 10,000/cm 2 40,000/cm 2 250,000/cm 2 1,000,000/cm 2

All possible oligos can be made in 4*N steps

Probe Length 4 8 10 15 Chemical Steps 16 32 40 60 Number of Possible Probes 256 65,536 1,048,576 1,073,741,824

Affymetrix Platform

• Each gene is represented by 11 probe pairs of 25 bp oligos • Each probe pair contains a perfect match and a mismatch to the gene sequence • Target sample is labeled with a biotinylated nucleotide and detected via a streptavidin phycoerythrin conjugate • One sample per array, one-color data

Affymetrix Expression Data

Data from the 11 probe pairs are used to calculated an aggregate signal for each gene

Strategies For Array Design

Known Exons Unknown transcript

Surrogate Strategy

Most expression arrays to date

Annotation Strategy

Exon arrays Splice variants

Tiling strategy

Unbiased look at the genome

Affymetrix Platform

• Expression arrays – Human, Mouse, Rat, Yeast,

E. coli

,

Drosophila

,

C

.

elegans

, Dog, Soybean,

Plasmodium, Anopheles

,

Pseudomonas

,

Arabidopsis

, Zebrafish,

Xenopus, etc.

• Exon arrays – Alternative splicing patterns • Mapping arrays – SNP analysis, loss of heterozygosity • Tiling array sets – Transcript mapping • Custom arrays

Issues with synthesized oligos

• Repetitive yield - e.g. for each reaction cycle, what percentage of the oligos react as intended estimated at 95% for light directed method, 98 99% for ink jet method • (0.95) 20 = 35.8%, (0.98) usually 60mers.

20 = 67% - net result Affy arrays are usually 25-mers, ink jet arrays are • For a single oligo, it can be shown that sensitivity plateaus at 50-70bp.

Relative merits of different methods of making oligo arrays

• Affy: – available first, large catalogue, small feature size possible • Inkjet: – much more flexible to design • Spotted: – less practical for large numbers (>a few 100) of oligo ’ s, can be made with std. spotting equipment. Libraries of oligos exist for more common organisms, so oligo deposition is feasible for some organisms.

Illumina

s Bead Arrays

ACGTGTCTACAGT TGCATCAGTGCA CGTGTATGCATGT TGCATCAGTGCA ATGCACTGTAGT Step 1 - synthesize beads in batches each batch with a sequence on it. Generally, color code the beads to keep track of which one has what molecule on it.

Step 2 - Etch the ends of optical fibers in a bundle or circular spots on a glass slide to create bead sized depressions.

Illumina

s Bead Arrays (cont)

ACGTGTCTACAGT TGCATCAGTGCA TGCATCAGTGCA CGTGTATGCATGT ATGCACTGTAGT Step 3 - Allow beads to self assemble an array on the end of the fibers or on the surface •These self assembled arrays can be used for the same applications as other DNA arrays.

•Since the assembly is random, one must over represent each desired oligo 10 ’ s of times to assure that each oligo is represented at least n times on the array.

•Decoding can also be accomplished by hybridizing short labeled oligos to the oligos on each bead. In practice, this is how it is usually done.

See www.illumina.com

Detection technologies

• Radio labeled probes – Film or phosphorimagers • Biotin labled – Post hyb with SA labeled with a fluor or an enzyme • Fluorescent probes – confocal scanning

Scanning with a confocal microscope

Expression Array Analysis

2- color Microarray Overview Measure Fluorescence in 2 channels red / green Control Test Prepare Fluorescently Labeled Probes

Slide from John Quackenbush, Dana Farber

Hybridize, Wash Analyze the data to identify patterns of gene expression

1-color Microarray Overview Weed Control Measure Fluorescence in 1 channel Hybridize, Wash Test Prepare Fluorescently Labeled Probes Bush

Slide adapted from John Quackenbush, Dana Farber

Analyze the data to identify patterns of gene expression

2-color vs. single color

• 2-color was originally designed due to problems in making reproducible arrays - e.g. the ratio on a spot is more reproducible than the absolute intensity if the spot size/concentration changes from array-to-array.

• With 2-colors, you don ’ t necessarily get twice as much data since it is typically to run an extra array in the inverted color scheme.

• Experimental design and cross experiment comparisons are much more complicated with 2 color arrays.

Expression Arrays are a Natural Extension of Genomic Analysis • Genome studies provide the source material for the arrays - eg. clones or manufactured DNA ’ s.

• For completely sequenced genomes, arrays allow a comprehensive survey of gene expression.

• This level of analysis is a revolution in biology.

Expression Arrays Have a Broad Range of Applicability • Cancer Studies - tumor vs. normal.

• Infectious disease studies - host response infection, infectious agent gene expression, viral diversity.

• Pharmaceutical studies - drug treated vs. non treated.

• Environmental - microbial diversity, effects of toxins, effect of growth conditions.

Expression Arrays Have a Broad Range of Applicability • Gene specific studies - deletion ( “ knockout ” ) vs. normal, over expression vs. normal.

• Agricultural studies - effects of pesticides, growth conditions, hormones.

• Developmental biology - cells from different areas/stages of developing organisms • Many others - any two samples of interest can be compared.

Challenges for Planning Good Array Experiments

• Experimental Design – Replicates are necessary and expensive – A simple experiment may not give a simple answer – What comparisons should be made?

• Data Analysis – How will differentially expressed genes be identified?

– How will errors be estimated?

– What software does this best?

– How will the data be mined?

Where are arrays going?

• As sequencing gets cheaper and cheaper, most assays that are currently done by arrays can be done more effectively by sequencing. Hence, the analytical use of arrays will be replaced by sequencing.

• However, arrays can also be used to enrich for specific genomic regions upstream of sequencing or can be used to create many sequences for the artificial production of genomes or genomic regions.