Statistical analysis of DNA microarray data

Download Report

Transcript Statistical analysis of DNA microarray data

Gene Expression
BMI 731 Winter 2005
Catalin Barbacioru
Department of Biomedical Informatics
Ohio State University
Thesis: the analysis of gene expression
data is going to be big in 21st century statistics
Many different technologies, including
Spotted DNA arrays (Brown/Botstein)
Short oligonucleotide arrays (Affymetrix)
Serial analysis of gene expression (SAGE)
Long oligo arrays (Agilent)
Fibre optic arrays (Illumina)
Total microarray articles
indexed in Medline
600
Number of papers
500
400
300
200
100
0
1995
1996
1997
1998
Year
1999
2000
2001
(projected)
Common themes
• Parallel approach to collection of very large
amounts of data (by biological standards)
• Sophisticated instrumentation, requires some
understanding
• Systematic features of the data are at least as
important as the random ones
• Often more like industrial process than single
investigator lab research
• Integration of many data types: clinical, genetic,
molecular…..databases
Central dogma
The expression of the genetic information stored in
the DNA molecule occurs in two stages:
• (i) transcription, during which DNA is transcribed
into mRNA;
• (ii) translation, during which mRNA is translated to
produce a protein.
DNA → mRNA → protein
Other important aspects of gene regulation:
methylation, alternative splicing.
Idea: measure the amount of mRNA to see which
genes are being expressed in (used by) the cell.
Measuring protein might be better, but is currently
harder.
• DNA microarrays represent an important new
method for determining the complete expression
profile of a cell.
• Monitoring gene expression lies at the heart of a
wide variety of medical and biological research
projects, including classifying diseases,
understanding basic biological processes, and
identifying new drug targets.
®
Affymetrix Instrument
System
Platform for GeneChip® Probe Arrays
• Integrated
• Exportable
• Easy to use
•Versatile
Photolithography
Synthesis of Ordered
Oligonucleotide Arrays
Light
(deprotection)
Mask
OOOOO
HO HO O O O
T–
TTOOO
Substrate
Light
(deprotection)
Mask
C AT A T
AGCTG
T TCCG
TTCCO
TTOOO
Substrate
C–
REPEAT
Affymetrix GeneChip arrays
GeneChip Probe Arrays
®
Hybridized Probe Cell
GeneChip Probe Array
Single stranded,
labeled RNA target
*
*
*
*
*
Oligonucleotide probe
24µm
1.28cm
Millions of copies of a specific
oligonucleotide probe
>200,000 different
complementary probes
Image of Hybridized Probe Array
Analysis of expression
level from probe sets
Each pixel is quantitated and integrated for each
oligo feature (range 0-25,000)
Perfect Match (PM)
Mis Match (MM) Control
log(PM / MM) = difference score
All significant difference scores are averaged to
create “average difference” = expression level of
the gene.
Analysis of expression
level from probe sets
• each oligo sequence (20-25 mer) is synthesized
as a 20 µ square (feature)
• each feature contains > 1 million copies of the oligo
• scanner resolution is about 2 µ (pixel)
• each gene is quantitated by 16-20 oligos and
compared to equal # of mismatched controls
• 22,000 genes are evaluated with 20 matching oligos
and 10 mismatched oligos = 480,000 features/chip
• 480,000 features are photolithographically
synthesized onto a 2 x 2 cm glass substrate
Affymetrix arrays
• Global views of gene expression are often essential for obtaining
comprehensive pictures of cell function.
• For example, it is estimated that between 0.2 to 10% of the 10,000
to 20,000 mRNA species in a typical mammalian cell are
differentially expressed between cancer and normal tissues.
• Whole-genome analyses also benefit studies where the end goal is
to focus on small numbers of genes, by providing an efficient tool to
sort through the activities of thousands of genes, and to recognize
the key players.
• In addition, monitoring multiple genes in parallel allows the
identification of robust classifiers, called "signatures", of disease.
• Global analyses frequently provide insights into multiple facets of a
project. A study designed to identify new disease classes, for
example, may also reveal clues about the basic biology of disorders,
and may suggest novel drug targets.
Spotted DNA microarrays
• In ‘‘spotted’’ microarrays, slides carrying spots of target DNA are
hybridized to fluorescently labeled cDNA from experimental and
control cells and the arrays are imaged at two or more wavelengths
• Expression profiling involves the hybridization of fluorescently
labeled cDNA, prepared from cellular mRNA, to microarrays carrying
thousands of unique sequences.
• Typically, a set of target DNA samples representing different genes
is prepared by PCR and transferred to a coated slide to form a 2-D
array of spots with a center-to-center distance (pitch) of about 200
μm, providing a pan-genomic profile in an area of 3 cm2 or less.
• cDNA samples from experimental and control cells are labeled with
different color fluors (cytochrome Cy5 and Cy3) and hybridized
simultaneously to microarrays, and the relative levels of mRNA for
each gene are then determined by comparing red and green signal
intensities
Spotted DNA microarrays
Scanning Technology
• Microarray slides are imaged with a modified fluorescence
microscope designed for scanning large areas at high resolution
(arrayWoRx, Applied Precision, Issaquah, WA, Affymetrix).
• Fluorescence illumination are obtained from a metal halide arc lamp
focused onto a fiber optic bundle, the output of which is directed at
the microarray slide and emission recorded through a microscope
objective (Nikon) onto a cooled CCD (charge-coupled device)
camera.
• Interference filters are used to select the excitation and emission
wavelengths corresponding to the Cy3 and Cy5 fluorescent probes
(Amersham Pharmacia).
• Each image covered a 2.4 x 2.4 mm area of the slide at 5-μm
resolution. To scan the entire microarray, a series of images
(‘‘panels’’) were acquired by moving the slide under the microscope
objective in 2.4-mm increments.
http://www.bio.davidson.edu/course
s/genomics/chip/chip.swf
Jump to Animation
The red/green ratios can be spatially biased
• .
Top 2.5%of ratios red, bottom 2.5% of ratios green
Spotted vs. Affymetrix
Arrays
Affymetrix strengths:
- highly reliable: synthesized in situ
- highly reproducible from run to run
- no clone maintenance or ‘drift’
- sealed fluidics and controlled temperature
- standardized chips increase database power
- excellent scanner
- complex, but very reliable labelling
- excellent cost/benefit ratio
- amenable to mutation and SNP detection
Affymetrix weaknesses/limitations
- not easily customized: $300K/chip
- high labeling cost $170/chip
- high per chip cost $350 to $1850
- limited choice of species
- requires knowledge of sequence
- not designed for competitive protocols
Limitations to all microarrays
- dynamic range of gene expression:
very difficult to simultaneously detect low and high
abundance genes accurately
- each gene has multiple splice variants
2 splice variants may have opposite effects (i.e. trk)
arrays can be designed for splicing, but complexity ^ 5X
- translational efficiency is a regulated process:
mRNA level does not correlate with protein level
- proteins are modified post-translationally
glycosylation, phosphorylation, etc.
- pathogens might have little ‘genomic’ effect
Biological question
Differentially expressed genes
Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination