Agilent bake-off: oligonucleotide layers per gene

Download Report

Transcript Agilent bake-off: oligonucleotide layers per gene

Microarrays
Wednesday, March 1, 2006
Dr. Tim Hughes
CCBR – 160 College St. – Room 1302
[email protected]
Outline:
• Microarray experiments
• Normalization
• Different types of microarrays
• Other applications besides expression profiling
• Clustering and interpretation
Suggested reading
Eisen et al., 1998
HARTIGAN, J.A., Clustering Algorithms, Wiley, New York and London (1975).
My understanding is that it is no longer in print, but is available on CD.
Jain et al., ACM Computing Surveys, 31(3) 1999 “Data Clustering: a review”.
(http://www.amk.alt-neustadt.at/diplom/papers/Clustering/p264-jain.pdf)
Hegde et al., A concise guide to cDNA microarray analysis.
Biotechniques. 2000 Sep;29(3):548-50, 552-4, 556
Sherlock G. Analysis of large-scale gene expression data.
Curr Opin Immunol. 2000 Apr;12(2):201-5.
Nucleic Acid Hybridization
www.accessexcellence.org/AB/GG/nucleic.html
Microarray expression profiling
by 2-color assay (“cDNA arrays”)
Array:
PCR products
6250 yeast ORFs
hybridized cDNAs:
green = control
red = experiment
*Schena et al., 1995
“cDNA microarrays” are essentially dot-blots
on glass slides
0.45 mm
http://arrayit.com/Products/Printing/Stealth/stealth.html
• This slide was made with 16 pins
• 4.5 mm pin spacing matches 384-well
plates (16 x 24)
• Done with robotics
• Slides usually coated with poly-lysine
• Spots are usually 100-150 microns
• Spot spacing is usually 200-300
microns.
• Slides are 25 x 75 mm
• Easy to deposit 20K spots/slide
Common ways to “label” nucleic acids
Random priming of double-stranded
DNA:
Reaction
contains
labelled
nucleotides
Direct labelling (fluors only):
*
*
*
Amplification:
*
AAAAAAAA
AAAAAAAA
Poly-T primed cDNA synthesis:
TTTTTTTTTT-T7 promoter
AAAAAAAA
Reaction
contains
labelled
nucleotides
“second
strand”
synthesis
AAAAAAAA-T7 promoter
TTTTTTTTTT-T7 promoter
AAAAAAAA
**
TTTTTTTTTT
* *
* *
*
T7 reaction
contains
labelled
nucleotides
Typical use of cDNA microarrays:
“Internal” normalization using two colors
treatment
control
(drug, mutation)
x
x
y
y
z
z
x
x x
y
y
cDNA pools
z
x
y
z
up
down
unchanged
not present
532 nm laser (green) excites Cy3
Cy3 detected with an emission filter
that passes 557-592 nm
Excitation
635 nm (red) excites Cy5
Cy5 detected with an emission filter
that passes 650-690 nm.
Both are detected by a photomultiplier
tube.
Cy3
NHS
Ester
Emission
Cy5
NHS
Ester
http://www.jacksonimmuno.com/2001site/home/catalog/f-cy3-5.htm
http://www.ope-tech.com/doc/Cy5_structure.htm
The primary data: two grayscale TIFF files
Cy3 channel
(“green”)
Cy5 channel
(“red”)
http://www.axon.com/GN_GenePix4000.html
Image processing and normalization: what is microarray data?
Microarray data is summary information from image files that come out of the scanner.
Image processing: line up grids, flag bad spots, quantitate.
Looking at data from a single experiment
2
P-value < 0.01
1.5
1
1
0.5
0
0
-0.5
10
Log (Expression Ratio)
log10(ratio)
3-AT vs.
No drug
Slides: 11120c01 -11121c01
2
-1
-1
-1.5
-2
-2
-2
-2
-1
0
-1
-0.5
0
0.5
Log10(Intensity)
1
1
1.5
2
2
Slides: 11857c01 -11858c01
2
2
P-value < 0.01
1.5
1
Log (Expression Ratio)
1
0.5
0
0
-0.5
10
log10(ratio)
wild-type vs.
wild-type
-1.5
-1
-1
-1.5
-2
-2
-2
-2
-1.5
-1
-1
-0.5
0
0.5
Log10(Intensity)
0
1
1
1.5
2
2
log10(average intensity)
Lowess smoothing: The names "lowess" and "loess" are derived from the term
"locally weighted scatter plot smooth," as both methods use locally weighted
linear regression to smooth data.
http://www.mathworks.com/access/helpdesk/help/toolbox/curvefit/ch_data7.html
Find spots
Manual edit
Quantitate
Normalize (“Lowess smoothing”)
(Locally weighted scatterplot
smoothing)
Confirm spots outside envelope
Save data, images, spot map
Selected tricks for processing and normalization
(1) High-pass spatial detrending. See: O. Shai, Q. Morris,
and B.J. Frey, (2003) Spatial Bias Removal in Microarray
Images, University of Toronto Technical Report PSI-200321, http://www.psi.utoronto.ca/~ofer/detrendingReport.pdf
(2) VSN – “Variance Stabilizing Normalization”. See:
Huber, W., Von Heydebreck, A., Sultmann, H., Poustka, A.,
& Vingron, M.
Variance stabilization applied to microarray data calibration
and to the quantification of differential expression.
Bioinformatics 18, Suppl 1, S96-S104 (2002).
Q. Morris, B. Frey, O. Shai
Other types of arrays
Photolithographic arrays (Affymetrix)
Building up oligonucleotides on a surface:
http://www.affymetrix.com/technology/manufacturing/index.affx
Photolithographic arrays (Affymetrix)
Arrays are typically 25-mers, with “mismatch” control for specificity
aka “GeneChip”
Photolithographic arrays (Affymetrix)
Advantages:
Density is limited essentially by the 5 micron resolution of scanners (solution:
larger arrays).
Well-developed protocols.
“Industry standard” (largely self-driven).
Disadvantages:
Not all probes work well. Affymetrix has evolved a complicated system to
compensate for this, but even “believers” use at least four probes per gene, and
usually more.
Single color.
Sample preparation typically requires amplification.
Single supplier; historically intellectual property issues. (i.e. comparisons)
Ink-jet arrays (Agilent)
G A
A
G
T
C
C
C T
G G
G A
G A
• 25,000 oligos / 1 x 3 inches
• Sequence completely flexible
• 60-mers
Hughes TR et al. Expression profiling using microarrays fabricated by an ink-jet
oligonucleotide synthesizer. Nat Biotechnol. 2001 Apr;19(4):342-7.
Ink-jet arrays generally agree with spotted
cDNA arrays
r = 0.97
HXT4
HXT3
cDNA array
Spo vs. SC
r = 0.96
single oligo
multiple oligos
Yeast IJS array: ~8 oligos per gene
HXT1
cDNA array
Ink-jet arrays (Agilent)
Advantages:
User-specified sequences; “no questions asked”
Sensitivity and specificity are defined and exceed requirement for most
expression profiling applications; no amplification required
Virtually every 60-mer is functional
Data correlates well with spotted cDNA arrays
Disadvantages:
Density currently limited to ~45,000 spots per array.
Single supplier (although a protocol is in press for making your own
synthesizer!)
“Maskless” arrays (Nimblegen)
http://www.nimblegen.com/technology/manufacture.html
“Maskless” arrays (Nimblegen)
Advantages:
User-specified sequences.
Density is limited essentially by the 5 micron resolution of scanners.
Disadvantages:
New to arena. Performance in initial publication (Nuwaysir et al., Genome
Research, 2002) suggests that sensitivity and specificity may be lower than that
of Agilent arrays.
Single supplier – although all the parts are there for academics to build one.
Possible IP issues. Hybs are done in Iceland to bypass Affy IP. Nimblegen web
site boasts of new partnership with Affymetrix.
Applications beyond expression profiling
• DNA copy number
• Genotyping
• Protein-DNA associations
• Molecular “Barcoding”
• Protein arrays
• Transformation arrays
Identifying DNA binding sites
Science 2000 Dec
22;290(5500):23069 Genome-wide
location and
function of DNA
binding proteins.
Ren B, Robert F,
Wyrick JJ, Aparicio
O, Jennings EG,
Simon I, Zeitlinger
J, Schreiber J,
Hannett N, Kanin E,
Volkert TL, Wilson
CJ, Bell SP, Young
RA.
Analysis of multiple experiments
• Comparisons
• Clustering
• Predicting gene functions
• Finding promoter elements
Comparing data from two experiments
1.5
1.5
r = 0.88
log10(ratio), cup5 / WT
log10(ratio), cup5 / WT
scatter plot of ratios (intensity not displayed)
11
0.5
0.5
VMA8
00
-0.5
-0.5
r = 0.09
1.51.5
11
MRT4
0.50.5
00
-0.5-0.5
CUP5
-1.0-1
CUP5
-1.0-1
-1.5
-1.5
-1.5-1.5
-1.5
-1.5
-1
-1
-0.5
-0.5
00
0.5
0.5
1
1.0
1.5
1.5
log10(ratio), vma8 / WT
-1.5
-1
-0.5
-1.5 -1 -0.5
0
0
0.5
1
0.5 1.0
1.5
1.5
log10(ratio), mrt4 / WT
The behavior of two genes over many
experiments can be compared
in the same fashion
2-D clustering
Step 1: cut experiments and transcripts
falling below P-value and ratio thresholds
experiment index
transcript response index
-10
-5
-2
fold repression
44 experiments
x
407 genes
1
2
5
10
fold induction
2-D clustering
Step 2: cluster experiments and transcripts
transcript response index
ste mutants
experiment index
RHO O/X
PKC O/X
-10
treatment with
alpha-factor
-5
-2
fold repression
1
2
5
10
fold induction
Data from Roberts et al.,
Science (2000)
There are many types of clustering.
One example: K-means (must choose K)
See: Sherlock G. Analysis of
large-scale gene expression data.
Curr Opin Immunol. 2000
Apr;12(2):201-5.
K = 10
#1
#2
#3
Basics of clustering freeware: Eisen’s “Cluster” and “Treeview”
Mike Eisen's web site: rana.lbl.gov/EisenSoftware.htm
“Cluster” loads an Excel file (save as tab-delimited text) in the following format:
Cluster
Treeview
(also: “TreeArrange” - http://monod.uwaterloo.ca/downloads/treearrange/)
There are also many commercial programs available.
protein
mRNA
nucleus
cell
Microarray expression data
Co-regulated groups of genes
Functional categories
Predict functions of
new genes
cis, trans regulators
Non-overlapping yeast gene expression
clusters
1,226 genes
249 genes
424 experiments
Cluster label
amino acid metabolism
arginine biosynthesis
arginine catabolism
aromatic AA metabolism
asparagine biosynthesis
branched chain AA synth
lysine biosynthesis
methionine biosynthesis
sulfur AA tnsprt, metab
adenine biosynthesis
aldehyde metabolism
biotin biosynthesis
citrate metabolism
ergosterol biosynthesis
fatty acid biosynthesis
gluconeogenesis
NAD biosynthesis
one-carbon metabolism
pyridoxine metabolism
thiamin biosynthesis 1
thiamin biosynthesis 2
hexose transport
sodium ion transport
polyamine transport
nucleocytoplasmic transport
ribosome/RNA biogenesis
ribosomal proteins
translational elongation
protein folding
secretion
protein glycosylation
vesicle-mediated transport
proteasome
vacuole fusion
mitoribosome/respiration
Mitochond. electron trans.
iron transport/TCA cycle
Chromatin/transcription
histones
MCM2/3/6/CDC47
DNA replication
mitotic cell cycle
CLB1/CLB6/BBP1
cytokinesis
development
pheromone response
conjugation
sporulation/meiosis
response to oxidative stress
stress/heat shock
Sample genes
TRP4, HIS3
ARG1, ARG3
CAR1, CAR2
ARO9, ARO10
ASN1, ASN2
ILV1,2,3,6
LYS2, LYS9
MET3,16,28
MUP1, MHT1
ADE1,4,8
AAD4,14,16
BIO3,4
CIT1,2
ERG1,5,11
FAS1,FAS2
PGK1, TDH1,2,3
BNA4,6
GCV1,2,3
SNO1, SNZ1
THI5,12
THI2,20
HXT4,GSY1
ENA1,2,5
TPO2,3
KAP123,NUP100
MAK16,CBF5
RPS1A,RPL28
TEF1,2
SSA1,HSP60
VTH1,KRE11
ALG6,CAX4
VPS5,IMH1
RPN6,RPT5
VTC1,3,4,PHO84
MRPL1,MRPS5
ATP1,COX4
FRE1,FET3
SNF2,CHD1,DOT6
HTA1,HHF1
MCM2,3,6
RFA1,POL12
SPC110,CIN8
CLB1,6
CTS1,EGT2
PAM1,GIC2
FUS3,FAR1
CIK1,KAR3
SPO11,SPO19
GDH3,HYR1
HSP104,SSA4
Candidate regulator
GCN4
ARG80/81
ARG80/81/UME6/RPD3
ARO80
GCN4/HAP1/HAP2
LEU3, GCN4
LYS14
CBF1, MET28, MET32
MET31,MET32
BAS1, BAS2, GCN4
RTG3
ECM22/UPC2
INO4
GCR1
THI2/THI3
THI2/THI3
GCR1
NRG1,MIG1
HAA1
RRPE-binding factor
PAC/RRPE-binding factors
HAC1,ROX1
RLM1
XBP1
RPN4
PHO4
HAP2/3/4/5
MAC1/RCS1/AFT1/PDR1/3
HIR1,HIR2
ECB
MCB
HCM1
FKH1
ACE2,SWI4
MATALPHA2,STE12
KAR4
NDT80
ROX1,MSN2,MSN4
MSN2,MSN4
Chua et al., 2004
Analyzing clusters:
amino acid biosynthesis
(p<10-14)
amino acid metabolism
(p<10-14)
methionine metabolism
(p=1.07×10-7)
Some web resources for promoter analysis:
YRSA (http://forkhead.cgb.ki.se/YRSA/define1.htm)
AlignACE (http://atlas.med.harvard.edu/cgi-bin/fullanalysis.pl)
**http://area51.med.utoronto.ca/FUNSPEC.html
GO-Biological Process categories
# annotated genes
(mouse)
Very Broad
Broad
Mid-level
Narrow
metabolism
1548
development
2341
vision
163
CNS development
137
eye morphogenesis
21
ATP biosynthesis
36
pigment metabolism
25
striated muscle contraction
33
eye pigment metabolism
insulin secretion
3
4
GO-Biological Process hierarchy
metabolism
development
CNS development
pigment metabolism
eye morphogenesis
eye pigment metabolism
Other types of categorical annotations:
KEGG, EC numbers (describe biochemical “pathways”)
MIPS, YPD (yeast databases – older than GO)
Results of individual studies (localization, 2-hybrid screens,
protein complexes, etc.
Sequence motifs, structural domains (pfam, SMART)
Other people’s microarray clusters
etc.
**When testing clusters against many different types of
categorical annotations, should consider correcting for
multiple-testing, and also consider that categories are often
not independent
protein
mRNA
nucleus
cell
Big questions:
To what degree are functional pathways
coordinately regulated?
What controls the observed regulations?
Exploring mouse gene expression
using Ink-jet Oligonucleotide Arrays
• 22,000 oligos / 1 x 3 inches
• Sequence completely flexible
G A
A
G
T
C
C
C T
G G
G A
G A
• Mouse “42K” array: NCBI
GenomeScan predictions
(“XM”) on mouse draft
sequence
• Includes:
25K with cDNA
(75% of 18K RefSeq genes)
30K with cDNA or EST
12K potential new genes
**Wen Zhang
Exploring mouse gene expression
using Ink-jet Oligonucleotide Arrays
Collect 55 different
mouse tissues from
experts:
Janet Rossant
Jane Aubin
Derek van der Kooy
Michael Fehlings
Benoit Bruneau**
Analyze
mRNA levels
on arrays
(1 mg
poly-A)
**Wen Zhang
Testis
Olfactory bulb
Brain
Eye
ES
Skel.l Muscle
Liver
Femur
Teeth
Placenta
Prostate
Lymph node
Spleen
Digit
Tongue
Trachea
Large intestine
Colon
Testis
Olfactory bulb
Brain
Eye
ES
Skel. Muscle
Liver
Femur
Teeth
Placenta
Prostate
Lymph node
Spleen
Digit
Tongue
Trachea
Large intestine
Colon
Unchar.
cDNA
EST
Gene trap
Transcription factor
RNA binding/RS domain
Analysis of 55 mouse tissues: QC
Description
Hypothetical protein FLJ20519
Testis nuclear RNA binding ptn (Tenr)
DEAD box polypeptide 4 (Ddx4)
Deleted in azoospermia-like (Dazl)
RIKEN cDNA 1700001N01
LOC235045
Sim. to serine protease inhibitor
RIKEN cDNA 1700067I02
LOC245536 (LOC245536), mRNA
Hematopoietic cell transcript 1
Chr 7 expressed (D7Wsu180e)
Sim. to orphan receptor (LOC215448)
Poly(rC) binding ptn. 3 (Pcbp3)
Voltage-dep. R-type Ca++ channel a-1E
Ataxin 2 binding protein 1
Sim. to HuC
Ventral neuron-specific ptn 1 NOVA1
Poly(rC) binding ptn 4 (Pcbp4)
LOC217874
LOC239368
Zinc finger protein 97
RIKEN cDNA 2400008B06
Metal-response element tx factor 2 (Mtf2)
LOC231661
LOC231903
Related to CG7582 (LOC232810)
RIKEN cDNA 1300006E06
Sim. to protease (LOC211700)
Hypothetical protein FLJ22774
RIKEN cDNA 5430427O21
Nuclear RNA export factor 6 (Nxf6)
Sim. to serine protease inhibitor 14
Sim. to serine protease inhibitor 13
Hypothetical ZNF protein KIAA0961
KIAA0215 gene product
LOC227582
Sim. to HMG-BOX tx factor BBX
LOC214566
FN5 protein (Fn5)
LOC229850
LOC229555
Ribonuclease L (Rnasel)
(2-5)oligo(A) synthetase 1A
Accession
XM_131066.1
XM_124039.1
XM_127536.1
XM_123141.1
XM_125027.1
XM_134745.1
XM_144364.1
XM_132042.1
XM_159329.1
XM_125337.1
XM_124875.1
XM_122095.1
XM_122063.1
XM_123530.1
XM_147994.1
XM_134734.1
XM_138026.1
XM_125213.1
XM_127170.1
XM_139399.1
XM_134010.1
XM_134886.1
XM_132195.1
XM_132381.1
XM_149717.1
XM_133152.1
XM_128315.1
XM_136425.1
XM_139234.1
XM_132158.1
XM_142153.1
XM_147352.1
XM_122538.1
XM_145503.1
XM_135809.1
XM_149095.1
XM_147194.1
XM_150017.1
XM_147333.1
XM_149402.1
XM_130999.1
XM_136286.1
XM_132373.1
GAPDH
**Malina Bakowski, Blencowe lab
Are functional pathways
coordinately regulated?
Compiled annotations from 992 GO “Biological
process” categories for 7,779 genes on the array
(from EBI and MGI/JAX)
(considered only categories with >3 and <500
genes)
**GO evidence codes (and manual inspection)
indicate that very few annotations are based
purely on expression
Polyamine biosynthesis
Oxidative phosphorylation
Muscle contraction
Gene
expression
reflects gene
function
Epidermal differentiation
Cell:cell adhesion
Regulation of neurotransmitter level
Synaptic transmission
Axonogenesis
RNA splicing
Cytokinesis
Microtubule-based movement
M phase
Ratio over median
<1
3
7
>20
55 mouse tissues/samples
Serine biosynthesis
Preganancy
Fertilization
Bone remodeling
Skeletal development