Transcript EE150a – Genomic Signal and Information Processing
EE150a – Genomic Signal and Information Processing
On DNA Microarrays Technology
October 12, 2004
Recall the information flow in cells
• Replication of DNA – {A,C,G,T} to {A, C, G,T} • Transcription of DNA to mRNA – {A,C,G,T} to {A, C, G,U} • Translation of mRNA to proteins – {A,C,G,U} to {20 amino-acids} • Interrupt the information flow and measure gene expression levels!
http://www-stat.stanford.edu/~susan/courses/s166/central.gif
Gene Microarrays
• A medium for matching known and unknown sequences of nucleotides based on hybridization (base-pairing: A-T, C-G) • Applications – identification of a sequence (gene or gene mutation) – determination of expression level (abundance) of genes – verification of computationally determined genes • Enables massively parallel gene expression studies • Two types of molecules take part in the experiments: – probes, orderly arranged on an array – targets, the unknown samples to be detected
Microarray Technologies
• Oligonucleotide arrays (Affymetrix GeneChips) – probes are photo-etched on a chip (20-80 nucleotides) – dye-labeled mRNA is hybridized to the chip – laser scanning is used to detect gene expression levels (i.e., amount of mRNA) • cDNA arrays – complementary DNA (cDNA) sequences “spotted” on arrays (500-1000 nucleotides) – dye-labeled mRNA is hybridized to the chip (2 types!) – laser scanning is used to detect gene expression levels • There are various hybrids of the two technologies above
Source: Affymetrix website
Oligonucleotide arrays
Source: Affymetrix website
GeneChip Architecture
Source: Affymetrix website
Hybridization
Source: Affymetrix website
Laser Scanning
Source: The Paterson Institute for Cancer Research
Sample Image
Competing Microarray Technologies
• So far considered oligonucleotide arrays: – automated, on-chip design – light dispersion may cause problems – short probes, 20-80 • cDNA microarrays are another technology: – longer probes obtained via PCR, polymerase chain reaction – [sidenote: what is optimal length?] – probes grown in a lab, robot printing – two types of targets – control and test
cDNA Microarrays
http://pcf1.chembio.ntnu.no/~bka/images/MicroArrays.jpg
Sample cDNA Microarray Image
Some Design Issues
• Photo-etching based design: unwanted light exposure – border minimization – the probes are 20-80 long • Hybridization: binding of a target to its
perfect
complement • However, when a probe differs from a target by a small number of bases, it still may bind • This non-specific binding (cross-hybridization) is a source of measurement noise • In special cases (e.g., arrays for gene detection), designer has a lot of control over the landscape of the probes on the array
Dealing with Measurement Noise
• Recent models of microarray noise – measurements reveal signal-dependent noise (i.e., shot-noise) as the major component – additional Gaussian-like noise due to sample preparation, image scanning, etc.
• Image processing assumes image background noise – attempts to subtract it – sets up thresholds • Lack of models of processes on microarrays
Probabilistic DNA Microarray Model
• Consider an
m
£
m
DNA microarray, with m 2 unique types of nucleotide probes • A total of N molecules of n different types of cDNA targets with concentrations c 1 ,…,c n , is applied to the microarray • Measurement is taken after the system reached chemical equilibrium • Our goal: from the scanned image, estimate the concentrations
DNA Microarray Model Cont’d
• Each target may hybridize to only one type of probe • There are k non-specific bindings • Model diffusion of unbound molecules by random walk; distribution of unbound molecules uniform on the array – justified by reported experimental results • Assume known probabilities of hybridization and cross hybridization – Theoretically: from melting temperature – Experimentally: measurements (e.g., from control target samples)
Markov Chain Model
Modeling transition between possible states of a target: • one specific binding state • k=2 non-specific bindings • p n =1-kp c -p h is probability that an unbound molecule remains free Measurement is taken after the system reached state of chemical equlibrium – need to find steady state
Markov Chain Model Cont’d
Let i =[ i,1 i,2 … i,k+2 ] T be a vector whose components are numbers of the type i targets that are in one of the k+2 states of the Markov chain • • i,1 is the # of hybridized molecules i,j , 2 < j · k+2 is # of cross-hybrid.
Note that k=1 k+2 i,k =c i for every i.
Stationary State of the Markov Chain
• In equilibrium, we want to find i such that where the transition matrix P i is given by • Clearly, in the stationary state we have • Finally, ratio i /c i gives stationary state probabilities
Linear Microarray Model
• Let matrix Q collect the previously obtained probabilities • The microarray measurement model can be written as • Vector w describes inherent fluctuations in the measured signal due to hybridization (shot-noise) • Binding of the j-type target to the i-type probe is the Bernoulli random variable with variance q i,j (1-q i,j ) – hence the variance of w i is given by • Vector
v
is comprised of iid Gaussian entries
Detection of Gene Expression Levels
• A simple estimate is obtained via pseudo-inverse, • Maximize a posteriori probability p(s|c), which is equivalent to where the matrix is given by • Optimization above readily simplifies to
Simulation Results
• Consider an 8 £ 8 array (m=8) • Apply n=6 types of targets • Concentrations: [1e5 2e5 2e5 2e5 1e5 2e5] (N=1e6) • Assume the following probabilities: – hybridization – 0.8 – cross-hybridization – 0.1
– release – 0.02
• Let k=3 (number of non-specific bindings) • Free molecules perform random walk on the array
Simulation Results: Readout Data
Simulation Results: Estimate
Some Comments
• Adopt mean-square error for a measure of performance • As expected, we observe significant improvement over raw measurements (improvement in terms of MSE) • Things to do: – investigate how to incorporate control sample measurements – modification of the technique for very large microarrays is needed (matrix inversion may be unstable) • Experimental verification!
Why is this Estimation Problem Important?
• Microarrays measure expression levels of thousands of gene simultaneously • Assume that we are taking samples at different times during a biological process • Cluster data in the expression level space – relatedness in biological function often implies similarity in expression behavior (and vice versa) – similar expression behavior indicates co-expression • Clustering of expression level data heavily depends on the measurements – better estimation may lead to different functionality conclusions
Summary
• Microarray technologies are becoming of great importance for medicine and biology – understanding how the cell functions, effects on organism – towards diagnostics, personalized medicine • Plenty of interesting problems – combinatorial design techniques – statistical analysis of the data – signal processing / estimation