EE150a – Genomic Signal and Information Processing

Download Report

Transcript EE150a – Genomic Signal and Information Processing

EE150a – Genomic Signal and Information Processing

On DNA Microarrays Technology

October 12, 2004

Recall the information flow in cells

• Replication of DNA – {A,C,G,T} to {A, C, G,T} • Transcription of DNA to mRNA – {A,C,G,T} to {A, C, G,U} • Translation of mRNA to proteins – {A,C,G,U} to {20 amino-acids} • Interrupt the information flow and measure gene expression levels!

http://www-stat.stanford.edu/~susan/courses/s166/central.gif

Gene Microarrays

• A medium for matching known and unknown sequences of nucleotides based on hybridization (base-pairing: A-T, C-G) • Applications – identification of a sequence (gene or gene mutation) – determination of expression level (abundance) of genes – verification of computationally determined genes • Enables massively parallel gene expression studies • Two types of molecules take part in the experiments: – probes, orderly arranged on an array – targets, the unknown samples to be detected

Microarray Technologies

• Oligonucleotide arrays (Affymetrix GeneChips) – probes are photo-etched on a chip (20-80 nucleotides) – dye-labeled mRNA is hybridized to the chip – laser scanning is used to detect gene expression levels (i.e., amount of mRNA) • cDNA arrays – complementary DNA (cDNA) sequences “spotted” on arrays (500-1000 nucleotides) – dye-labeled mRNA is hybridized to the chip (2 types!) – laser scanning is used to detect gene expression levels • There are various hybrids of the two technologies above

Source: Affymetrix website

Oligonucleotide arrays

Source: Affymetrix website

GeneChip Architecture

Source: Affymetrix website

Hybridization

Source: Affymetrix website

Laser Scanning

Source: The Paterson Institute for Cancer Research

Sample Image

Competing Microarray Technologies

• So far considered oligonucleotide arrays: – automated, on-chip design – light dispersion may cause problems – short probes, 20-80 • cDNA microarrays are another technology: – longer probes obtained via PCR, polymerase chain reaction – [sidenote: what is optimal length?] – probes grown in a lab, robot printing – two types of targets – control and test

cDNA Microarrays

http://pcf1.chembio.ntnu.no/~bka/images/MicroArrays.jpg

Sample cDNA Microarray Image

Some Design Issues

• Photo-etching based design: unwanted light exposure – border minimization – the probes are 20-80 long • Hybridization: binding of a target to its

perfect

complement • However, when a probe differs from a target by a small number of bases, it still may bind • This non-specific binding (cross-hybridization) is a source of measurement noise • In special cases (e.g., arrays for gene detection), designer has a lot of control over the landscape of the probes on the array

Dealing with Measurement Noise

• Recent models of microarray noise – measurements reveal signal-dependent noise (i.e., shot-noise) as the major component – additional Gaussian-like noise due to sample preparation, image scanning, etc.

• Image processing assumes image background noise – attempts to subtract it – sets up thresholds • Lack of models of processes on microarrays

Probabilistic DNA Microarray Model

• Consider an

m

£

m

DNA microarray, with m 2 unique types of nucleotide probes • A total of N molecules of n different types of cDNA targets with concentrations c 1 ,…,c n , is applied to the microarray • Measurement is taken after the system reached chemical equilibrium • Our goal: from the scanned image, estimate the concentrations

DNA Microarray Model Cont’d

• Each target may hybridize to only one type of probe • There are k non-specific bindings • Model diffusion of unbound molecules by random walk; distribution of unbound molecules uniform on the array – justified by reported experimental results • Assume known probabilities of hybridization and cross hybridization – Theoretically: from melting temperature – Experimentally: measurements (e.g., from control target samples)

Markov Chain Model

Modeling transition between possible states of a target: • one specific binding state • k=2 non-specific bindings • p n =1-kp c -p h is probability that an unbound molecule remains free Measurement is taken after the system reached state of chemical equlibrium – need to find steady state

Markov Chain Model Cont’d

Let  i =[  i,1  i,2 …  i,k+2 ] T be a vector whose components are numbers of the type i targets that are in one of the k+2 states of the Markov chain • •  i,1 is the # of hybridized molecules  i,j , 2 < j · k+2 is # of cross-hybrid.

Note that  k=1 k+2  i,k =c i for every i.

Stationary State of the Markov Chain

• In equilibrium, we want to find  i such that where the transition matrix P i is given by • Clearly, in the stationary state we have • Finally, ratio  i /c i gives stationary state probabilities

Linear Microarray Model

• Let matrix Q collect the previously obtained probabilities • The microarray measurement model can be written as • Vector w describes inherent fluctuations in the measured signal due to hybridization (shot-noise) • Binding of the j-type target to the i-type probe is the Bernoulli random variable with variance q i,j (1-q i,j ) – hence the variance of w i is given by • Vector

v

is comprised of iid Gaussian entries

Detection of Gene Expression Levels

• A simple estimate is obtained via pseudo-inverse, • Maximize a posteriori probability p(s|c), which is equivalent to where the matrix  is given by • Optimization above readily simplifies to

Simulation Results

• Consider an 8 £ 8 array (m=8) • Apply n=6 types of targets • Concentrations: [1e5 2e5 2e5 2e5 1e5 2e5] (N=1e6) • Assume the following probabilities: – hybridization – 0.8 – cross-hybridization – 0.1

– release – 0.02

• Let k=3 (number of non-specific bindings) • Free molecules perform random walk on the array

Simulation Results: Readout Data

Simulation Results: Estimate

Some Comments

• Adopt mean-square error for a measure of performance • As expected, we observe significant improvement over raw measurements (improvement in terms of MSE) • Things to do: – investigate how to incorporate control sample measurements – modification of the technique for very large microarrays is needed (matrix inversion may be unstable) • Experimental verification!

Why is this Estimation Problem Important?

• Microarrays measure expression levels of thousands of gene simultaneously • Assume that we are taking samples at different times during a biological process • Cluster data in the expression level space – relatedness in biological function often implies similarity in expression behavior (and vice versa) – similar expression behavior indicates co-expression • Clustering of expression level data heavily depends on the measurements – better estimation may lead to different functionality conclusions

Summary

• Microarray technologies are becoming of great importance for medicine and biology – understanding how the cell functions, effects on organism – towards diagnostics, personalized medicine • Plenty of interesting problems – combinatorial design techniques – statistical analysis of the data – signal processing / estimation