A Presentation of ‘Bayensian Models for Gene Expression

Download Report

Transcript A Presentation of ‘Bayensian Models for Gene Expression

A Presentation of ‘Bayesian
Models for Gene Expression
With DNA Microarray Data’ by
Ibrahim, Chen, and Gray
Presentation By Lara DePadilla
1
Goal
• To “develop a novel class of parametric
statistical models for analyzing DNA
microarray data’.
• Parametric statistical models require
making assumptions about the data, such as
believing it follows some probabilistic law,
and therefore we know something about it.
2
The Goal Applied
• The researchers are trying to discover which
genes play a major role for the disease of
endometrial cancer. This knowledge can
help to determine whether it is inherited and
target applicable therapies.
3
Motivation
• Determine which genes best discriminate
between different types of tissue
Why? Because of the sheer number of genes in the human
genome we must identify which one are relevant to our
purpose.
• Characterize gene expression patterns in
tumor tissues
Why? We must develop models to explain the patterns in
order to recognize them.
4
About Bayesion Models (Liu pp. 306)
The full process has three main steps:
1. Setting up a probability model to describe the
data. This is a joint distribution that makes use of
our prior knowledge of the subject:
Joint = Prior * Likelihood
f(y,Ө) = f(Ө) * f(y|Ө)
It must capture the elements of the scientific problem.
5
About Bayesion Models (Cont.)
The next step invokes Bayes rule
2. f(Ө,y) = f(Ө) * f(y|Ө)
f(y)
Now we know what we are looking for.
3. This step is evaluate and improve upon what we
have done.
6
Back to Our Goal Applied
Data Structure of Observations
• The array contains more than 7,000 probe sets, which are thought to
represent 5,600 genes.
• Each probe set consists of 16 – 20 perfect match and mismatch pairs.
• A match is a strand of DNA that compliments a specific DNA
sequences.
• A mismatch has a single base mismatch position (one piece out of
approx. 25 doesn’t match).
• Using pairs from the same gene from different probes will be more
specific than is possible with a single probe.
7
Back to Our Goal Applied
More Data Preparation
• The probes are compared and normalized, resulting in a
dataset of expression levels that have atypical results
filtered out.
• After the filtering process, the data set was 14 x 3214, with
14 samples (10 cancerous, 4 normal) and 3214 genes.
8
The Model Setup: Data
•
•
•
•
•
j = 1, 2 (for each tissue type)
i = 1… n (for each individual)
nj individuals available (n for each tissue type)
G genes are measure for each individual
x is the represents each gene in the dataset
⇒ c0 is the threshold value for a gene is considered not
expressed (and therefore not what we are seeking), so
if x = c0, it is not expressed
so,
x = c0 with probability p
x = c0 + y with probability 1 – p
where y is the level of expression
• xjig denotes the random variable
• yjig denotes the expression level
9
The Model Setup: Likelihood
• Let  = 1 if x = c0 (not expressed) and 0 otherwise
(expressed)
• Remember the expressed/not expressed probability from
before, so there is one probability for each gene within
each tissue type:

pjg = P(xjig = c0) = P(δ = 1)
1 – pjg = P(xjig = c0 + yjig) = P(δ = 0)
• Based on whether the gene had the qualified expression
level, we have δ = (δ 111, … , δ 2,n2,G), meaning one for
each gene, for each individual, for each tissue type).
10
The Model Setup: Likelihood
• The mean expression level of each gene for both
tissue types:
μ = (μ11,…, μ2,G)
• The variance of each expression for each gene for
both tissue types: σ2 = (σ211,…, σ22,G)
• The probability the gene not being expressed for
each gene for both tissue types: p = (p11,…, p2,G)
11
The Model Setup: Likelihood
• Ө = (μ, σ2, p) is the likelihood function based on the data:
D = (x111,…, x2,n2,G)
• L(Ө|D) =
П j = 1 to 2 П I = 1 to n П g = 1 to G (pjg δ )(1- p)1 - δ
j
jig
jig
* p(y
jig|
μjg,σ2jg) 1 - δ
jig
Interpreted: This is the product of the probability distribution
function (the probability that a gene qualifies for being of interest
to the study) of each data point to give the overall likelihood.
12
The Posterior: Which Genes
Discriminate?
• The posterior is a ratio between the average
expression level for a particular gene across
subjects in cancerous tissue and the same gene
across subject in non- cancerous tissue.
• The value of each element comprising the mean is
based on whether or not the gene for that
individual and that tissue type meets the necessary
expression level to count.
13
The Posterior: The Function
• Ψjg is the value for the expected value of the
joint distribution of (δ,y) with individual
subjects in the data comprising the elements
that create the expected value. The
distribution describes whether the
expression level enough to count, and what
is the level if it does?
14
The Posterior: The Function
• εg = Ψ2g / Ψ1g
• This is a ratio of the expression means between
normal and cancer tissues for all of the genes, so
there will be one distribution for each of the G
genes.
• A key summary to compute is P(εg) > 1|D),
which is the probability given the data (the
individuals in the study) that the ratio will exceed
1.
15
Priors
• The purpose of the prior in this situation is
to create a correlation between the genes for
a given individual
• The priors are hierarchal: there are different
priors for different parameters, and some
parameters of interest are incorporated into
other priors. In some cases, the values are
based on information from the data.
16
Gene Selection: Applying the
Posterior
• Compute the Posterior for g = 1…G
• Compare these probability that the ratios will
exceed 1 to a threshold γ. This threshold might be
.9, .8, .7 etc.
• One the threshold has established each gene as
being different enough between tissues, develop a
sub-model of the genes that describes which are
different and which are not.
• Different levels of γ will create different submodels.
17
Back to Bayes: Step 3
• Step 3 was to evaluate our process. In this
case, we use the L measure to evaluate the
sub-models.
• The model with the smallest L measure is
the best-fitting model
• It assesses goodness of fit based on:
⇒ how well the model predictions compare to the
observed data
⇒ the variability of the predictions
18
Sampling From the Posterior:
Gibbs Sampler
• Generating the mean expression levels for each
type of tissue for each gene requires the
parameters μ and σ2
• Gibbs Sampler makes use of conditional
distributions; in our case these stem from the
priors.
• The algorithm will ultimately yield μ, σ2, b0, μ0, e,
and u0 for each tissue type. All but μ and σ2 are integrated
out, and the resulting μ and σ2 can be passed into the
posterior equation.
19
The Results: Table 2
Number of genes to be declared different
based on Several Choices of
Hyperparameters and Various Choices of γ0
γ0
.95
.90
.80
.70
.01
178
316
695
1350
η0, d0, k0, h0
.02
.05
167
154
290
271
668
629
1266 1191
.10
115
283
674
1209
20
The Results: Table 3
• Number of genes to be declared different
based on Several Choices of Hyperparameters
Mean
η0, d0, k0, h0
.01
.02
.05
.10
Normal Tissue
Cancer Tissue
6.08
6.08
6.08
6.08
6.013
6.015
6.016
6.016
L measure
180,837
177,047
167,057
155,624
21
The Results: Tables 4 & 5
• This determines nonparametrical (based on no prior knowledge of the
parameters in the distribution ie μ and σ2 that we got from our priors)
results with the results of our algorithm
• Table 4 Compares Genes identified using Informative Priors and Table
5 Compares Genes identified using Moderate Priors (less informative)
• The percentages are the posterior probabilities—this would correspond
to the thresholds.
• The sum is the number of genes that overlapped—we can see that the
lower the threshold, the more genes overlap.
• Comparing Table 4 to Table 5, we can see that a less informative prior
will result in more genes overlapping (which supports the result of
analyzing the L statistic in Table 3).
22
The Results: Table 6
• That is not to say more genes passing the test (of able to help
distinguish cancerous tissue from non-cancerous tissue) is better; the
threshold uses more discretion in declaring a gene different, and the L
statistic tells us the goodness of the fit. We need both.
Criterion
L measure
# of diff genes
Full Model
γ = 70%
γ = 80%
γ = 90%
PERMAX
98,305
97,932
98,905
102,017
110,809
3,214
2,055
1,505
1,004
47
23
24
The Results: Table 7
• Using the Full
Model (ie, no
threshold) change
the informative
level of the prior
and compare to the
L measure
(η0, d0, k0, h0)
(1,1,1,1)
(10,10,10,10)
(20,20,20,20)
L meas
116,246
101,326
99,699
(100,100,20,20)
(50,50,50,50)
(20,20,50,50)
99,690
98,307
98,307
(100,100,50,50)
(10,10,50,50)
98,307
98,305
25
Conclusion
• Apply a Gibbs Sampler to sample from a
hierarchical class of prior distribution
• Use the results to sample from the posterior
distribution and produce a summary of the results
that describes how likely the gene is to be different
based on tissue type.
• Use thresholds to decide which genes are different
enough to make a model of genes that can be
applied to this problem.
• Assess the model with the L measure to check the
goodness of fit.
26
Bibliography
• ‘Bayesion Models for Gene Expression
With DNA Microarray Data’, Ibrahim,
Chen, and Gray, Journal of the American
Statistical Association, Mar 02; 97,457
• Monte Carlo Strategies in Scientific
Computing, Liu, Springer-Verlag New York,
Inc. 2001
27