Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou

Download Report

Transcript Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou

Markov Chain Monte Carlo
and Gibbs Sampling
Vasileios Hatzivassiloglou
University of Texas at Dallas
Markov chains
• A subtype of random walks (not
necessarily uniform) where the entire
memory of the system is contained in the
current state
• Described by a transition matrix P, where
pij is the probability of going from state i to
state j
• Very useful for describing stochastic
discrete systems
2
Markov chain example
pCC
C
pAC
pTC
pCA pCG
pAT
pCT
A
pAA
T
pTA
pGA
pGC pGT
pAG
pTG
pTT
G
pGG
3
Example application of Markov
chains
• Markov chains model dependencies
across time or position
• The model assigns a probability to each
sequence of observed data and can be
used to measure how likely an observed
sequence is to follow it
• The GeneMark algorithm uses 5th-order
Markov chains (why?) to find genes
(distinguishing them from other regions in
the DNA)
4
Markov Chains – Stationary
Distribution
• Under general assumptions (irreducibility
and aperiodicity), Markov chains have a
stationary distribution π, the limit of Pk as k
goes to infinity
• An irreducible MC has no identical states
• An aperiodic MC can reach any state from
any other state
• Such a Markov chain is called ergodic
5
Markov Chain Monte Carlo
• Used as a means to guide the selection of
samples
• We want the stationary distribution of the
Markov chain to be the distribution we are
sampling from
• In many cases, this is relatively easy to do
6
Gibbs sampling
• A special case of MCMC where the
conditional probabilities on specific
variables can be calculated easily, but the
joint probability must be sampled from
7
Gibbs sampling in our problem
• Start with a single candidate S={S1,...,Sk}
where each Si chosen randomly and
uniformly from input sequence i
• Calculate A and D(A||B) for S
• Choose one member of S randomly to
remove
• Choose an alternative (from the
corresponding sequence) with probability
proportional to the corresponding D(A||B)
• Repeat until D(A||B) converges
8
Exploring alternative strings
• When we replace a string from sequence i
– We examine in turn each of the m-n+1 strings
that that sequence could offer
– For each such string, we add it temporarily to
S and calculate the new A and D(A||B)
– Then we assign to each string Sij (j varies
across these strings) probability D ( Aj || B )
 D( A
r
|| B )
r
• May pick a “worse” string, or the same
string we just removed
9
Gibbs sampler convergence
• Return best S seen across all iterations
(may not be the last one)
• Stop after a fixed number of iterations, or
when D(A||B) does not change very much
• Solution is sensitive to the starting S, so
we typically run the algorithm several
(thousand) times from different starting
points
10
Complexity of Gibbs sampler
• Construct initial S and calculate A and
D(A||B) in O(kn) time
• Each iterative step takes O(n) time to
remove a string and recalculate D(A||B),
O(mn) time to calculate the probabilities of
the m-n+1 alternatives
• Total time is O(mnd) where d is the
number of iteration steps (dm>>k),
multiplied by the number of random
11
restarts
Why Gibbs sampling works
• Retains elements of the greedy approach
– weighing by relative entropy makes likely to
move towards locally better solutions
• Allows for locally bad moves with a small
probability, to escape local maxima
12
Variations in Gibbs sampling
• Discard substrings non-uniformly (weighed
by relative entropy, analogous to
subsequent selection of new string)
• Use simulated annealing to reduce chance
of making a bad move (and gradually
ensure convergence)
13
Annealing
• Annealing is a process in metallurgy for
improving metals by increasing crystal size
and reducing defects
• The process works by heating the metal
and controlled cooling which lets the
atoms go through a series of states with
gradually lower internal energy
• The goal is to have the metal settle in a
configuration with lower than the original
internal energy
14
Simulated Annealing
• Simulated annealing (SA) adopts an
energy function equal to the function we
want to minimize
• Transitions between neighboring states
are adopted with probability specified by a
function f of the energy gain ΔE=Enew-Eold
and the temperature T
• f(ΔE,T)>0 even if ΔE>0, but as T→0,
f(ΔE,T)→1 if ΔE<0 (or to cΔE for some
15
constant c) and f(ΔE,T)→0 for ΔE>0
Simulated Annealing
• Original f (from the Metropolis-Hasting
algorithm)
if E  0
1,

E
f ( E , T )    T
e , if E  0
• T controls acceptance of locally bad
solutions
• The annealing schedule is a process for
gradually reducing T so that eventually
only good moves are accepted
16
Special cases
• If T is always zero,
– simulated annealing reduces to greedy local
optimization
• If T is constant but non-zero,
– simulated annealing reduces to the process
we described for Gibbs sampling (choose
solutions randomly with probability
proportional to their improvement)
17