Markov chains/HMMs/DBNs - Indiana University Computer Science
Download
Report
Transcript Markov chains/HMMs/DBNs - Indiana University Computer Science
CS B553: ALGORITHMS FOR
OPTIMIZATION AND LEARNING
Temporal sequences: Hidden Markov Models and
Dynamic Bayesian Networks
MOTIVATION
Observing a stream of data
Monitoring (of people,
computer systems, etc)
Surveillance, tracking
Finance & economics
Science
Questions:
Modeling & forecasting
Unobserved variables
TIME SERIES MODELING
Time occurs in steps t=0,1,2,…
Time step can be seconds, days, years, etc
State variable Xt, t=0,1,2,…
For partially observed problems, we see
observations Ot, t=1,2,… and do not see the X’s
X’s are hidden variables (aka latent variables)
MODELING TIME
Arrow of time
Causes
Effects
Causality => Bayesian networks are natural
models of time series
PROBABILISTIC MODELING
For now, assume fully observable case
X0
X1
X2
X3
X1
X2
X3
What parents?
X0
MARKOV ASSUMPTION
Assume Xt+k is independent of all Xi for i<t
P(Xt+k | X0,…,Xt+k-1) = P(Xt+k | Xt,…,Xt+k-1)
K-th order Markov Chain
Order 0
X0
X1
X2
X3
Order 1
X0
X1
X2
X3
Order 2
X0
X1
X2
X3
Order 3
X0
X1
X2
X3
1ST ORDER MARKOV CHAIN
MC’s of order k>1 can be converted into a 1st
order MC on the variable Yt = {Xt,…,Xt+k-1}
So w.o.l.o.g., “MC” refers to a 1st order MC
X0
X1
X2
X3
X0
X1’
X2’
X3’
X1
X2
X3
X4
Y0
Y1
Y2
Y3
INFERENCE IN MC
What independence relationships can we read
from the BN?
X0
X1
X2
X3
Observe X1
X0 independent of X2, X3, …
P(Xt|Xt-1) known as transition model
INFERENCE IN MC
Prediction: the probability of future state?
P(Xt) =
S
=S
=
S
x0,…,xt-1P
x0,…,xt-1P
(X0)
xt-1P(Xt|Xt-1)
(X0,…,Xt)
P
x1,…,xt
P(Xt-1)
P(Xi|Xi-1)
[Recursive approach]
Approach: maintain a belief state bt(X)=P(Xt), use
above equation to advance to bt+1(X)
Equivalent to VE algorithm in sequential order
BELIEF STATE EVOLUTION
P(Xt) =
Sxt-1P(Xt|Xt-1) P(Xt-1)
“Blurs” over time, and (typically) approaches a
stationary distribution as t grows
Limited prediction power
Rate of blurring known as mixing time
STATIONARY DISTRIBUTIONS
For discrete variables Val(X)={1,…,n}:
Transition matrix Tij = P(Xt=i|Xt-1=j)
Belief bt(X) is just a vector bt,i=P(Xt=i)
Belief update equation: bt+1 = T*bt
A stationary distribution b is one in which b = Tb
=> b is an eigenvector of T with eigenvalue 1
=> b is in the null space of (T-I)
HISTORY DEPENDENCE
In Markov models, the state must be chosen so
that the future is independent of history given
the current state
Often this requires adding variables that cannot
be directly observed
minimum
essentials
“the bare”
market
wipes himself
with the rabbit
Are these people walking toward
you or away from you?
What comes next?
PARTIAL OBSERVABILITY
Hidden Markov Model (HMM)
X0
X1
X2
X3
Hidden state
variables
O1
O2
O3
Observed
variables
P(Ot|Xt) called the observation
model (or sensor model)
INFERENCE IN HMMS
Filtering
Prediction
Smoothing, aka hindsight
Most likely explanation
X0
X1
X2
X3
O1
O2
O3
INFERENCE IN HMMS
Filtering
Prediction
Smoothing, aka hindsight
Most likely explanation
Query variable
X0
X1
X2
O1
O2
FILTERING
Name comes from signal processing
P(Xt|o1:t) =
S
xt-1
P(xt-1|o1:t-1) P(Xt|xt-1,ot)
P(Xt|Xt-1,ot) = P(ot|Xt-1,Xt)P(Xt|Xt-1)/P(ot|Xt-1)
= a P(ot|Xt)P(Xt|Xt-1)
Query variable
X0
X1
X2
O1
O2
FILTERING
P(Xt|o1:t) = a Sxt-1P(xt-1|o1:t-1) P(ot|Xt)P(Xt|xt-1)
Forward recursion
If we keep track of belief state bt(X) = P(Xt|o1:t)
=> O(|Val(X)|2) updates for each t!
Query variable
X0
X1
X2
O1
O2
PREDICT-UPDATE INTERPRETATION
Given old belief state bt-1(X)
Predict: First compute MC update
bt’(Xt)=P(Xt|o1:t-1) = a Sxbt-1(x) P(Xt|Xt-1=x)
Update: Re-weight to account for observation
probabilities:
bt(x) = bt’(x)P(ot|Xt=x)
X0
Query variable
X1
X2
O1
O2
INFERENCE IN HMMS
Filtering
Prediction
Smoothing, aka hindsight
Most likely explanation
Query
X0
X1
X2
X3
O1
O2
O3
PREDICTION
P(Xt+k|o1:t)
2 steps: P(Xt|o1:t), then P(Xt+k|Xt)
Filter to time t, then predict as with standard
MC
Query
X0
X1
X2
X3
O1
O2
O3
INFERENCE IN HMMS
Filtering
Prediction
Smoothing, aka hindsight
Most likely explanation
Query
X0
X1
X2
X3
O1
O2
O3
SMOOTHING
P(Xk|o1:t) for k < t
P(Xk|o1:k,ok+1:t)
= P(ok+1:t|Xk,o1:k)P(Xk|o1:k)/P(ok+1:t|o1:k)
= a P(ok+1:t|Xk)P(Xk|o1:k)
Standard filtering to time k
Query
X0
X1
X2
X3
O1
O2
O3
SMOOTHING
Computing P(ok+1:t|Xk)
P(ok+1:t|Xk) =
S
=S
=
S
xk+1P(ok+1:t|Xk,xk+1)
xk+1P(ok+1:t|xk+1)
P(xk+1|Xk)
P(xk+1|Xk)
xk+1P(ok+2:t|xk+1)P(ok+1|xk+1)P(xk+1|Xk)
Backward recursion
Given prior states
X0
X1
O1
X2
O2
X3
O3
What’s the
probability of this
sequence?
INTERPRETATION
Filtering/prediction:
Smoothing:
Equivalent to forward variable elimination / belief
propagation
Equivalent to forward VE/BP up to query variable,
then backward VE/BP from last observation back to
query variable
Running BP to completion gives the smoothed
estimates for all variables (forward-backward
algorithm)
INFERENCE IN HMMS
Filtering
Prediction
Smoothing, aka hindsight
Most likely explanation
Subject of next lecture
X0
X1
X2
X3
O1
O2
O3
Query returns a path
through state space
x0,…,x3
APPLICATIONS OF HMMS IN NLP
Speech recognition
Hidden phones
(e.g., ah eh ee th r)
Observed, noisy acoustic
features (produced by signal
processing)
PHONE OBSERVATION MODELS
Phonet
Model defined to be
robust over variations
in accent, speed,
pitch, noise
Signal processing
Features
(24,13,3,59)
Featurest
PHONE TRANSITION MODELS
Phonet
Phonet+1
Good models will capture (among other things):
Featurest
Pronunciation of words
Subphone structure
Coarticulation effects
Triphone models = order 3 Markov chain
WORD SEGMENTATION
Words run together when
pronounced
Unigrams P(wi)
Bigrams P(wi|wi-1)
Trigrams P(wi|wi-1,wi-2)
Random 20 word samples from R&N using N-gram models
Logical are as
confusion a may right
tries agent goal the
was diesel more object
then informationgathering search is
Planning purely diagnostic
expert systems are very
similar computational
approach would be
represented compactly
using tic tac toe a predicate
Planning and
scheduling are
integrated the success
of naïve bayes model is
just a possible prior
source by that time
WHAT ABOUT MODELS WITH MANY
VARIABLES?
Say X has n binary variables, O has m binary variables
Naively, a distribution over Xt may be intractable to
represent (2n entries)
Transition models P(Xt |Xt-1) require 22n entries
Observation models P(Ot |Xt) require 2n+m entries
Is there a better way?
EXAMPLE: FAILURE DETECTION
Consider a battery meter sensor
Battery = true level of battery
BMeter = sensor reading
Transient failures: send garbage at time t
Persistent failures: send garbage forever
EXAMPLE: FAILURE DETECTION
Consider a battery meter sensor
Battery = true level of battery
BMeter = sensor reading
Transient failures: send garbage at time t
5555500555…
Persistent failures: sensor is broken
5555500000…
DYNAMIC BAYESIAN NETWORK
Template model relates variables on prior time
step to the next time step (2-TBN)
“Unrolling” the template for all t gives the
ground Bayesian network
Batteryt-1
Batteryt
BMetert
BMetert ~ N(Batteryt,s)
DYNAMIC BAYESIAN NETWORK
Batteryt-1
Batteryt
BMetert
BMetert ~ N(Batteryt,s)
Transient
failure model
P(BMetert=0 | Batteryt=5) = 0.03
RESULTS ON TRANSIENT FAILURE
Meter reads 55555005555…
E(Batteryt)
Transient failure occurs
Without model
With model
RESULTS ON PERSISTENT FAILURE
Meter reads 5555500000…
E(Batteryt)
Persistent failure occurs
With transient
model
PERSISTENT FAILURE MODEL
Brokent-1
Brokent
Batteryt-1
Batteryt
BMetert
BMetert ~ N(Batteryt,s)
P(BMetert=0 | Batteryt=5) = 0.03
P(BMetert=0 | Brokent) = 1
RESULTS ON PERSISTENT FAILURE
Meter reads 5555500000…
E(Batteryt)
Persistent failure occurs
With persistent
failure model
With transient
model
HOW TO PERFORM INFERENCE ON
DBN?
Exact inference on “unrolled” BN
E.g. Variable Elimination
Typical order: eliminate sequential time steps so that
the network isn’t actually constructed
Unrolling is done only implicitly
Br0
Br1
Ba0
Br2
Br3
Br4
Ba1
Ba2
Ba3
Ba4
BM1
BM2
BM3
BM4
ENTANGLEMENT PROBLEM
After n time steps, all n variables in the belief
state become dependent!
Unless 2-TBN can be partitioned into disjoint subsets
(rare)
Lost sparsity structure
APPROXIMATE INFERENCE IN DBNS
Limited history updates
Assumed factorization of belief state
Particle filtering
INDEPENDENT FACTORIZATION
Idea: assume belief state P(Xt) factors across
individual attributes P(Xt) = P(X1,t)*…*P(Xn,t)
Filtering: only maintain factored distributions
P(X1,t|O1:t),…,P(Xn,t|O1:t)
Filtering update: P(Xk,t|O1:t) = Sxt-1P(Xk,t|Ot,Xt-1)
P(Xt-1|O1:t-1) = marginal probability query over 2TBN
X1,t-1
X1,t
Xn,t-1
Xn,t
O1,t
Om,t
NEXT TIME
Viterbi algorithm
Read K&F 13.2 for some context
Kalman and particle filtering
Read K&F 15.3-4