Active Learning as Active Inference Brigham S. Anderson www.cs.cmu.edu/~brigham [email protected] School of Computer Science Carnegie Mellon University Copyright © 2006, Brigham S.

Download Report

Transcript Active Learning as Active Inference Brigham S. Anderson www.cs.cmu.edu/~brigham [email protected] School of Computer Science Carnegie Mellon University Copyright © 2006, Brigham S.

Active Learning
as
Active Inference
Brigham S. Anderson
www.cs.cmu.edu/~brigham
[email protected]
School of Computer Science
Carnegie Mellon University
Copyright © 2006, Brigham S. Anderson
OUTLINE
• New Active Inference Algorithm
• Active Learning
• Background
• Application of new algorithm
• Example application to Hidden Markov Models
• Active sequence selection for Hidden Markov Model
learning
2
Rain Tomorrow
Who will win American Idol?
NP = P
Left iron on
3
I will answer
one question.
Choose a
node.
Wow! uh…
?
Oracle
Rain tomorrow?
NP = P?
Is the iron on?
Do I have cancer?
NIPS acceptance?
Today’s Lotto numbers?
etc…
4
Active Inference
Given:
1.
2.
3.
4.
Set of target nodes:
Set of query nodes:
Probabilistic model:
Uncertainty function:
X
Y
P(X,Y)
uncertainty(X)
Problem:
Choose a node in Y to observe in order to minimize Uncertainty(P(X))
Why is this difficult?
…for every Y, we must evaluate
uncertainty({Xi} |Y)
5
Why is this useful?
Diagnosis,
Active Learning,
Optimization,
…
How do we quantify
“uncertainty” of a node?
6
Example
You have the following model of your Cancer state:
P(Cancer)
P( no) = 0.95
P(yes) = 0.05
Cancer
TestA
P(TestA|Cancer)
P(pos| no)
P(pos|yes)
P(neg| no)
P(neg|yes)
7
=
=
=
=
0.50
0.99
0.50
0.01
TestB
P(TestB|Cancer)
P(pos| no)
P(pos|yes)
P(neg| no)
P(neg|yes)
=
=
=
=
0.01
0.50
0.99
0.50
Example
• Your uncertainty about P(Cancer) is “bad”
• How can we quantify the badness?
P(Cancer)
P( no) = 0.95
P(yes) = 0.05
Cancer
TestA
P(TestA|Cancer)
P(pos| no)
P(pos|yes)
P(neg| no)
P(neg|yes)
8
=
=
=
=
0.50
0.99
0.50
0.01
TestB
P(TestB|Cancer)
P(pos| no)
P(pos|yes)
P(neg| no)
P(neg|yes)
=
=
=
=
0.01
0.50
0.99
0.50
The Uncertainty Function
Obvious candidates for Uncertainty:
•
•
•
P(L)
Entropy
Variance
Misclassification risk
High entropy
High variance
High misclassification risk
P(L)
L
Low entropy
Low variance
Low misclassification risk
L
9
Notation
• Given that you have not had any tests yet, what is your
P(Cancer)?
• P(Cancer)   Cancer
0.95


0
.
05


P(Cancer)
P( no) = 0.95
P(yes) = 0.05
Cancer
Notation
P( X )   X
 p1 
p 
  2
  
 
 pk 
TestA
P(TestA|Cancer)
P(pos| no)
P(pos|yes)
P(neg| no)
P(neg|yes)
10
=
=
=
=
0.50
0.99
0.50
0.01
TestB
P(TestB|Cancer)
P(pos| no)
P(pos|yes)
P(neg| no)
P(neg|yes)
=
=
=
=
0.01
0.50
0.99
0.50
Uncertainty
How
surprised will I
be?
k
Entropy
  pi log pi
i
How often will I
be wrong if I
guess the most
likely?
How often will I
be wrong if I
guess
probabilistically?
Expected
Misclassification
1  max(p0 , p1,..., pk )
k
“Gini”
  pi
2
i
Proposed
uncertainty measure
11
 
T
Uncertainty Functions for P(Cancer)
Uncertainty
ENTROPY
GINI
MISCLASSIFICATION
P(Cancer=yes)
12
The ALARM network
13
Active
Inference
Performances
on ALARM
network
0/1
Misclass.
Error
0.4
Random
Info Gain (entropy)
E.Misclass
Gini
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Neg.
log
likelihood
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
25
20
15
10
5
0
Number of Queries
14
Active
Inference
Performances
on Randomly
Generated
Networks
Random
Info Gain
Gini
Number of Queries
15
Some Nice Gini Properties
• For multinomials, minimizing Σpi2 minimizes the sum of
eigenvalues of the covariance matrix.
• Can incorporate misclassification costs naturally:
    W
T
T
16
GINI Active Inference Problem
Given:
1.
2.
3.
4.
Set of target nodes:
Set of query nodes:
Probabilistic model:
Uncertainty function:
X
Y
P(X,Y)
gini(X)
Problem:
Find the one node in Y expected to minimize gini(X)
giniCan
(X) do
giniit( Xin1 ) O(N)
 gini(for
X 2 ) polytrees
   gini( X m )
(Anderson
T
T& Moore, 2005)T
 1 1   2  2     m  m
17
Polytrees
18
Example Problem
Given:
1.
2.
3.
4.
Cancer
TestA TestB
Target node:
Observable nodes:
Probabilistic model:
Uncertainty function:
Cancer
{TestA, TestB}
P(Cancer,TestA,TestB)
gini(Cancer)
Problem:
Choose the test expected to minimize gini(Cancer) if we
perform it
19
• In order to know how a test will affect our
P(Cancer), we need to know the conditional
probabilities between the test results and Cancer.
20
CPT Matrices
Define.
If A and B are discrete random variables, then CA|B is a CPT
matrix where the ijth element is P(A=i|B=j)
Theorem.
If A and B are discrete random variables, and CA|B is a
CPT matrix,
 A  CA|B B
I.e., inferring one variable’s distribution from another is
a linear operation given the CPT matrix
21
 Cancer  CCancer| A A
 P ( A  0) 
 P (Cancer  0 | A  0) P (Cancer  0 | A  1) 




P
(
Cancer

1
|
A

0
)
P
(
Cancer

1
|
A

1
)

  P ( A  1) 


 P (Cancer  0, A  0)  P (Cancer  0, A  1)


P
(
Cancer

1
,
A

0
)

P
(
Cancer

1
,
A

1
)


 P (Cancer  0)


P
(
Cancer

1
)


22
Imagine that, instead of one cancer node,
we have X1,X2,…,Xm nodes that we want
to determine the gini of
gini(X)  gini( X 1 )  gini( X 2 )    gini( X m )

T
X1
 X1  
T
Xm
 Xm
 C X 1| A A  C X 1| A A     C Xm| A A  C Xm| A A 
T
 G
T
A
T
A
X 1, X 2,...,Xm
A
23
• So, we want GA{targets} for each node A in the query nodes.
• How to compute all of these GA{targets} matrices efficiently?
• Can do it with dynamic programming because…
Theorem.
For any nodes X, Y, and set of nodes Z,
if X and Z are conditionally independent given Y, then
G  CY | X G CY | X
Z
X
T
24
Z
Y
A
B
C
G I
A
A
GBAB  I  CAT|BGAACA|B
GCABC  I  CBT|CGBABCB|C
Polytrees Use Similar Principle
25
Fast Active Inference
• Information gain is quadratic in the number of
nodes to compute (there is no way to do messagepassing.)
• Gini is linear in the number of nodes.
26
Time to Compute Gain: Random Polytrees
Entropy
Gini
Seconds
27
Applications
• Active learning
• Diagnosis
• Optimization of noisy functions
28
OUTLINE
• New Active Inference Algorithm
• Active Learning
• Background
• Application of new algorithm
• Example application to Hidden Markov Models
• Active sequence selection for Hidden Markov Model
learning
29
Active LEARNING
Site_id
F1
F2
F3
F4
F5
OFFENSIVE
0
0
0
0
1
0
1
0
1
0
1
0
2
0
0
0
0
0
3
0
0
1
1
1
?
?
false
?
true
?
4
1
0
0
1
0
5
1
1
0
0
1
6
0
0
0
0
0
7
0
0
1
1
0
8
0
1
0
0
1
30
?
?
false
?
?
?
Active Learning Flavors
Select Queries
Pool
Myopic
Construct Queries
Sequential
Specifically, we’re not doing
decision processes, POMDPs,
or any kind of policy learning.
Batch
We’re asking: what is the one
label you most want to see?
31
Active Learning
Ө: Model parameter(s)
Ө
fi : feature(s) of example
f1
Li : label of example
L1
32
Active Learning
Ө
Inference
f1
f2
f3
f4
f5
TRUE
L1
FALSE
L2
L3
L4
FALSE
L5
At each iteration, we select the one best
How
we select
node
to minimize
node
todo
observe
thata will
minimize
our
the uncertainty
of the
target
node,
Θ?
expected
uncertainty
about
the
Ө node.
33
Active Learning
• Coincidentally, the Cancer network is analogous to our
Active learning problem.
Ө
Cancer
TestA
TestB
f1
f2
f3
f4
f5
L1
L2
L3
L4
L5
Select test to minimize
uncertainty of Cancer
Select L to minimize
uncertainty of Ө
34
Active Learning
•
Which page do I show the human expert in order to learn
my is-offensive model Ө?
•
Which email do I show the user in order to learn my is-spam
model Ө?
Active Inference
•
Which question do I ask the user in order to infer his
preference nodes?
•
What question do I ask the user in order to infer his printerstate node(s)?
35
Active Learning Basics
Uncertainty
Sampling
uncertainty(L)
Query by
Committee
disagreement(L)
Information
Gain
H(Θ) – H(Θ |L)
36
Active Learning Basics
Uncertainty
Sampling
uncertainty(L)
Query by
Committee
disagreement(L)
Information
Gain
H(Θ) – H(Θ |L)
Gini Gain
Gini(Θ) – Gini(Θ |L)
37
New
Active Learning Basics
Uncertainty
Sampling
uncertainty(L)
Query by
Committee
disagreement(L)
Information
Gain
H(Θ) – H(Θ |L)
Gini Gain
Gini(Θ) – Gini(Θ |L)
38
Uncertainty Sampling
(Lewis and Gale, 1994)
BASIC IDEA: choose uncertain labels.
Talk Assumption: uncertainty is entropy
39
Uncertainty Sampling
Example
id
F1
F2
F3
F4
F5
OFFEN.
P(OFFEN)
H(OFFEN)
0
0
0
0
1
0
?
0.02
0.043
1
0
1
0
1
0
?
0.01
0.024
2
0
0
0
0
0
?
0.05
0.086
3
0
0
1
1
1
?
FALSE
0.33
0.910
4
1
0
0
1
0
?
0.01
0.024
5
1
1
0
0
1
?
0.96
0.073
40
Uncertainty Sampling
BASIC IDEA: choose the sample you are most uncertain about
GOOD: easy
GOOD: sometimes works
BAD: H(L) measures information gained about the sample,
not the model
Attracted to noisy samples
41
Uncertainty Sampling
…but at least H(L) upper bounds the information
gain of L w.r.t. the model (or anything else.)
BAD: H(L) measures information gained about the sample,
not the model
Attracted to noisy samples
42
We can do better than
uncertainty sampling
43
Query By Committee (QBC)
(Seung, Opper, and Sompolinsky, 1992)
IDEA: choose labels your models disagree on.
ASSUMPTION: no noise
ASSUMPTION: perfectly learnable model
E.g., if half your version space says X is true, and the other
half says it is false, you’re guaranteed to reduce your
version space by half if you find out X.
44
QBC
•
•
•
Randomly
Randomlysample
draw 2 2
models
models
from model space
Classify example
the example
If they disagree, select the example
t
Sex
Age
Test
A
Test
B
Test
C
Li
θ1
1
M
2030
0
1
1
?
2
F
2030
0
1
0
?
3
F
3040
1
0
0
?
4
F
60+
1
1
0
?
5
M
1020
0
1
0
?
6
M
2030
1
1
1
?
FALSE
45
θ2
FALSE
QBC
•
•
•
Randomly draw 2 models from model space
Classify the example
If they disagree, select the example
t
Sex
Age
Test
A
Test
B
Test
C
Li
1
M
2030
0
1
1
?
2
F
2030
0
1
0
?
3
F
3040
1
0
0
?
4
F
60+
1
1
0
?
5
M
1020
0
1
0
?
6
M
2030
1
1
1
?
θ1
TRUE
46
θ2
TRUE
QBC
•
•
•
Randomly draw 2 models from model space
Classify the example
If they disagree, select the example
t
Sex
Age
Test
A
Test
B
Test
C
Li
1
M
2030
0
1
1
?
2
F
2030
0
1
0
?
3
F
3040
1
0
0
?
FALSE
4
F
60+
1
1
0
?
5
M
1020
0
1
0
?
6
M
2030
1
1
1
?
θ1
47
TRUE
θ2
FALSE
Query By Committee (QBC)
IDEA: choose labels your models disagree on.
In the noise-free case, H(L) is entirely due to
uncertainty about the model, so it reduces to
uncertainty sampling!
If we allow noisy samples and use a model
posterior instead of a version space, QBC starts
to look exactly like…
48
Active Learning Basics
Uncertainty
Sampling
uncertainty(L)
Query by
Committee
disagreement(L)
Information
Gain
H(Θ) – H(Θ |L)
Gini Gain
Gini(Θ) – Gini(Θ |L)
49
Information Gain
• Choose the unlabeled example whose label has
the greatest information gain w.r.t. the model.
Ө
f1
f2
f3
f4
f5
L1
L2
L3
L4
L5
50
Information Gain
• Choose the unlabeled example whose label has
the greatest information gain w.r.t. the model.
IG( L; )  H ()  H ( | L)
 H ( L)  H ( L | )
Interesting:
Uncertainty sampling  Information Gain
when H(L|Θ) is small relative to H(L).
51
Information Gain Example
Assume that our model space consists of two models, θ1 and θ2 …
F1
F2
F3
F4
F5
OFFEN.
P(OFFEN|θ1)
P(OFFEN|θ2)
IG(OFFEN ; Θ)
0
0
0
0
1
0
?
0.02
0.02
0.000
1
0
1
0
1
0
?
FALSE
0.12
0.01
0.230
2
0
0
0
0
0
?
0.07
0.05
0.025
3
0
0
1
1
1
?
0.33
0.33
0.000
4
1
0
0
1
0
?
0.02
0.01
0.007
5
1
1
0
0
1
?
0.99
0.96
0.022
id
52
Active Learning Basics
Uncertainty
Sampling
uncertainty(L)
Query by
Committee
disagreement(L)
Information
Gain
H(Θ) – H(Θ |L)
Gini Gain
Gini(Θ) – Gini(Θ |L)
53
Gini Gain
• Use the active inference algorithm from the first
part of this talk…
Target node: Ө
Query nodes: {Li}
Ө
f1
f2
f3
f4
f5
L1
L2
L3
L4
L5
54
Gini Gain
Definition.
The Gini gain between two random variables X and Y, denoted
as GG(X;Y), is defined as
GG(Y ; X )   gini(Y )  gini(Y | X )
  Y  Y 
 P( x) gini(Y | X  x)
T
xdom ( X )
55
Active Learning Basics
Uncertainty
Sampling
uncertainty(L)
Query by
Committee
disagreement(L)
PRO: Simple
CON: No good theory for noise
Information
Gain
H(Θ) – H(Θ |L)
PRO: Information theory-based
CON: Does not scale well
Gini Gain
Gini(Θ) – Gini(Θ |L)
56
PRO: Simple
CON: Misled by noise
PRO: Scales extremely well.
Can use confusion costs.
Interesting Question
• Can we “fix” uncertainty sampling by
approximating H(L|Ө)?
If we can do this, it will
approximate information gain
57
We’re Still Not Happy
• All of the active learning methods used this model:
Ө
f1
f2
f3
f4
f5
L1
L2
L3
L4
L5
…But something seems wrong…
58
We’re Still Not Happy
We usually don’t want information about the model…
We want information about the test set labels!
Ө
Z1
Z2
Z3
f’1
f’2
f’3
L1
L2
L3
L4
f1
f2
f3
f4
Test Set
Training Set
59
Information Gain Approach
Ө
Z1
Z2
Z3
f’1
f’2
f’3
Test Set
Information Gain:
L1
L2
L3
L4
f1
f2
f3
f4
Training Set
Y *  arg maxIG(Y ; Z1 )  IG(Y ; Z 2 )    IG(Y ; Z m )
Y
This blows up quadratically,
since we’re evaluating each L’s
effect on each Z in the test set.
60
Gini Gain Approach
Ө
• Gini Gain:
Target nodes: {Zi}
Query nodes: {Yi}
Z1
Z2
Z3
f’1
f’2
f’3
Test Set
L1
L2
L3
L4
f1
f2
f3
f4
Training Set
Note that the structure of this
problem is a polytree, so
the algorithm is O(N)
Work in progress
61
OUTLINE
• New Active Inference Algorithm
• Active Learning
• Background
• Application of new algorithm
• Example application to Hidden Markov Models
• Active sequence selection for Hidden Markov Model
learning
62
The SwitchMaster™
(powered by Hidden Markov Models!)
INPUT
OUTPUT
Binary stream of
motion / no-motion
Probability distribution over
• Phone,
• Meeting,
• Computer, and
• Out
E.g.,
“There is an 86% chance
that the user is in a meeting
right now.”
63
Hidden Markov Model
Model parameters Ө = {π0,A,B}
π0=
P(S0=1)
P(S0=2)
…
P(S0=n)
A=
P(St+1=1|St=1) … P(St+1=n|St=1)
P(St+1=1|St=2) … P(St+1=n|St=2)
…
P(St+1=1|St=n) … P(St+1=n|St=n)
B=
P(O=1|S=1) … P(O=m|S=1)
P(O=1|S=2) … P(O=m|S=2)
…
P(O=1|S=n) … P(O=m|S=n)
64
O0
O1
O2
O3
S0
S1
S2
S3
SwitchMaster HMM
A=
O1
O2
O3
O4
S1
S2
S3
S4
P(St+1=Phone|St=Phone) …
P(St+1=Phone|St=Meeting) …
…
… P(St+1=Out|St=Out)
65
B=
P(Ot =1 | St =Phone)
P(Ot =1 | St =Meeting)
P(Ot =1 | St =Computer)
P(Ot =1 | St =Out)
HMM Inference
Ot
P(St=
Phone)
P(St=
Meeting)
P(St=
Computer)
P(St=
Out)
1
0
1.00
0.0
0.00
0.0
2
0
1.00
0.0
0.00
0.0
3
1
0.0
0.10
0.80
0.10
4
1
0.0
0.11
0.80
0.09
5
1
0.0
0.12
0.80
0.08
6
0
0.0
0.10
0.78
0.12
…
…
…
…
…
…
t
66
Active Learning!
Good Morning Sir!
Here’s the video footage of
yesterday.
Good Morning Sir!
Could you just go through it and
label each frame?
Can you tell me what you are doing in
this frame of video?
67
HMM User Model
…Now suppose that our human labels
observations:
this time step
states:
Phone
Meeting
Computer
Out
O1
O2
O3
O4
S1
S2
S3
S4
Motion sensors
Microphones
Keyboard activity
etc.
HIDDEN
PMC O
PMC O
PMC O
PMC O
State Probabilities for Phone/Meeting/Computer/Out
68
HMMs and Active Learning
…Now suppose that our human labels
this time step
O1
O2
O3
O4
S1
S2
S3
S4
PMC O
PMC O
PMC O
PMC O
State Probabilities for Phone/Meeting/Computer/Out
69
HMMs and Active Learning
O1
O2
O3
O4
S1
S2
S3
S4
…No problem, if we know the true state…
70
HMMs and Active Learning
using Evidence
O1
O2
O3
O4
S1
S2
S3
S4
L1
L2
L3
L4
“Queryable” Observations
(costly observations, labels, uncertain
labels, tests, etc.)
71
HMMs and Active Learning
using Evidence
O1
O2
O3
O4
S1
S2
S3
S4
L1
L2
L3
L4
PMC O
PMC O
PMC O
…Now we
choose a
measurement…
PMC O
State Probabilities for Phone/Meeting/Computer/Out
72
HMMs and Active Learning
O1
O2
O3
O4
S1
S2
S3
S4
L1
L2
L3
L4
Active Learning:
What is the optimal observation, L1, L2, L3, or L4?
Choose L* to minimize uncertainty of the model or the hidden states?
73
HMMs and Active Learning
O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
L1
L2
L3
L4
L5
L6
L7
?
hmm…
?
?
74
hmmmmmmmmmmmmmmm
mmmmmmmmmmmmmmmm
mmmmmmmmmmmmmm…
75
HMMs and Active Learning
The SwitchMaster™ is trying to minimize
the uncertainty of some target node(s)
…What are its target nodes?
76
HMM Inference Tasks
States
Parameters
Baum-Welch
algorithm
Path
Viterbi
algorithm
Individual
States
Forward-Backward
algorithm
77
Different entropybased and ginibased active learners
Path
States
Model
T
Entropy
H (S1, S2 ,...,ST )
Gini
 joint  joint
 H (S )
t
H ()
T

 S S
 T  
t
T
T
t
t
t
Efficient myopic algorithms for
each of these objective functions
in Anderson and Moore, 2005
78
Active State Learning
with Information Gain
T
Y  arg max  IG(Y ; St )
*
Y
t
2
2
O(T MN )
79
L1
L2
L3
L4
S1
S2
S3
S4
Path
States
Model
T
Entropy
H (S1, S2 ,...,ST )
Gini
 joint  joint
 H (S )
t
H ()
T

 S S
 T  
t
T
T
t
t
80
t
Active State Learning
with Gini
T
Y  arg max  GG(Y ; St )
*
Y
t
2
2
2
O(TMN
O(T MN) )
81
L1
L2
L3
L4
S1
S2
S3
S4
Experiment: User Model
States:
Emacs-Latex
Emacs-Code
Shell
Email
Other
Observations:
Key duration (msec)
Key transition time (msec)
Key category (alpha,space,enter,punc,edit)
1 keystroke = 1 timestep
20,000 timesteps
82
Results
Random
Uncertainty sampling
Gini
83
OUTLINE
• New Active Inference Algorithm
• Active Learning
• Background
• Application of new algorithm
• Example application to Hidden Markov Models
• Active sequence selection for Hidden Markov
Model learning (Anderson, Siddiqqi, and Moore, 2006)
84
Actively Selecting Excerpts
Good Morning Sir!
Could you please label the following
scene from yesterday…
85
0
0 0
1
0 1
1
1 0
0
1 0
1
1 0
0
0 0
1
1 1
OK, which
subsequence would
be most informative
about my model?
There are O(T2)
of them!
86
1
1 1
1
0
0 0
1
0 1
1
1 0
0
1 0
1
1 0
0
0 0
1
1 1
P
P
M
M
C
P
M
1
Note: the expert
annotates each of
the states
?
hmmmmmmmmmmmmmmm
mmmmmmmmmmmmmmmm
mmmmmmmmmmmmmm…
87
1 1
1
0
0 0
1
0 1
1
1 0
0
1 0
1
1 0
0
0 0
1
1 1
P
P
M
M
C
P
M
1
Possible applications of “excerpt selection”
• Selecting utterances from audio
• Selecting excerpts from text
• Selecting sequences from DNA
88
1 1
1
Excerpt Selection
PROBLEM: Find the sequence S = {St,St+1,…,St+k} to
maximize IG(S; Θ)
NOTE: We’re not using Gini,
we’re using information gain!
Trick question:
Which subsequence maximizes IG(S;Θ)?
89
Sequence Selection
We have to include the cost incurred
when we force an expert to sit down
and label 1000 examples…
score(S; )  IG(S; )   S
So there is a constant
cost, α, associated with
providing each label
This is computed from
the entropy of the
sequence, H(S). How
do we compute H(S)?
90
What is the Entropy of a
Sequence?
• H(S1:4) = H(S1,S2,S3,S4) = ?
The Chain Rule of Entropy
H(S1,S2,S3,S4) = H(S1) + H(S2 |S1) + H(S3 |S1,S2) + H(S4 |S1,S2,S3)
…but we have some structural information:
S1
S2
S3
S4
H(S1,S2,S3,S4) = H(S1) + H(S2 |S1) + H(S3 |S2) + H(S4 |S3)
91
Entropy of a Sequence
t k
H (St , St 1 ,...,St  k )  H (St )   H ( Si 1 | Si )
i t
We still get the H(St) and H(St+1|St)
values from P(St | O1:T), and
P(St+1 | St, O1:T)
92
Score of a Sequence
score(S; )  H (S)  H (S | )   S
tL
tL

 

  H ( St )   H ( Si 1 | Si )   H ( St | )   H ( Si 1 | Si , )   S
i t
i t

 

93
How can I find the
best excerpt of
length k?
0
0
0 0
1
0 1
1
1 0
0
1 0
94
1
1 0
0
0 0
1
1 1
1
1
Find Best Sequence of
Length k
1. Score each length-k subsequence according to
score(S;Ө) = H(S) – H(S|Ө)
2. Select the best one
0
0
0 0
k=5
1
0 1
1
1 0
0
1 0
***
1
1 0
0
0 0
1
1 1
1
Some simple caching
gives O(T)
95
1
Yeah, but what if I
don’t know k?
I want to find the
best excerpt of
any length
0
0
0 0
1
0 1
1
1 0
0
1 0
96
1
1 0
0
0 0
1
1 1
1
1
Find Best Sequence of
Any Length
Hmm…
1. Score all possible intervals
2. Pick the best one
That’s O(T2).
We could cleverly cache
some of the computation as we go…
0
0
0 0
1
0 1
1
1 0
0
But we’re still going to be O(T2)
1 0 1 1 0 0 0 0 1 1 1
97
1
1
Similar Problem
4
f(t)
Find the sequence with
largest integral
3
2
Note: a Google
interview question
1
t
0
??
?
-1
-2
Can be done using
Dynamic Programming in O(T)
98
DP Intuition
state(t) = the best interval so far, and
the best interval ending at t
state(t+1) = if f(t) + score of best-ending-at-t < 0
then start a new best-ending-at-t
else “keep going”
99
Find Best Sequence of
Any Length
Use DP to find the subsequence that maximizes
score(S;Ө) = H(S) – H(S|Ө) – α|S|
0
0
0 0
1
0 1
1
1 0
0
1 0
1
1 0
***
100
0
0 0
1
1 1
1
1
Not Just HMMs
This active learning algorithm can be applied to any
sequential process with the Markov property
E.g., Kalman filters
101
SUMMARY
• Linear time active inference using Gini
• Applications to Hidden Markov Models
• Applications to general Active Learning
Ө
Z1 Z2 Z3
f’1 f’2 f’3
• Active sequence selection
102
L1 L2 L3 L4
f1 f2 f 3 f4
Future Work
• On-line active learning
• Batch active learning
• Optimization of noisy functions
103
104
Selective Sampling  Bias?
105
Related Work
• Label selection for tracking in text HMMs (Scheffer, et al.
2001)
• Nonmyopic label selection for tracking with chain models
(Krause & Guestrin, 2005)
• Label selection for model learning in general graphical
models (Tong & Koller, 2001)
106
Imagine that, instead of one cancer node
we’re interested in, we have X1,X2,…,Xm that
we want to determine the gini of
gini(X)  gini( X 1 )  gini( X 2 )    gini( X m )
 1 1   2  2     m  m
T
 C
T
A
X 1| A
T
 C
T
A
X 1| A
  C
A
Xm| A
 C
T
A
Xm| A
  A CX 1| A C X 1| A A     A C Xm| A C Xm| A A
T

T
A
T
C
T
X 1| A
T
C X 1| A    C
  GAX 1, X 2,...,Xm A
T
A
107
T
Xm| A
T

C Xm| A  A
state(t) =
[a*,b*] : best interval so far
atemp : start of best interval ending at t
sum(a*,b*)
sum(atemp,t )
Rules:
if ( sum(atemp,t-1) + y(t) < 0 )
then atemp= t
if ( sum(atemp,t) > sum(a*,b*) )
then [a*,b*] = [atemp,t]
108