Predictive State Representations Duke University Machine Learning Group Discussion Leader: Kai Ni

Download Report

Transcript Predictive State Representations Duke University Machine Learning Group Discussion Leader: Kai Ni

Predictive State Representations
Duke University Machine Learning Group
Discussion Leader: Kai Ni
September 09, 2005
Outline
• Predictive State Representations (PSR) Model
• Constructing a PSR from a POMDP
• Learning parameters for PSR
• Conclusions
Two Popular Methods
• There are two dominant approaches in controlling/AI area.
• The generative-model approach
– Typified by POMDP, more general, unlimited memory;
– Strongly dependent on a good model of system.
• The history-based approach
– Typified by k-order Markov methods, simple and effective;
– Limited by history extension.
The Position of PSR
Figure 1: Data flow in a) POMDP and other recursive updating of state representation,
and b) history-based state representation.
The predictive state representation (PSR) approach
• Like the generative-model approach in that it updates the
state representation recursively;
• Like the history-based approach in that its representations
are grounded in data
What is a PSR
• A PSR looks to the future and represents what will happen.
• A PSR is a vector of predictions for a specially selected set
of action-observation sequences, called tests:
– One test for a1o1a2o2 after time k means
Pr{Ok  o1 , Ok 1  o2 | Ak  a1 , Ak 1  a2 }
• A PSR is a set of tests that is sufficient information to
determine the prediction for all possible tests (a sufficient
statistic).
The System-Dynamics Vector (1)
• Given an ordering over all possible tests t1t2…, the
system’s probability distribution over all tests, defines an
infinite system-dynamics vector d.
• The ith elements of d is the prediction of the ith test
di  p(ti )  Pr(o1  o1 , , ok  ok | a1  a1 , , ak  a k )
where ti  o1a1
ok a k
• The predictions in d have some properties
The System-Dynamics Vector (2)
Figure 2: a) Each of d’s entries corresponds to the prediction of the test. b)Properties of
the predictions imply structure in d.
 i 0  di  1
 Let T ( a ) be the set of tests whose action sequences equal
a . Then k a  Ak ,
 ta  A, p (t ) 


oO
tT ( a )
p (t )  1
p (tao)
System-Dynamics Matrix (1)
• To make the structure explicit, we consider a matrix, D,
whose columns correspond to tests and whose rows
correspond to histories.
• Each element is a history-conditional prediction
Dij  p (t j | hi )  p (hi t j ) / p (hi )
• The first history is the zero length history, thus the systemdynamics vector d is the first row of the matrix D.
System-Dynamics Matrix (2)
Figure 3: The rows in the system-dynamics matrix correspond to all possible histories
(pasts), while the columns correspond to all possible tests (futures). The entries in the
matrix are the probabilities of futures given pasts.
• All the entries of matrix D are uniquely determined by the
vector d because both the numerator and the denominator are
elements of d.
POMDP and D
• The system-dynamics matrix D is not a model of the
system but should be viewed as the system itself.
• D can be generated from a POMDP model by generating
each test’s prediction as follows:
1 1
p( a o
a o | h)  b(h)T O
n n
a1
a1o1
an
T O
a n on
1
• Theorem: A POMDP with k nominal states cannot model a
dynamical system with dimension greater than k.
– The dimension of a dynamic system equal to the rank of D
The Idea of Linear PSR
• For any D with rank k, there must exist k linearly
independent columns and rows. We consider the set of
columns and let the tests corresponding to these columns
be Q = {q1 q2 … qk}, called core tests.
• For any h, the prediction vector p(Q|h) = [p(q1|h) … p(qk|h]
is a predictive state representation. It forms a sufficient
statistic for the system. All other tests can be calculated
from the linear dependence
p(t|h) = p(Q|h)Tmt, where mt is the weight vector for test t.
Update the core tests
• The predictive vector can be update recursively after new
action-observation pair is added.
T
p(aoqi | h) p(Q | h) maoqi
p(qi | hao) 

p(ao | h)
p(Q | h)T mao
Figure 4: An example of system-dynamics matrix. The set Q = {t1, t3, t4} forms a set
of core tests. The equations in the ti column show how any entry on a row can be
computed from the prediction vector of that row.
Constructing a PSR from a POMDP
• POMDP updates its belief state by computing
b(h)T a O ao
b(hao) 
b(h)T a O ao 1
• Define a function u mapping tests to (1 x k) vectors by
u() = 1 and u(aot) = (TaOa,ou(t)T)T. We call u(t) the
outcome vector for test t.
• A test t is linearly independent of a set of tests S if u(t) is
linearly independent of the set of u(S).
Searching Algorithm
Figure 5: Searching algorithm for finding a linear PSR from a POMDP.
• The cardinality of Q is bounded by k and no test in Q is
longer than k action-observation pairs.
• All other tests can be computed by
P(a1 o1
anon | h)  p(Q | h)T M a1o1
M an1on1 manon
An Example of PSR
Figure 6: The float-reset problem
• Any linear PSR of this system has 5 core tests. One such
PSR has the core tests and the initial predictions:
Q = {r1, f0r1, f0f0r1, f0f0f0r1, f0f0f0f0r1}.
q(Q|h) = [1, 0.5, 0.5, 0.375, 0.375];
• After a float action, the last prediction is updated by
p(f0f0f0f0r1|hf0) = q(Q|h)*[.0625, -.0625, -.75, -.75, 1]T
Learning PSR model
• The parameters we need to learn are weight vector {mao}
and weight matrix {Mao} with the ith column equal to maoqi
• Using an Oracle
1
– Parameters can be computed by m t  p (Q | H )p(t | H )
– Build a PSR by querying the oracle for p(Q|H), p(ao|H) and
p(aoqi|H)
• Without an Oracle
– Estimate an entry p(t|h) in D by performing a Bernoulli trial
– Using suffix-history to get around the problem without reset
– pˆ (t | h)  Succeeded(t, h) / Executed(t, h)
TD (temporal difference) Learning
• Update long-term guess based on the next time step instead
of waiting until the end.
• t = a1o1a2o2a3o3 and pˆ (t | h) is the estimation of p(t|h). After
takes action a1 and observe ok+1, TD estimation is:
k 1
ˆ

p
(
a
o
a
o
|
ha
o
),
o
 o1

2
2
3
3
1
1
~
p (t | h )  
k 1

0
,
o
 o1

and model parameters can be updated based on error.
• Expand the Q to include all suffixes of the core tests,
called Y.
p(Q | h) W ao  b ao
p(Y | hao )  g(
)
d ao (h)
Result (1)
Table 1. Domain and Core Search Statistics. The Asymp column denotes the
approximate asymptote for the percent of required core tests found during the
trials for suffix-history (with parameter 0.1). The Training column denotes the
approximate smallest training size at which the algorithm achieved the
asymptote value.
Result (2)
• Average error between prediction and truth.
1
o

o
2
ˆ
(
p
(
ao
|
h
)

p
(
ao
|
h
))
i
i
i 1
Figure 7: Comparison of Error vs. training length for tiger problem
Conclusion
• Predictive state representation (PSR) is a new way to
model the dynamical systems. It is more general than both
POMDPs and nth-order Markov models. PSR is grounded
in data flow and is easy to learn.
• The system-dynamics matrix provides an interesting way
of looking at discrete dynamical systems.
• The author propose suffix-history and TD algorithm for
learning PSR without reset. Both of them have small
prediction error.
Reference
• M. L. Littman, R. S. Sutton and S. Singh, “Predictive
Representations of State”, NIPS 2002
• S. Singh, M. R. James and M. R. Rudary, “Predictive State
Representations: A New Theory for Modeling Dynamical
Systems”, UAI 2004
• B. Wolfe, M. R. James and S. Singh, “Learning Predictive
State Representations in Dynamical Systems Without
Reset”, ICML 2005