An Object-oriented Representation for Efficient Reinforcement Learning Carlos Diuk, Andre Cohen and Michael L.

Download Report

Transcript An Object-oriented Representation for Efficient Reinforcement Learning Carlos Diuk, Andre Cohen and Michael L.

An Object-oriented Representation for
Efficient Reinforcement Learning
Carlos Diuk, Andre Cohen and Michael L. Littman
Rutgers Laboratory for Real-Life Reinforcement Learning (RL)3
Department of Computer Science
Rutgers University (New Jersey, USA)
ICML 2008 – Helsinki, Finland
Motivation
How would YOU play this game?
What’s in a state?
s1 -> a0 -> s5
know that our agents are interacting in a
s5 -> a2 ->Ifswe
24
spatial
s24 -> a1 ->
s1 relation with objects, let’s just tell them so.
A simple hash code
that tells you if you’ve
been “there” before.
What we (the agent) can actually “see”:
objects, interactions, spatial
relationships.
What we did
• Grab ideas from Relational RL and come up
with a representation that:
–
–
–
–
is suitable for a wide-enough range of domains
is tractable
provides opportunities for generalization
enables smart exploration
• Strike a balance between generality and
tractability.
OO representation
• Problem defined by a set of objects and their
attributes.
• Example: Objects in Pitfall defined by a bounding box
on a set of pixels based on color.
Man.<x,y>
Hole.<x,y>
Ladder.<x,y>
Log.<x,y>
Wall.<x,y>
• State is the union of all objects’ attribute values.
OO representation
• For any given state s, there is a function c(s)
that tells us which relations occur under s.
• Dynamics defined by preconditions and
effects.
• Preconditions are conjunctions of terms:
– Relations between objects:
• touchN/S/E/W(objecti, objectj)
• on(objecti, objectj)
– Any (boolean) function on the attributes.
– Any other function encoding prior knowledge.
• Actions have effects that determine how
objects’ attributes get modified.
on(Man, Ladder)
Action Up
Man.y = Man.y + 8
DOORMax
• An algorithm for efficient learning of
deterministic OO-MDPs.
• When objects interact, and an effect is
observed, DOORMax learns the conjunction
of terms that enabled the effect.
• Belongs to the R-Max family of algorithms:
– Guides exploration to make objects interact
Pitfall video
DOORMax Analysis
• Let n be the number of terms.
• Assume that:
– The number of effects per action is bounded by a
(small) constant m.
– Each effect has a unique conjunctive condition.
• As long as effects are observed (that is, some effect
occurs given an action a), DOORMax will learn the
condition-effect pairs that determine the dynamics of
a in O(nm). There is a worst-case bound, when lots
of no-effects are observed, of O(nm).
Results
What about this game?
Representations in Taxi
# of steps
Q-learning
Time per step
What we have to tell it
47157
<1ms
#states, #actions
MaxQ
6298
10ms
Task hierarchy, DBNs for each
task
Flat Rmax
4151
~40ms
#states, #actions
Factored Rmax
1676
44ms
DBN
DSHP
319
11ms
Task hierarchy, DBNs for each
task
DOORMax
529
14ms
Object representation
Bigger Taxi
Taxi 5x5
Taxi 10x10
Ratio
# States
500
7200
14.40
Factored – Rmax
1676 steps
19100 steps
11.39
DOORmax
529 steps
821 steps
1.55
DOORmax with Transfer from 5x5
529 steps
529 steps
1
Conclusions and future work
• OO-MDPs provide a natural way of modeling an interesting set
of domains, while enabling generalization and smart exploration.
•
DOORMax learns deterministic OO-MDPs outperforming stateof-the-art algorithms for factored-state representations.
• DOORMax scales very nicely with respect to the size of the
state space, as long as transition dynamics between objects do
not change.
•
We do not have a provably efficient algorithm for stochastic
OO-MDPs.
•
We do not yet handle inheritance between classes of objects.