An Object-oriented Representation for Efficient Reinforcement Learning Carlos Diuk, Andre Cohen and Michael L.
Download ReportTranscript An Object-oriented Representation for Efficient Reinforcement Learning Carlos Diuk, Andre Cohen and Michael L.
An Object-oriented Representation for Efficient Reinforcement Learning Carlos Diuk, Andre Cohen and Michael L. Littman Rutgers Laboratory for Real-Life Reinforcement Learning (RL)3 Department of Computer Science Rutgers University (New Jersey, USA) ICML 2008 – Helsinki, Finland Motivation How would YOU play this game? What’s in a state? s1 -> a0 -> s5 know that our agents are interacting in a s5 -> a2 ->Ifswe 24 spatial s24 -> a1 -> s1 relation with objects, let’s just tell them so. A simple hash code that tells you if you’ve been “there” before. What we (the agent) can actually “see”: objects, interactions, spatial relationships. What we did • Grab ideas from Relational RL and come up with a representation that: – – – – is suitable for a wide-enough range of domains is tractable provides opportunities for generalization enables smart exploration • Strike a balance between generality and tractability. OO representation • Problem defined by a set of objects and their attributes. • Example: Objects in Pitfall defined by a bounding box on a set of pixels based on color. Man.<x,y> Hole.<x,y> Ladder.<x,y> Log.<x,y> Wall.<x,y> • State is the union of all objects’ attribute values. OO representation • For any given state s, there is a function c(s) that tells us which relations occur under s. • Dynamics defined by preconditions and effects. • Preconditions are conjunctions of terms: – Relations between objects: • touchN/S/E/W(objecti, objectj) • on(objecti, objectj) – Any (boolean) function on the attributes. – Any other function encoding prior knowledge. • Actions have effects that determine how objects’ attributes get modified. on(Man, Ladder) Action Up Man.y = Man.y + 8 DOORMax • An algorithm for efficient learning of deterministic OO-MDPs. • When objects interact, and an effect is observed, DOORMax learns the conjunction of terms that enabled the effect. • Belongs to the R-Max family of algorithms: – Guides exploration to make objects interact Pitfall video DOORMax Analysis • Let n be the number of terms. • Assume that: – The number of effects per action is bounded by a (small) constant m. – Each effect has a unique conjunctive condition. • As long as effects are observed (that is, some effect occurs given an action a), DOORMax will learn the condition-effect pairs that determine the dynamics of a in O(nm). There is a worst-case bound, when lots of no-effects are observed, of O(nm). Results What about this game? Representations in Taxi # of steps Q-learning Time per step What we have to tell it 47157 <1ms #states, #actions MaxQ 6298 10ms Task hierarchy, DBNs for each task Flat Rmax 4151 ~40ms #states, #actions Factored Rmax 1676 44ms DBN DSHP 319 11ms Task hierarchy, DBNs for each task DOORMax 529 14ms Object representation Bigger Taxi Taxi 5x5 Taxi 10x10 Ratio # States 500 7200 14.40 Factored – Rmax 1676 steps 19100 steps 11.39 DOORmax 529 steps 821 steps 1.55 DOORmax with Transfer from 5x5 529 steps 529 steps 1 Conclusions and future work • OO-MDPs provide a natural way of modeling an interesting set of domains, while enabling generalization and smart exploration. • DOORMax learns deterministic OO-MDPs outperforming stateof-the-art algorithms for factored-state representations. • DOORMax scales very nicely with respect to the size of the state space, as long as transition dynamics between objects do not change. • We do not have a provably efficient algorithm for stochastic OO-MDPs. • We do not yet handle inheritance between classes of objects.