Transcript PowerPoint
CS 416 Artificial Intelligence Lecture 20 Making Complex Decisions Chapter 17 Midterm Results AVG: 72 MED: 75 STD: 12 Rough dividing lines at: 58 (C), 72 (B), 85 (A) Assignment 1 Results AVG: 87 MED: 94 STD: 19 How to interpret the grade sheet… Interpreting the grade sheet… • You see the tests we ran listed in the first column • The metrics we accumulated are: – Solution depth, nodes created, nodes accessed, fringe size – All metrics are normalized by dividing by the value obtained using one of the good solutions from last year • The first four columns show these normalized metrics averaged across the entire class’s submissions • The next four columns show these normalized metrics for your submission… – Ex: A value of “1” for “Solution” means your code found a solution at the same depth as the solution from last year. The class average for “solution” might be 1.28 because some submissions searched longer and thus increased the average Interpreting the grade sheet • SLOW = more than 30 seconds to complete – 66% credit given to reflect partial credit even though we never obtained firm results • N/A = the test would not even launch correctly… it might have crashed or ended without output – 33% credit given to reflect that frequently N/A occurs when no attempt was made to create an implementation If you have an N/A but you think your code reflects partial credit, let us know. Gambler’s Ruin Consider working out examples of gambler’s ruin for $4 and $8 by hand Ben created some graphs to show solution of gambler’s ruin for $8 $0 bets are not permitted! $8-ruin using batch update Converges after three iterations. Value vector is only updated after a complete iteration has completed $8-ruin using in-place updating Convergence occurs more quickly Updates to value function occur in-place starting from $1 $100-ruin A more detailed graph than provided in the assignment Trying it by hand Assume value update is working… $1 $2 $3 .064 .16 .256 $4 .4 $5 $6 $7 .496 .64 .784 What’s the best action at $5? When tied… pick the smallest action $8 1 Office hours Sunday: 4 – 5 in Thornton Stacks Send email to Ben ([email protected]) by Saturday at midnight to reserve a slot Also make sure you have stepped through your code (say for the $8 example) to make sure that it is implementing your logic Compilation Just for grins Take your Visual Studio code and compile using g++: g++ foo.cpp –o foo -Wall Partially observable Markov Decision Processes (POMDPs) Relationship to MDPs • Value and Policy Iteration assume you know a lot about the world: – current state, action, next state, reward for state, … • In real world, you don’t exactly know what state you’re in – Is the car in front braking hard or braking lightly? – Can you successfully kick the ball to your teammate? Partially observable Consider not knowing what state you’re in… • Go left, left, left, left, left • Go up, up, up, up, up – You’re probably in upperleft corner • Go right, right, right, right, right Extending the MDP model MDPs have an explicit transition function T(s, a, s’) • We add O (s, o) – The probability of observing o when in state s • We add the belief state, b – The probability distribution over all possible states – b(s) = belief that you are in state s Two parts to the problem Figure out what state you’re in • Use Filtering from Chapter 15 Figure out what to do in that state Update b(s) and p(s) / U(s) after each iteration • Bellman’s equation is useful again The optimal action depends only on the agent’s current belief state Selecting an action • a is normalizing constant that makes belief state sum to 1 • b’ = FORWARD (b, a, o) • Optimal policy maps belief states to actions – Note that the n-dimensional belief-state is continuous Each belief value is a number between 0 and 1 A slight hitch The previous slide required that you know the outcome o of action a in order to update the belief state If the policy is supposed to navigate through belief space, we want to know what belief state we’re moving into before executing action a Predicting future belief states Suppose you know action a was performed when in belief state b. What is the probability of receiving observation o? • b provides a guess about initial state • a is known • Any observation could be realized… any subsequent state could be realized… any new belief state could be realized Predicting future belief states The probability of perceiving o, given action a and belief state b, is given by summing over all the actual states the agent might reach Predicting future belief states We just computed the odds of receiving o We want new belief state • Let t (b, a, b’) be the belief transition function Equal to 1 if b′ = FORWARD(b, a, o) Equal to 0 otherwise Predicted future belief states Combining previous two slides This is a transition model through belief states Relating POMDPs to MDPs We’ve found a model for transitions through belief states • Note MDPs had transitions through states (the real things) We need a model for rewards based on beliefs • Note MDPs had a reward function based on state Bringing it all together We’ve constructed a representation of POMDPs that make them look like MDPs • Value and Policy Iteration can be used for POMDPs • The optimal policy, p*(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representation Continuous vs. discrete Our POMDP in MDP-form is continuous • Cluster continuous space into regions and try to solve for approximations within these regions Final answer to POMDP problem [l, u, u, r, u, u, r, u, u, r, …] • It’s deterministic (it already takes into account the absence of observations) • It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…) • It is successful 86.6% In general, POMDPs with a few dozen states are nearly impossible to optimize