Transcript PowerPoint

CS 416
Artificial Intelligence
Lecture 20
Making Complex Decisions
Chapter 17
Midterm Results
AVG:
72
MED:
75
STD:
12
Rough dividing lines at: 58 (C), 72 (B), 85 (A)
Assignment 1 Results
AVG:
87
MED:
94
STD:
19
How to interpret the grade sheet…
Interpreting the grade sheet…
• You see the tests we ran listed in the first column
• The metrics we accumulated are:
– Solution depth, nodes created, nodes accessed, fringe size
– All metrics are normalized by dividing by the value obtained using one of the
good solutions from last year
• The first four columns show these normalized metrics averaged across the
entire class’s submissions
• The next four columns show these normalized metrics for your submission…
– Ex: A value of “1” for “Solution” means your code found a solution at the
same depth as the solution from last year. The class average for “solution”
might be 1.28 because some submissions searched longer and thus
increased the average
Interpreting the grade sheet
• SLOW = more than 30 seconds to complete
– 66% credit given to reflect partial credit even though we
never obtained firm results
• N/A = the test would not even launch correctly… it might
have crashed or ended without output
– 33% credit given to reflect that frequently N/A occurs when
no attempt was made to create an implementation
If you have an N/A but you think your code
reflects partial credit, let us know.
Gambler’s Ruin
Consider working out examples of gambler’s ruin
for $4 and $8 by hand
Ben created some graphs to show solution of
gambler’s ruin for $8
$0 bets are not permitted!
$8-ruin using batch update
Converges after
three iterations.
Value vector is
only updated after
a complete
iteration has
completed
$8-ruin using in-place updating
Convergence
occurs more
quickly
Updates to value
function occur
in-place starting
from $1
$100-ruin
A more detailed
graph than
provided in the
assignment
Trying it by hand
Assume value update is working…
$1
$2
$3
.064 .16 .256
$4
.4
$5
$6
$7
.496 .64 .784
What’s the best action at $5?
When tied… pick the smallest action
$8
1
Office hours
Sunday: 4 – 5 in Thornton Stacks
Send email to Ben ([email protected]) by
Saturday at midnight to reserve a slot
Also make sure you have stepped through your
code (say for the $8 example) to make sure that it
is implementing your logic
Compilation
Just for grins
Take your Visual Studio code and compile using
g++:
g++ foo.cpp –o foo -Wall
Partially observable Markov Decision
Processes (POMDPs)
Relationship to MDPs
• Value and Policy Iteration assume you know a lot about the
world:
– current state, action, next state, reward for state, …
• In real world, you don’t exactly know what state you’re in
– Is the car in front braking hard or braking lightly?
– Can you successfully kick the ball to your teammate?
Partially observable
Consider not knowing what state you’re in…
• Go left, left, left, left, left
• Go up, up, up, up, up
– You’re probably in upperleft corner
• Go right, right, right, right, right
Extending the MDP model
MDPs have an explicit transition function
T(s, a, s’)
• We add O (s, o)
– The probability of observing o when in state s
• We add the belief state, b
– The probability distribution over all possible states
– b(s) = belief that you are in state s
Two parts to the problem
Figure out what state you’re in
• Use Filtering from Chapter 15
Figure out what to do in that state
Update b(s) and
p(s) / U(s) after
each iteration
• Bellman’s equation is useful again
The optimal action depends only on the agent’s
current belief state
Selecting an action
• a is normalizing constant that makes belief state sum to 1
• b’ = FORWARD (b, a, o)
• Optimal policy maps belief states to actions
– Note that the n-dimensional belief-state is continuous
 Each belief value is a number between 0 and 1
A slight hitch
The previous slide required that you know the
outcome o of action a in order to update the belief
state
If the policy is supposed to navigate through
belief space, we want to know what belief state
we’re moving into before executing action a
Predicting future belief states
Suppose you know action a was performed when
in belief state b. What is the probability of
receiving observation o?
• b provides a guess about initial state
• a is known
• Any observation could be realized… any subsequent state
could be realized… any new belief state could be realized
Predicting future belief states
The probability of perceiving o, given action a and
belief state b, is given by summing over all the
actual states the agent might reach
Predicting future belief states
We just computed the odds of receiving o
We want new belief state
• Let t (b, a, b’) be the belief transition function
Equal to 1 if
b′ = FORWARD(b, a, o)
Equal to 0 otherwise
Predicted future belief states
Combining previous two slides
This is a transition model through belief states
Relating POMDPs to MDPs
We’ve found a model for transitions through belief
states
• Note MDPs had transitions through states (the real things)
We need a model for rewards based on beliefs
• Note MDPs had a reward function based on state
Bringing it all together
We’ve constructed a representation of POMDPs
that make them look like MDPs
• Value and Policy Iteration can be used for POMDPs
• The optimal policy, p*(b) of the MDP belief-state
representation is also optimal for the physical-state POMDP
representation
Continuous vs. discrete
Our POMDP in MDP-form is continuous
• Cluster continuous space into regions and try to solve for
approximations within these regions
Final answer to POMDP problem
[l, u, u, r, u, u, r, u, u, r, …]
• It’s deterministic (it already takes into account the absence of
observations)
• It has an expected utility of 0.38 (compared with 0.08 of the
simple l, l, l, u, u, u, r, r, r,…)
• It is successful 86.6%
In general, POMDPs with a few dozen states are
nearly impossible to optimize