A Example of Q

Download Report

Transcript A Example of Q

TD-Learning
Shiqi Zhang
AI class in Texas Tech, 2010 fall
Adapted from Kainan University, Taiwan
Model is given?
We say model is given when the conditional
probabilities are known in all states
• Model-based: trying to find R and T, and use
them to get the value function
• Model-free: directly find out the value
function
Policy is fixed?
• Passive learning: a policy is fixed, and we try
to evaluate this policy
• Active learning: the policy is not fixed, we can
change it a little every time
Policy evaluation and Policy improvement
On-policy or Off-policy
• Basic TD-Learning
• SARSA (On-policy)
• Q-Learning (Off-policy)
Advantages
• TD-learning vs Monte Carlo
TD is an Online algorithm
• TD-learning vs Dynamic Programming
TD is Model-free
Q-Learning Task
Agent
state
action
reward
Environment
S0
a0
r0
S1
a1
r1
S2
…
Goal: Learn to choose actions that maximize
r0  r1  ...,
where
0   1
Problem Description
• Environment
A to E: rooms, F: outside building (target).
The aim is that an agent to learn to get out of building
from any of rooms in an optimal way.
• Modeling of the environment
State, Action, Reward and
Q-value
• Reward matrix
• Q-table and the update rule
Q table update rule:
:learning parameter
Q-Learning
•
•
Given : State diagram with a goal state (represented by matrix R)
Find : Minimum path from any initial state to the goal state
(represented by matrix Q)
1.
2.
3.
Set parameter γ , and environment reward matrix R
Initialize matrix Q as zero matrix
For each episode:
・ Select random initial state
・ Do while not reach goal state
・ Select one among all possible actions for the current state
・ Using this possible action, consider to go to the next state
・ Get maximum Q value of this next state based on all possible actions
・ compute
・ Set the next state as the current state
End Do
End For
Algorithm to utilize the
Q table
Input: Q matrix, initial state
1.
2.
3.
4.
Set current state = initial state
From current state, find action that produce maximum Q value
Set current state = next state
Go to 2 until current state = goal state
The algorithm above will return sequence of current state
from initial state until goal state.
Comments: Parameter γ has range value of 0 to 1(
). If
γ is closer to zero, the agent will tend to consider only
immediate reward. If γ is closer to one, the agent will consider
future reward with greater weight, willing to delay the reward.
Numerical Example
Let us set the value of learning parameter and initial state as
room B.
Episode 1
Look at the second row (state B) of matrix R. There are two
possible actions for the current state B, that is to go to state D,
or go to state F. By random selection, we select to go to F as
our action.
Episode 2
This time for instance we randomly have state D as our initial
state. From R; it has 3 possible actions, B, C and E. We randomly
select to go to state B as our action.
Episode 2 (cont’d)
The next state is B, now become the current state. We repeat
the inner loop in Q learning algorithm because state B is not
the goal state. There are two possible actions from the current
state B, that is to go to state D, or go to state F. By lucky draw,
our action selected is state F.
No
change
After Many Episodes
If our agent learns more and more experience through many
episodes, it will finally reach convergence values of Q matrix as
Normalized
to percentage
Once the Q matrix reaches almost the convergence value, our
agent can reach the goal in an optimum way. To trace the
sequence of states, it can easily compute by finding action that
makes maximum Q for this state.
For example from initial State C, using the Q
matrix, we can have the sequences C – D – B – F
or C-D-E-F
The End…