Reinforcement Learning
Download
Report
Transcript Reinforcement Learning
KI2 - 11
Reinforcement Learning
Johan Everts
Kunstmatige Intelligentie / RuG
1
What is Learning ?
Learning takes place as a result of interaction
between an agent and the world, the idea
behind learning is that
Percepts received by an agent should
be used not only for acting, but also for
improving the agent’s ability to behave
optimally in the future to achieve its
goal.
Learning Types
Supervised learning:
Situation in which sample (input, output) pairs
of the function to be learned can be perceived or
are given
Reinforcement learning:
Where the agent acts on its environment, it
receives some evaluation of its action
(reinforcement), but is not told of which action is
the correct one to achieve its goal
Unsupervised Learning:
No information at all about given output
Reinforcement Learning
Task
Learn how to behave successfully to achieve a
goal while interacting with an external
environment
Learn through experience
Examples
Game playing: The agent knows it has won or lost,
but it doesn’t know the appropriate action in each
state
Control: a traffic system can measure the delay of
cars, but not know how to decrease it.
Elements of RL
Agent
State
Policy
Reward
Action
Environment
0 : r0
1 : r1
2 : r2
s0 a
s1 a
s2 a
Transition model, how action influence states
Reward R, imediate value of state-action transition
Policy , maps states to actions
Elements of RL
0
100
0
0
G
0
0
0
0
0
0
100
0
0
r(state, action)
immediate reward values
Elements of RL
0
100
0
90
0
0
0
0
0
0
0
G
0
100
0
81
0
r(state, action)
immediate reward values
100
G
90
100
V*(state) values
Value function: maps states to state values
Vπ (s ) r (t )+γr (t +1)+γ2 r (t +1)+ ...
Discount factor [0, 1)
(here 0.9)
RL task (restated)
Execute actions in environment,
observe results.
Learn action policy : state action that
maximizes expected discounted reward
E [r(t) + r(t + 1) + 2r(t + 2) + …]
from any starting state in S
Reinforcement Learning
Target function is : state action
RL differs from other function
approximation tasks
Partially observable states
Exploration vs. Exploitation
Delayed reward -> temporal credit
assignment
Reinforcement Learning
Target function is : state action
However…
We have no training examples of form
<state, action>
Training examples are of form
<<state, action>, reward>
Utility-based agents
Try to learn V * (abbreviated V*)
perform lookahead search to choose best
action from any state s
π* (s ) arg max r (s, a ) + V * (δ (s, a ))
a
Works well if agent knows
: state action state
r : state action R
When agent doesn’t know and r, cannot
choose actions this way
Q-learning
Q-learning
Define new function very similar to V*
Q(s, a ) r (s, a ) + γV * (δ(s, a ))
If agent learns Q, it can choose optimal
action even without knowing or R
Using Learned Q
π* (s ) arg max Q (s, a )
a
Learning the Q-value
Note: Q and V* closely related
V * (s ) arg max Q (s, a' )
a'
Allows us to write Q recursively as
Q (s (t ), a(t )) r (s (t ), a(t )) + γV (δ (s (t ), a(t )))
r (s (t ), a(t )) + γ ma xQ (s (t + 1), a' )
a'
Learning the Q-value
FOR each <s, a> DO
Initialize table entry:
Observe current state s
WHILE (true) DO
Qˆ (s, a ) 0
Select action a and execute it
Receive immediate reward r
Observe new state s’
Update table entry for Qˆ (s, a) as follows
Qˆ (s, a ) r (s, a ) + γ max Qˆ (s' , a' )
a'
Move: record transition from s to s’
Q-learning
Q-learning, learns the expected utility of
taking a particular action a in a particular
state s (Q-value of the pair (s,a))
0
100
90
0
90
0
0
0
0
0
0
0
72
G
0
100
0
100
G
0
r(state, action)
immediate reward values
81
81
90
V*(state) values
100
100
81
81
81
90
72
0
G
90
100
81
Q(state, action) values
Q-learning
Demonstration
http://iridia.ulb.ac.be/~fvandenb/qlearning/qlearning.html
eps: probability to use a random action
instead of the optimal policy
gam: discount factor, closer to 1 more weight
is given to future reinforcements.
alpha: learning rate
Temporal Difference Learning:
Q-learning estimates one time step difference
Q (1) (s (t ), a(t )) r (t ) + γ max Qˆ (s (t + 1), a )
a
Why not for n steps?
Q (n ) (s(t ), a(t )) r (t ) + γr (t + 1) + + γ (n 1)r (t + n 1) + γ n max Qˆ (s(t + n ), a )
a
Temporal Difference Learning:
TD() formula
Qλ (s(t ), a(t )) (1λ ) Q(1) (s(t ), a(t )) + λQ (2) (s(t ), a(t )) + λ 2Q(3) (s(t ), a(t )) +
Intuitive idea: use constant 0 1 to
combine estimates from various lookahead
distances (note normalization factor (1- ))
Genetic algorithms
Imagine the individuals as agent functions
Fitness function as performance measure or
reward function
No attempt made to learn the relationship
between the rewards and actions taken by an
agent
Simply searches directly in the individual
space to find one that maximizes the fitness
functions
Genetic algorithms
Represent an individual as a binary string
Selection works like this: if individual X scores twice
as high as Y on the fitness function, then X is twice
as likely to be selected for reproduction than Y.
Reproduction is accomplished by cross-over and
mutation
Cart – Pole balancing
Demonstration
http://www.bovine.net/~jlawson/hmc/pole/sane.html
Summary
RL addresses the problem of learning control
strategies for autonomous agents
In Q-learning an evaluation function over
states and actions is learned
TD-algorithms learn by iteratively reducing
the differences between the estimates
produced by the agent at different times
In the genetic approach, the relation between
rewards and actions is not learned. You
simply search the fitness function space.