Reinforcement Learning

Download Report

Transcript Reinforcement Learning

KI2 - 11
Reinforcement Learning
Johan Everts
Kunstmatige Intelligentie / RuG
1
What is Learning ?

Learning takes place as a result of interaction
between an agent and the world, the idea
behind learning is that
Percepts received by an agent should
be used not only for acting, but also for
improving the agent’s ability to behave
optimally in the future to achieve its
goal.
Learning Types

Supervised learning:
Situation in which sample (input, output) pairs
of the function to be learned can be perceived or
are given

Reinforcement learning:
Where the agent acts on its environment, it
receives some evaluation of its action
(reinforcement), but is not told of which action is
the correct one to achieve its goal

Unsupervised Learning:
No information at all about given output
Reinforcement Learning

Task
Learn how to behave successfully to achieve a
goal while interacting with an external
environment
Learn through experience

Examples


Game playing: The agent knows it has won or lost,
but it doesn’t know the appropriate action in each
state
Control: a traffic system can measure the delay of
cars, but not know how to decrease it.
Elements of RL
Agent
State
Policy
Reward
Action
Environment
0 : r0
1 : r1
2 : r2
s0 a
 s1 a
 s2 a
 

Transition model, how action influence states
Reward R, imediate value of state-action transition

Policy , maps states to actions

Elements of RL
0
100
0
0
G
0
0
0
0
0
0
100
0
0
r(state, action)
immediate reward values
Elements of RL
0
100
0
90
0
0
0
0
0
0
0
G
0
100
0
81
0
r(state, action)
immediate reward values

100
G
90
100
V*(state) values
Value function: maps states to state values
Vπ (s )  r (t )+γr (t +1)+γ2 r (t +1)+ ...
Discount factor   [0, 1)
(here 0.9)
RL task (restated)

Execute actions in environment,
observe results.

Learn action policy  : state  action that
maximizes expected discounted reward
E [r(t) + r(t + 1) + 2r(t + 2) + …]
from any starting state in S
Reinforcement Learning

Target function is  : state  action

RL differs from other function
approximation tasks



Partially observable states
Exploration vs. Exploitation
Delayed reward -> temporal credit
assignment
Reinforcement Learning

Target function is  : state  action

However…

We have no training examples of form
<state, action>

Training examples are of form
<<state, action>, reward>
Utility-based agents


Try to learn V * (abbreviated V*)
perform lookahead search to choose best
action from any state s
π* (s )  arg max r (s, a ) + V * (δ (s, a ))
a


Works well if agent knows

 : state  action  state

r : state  action  R
When agent doesn’t know  and r, cannot
choose actions this way
Q-learning

Q-learning

Define new function very similar to V*
Q(s, a )  r (s, a ) + γV * (δ(s, a ))

If agent learns Q, it can choose optimal
action even without knowing  or R

Using Learned Q
π* (s )  arg max Q (s, a )
a
Learning the Q-value

Note: Q and V* closely related
V * (s )  arg max Q (s, a' )
a'

Allows us to write Q recursively as
Q (s (t ), a(t ))  r (s (t ), a(t )) + γV (δ (s (t ), a(t )))
 r (s (t ), a(t )) + γ ma xQ (s (t + 1), a' )
a'
Learning the Q-value

FOR each <s, a> DO

Initialize table entry:

Observe current state s

WHILE (true) DO
Qˆ (s, a )  0

Select action a and execute it

Receive immediate reward r

Observe new state s’

Update table entry for Qˆ (s, a) as follows
Qˆ (s, a )  r (s, a ) + γ max Qˆ (s' , a' )
a'

Move: record transition from s to s’
Q-learning
Q-learning, learns the expected utility of
taking a particular action a in a particular
state s (Q-value of the pair (s,a))

0
100
90
0
90
0
0
0
0
0
0
0
72
G
0
100
0
100
G
0
r(state, action)
immediate reward values
81
81
90
V*(state) values
100
100
81
81
81
90
72
0
G
90
100
81
Q(state, action) values
Q-learning

Demonstration
http://iridia.ulb.ac.be/~fvandenb/qlearning/qlearning.html



eps: probability to use a random action
instead of the optimal policy
gam: discount factor, closer to 1 more weight
is given to future reinforcements.
alpha: learning rate
Temporal Difference Learning:

Q-learning estimates one time step difference
Q (1) (s (t ), a(t ))  r (t ) + γ max Qˆ (s (t + 1), a )
a

Why not for n steps?
Q (n ) (s(t ), a(t ))  r (t ) + γr (t + 1) +  + γ (n 1)r (t + n  1) + γ n max Qˆ (s(t + n ), a )
a
Temporal Difference Learning:

TD() formula

Qλ (s(t ), a(t ))  (1λ ) Q(1) (s(t ), a(t )) + λQ (2) (s(t ), a(t )) + λ 2Q(3) (s(t ), a(t )) + 

Intuitive idea: use constant 0    1 to
combine estimates from various lookahead
distances (note normalization factor (1- ))

Genetic algorithms




Imagine the individuals as agent functions
Fitness function as performance measure or
reward function
No attempt made to learn the relationship
between the rewards and actions taken by an
agent
Simply searches directly in the individual
space to find one that maximizes the fitness
functions
Genetic algorithms



Represent an individual as a binary string
Selection works like this: if individual X scores twice
as high as Y on the fitness function, then X is twice
as likely to be selected for reproduction than Y.
Reproduction is accomplished by cross-over and
mutation
Cart – Pole balancing

Demonstration
http://www.bovine.net/~jlawson/hmc/pole/sane.html
Summary




RL addresses the problem of learning control
strategies for autonomous agents
In Q-learning an evaluation function over
states and actions is learned
TD-algorithms learn by iteratively reducing
the differences between the estimates
produced by the agent at different times
In the genetic approach, the relation between
rewards and actions is not learned. You
simply search the fitness function space.