Reinforcement Learning

Download Report

Transcript Reinforcement Learning

KI2 - 11
Reinforcement Learning
Sander van Dijk
Kunstmatige Intelligentie / RuG
What is Learning ?


Percepts received by an agent should be used
not only for acting, but also for improving the
agent’s ability to behave optimally in the
future to achieve its goal.
Interaction between an agent and the world
Learning Types

Supervised learning:

Input, output) pairs of the function to be learned
can be perceived or are given.
Back-propagation

Unsupervised Learning:


No information at all about given output
SOM
Reinforcement learning:

Agent receives no examples and starts with no
model of the environment and no utility function.
Agent gets feedback through rewards, or
reinforcement.
Reinforcement Learning

Task


Learn how to behave successfully to achieve a
goal while interacting with an external
environment
Learn through experience from trial and error
Examples


Game playing: The agent knows it has won or lost,
but it doesn’t know the appropriate action in each
state
Control: a traffic system can measure the delay of
cars, but not know how to decrease it.
Elements of RL
Agent
State
Reward
Policy
Action
Environment
0 : r0
1 : r1
2 : r2
s0 a
 s1 a
 s2 a
 

Transition model, how action influence states
Reward R, immediate value of state-action transition

Policy , maps states to actions

Elements of RL
0
100
0
0
G
0
0
0
0
0
0
100
0
0
r(state, action)
immediate reward values
Elements of RL
0
100
0
90
0
0
0
0
0
0
0
G
0
100
0
81
0
r(state, action)
immediate reward values

100
G
90
100
V*(state) values
Value function: maps states to state values
Vπ (s )  r (t )+γr (t +1)+γ2 r (t +1)+ ...
Discount factor   [0, 1)
(here 0.9)
RL task (restated)

Execute actions in environment,
observe results.

Learn action policy  : state  action that
maximizes expected discounted reward
E [r(t) + r(t + 1) + 2r(t + 2) + …]
from any starting state in S
Reinforcement Learning

Target function is  : state  action

However…

We have no training examples of form
<state, action>

Training examples are of form
<<state, action>, reward>
Utility-based agents


Try to learn V * (abbreviated V*)
Perform look ahead search to choose best
action from any state s
π* (s )  arg max r (s, a ) + V * (δ (s, a ))
a


Works well if agent knows

 : state  action  state

r : state  action  R
When agent doesn’t know  and r, cannot
choose actions this way
Q-values

Q-values

Define new function very similar to V*
Q(s, a )  r (s, a ) + V * (δ (s, a ))

If agent learns Q, it can choose optimal
action even without knowing  or R

Using Q
π* (s )  arg max Q (s, a )
a
Learning the Q-value

Note: Q and V* closely related
V * (s )  arg max Q (s, a' )
a'

Allows us to write Q recursively as
Q (s (t ), a(t ))  r (s (t ), a(t )) + γV (δ (s (t ), a(t )))
 r (s (t ), a(t )) + γ ma xQ (s (t + 1), a' )
a'

Temporal Difference learning
Learning the Q-value

FOR each <s, a> DO

Initialize table entry:

Observe current state s

WHILE (true) DO
Qˆ (s, a )  0

Select action a and execute it

Receive immediate reward r

Observe new state s’

Update table entry for Qˆ (s, a) as follows
Qˆ (s, a )  r (s, a ) + γ max Qˆ (s' , a' )
a'

Move: record transition from s to s’
Q-learning
Q-learning, learns the expected utility of
taking a particular action a in a particular
state s (Q-value of the pair (s,a))

0
100
90
0
90
0
0
0
0
0
0
0
72
G
0
100
0
100
G
0
r(state, action)
immediate reward values
81
81
90
V*(state) values
100
100
81
81
81
90
72
0
G
90
100
81
Q(state, action) values
Representation


Explicit
State
Action
Q(s, a)
2
MoveLeft
81
2
MoveRight
100
...
...
...
Implicit

Weighted linear function/neural network
Classical weight updating
Exploration





Agent follows policy deduced from learned Q-values
Agent always performs same action in certain state,
but perhaps there is an even better action?
Exploration: Be safe <-> learn more, greed <->
curiosity.
Extremely hard, if not impossible, to obtain optimal
exploration policy.
Randomly try actions that have not been tried often
before but avoid actions that are believed to be of
low utility
Enhancement: Q()

Q-learning estimates one time step difference
Q (1) (s (t ), a(t ))  r (t ) + γ max Qˆ (s (t + 1), a )
a

Why not for n steps?
Q (n ) (s(t ), a(t ))  r (t ) + γr (t + 1) +  + γ (n 1)r (t + n  1) + γ n max Qˆ (s(t + n ), a )
a
Enhancement: Q()

Q() formula

Qλ (s(t ), a(t ))  (1λ ) Q(1) (s(t ), a(t )) + λQ (2) (s(t ), a(t )) + λ 2Q(3) (s(t ), a(t )) + 

Intuitive idea: use constant 0    1 to
combine estimates from various look ahead
distances (note normalization factor (1- ))

Enhancement: Eligibility Traces




Look backward instead of forward.
Weigh updates by eligibility trace e(s, a).
On each step, decay all traces by  and
increment the trace for the current stateaction pair by 1.
Update all state-action pairs in proportion to
their eligibility.
Genetic algorithms




Imagine the individuals as agent functions
Fitness function as performance measure or
reward function
No attempt made to learn the relationship
between the rewards and actions taken by an
agent
Simply searches directly in the individual
space to find one that maximizes the fitness
functions
Genetic algorithms



Represent an individual as a binary string
Selection works like this: if individual X scores twice
as high as Y on the fitness function, then X is twice
as likely to be selected for reproduction than Y.
Reproduction is accomplished by cross-over and
mutation
Cart – Pole balancing

Demonstration
http://www.bovine.net/~jlawson/hmc/pole/sane.html
Summary




RL addresses the problem of learning control
strategies for autonomous agents
TD-algorithms learn by iteratively reducing
the differences between the estimates
produced by the agent at different times
In Q-learning an evaluation function over
states and actions is learned
In the genetic approach, the relation between
rewards and actions is not learned. You
simply search the fitness function space.