Reinforcement Learning
Download
Report
Transcript Reinforcement Learning
KI2 - 11
Reinforcement Learning
Sander van Dijk
Kunstmatige Intelligentie / RuG
What is Learning ?
Percepts received by an agent should be used
not only for acting, but also for improving the
agent’s ability to behave optimally in the
future to achieve its goal.
Interaction between an agent and the world
Learning Types
Supervised learning:
Input, output) pairs of the function to be learned
can be perceived or are given.
Back-propagation
Unsupervised Learning:
No information at all about given output
SOM
Reinforcement learning:
Agent receives no examples and starts with no
model of the environment and no utility function.
Agent gets feedback through rewards, or
reinforcement.
Reinforcement Learning
Task
Learn how to behave successfully to achieve a
goal while interacting with an external
environment
Learn through experience from trial and error
Examples
Game playing: The agent knows it has won or lost,
but it doesn’t know the appropriate action in each
state
Control: a traffic system can measure the delay of
cars, but not know how to decrease it.
Elements of RL
Agent
State
Reward
Policy
Action
Environment
0 : r0
1 : r1
2 : r2
s0 a
s1 a
s2 a
Transition model, how action influence states
Reward R, immediate value of state-action transition
Policy , maps states to actions
Elements of RL
0
100
0
0
G
0
0
0
0
0
0
100
0
0
r(state, action)
immediate reward values
Elements of RL
0
100
0
90
0
0
0
0
0
0
0
G
0
100
0
81
0
r(state, action)
immediate reward values
100
G
90
100
V*(state) values
Value function: maps states to state values
Vπ (s ) r (t )+γr (t +1)+γ2 r (t +1)+ ...
Discount factor [0, 1)
(here 0.9)
RL task (restated)
Execute actions in environment,
observe results.
Learn action policy : state action that
maximizes expected discounted reward
E [r(t) + r(t + 1) + 2r(t + 2) + …]
from any starting state in S
Reinforcement Learning
Target function is : state action
However…
We have no training examples of form
<state, action>
Training examples are of form
<<state, action>, reward>
Utility-based agents
Try to learn V * (abbreviated V*)
Perform look ahead search to choose best
action from any state s
π* (s ) arg max r (s, a ) + V * (δ (s, a ))
a
Works well if agent knows
: state action state
r : state action R
When agent doesn’t know and r, cannot
choose actions this way
Q-values
Q-values
Define new function very similar to V*
Q(s, a ) r (s, a ) + V * (δ (s, a ))
If agent learns Q, it can choose optimal
action even without knowing or R
Using Q
π* (s ) arg max Q (s, a )
a
Learning the Q-value
Note: Q and V* closely related
V * (s ) arg max Q (s, a' )
a'
Allows us to write Q recursively as
Q (s (t ), a(t )) r (s (t ), a(t )) + γV (δ (s (t ), a(t )))
r (s (t ), a(t )) + γ ma xQ (s (t + 1), a' )
a'
Temporal Difference learning
Learning the Q-value
FOR each <s, a> DO
Initialize table entry:
Observe current state s
WHILE (true) DO
Qˆ (s, a ) 0
Select action a and execute it
Receive immediate reward r
Observe new state s’
Update table entry for Qˆ (s, a) as follows
Qˆ (s, a ) r (s, a ) + γ max Qˆ (s' , a' )
a'
Move: record transition from s to s’
Q-learning
Q-learning, learns the expected utility of
taking a particular action a in a particular
state s (Q-value of the pair (s,a))
0
100
90
0
90
0
0
0
0
0
0
0
72
G
0
100
0
100
G
0
r(state, action)
immediate reward values
81
81
90
V*(state) values
100
100
81
81
81
90
72
0
G
90
100
81
Q(state, action) values
Representation
Explicit
State
Action
Q(s, a)
2
MoveLeft
81
2
MoveRight
100
...
...
...
Implicit
Weighted linear function/neural network
Classical weight updating
Exploration
Agent follows policy deduced from learned Q-values
Agent always performs same action in certain state,
but perhaps there is an even better action?
Exploration: Be safe <-> learn more, greed <->
curiosity.
Extremely hard, if not impossible, to obtain optimal
exploration policy.
Randomly try actions that have not been tried often
before but avoid actions that are believed to be of
low utility
Enhancement: Q()
Q-learning estimates one time step difference
Q (1) (s (t ), a(t )) r (t ) + γ max Qˆ (s (t + 1), a )
a
Why not for n steps?
Q (n ) (s(t ), a(t )) r (t ) + γr (t + 1) + + γ (n 1)r (t + n 1) + γ n max Qˆ (s(t + n ), a )
a
Enhancement: Q()
Q() formula
Qλ (s(t ), a(t )) (1λ ) Q(1) (s(t ), a(t )) + λQ (2) (s(t ), a(t )) + λ 2Q(3) (s(t ), a(t )) +
Intuitive idea: use constant 0 1 to
combine estimates from various look ahead
distances (note normalization factor (1- ))
Enhancement: Eligibility Traces
Look backward instead of forward.
Weigh updates by eligibility trace e(s, a).
On each step, decay all traces by and
increment the trace for the current stateaction pair by 1.
Update all state-action pairs in proportion to
their eligibility.
Genetic algorithms
Imagine the individuals as agent functions
Fitness function as performance measure or
reward function
No attempt made to learn the relationship
between the rewards and actions taken by an
agent
Simply searches directly in the individual
space to find one that maximizes the fitness
functions
Genetic algorithms
Represent an individual as a binary string
Selection works like this: if individual X scores twice
as high as Y on the fitness function, then X is twice
as likely to be selected for reproduction than Y.
Reproduction is accomplished by cross-over and
mutation
Cart – Pole balancing
Demonstration
http://www.bovine.net/~jlawson/hmc/pole/sane.html
Summary
RL addresses the problem of learning control
strategies for autonomous agents
TD-algorithms learn by iteratively reducing
the differences between the estimates
produced by the agent at different times
In Q-learning an evaluation function over
states and actions is learned
In the genetic approach, the relation between
rewards and actions is not learned. You
simply search the fitness function space.