Reward Functions for Accelarated Learning
Download
Report
Transcript Reward Functions for Accelarated Learning
Reward Functions for
Accelerated Learning
Presented by Alp Sardağ
Why RL?
RL a methodology of choice for learning in
a variety of different domains.
– Convergence property.
– Potential biological relevance.
RL is good in:
– Game playing
– Simulations
Cause of Failure
Fundamental assumption of RL models, the
belief that the agent-environment
interaction can be modeled as a MDP.
–
–
–
–
–
A and E are synchronized finite state automata.
A and E interact in discrete time intervals.
A can sense the state of E and use it to act.
After A acts, E transitions to a new state.
A receives a reward after performing an action.
States vs. Descriptors
Traditional RL depend on accurate state
information, where as in physical robot
environments:
– Even for simplest agents state space is very
large.
– Sensor inputs are noisy
– The agent usually percieve local
Transitions vs. Events
World and agent states change asynchronously, in
response to events not all are caused by the agent.
Same event can vary in duration under different
circumstances and have different consequences.
Nondeterministic and stochastic models are more close
to real world. However, the information for
establishing a stochastic model is not usually available.
Learning Trials
Generating a complete policy requires a search in
a large size of the state space.
In real world, the agent cannot choose what states
it will transition to, and cannot visit all states.
Convergence in real world depends only on
focusing only on the relevant parts of state.
The better the problem formulated, fewer learning
trials.
Reinforcement vs. Feedback
Current RL work uses two types of reward:
– Immediate
– Delayed
Real world situations tend to fall in between
the two popular extremes.
– Some immediate rewards
– Plenty of intermittent rewards
– Few very delayed rewards
Multiple Goals
Traditional RL deal with specialized problems in
which the learning task can be specified with a
single goal. The problems:
– Very specific task learned
– Conflicts with any future learning
The extension:
– Sequentially formulated goals where state space
explicitly encode what goals reached so far.
– Use separate state space and reward function for each
goal.
– W-learning: competition among selfish Q-learners.
Goal
Given the complexity and uncertanity of
real world domains, a learning model, that
minimizes the state space and maximizes
the amount of learning at each trial.
Intermediate Rewards
Interminent rewards can be introduced :
– Reinforcing multiple goals, by using progress
estimators.
Heterogenous Reinforcement Function: In
real worlds multiple goal exists, it is natural
to reinforce individually rather than a
monolithic goal.
Progress Estimators
Partial internal critics associated with
specific goal, provide a metric of
improvement relative to those goal. They
are importanat in noisy worlds:
– Decrease the learner’s sensitivity to intermittent
errors.
– Encourage the exploration, without them, the
agent can trash repeadetly attempting
inappropriate behaviors.
Experimental Design
To validate the proposed approach,
experiments designed for comparing new
RL with traditional RL.
– Robots
– Learning Task
– Learning Algorithm
– Control Algorithm
Robots
In the experiments four fully autonomous R2
mobile robots consisting of:
– Differentially steerable
– Gripper for lifting objects
– Piezo-electric bump sensor for detecting contact-
collisions and monitoring the grasping force.
– Set of IR for obstacle avoidance.
– Radio tranceivers, used for determining absolute
posiiton.
Robot Algorithm
The robots are programmed in the behavior
language:
– Based on the subsumption architecture.
– Parallel control system formed concurrently
active behaviors, some of which gather
information, some drive effectors, and some
monitor progress and contribute reinforcement.
The Learning Task
The learning task consists of finding a mapping of all
conditions and behaviors into the most efficient policy for
group foraging.
Basic behaviors from which to learn behavior selection:
– Avoiding
– Searching
– Resting
– Dispersing
– Homing
The Learning Task Cont.
The state space can be reduced to the crossproduct of the following state variables:
– Have-puck?
– At-home?
– Near-intruder?
– Night-time?
Learning Task Cont.
Instinctive behaviors because learning them
has a high cost:
– As soon as robot detects a puck between its
fingers, it grasps it.
– As soon as the robot reaches the home region, it
drops a puck if ti is carrying one.
– Whenever the robot is too near an obstacle, it
avoids.
The learning Algorithm
The algorithm produces and maintains a
matrix where appropriatness of behaviors
associated with each state is kept.
The values in the matrix fluctuates over
time based on received reinforcement, and
are updated asynchronously, with any
received reward.
The Learning Algorithm
The Learning Algorithm
The algorithm sums the reinforcement over time:
The influence of the different types of feedback was
weighted by the values of feedback constant:
The Control Algorithm
Whenever an event is detected, the following
control sequence is executed:
– Appropriate reinforcement delivered for current
condition-behavior pair,
– The current behavior is terminated,
– Another behavior is selected.
Behaviors are selected according to the following
rule:
– Choose an untried behavior if one is available.
– Otherwise choose best behavior
Experimental Results
The following three approaches are compared:
1.
2.
3.
A monolithic single-goal (puck delivery to the home
region) reward function using Q-learning, R(t)=P(t)
A heterogeneous reinforcement function using
multiple goals: R(t)=E(t),
A heterogeneous reinforcement function using
multiple goals and two progress estimator function:
R(t)=E(t)+I(t)+H(t)
Experimental Results
Values are collected twice per minute.
The final learning values are collected after 15 minute run.
Convergence is defined as relative ordering of conditionbehavior pairs.
Evaluation
Given the undeterminism and noisy sensor inputs
the single goal provides insufficient feedback. It
was vulnerable to interference.
The second learning strategy outperforms second
because it detects the achievement of subgoals on
the way of top level goal of depositing pucks at
home.
The complete heterogenous reinforcement and
progress estimator outperforms the others because
it uses of all available information for every
condition and behavior.
Additional Evaluation
Evaluated each part of the policy separately,
according the following criteria:
1.
2.
3.
Number of trials required,
Correctness,
Stability.
Some condition-behavior pairs proved to be
much more difficult to learn than others:
–
–
without progress estimator
rare states
Discussion
Summing reinforcement
Scaling
Transition models
Summing Reinforcement
Allows for oscillations.
In theory, the more reinforcement the faster the
learning. In practice noise and error could have the
opposite effect.
The experiments described here demonstrate that
even with a significant amount of noise, multiple
reinforcers and progress estimators significantly
accelerate learning.
Scaling
Interference was detriment to all three
approach.
In terms of the amount of time required,The
learned group foraging strategy
outperformed hand-coded greedy agent
strategies.
Foraging can be improved further by
minimizing interference. Only one robot
move at a time.
Transition Models
In case of noisy and uncertain environments
transition model is not available to aid the
learner.
The absence of a model made it difficult to
compute discounted future reward.
Future work: applying this approach to
problems that involve incomplete and
approximate state transition models.