Reward Functions for Accelarated Learning

Download Report

Transcript Reward Functions for Accelarated Learning

Reward Functions for
Accelerated Learning
Presented by Alp Sardağ
Why RL?

RL a methodology of choice for learning in
a variety of different domains.
– Convergence property.
– Potential biological relevance.

RL is good in:
– Game playing
– Simulations
Cause of Failure

Fundamental assumption of RL models, the
belief that the agent-environment
interaction can be modeled as a MDP.
–
–
–
–
–
A and E are synchronized finite state automata.
A and E interact in discrete time intervals.
A can sense the state of E and use it to act.
After A acts, E transitions to a new state.
A receives a reward after performing an action.
States vs. Descriptors

Traditional RL depend on accurate state
information, where as in physical robot
environments:
– Even for simplest agents state space is very
large.
– Sensor inputs are noisy
– The agent usually percieve local
Transitions vs. Events

World and agent states change asynchronously, in
response to events not all are caused by the agent.

Same event can vary in duration under different
circumstances and have different consequences.

Nondeterministic and stochastic models are more close
to real world. However, the information for
establishing a stochastic model is not usually available.
Learning Trials

Generating a complete policy requires a search in
a large size of the state space.
 In real world, the agent cannot choose what states
it will transition to, and cannot visit all states.
 Convergence in real world depends only on
focusing only on the relevant parts of state.
 The better the problem formulated, fewer learning
trials.
Reinforcement vs. Feedback

Current RL work uses two types of reward:
– Immediate
– Delayed

Real world situations tend to fall in between
the two popular extremes.
– Some immediate rewards
– Plenty of intermittent rewards
– Few very delayed rewards
Multiple Goals

Traditional RL deal with specialized problems in
which the learning task can be specified with a
single goal. The problems:
– Very specific task learned
– Conflicts with any future learning

The extension:
– Sequentially formulated goals where state space
explicitly encode what goals reached so far.
– Use separate state space and reward function for each
goal.
– W-learning: competition among selfish Q-learners.
Goal

Given the complexity and uncertanity of
real world domains, a learning model, that
minimizes the state space and maximizes
the amount of learning at each trial.
Intermediate Rewards

Interminent rewards can be introduced :
– Reinforcing multiple goals, by using progress
estimators.

Heterogenous Reinforcement Function: In
real worlds multiple goal exists, it is natural
to reinforce individually rather than a
monolithic goal.
Progress Estimators

Partial internal critics associated with
specific goal, provide a metric of
improvement relative to those goal. They
are importanat in noisy worlds:
– Decrease the learner’s sensitivity to intermittent
errors.
– Encourage the exploration, without them, the
agent can trash repeadetly attempting
inappropriate behaviors.
Experimental Design

To validate the proposed approach,
experiments designed for comparing new
RL with traditional RL.
– Robots
– Learning Task
– Learning Algorithm
– Control Algorithm
Robots

In the experiments four fully autonomous R2
mobile robots consisting of:
– Differentially steerable
– Gripper for lifting objects
– Piezo-electric bump sensor for detecting contact-
collisions and monitoring the grasping force.
– Set of IR for obstacle avoidance.
– Radio tranceivers, used for determining absolute
posiiton.
Robot Algorithm

The robots are programmed in the behavior
language:
– Based on the subsumption architecture.
– Parallel control system formed concurrently
active behaviors, some of which gather
information, some drive effectors, and some
monitor progress and contribute reinforcement.
The Learning Task


The learning task consists of finding a mapping of all
conditions and behaviors into the most efficient policy for
group foraging.
Basic behaviors from which to learn behavior selection:
– Avoiding
– Searching
– Resting
– Dispersing
– Homing
The Learning Task Cont.

The state space can be reduced to the crossproduct of the following state variables:
– Have-puck?
– At-home?
– Near-intruder?
– Night-time?
Learning Task Cont.

Instinctive behaviors because learning them
has a high cost:
– As soon as robot detects a puck between its
fingers, it grasps it.
– As soon as the robot reaches the home region, it
drops a puck if ti is carrying one.
– Whenever the robot is too near an obstacle, it
avoids.
The learning Algorithm

The algorithm produces and maintains a
matrix where appropriatness of behaviors
associated with each state is kept.
 The values in the matrix fluctuates over
time based on received reinforcement, and
are updated asynchronously, with any
received reward.
The Learning Algorithm
The Learning Algorithm

The algorithm sums the reinforcement over time:

The influence of the different types of feedback was
weighted by the values of feedback constant:
The Control Algorithm

Whenever an event is detected, the following
control sequence is executed:
– Appropriate reinforcement delivered for current
condition-behavior pair,
– The current behavior is terminated,
– Another behavior is selected.
 Behaviors are selected according to the following
rule:
– Choose an untried behavior if one is available.
– Otherwise choose best behavior
Experimental Results

The following three approaches are compared:
1.
2.
3.
A monolithic single-goal (puck delivery to the home
region) reward function using Q-learning, R(t)=P(t)
A heterogeneous reinforcement function using
multiple goals: R(t)=E(t),
A heterogeneous reinforcement function using
multiple goals and two progress estimator function:
R(t)=E(t)+I(t)+H(t)
Experimental Results



Values are collected twice per minute.
The final learning values are collected after 15 minute run.
Convergence is defined as relative ordering of conditionbehavior pairs.
Evaluation

Given the undeterminism and noisy sensor inputs
the single goal provides insufficient feedback. It
was vulnerable to interference.
 The second learning strategy outperforms second
because it detects the achievement of subgoals on
the way of top level goal of depositing pucks at
home.
 The complete heterogenous reinforcement and
progress estimator outperforms the others because
it uses of all available information for every
condition and behavior.
Additional Evaluation

Evaluated each part of the policy separately,
according the following criteria:
1.
2.
3.

Number of trials required,
Correctness,
Stability.
Some condition-behavior pairs proved to be
much more difficult to learn than others:
–
–
without progress estimator
rare states
Discussion

Summing reinforcement
 Scaling
 Transition models
Summing Reinforcement

Allows for oscillations.
 In theory, the more reinforcement the faster the
learning. In practice noise and error could have the
opposite effect.
 The experiments described here demonstrate that
even with a significant amount of noise, multiple
reinforcers and progress estimators significantly
accelerate learning.
Scaling

Interference was detriment to all three
approach.
 In terms of the amount of time required,The
learned group foraging strategy
outperformed hand-coded greedy agent
strategies.
 Foraging can be improved further by
minimizing interference. Only one robot
move at a time.
Transition Models

In case of noisy and uncertain environments
transition model is not available to aid the
learner.
 The absence of a model made it difficult to
compute discounted future reward.
 Future work: applying this approach to
problems that involve incomplete and
approximate state transition models.