W-Learning: Competition Among Selfish Q

Download Report

Transcript W-Learning: Competition Among Selfish Q

W-Learning:
Competition Among Selfish Q-Learners
Presented by Alp Sardağ
Autonomous Mobile Robots:
 Behaviour Based AI: emphasizing
intelligence as emerging from ongoing
interaction with the world.
– Subsumption Architecture: By Brooks
Ideas Of Subsumption Arcihtecture
 Default Behaviour: ‘Avois All things’ layer1
takes control of the robot whenever ‘look
for food’ layer2 is idle.
 Multiple parallel goals: Which to give
control?
The Action Selection Problem
 Brooks gives to the modules full sensing-
and acting powers, but action-selection is
the job of the programmer
 W-learning modules are competing for
control, At this kind of robots actionselection is not designed but learnt.
Competition Among Selfish Agents
 Make Layers peers
 Layers Compete for control
Definition and Terms
 The collection of agents A1,...,An are :
– Selfish agents
– No cooperation
– No knowledge of others
 Each agent Ai suggests an action ai(x) where
x the world state.
 The robot chooses one of these actions ak(x)
and executes it.
How the Robot Works?
 Some way of resolving the competition.
 The idea: Agents have always an action to
suggest, but it will care some times more
than others.
Example:
‘avoid the predator’ and ‘wander around
looking for food’
How to resolve the competition
 Each agent suggests some action ai(x) with
weight Wi(x), the robot executes action
ak(x) where:
Wk(x)=max Wi(x) where i1,2,...,n
 Ak is the leader of the competition for state
x.
Example
• No agent is explicitly aware of the existence of any other.
An agent can still ‘use’ another agent by ceding to control.
Example Cont.
W-Values as Action-Selection
 As opposed to agents that share information
and make a compromise action, This is a
winner-take-all action selection scheme.
 The division of control is state-based rather
than time-based.
– Blumberg points out that animals sometimes
appear to engage in a form of time-sharing.
– Same effect can be achieved by a suitable state
representation x.
Example
 Let x=(e,i) be the state.
 e: information from external sensors.
 i:(f,c) information from external sensors.
 f:very hungry(2),hungry(1),not hungry(0)
 c:very dirty(2),dirty(1),clean(0)
 The weights may be:
Wf((e(2,c))) > Wf((e(1,c))) > Wc((e(f,2))) >
Wf((e(0,c))) > Wc((e(f,1))) > Wf((e(f,0)))
Engage Opportunistic Behaviour
 Hungry and Thirsty animal example: Food is only found in the north,
water in the south. The animal treks north, eats and as soon as its
hunger only partially satisfied thirst is now higher. Even before it got
south, it wiil be starving again.
– 1st solution: time-based agents, get control for some minimum amount of
time.
– 2nd solution: the agents can tell the difference between immediate and
distant likely payoff, and present W-values accordingly.
 Assigning W-values to actions:
– Previous work: as a design problem
– Using learning methods that automatically assign values to actions.
Reinforcement Learning
 By trial-and-error, the agent learns to take
the actions which maximise its rewards.
Q-learning
.5
.5
.33
1.0
1
.5
.5
-1
.5
.33
.33
.5
.33
.33 .5
.5
.33
.5
1.0
Start
(a)Simple Stocastic Environment (b)Mij is provided in PL, Maij is provided in AL
NOTE: Transitions are probabilistic. Pxa(y) is the probability
that doing a in x will lead to state y.
Q-learning
 The agent is interested not in immediate
rewards, but in the total discounted reward.
R=rt+rt+1+ 2rt+2+... where 0 <1
 The expected total discounted reward:
V(xt)=E(R)=E(rt)+ E(rt+1)+ 2E(rt+2)+...
=E(rt)+ [E(rt+1)+ E(rt+2)+...]
=E(rt)+ V(xt+1)
=r rPxa(r)+ yV(y)Pxa(y)
Q-learning
 In learning phase the agent try to build up Q-
values for each pair (x,a).
– Temporal Difference Learning:
Q(x,a)  Q(x,a) +(r+ maxbQ(y,b) - Q(x,a))
where  learning rate and  discount rate.
 Convergence of Q-learning:
Q(x,a)  (1- )Q(x,a) +(r+ maxbQ(y,b))
where  takes decreasing successive values 1, 2,...
let n(x,a) =1,2,... The number of times (x,a) visited
(x,a)=1n(x,a)
=1,12,13,...
Q-learning
 The optimal policy:
*(x)=a*(x) where
=maxaQ(x,a)
 Exploration problem:
U(i)  R(i) + maxa F(jMaijU(j),N(a,i)) where
F(u,n) =
{
R+
u
if n<Ne
otherwise
 A new approach to exploration problem:
Multi-Module RL
 Most work in RL focused on single agents. In
theory any problem can be seen just another IO
mapping to be learnt by a single agent. Scalibility
problem leads to combine simple agents to solve
complex task. Some approaches are:
– Top-down: identifying task and decompose it into
subtasks. Moore by hand, Tham learn the
decomposition where subtasks combine sequentially to
solve main task.
– Bottom-up: the behaviour that emerges when multiple
RL agents are combined in different ways. Tan studies
the benefits of cooperation among agents like ants.
Selfish Q-learners:
 Each agent is a Q-learning agent, with its own reward function and Q-
values.
 Co-operation is involuntary and emerges from competition among
agents.
 Let agents be A1,…,An
The robot works:
observe x
for (all agents):
get sugested action ai with strength Wi(x)
find Wk(x)=max Wi(x)
execute ak
observe y
for (all agents):
get reward ri
update Q andor W
W-values
 For updating W, use the numerical Q-
values.
– Static W-values: the agent promote its action
with the same strength no matter what its
competition.
W(x)=Q(x,a)
– W=importance: W could be the difference
between suggested action and the worst
possible action:
W(x)=Q(x,a)-minb(x,b)
Example
Dynamic (learnt) W-values
 Previous W-values fail to take into account
what the other agents are doing.
– Examples:
• suggested actions may be the same.
• The other agent might be suggesting an action which
would be disastrous for the other agent.
 Two types of Ai need not compete for:
– A state which is relatively unimportant to it.
– A state which is important but some agent Ak
suggesting an action which is good for Ai.
Meaning of W-values
 W=(P-A): the difference between P (what is
predicted if we are listened to) and actual
reward A (what actually happened).
– An agent will not need explicit knowledge
about who it is competing with. It will be aware
of them when they stop its action being obeyed,
and will be aware of the y and r caused as a
result.
– The agents will set their own W-values in an
incremental way using Q-values.
W-learning
 Q-learning process:
P:=(1-Q)P+ Q(A)
 W-learning process:
W:=(1-w)W+ w(P-A)
 For updating Q-values:
Qi(x,ak):=(1- Q)Qi(x,ak)+ Q(ri+maxbQi(y,b))
 For updating W-values:
Wi(x)=(1- w)Wi(x)+ w(Qi(x,ai)-(ri+maxbQi(y,b))
NOTE: only agents that were not obeyed are updated
W-learning pseudo-code
State x := observe();
For ( all i )
a[i] := A[i].suggestAction(x);
Find k
Execute ( a[k] );
State y := observe();
For ( all i )
{
r[i] := A[i].reward(x,y);
A[i].updateQ ( x , a[k] , y , r[i] );
if (i!=k)
A[i].updateW(x , a[k] , y , r[i] );
}
Learning Q (somewhat) Before learning W
 Ideally ‘Learn Q first, then W’.
 It is impossible to learn Q completely in finite
time.
 Alternatively, learning W while Q is still being
learnt:
Wi(x)=(1- w)Wi(x)+ w(1- Q)T(Qi(x,ai)-(ri+maxbQi(y,b))
where T >0 is the delaying rate.
After Q has been learnt
 Imagine a dynamically changing collection with agents
being continually created and destroyed over time, and the
suriving agnets adjusting their W-values as the nature of
their competiton changes. Q is leant once, whereas W is
relearnt again.
 Edelman’s biological theory of Neural Darvinism.
Self-modifying W-values
 The update of W for Ai if ak chosen:
Wi(x)=(1- w)Wi(x)+ wdki(x)
where dki(x) is the difference between P and A
 If Ak leads from start to infinity:
Wi(x)E(dki(x))
This is why we don’t update Wk(x) because E(dkk(x))=0
 Benefit:
– W-learning algorithm can handle any number of switches of leader.
Will competition ever be resolved?
 What we need to show the leader will not keep changing forever.
Convergence of W-learning
This process will terminate within n2 steps, resolving
competition with a winner:
Wk(x)E(dki(x)) i, ik
Remark1: More than one possible winner
(
0 3 0
0 0 9
0 0 0
)
Start with all Wi(x)=0. Choose A2’s action:
W1(x)=(1-1)x0+1.d21=3
Now A1 is the leader.
Start with all Wi(x)=0. Choose A3’s action:
W2(x)=(1-1)x0+1.d32=9
Now A2 is the leader.
Remark2: Should we score winner’s W
 Wk(x)  E(dkk(x)) = 0
the leader’s W converging to 0. Hence back and
forth competition forever under any such system.
Remark3:Scaling, peers and unequal agents
 An agent with high rewards will end up
with high W-values.
 The agents peers because they compete on
the same basis.
 All concerns may not be of equal
importance.