Multi-Agent Systems Lecture 10 & 11 University “Politehnica” of Bucarest 2004-2005 Adina Magda Florea [email protected] http://turing.cs.pub.ro/blia_2005

Download Report

Transcript Multi-Agent Systems Lecture 10 & 11 University “Politehnica” of Bucarest 2004-2005 Adina Magda Florea [email protected] http://turing.cs.pub.ro/blia_2005

Multi-Agent Systems
Lecture 10 & 11
University “Politehnica” of Bucarest
2004-2005
Adina Magda Florea
[email protected]
http://turing.cs.pub.ro/blia_2005
Machine Learning
Lecture outline
1 Learning in AI (machine learning)
2 Learning decision trees
3 Version space learning
4 Reinforcement learning
5 Learning in multi-agent systems
5.1 Learning action coordination
5.2 Learning individual performance
5.3 Learning to communicate
5.4 Layered learning
6 Conclusions
1 Learning in AI

What is machine learning?
Herbet Simon defines learning as:
“any change in a system that allows it to
perform better the second time on repetition
of the same task or another task drawn from
the same population (Simon, 1983).”
In ML the agent learns:



knowledge representation of the problem domain
problem solving rules, inferences
problem solving strategies
3
In MAS learning the agents should learn:
 what an agent learns in ML but in the context of MAS both cooperative and self-interested agents
 how to cooperate for problem solving - cooperative agents
 how to communicate - both cooperative and selfinterested agents
 how to negotiate - self interested agents
Classifying learning
Different dimensions
 explicitly represented domain knowledge
 how the critic component (performance evaluation) of a
learning agent works
 the use of knowledge of the domain/environment
4
Single agent learning
Learning
Process
Data
Feed-back
Learning results
Problem Solving
Environment
K&B
Inferences
Strategy
Results
Teacher
Feed-back
Performance
Evaluation
5
Self-interested learning agent
Communication
Agent
Data
Environment
Actions
Learning results
Problem Solving
K&B
Inferences
Strategy
Agent
Feed-back
Learning
Process
Self
Other
agents
Results
Agent
Feed-back
Performance
Evaluation
NB: Both in this diagram and the next, not all components or flow arrows are
always present - it depends on the type of agent (cognitive, reactive), type
of learning, etc.
6
Cooperative learning agents
Feed-back
Feed-back
Learning
Process
Communication
Learning results
Learning results
Problem Solving
K&B
Inferences
Strategy
Self
Other
agents
Learning
Process
Problem Solving
Data
K&B
Inferences
Strategy
Self
Other
agents
Results
Results
Performance
Evaluation
Communication
Actions
Agent
Feed-back
Environment
Actions
Agent
Communication
7
2 Learning decision trees
ID3 - Quinlan -’80
 ID3 algorithm classifies training examples in several classes
 Training examples: attributes and values
2 phases:
 build decision tree
 use tree to classify unknown instances

Decision tree - definition
2
3
1
Shape
circle
circle
triangle
circle
triangle
circle
Color
red
red
yellow
yellow
red
yellow
Size
small
big
small
small
big
big
Classification
+
+
8
The problem of
estimating an
individual’s credit
risk on the basis of:
- credit history,
- current debt,
- collateral,
- income
No
.
Risk
(Classification)
1
High/grand
2
Credit
History
Debt
Collateral
Income
Bad
High
None
$0 to $15k
High
Unknown
High
None
$15 to $35k
3
Moderate
Unknown
Low
None
$15 to $35k
4
High
Unknown
Low
None
$0k to $15k
5
Low
Unknown
Low
None
Over $35k
6
Low
Unknown
Low
Adequate
Over $35k
7
High
Bad
Low
None
$0 to $15k
8
Moderate
Bad
Low
Adequate
Over $35k
9
Low
Good
Low
None
Over $35k
10
Low
Good
High
Adequate
Over $35k
11
High
Good
High
None
$0 to $15k
12
Moderate
Good
High
None
$15 to $35k
13
Low
Good
High
None
Over $35k
14
High
Bad
High
None
$15 to $35k
9
Decision tree
Income?
$0K-$15K
$Over 35K
$15K-$35K
High risk
Credit history?
Credit history?
Unknown
Good
Bad
Unknown
Debt?
High
High risk
Bad
Good
High risk
Moderate risk
Low risk
Moderate risk
High risk
Low
Moderate risk
ID3 assumes the simplest decision tree that covers all the training examples
is the one it should be picked.
Ockham’s Razor (or Occam ??), 1324:
“It is vain to do with more what can be done with less…
Entities should not be multiplied beyond necessity.”
Information theoretic test selection in ID3



Information theory – the info content of a message
M={m1, …, mn}, p(mi)
The information content of the message M
I(M) = Sumi=1,n[-p(mi)*log2(p(mi))]
I(Coin_toss) = - p(heads)log2(p(heads)) – p(tails)log2(p(tails) = 1/2log2(1/2) – 1/2log2(1/2) = 1 bit
I(Coin_toss) = - p(heads)log2(p(heads)) – p(tails)log2(p(tails) = 3/4log2(3/4) – 1/4log2(1/4) = 0.811 bits
p(risk_high) = 6/14
p(risk_moderate) = 3/14
p(risk_low) = 5/14
The information in any tree that covers the examples
I(Tree) = -6/14log2(6/14)-3/14log2(3/14)-5/14log2(5/14)

The information gain provided by making a test at
the root of the current attribute
The amount of information needed to complete the tree
after making A the root
A – n values
{C1, C2, …,Cn}

n
| Ci |
E ( A)  
I (Ci)
i 1 | C |

Gain(A) = I(C) - E(A)

C1={1,4,7,11} C2={2,3,12,14} C3={5,6,8,9,10,13}

E(income)=4/14*I(C1) + 4/14*I(C2) + 6/14*I(C3) =
0.564
Assesing the performance of ID3
Training set and test set – average prediction of quality,
happy graphs
 Broadening the applicability of decision trees




Missing data: how to classify an instance that is missing
one of test attributes?
Pretend the instance has all possible values for the
attribute, weight each value according to its frequency
among the examples, follow all branches and multiply
weights along the path
Multivalued attributes: an attribute with a large number of
possible values – gain ratio
gain ratio = selects attributes according to
Gain(A)/I(CA)
Continuous-valued attributes - discretize
3 Version space learning


P and Q – sets that match p, q in FOPL
Expression p is more general than q iff P  Q - we say that p covers
q
color(X,red)  color(ball,red)


If a concept p is more general than a concept q then p(x), q(x)
descriptions that classify objects as being positive examples:
x p(x)  positive(x)
x q(x)  positive(x)
p covers q iff q(x)  positive(x) is a logical consequence of p(x) 
positive(x)

Concept space
obj(X,Y,Z)

A concept c is maximally specific if it covers all positive examples,
none of the negative examples, and for any other concept c’ that
covers the positive examples, c’ is more general than c - S

A concept c is maximally general if it covers none of the negative
examples, and for any other concept c’ that covers none negative
examples, c is more general than c’ - G
14
The candidate elimination algorithm

Algorithms for searching the concept space; overgeneralization,
overspecialization
Specific to general search for hypothesis set S:
maximally specific generalizations
 Initialize S to the first positive training instance
 Be N the set of all negative instances seen so far
 for each positive instance p do
– for every sS do if s does not match p, replace s with its
most specific generalization that matches p
– delete from S all hypotheses more general than some other
hypothesis in S
– delete from S all hypotheses that match a previously observed
negative instance in N
 for every negative instance n do
– delete all members of S that match n
– add n to N to check future hypotheses for overgeneralization
15
The candidate elimination algorithm
General to specific search for hypothesis set G:




maximally general specializations
Initialize G to contain the most general concept in the space
Be P the set of all positive instances seen so far
for each negative instance n do
– for every gG that matches n do replace g with its most
general specializations that do not match n
– delete from G all hypotheses more specific than some other
hypothesis in G
– delete from G all hypotheses that fail to match some positive
instance in P
for every positive instance p do
– delete all members of G that fail to match p
– add p to P to check future hypotheses for overspecialization
16

Bidirectional search; S and G
Candidate elimination
 Initialize S to the first positive training instance
 for each positive instance p do
– delete from G all hypotheses that fail to match p
– for every sS do if s does not match p, replace s with its most
specific generalization that matches p
– delete from S all hypotheses more general than some other
hypothesis in S
– delete from S all hypotheses more general than some hypothesis
in G
 for every negative instance n do
– delete all members of S that match n
– for every gG that matches n do replace g with its most general
specializations that do not match n
– delete from G all hypotheses more specific than some other
hypothesis in G
– delete from G all hypotheses more specific than some other
hypothesis in S
17
G: {obj(X, Y, Z)}
S: { }
Positive: obj(small, red, ball)
G: {obj(X,Y,Z)}
S: {obj(small, red, ball) }
Negative: obj(small, blue, ball)
G: {obj(X, red, Z)}
S: {obj(small, red, ball) }
Positive: obj(large, red, ball)
G: {obj(X, red, Z)}
S: {obj(X, red, ball) }
Negative: obj(large, red, cube)
G: {obj(X, red, ball)}
S: {obj(X, red, ball)
18
4 Reinforcement learning



Combines dynamic programming and AI machine learning
techniques
Trial-and-error interactions with a dynamic environment
The feedback of the environment – reward or reinforcement
search in the space of behaviors –
genetic algorithms

Two main approaches
learn utility based on statistical
techniques and dynamic programming methods
19
A reinforcement-learning model
E
T
s
B – agent's behavior
i – input = current state of the env
r – value of reinforcement
(reinforcement signal)
T – model of the world
I
a
i
R
B
r
The model consists of:
a discrete set of environment states S (sS)
a discrete set of agent actions A (a  A)
a set of scalar reinforcement signals, typically {0, 1} or real numbers
the transition model of the world, T
• environment is nondeterministic
T : S x A  P(S) – T = transition model
T(s, a, s’)
Environment history = a sequence of states that leads to a terminal state
20
Features varying RL






accessible / inaccessible environment
has (T known) / has not a model of the environment
learn behavior / learn behavior + model
reward received only in terminal states or in any state
passive/active learner:
– learn utilities of states (state-action)
– active learner – learn also what to do
how does the agent represent B, namely its behavior:
– utility functions on states or state histories (T is
known)
– active-value functions (T is not necessarily known) assigns an expected utility to taking a given action in a
given state
E(a,e) =e’env(e,a)(prob(ex(a,e)=e’)*utility(e’))
21
The RL problem
 the agent has to find a policy  = a function which
maps states to actions and which maximizes some
long-time measure of reinforcement.
 The agent has to learn an optimal behavior =
optimal policy = a policy which yields the highest
expected utility
The utility function depends on the environment
history (a sequence of states)
In each state s the agents receives a reward - R(s)
Uh([s0, s1, …, sn]) – utility function on histories
22
Models of optimal behavior

Finite-horizon model: at a given moment of time the agent
should optimize its expected reward for the next h steps
E(t=0, h R(st))
rt represents the reward received t steps into the future.

Infinite-horizon model: optimize the long-run reward
E(t=0, R(st))

Infinite-horizon discounted model: optimize the long-run
reward but rewards received in the future are geometrically
discounted according to a discount factor .
E(t=0, t R(st))
0   < 1.
23
Models of optimal behavior

Additive rewards:
Uh([s0, s1, …, sn]) = R(s0)+R(s1)+R(s2)+…
if U is separable / additive
Uh([s0, …, sn]) = R(s0) + Uh([s1, .., sn])

Discounted rewards:
Uh([s0, s1, …, sn]) = R(s0)+  *R(s1)+ 2 *R(s2)+…
0   < 1.
 can be interpreted in several ways. It can be seen
as an interest rate, a probability of living another
step, or as a mathematical trick to bound an infinite
sum.
24
Exploitation versus exploration




Difference between RL and supervised learning:
a reinforcement learner must explicitly explore its
environment.
The representative problem is the n-armed bandit
problem
The agent might believe that a particular arm has a fairly
high payoff probability; should it choose that arm all the
time, or should it choose another one that it has less
information about, but seems to be worse?
Answers to these questions depend on how long the
agent is expected to play the game; the longer the game
lasts, the worse the consequences of prematurely
converging to a sub-optimal arm, and the more the
agent should explore.
25
Utilities of states
 Utility of a state defined in terms of the utility of a state
sequence = expected utility of the state sequence that
might follow it, if the agents follows the policy 
 U (s) = E(H(s,) | T) =  (P(H(s, )| T) * Uh(H(s, )))
• H – history beginning in s
• E- expected utility Uh – utility function on histories
•  - is a policy defined by the transition model T and the utility
function on histories Uh.
• If U is separable / additive:
Uh([s0, …, sn]) = R(s0) + Uh([s1, .., sn])
st – the state after executing t steps using 

U(s) = E(t=0, (t *R(st) | , s0=s)
U(s) long term, R(s) short term
26
A 4 x 3 environment
 The intended outcome occurs with probability 0.8,
and with probability 0.2 (0.1, 0.1) the agent
moves at right angles to the intended direction.
 The two terminal states have reward +1 and –1,
all other states have a reward of –0.04, =1
0.8
0.1
0.1
3
+1
3
0.812 0.868 0.918
+1
2
-1
2
0.762
-1
1
0.705 0.655 0.611 0.388
1
1
2
3
4
1
0.660
2
3
4
27

The utility function U(s) allows the agent to select
actions by using the Maximum Expected Utility
principle
*(s) = argmax s’T(s,a,s’)*U(s’)
optimal policy
The utility of a state is the immediate reward for that
state plus the expected discounted utility of the next
state, assuming that the agent chooses the optimal
action
• U(s) = R(s) +  max as’T(s,a,s’)*U(s’)
• Bellman equation - U(s) – unique solutions

U(s) = E(t=0, (t *R(st) | , s0=s)
28
Bellman equation for the 4x3 world
Equation for the state (1,1)
U(1,1) = -0.04 +  max{ 0.8 U(1,2) + 0.1 U(2,1) + 0.1 U(1,1),
Up
0.9U(1,1) + 0.1U(1,2),
Left
0.9U(1,1) + 0.1U(2,1),
Down
0.8U(2,1) +0.1U(1,2) + 0.1U(1,1)}
Right
Up is the best action
3
0.812 0.868 0.918
+1
2
0.762
-1
1
0.705 0.655 0.611 0.388
1
0.660
2
3
4
Value Iteration

Given the maximal expected utility, the optimal policy is:
*(s) = arg maxa(R(s) +  s’ T(s,a,s’) * U(s’))

Compute U*(s) using an iterative approach  Value
Iteration
U0(s) = R(s)
Ut+1(s) = R(s) + maxa( s’ T(s,a,s’) * Ut(s’))
t  inf ….utility values converge to the optimal values

When do we stop the algorithm ?
- RMS (root mean square)

- Policy loss

30
Policy iteration


Manipulate the policy directly, rather than finding it
indirectly via the optimal value function
choose an arbitrary policy 
compute the utility using , i.e. solve the equations
U(s) = R(s) +  s’ T(s,a,s’) * U(s’))

improve the policy at each state
(s)  R(s) + arg maxa ( s’ T(s,a,s’) * U(s’))
31
a) Passive reinforcement learning
ADP (Adaptive Dynamic Programming) learning

The problem of calculating an optimal policy in an accessible,
stochastic environment.

Markov Decision Problem (MDP) consists of:
<S, A, P, R>
S - a set of states
A - a set of actions
R – reward function, R: S x A  R
T : S x A  (S), with (S) the probability distribution
over the states S The model is Markov if the state transitions
are independent of any previous environment states or agent
actions.

MDP: finite-state and finite-action – focus on that / infinite state
and action space
Finding the optimal policy given a model T = calculate the utility of
each state U(state) and use state utilities to select an optimal
action in each state.
32


The utility of a state s is given by the expected utility of the history
beginning at that state and following an optimal policy.
If the utility function is separable
U(s) = R(s) + maxas’ T(s,a,s’) * U(s’), s
(this is a dynamic programming technique)

(1)

The equation asserts that the utility of a state s is the expected
instantaneous reward plus the expected utility (expected discounted
utility) of the next state, using the best available action.

The maximal expected utility, considering moments of time t0, t1,
…, is
U*(s) = max E(t=0,inf t rt )

 - the discount factor
This is the expected infinite discounted sum of reward that the
agent will gain if it is in state s and executes an optimal policy. This
optimal value is unique and can be defined as the solution of the
33
simultaneous equations (1).
ADP (Adaptive Dynamic Programming) learning
function Passive-ADP-Agent(percept) returns an action
inputs: percept, a percept indicating the current state s’ and reward signal r’
variable: , a fixed policy
mdp, an MDP with model T, rewards R, discount 
U, a table of utilities, initially empty
Nsa, a table of frequencies for state-action pairs, initially zero
Nsas’, a table of frequencies of state-action-state triples, initially zero
s, a, the previous state and action, initially null
if s’ is new then U[s’]  r’, R[s’]  r’
if s is not null then
increment Nsa[s,a] and Nsas’[s,a,s’]
for each t such that Nsas’[s,a,t] <>0 do
T[s,a,t]  Nsas’[s,a,t] / Nsa[s,a]
U  Value-Determination(,U,mdp)
if Terminal[s’] then s,a  null else s,a  s’, [s’]
return a
end
34
Temporal difference learning
Passive learning in an unknown environment

The value function is no longer implemented by solving
a set of linear equations, but it is computed iteratively.

Used observed transitions to adjust the values of the
observed states so that they agree with the constraint
equations.

U(s)  U (s) + (R(s) +  U (s’) – U (s))
 is the learning rate.
Whatever state is visited, its estimated value is updated
to be closer to R(s) +  U (s’)
since R(s) is the instantaneous reward received and
U (s') is the estimated value of the actually occurring
next state.
35
Temporal difference learning
function Passive-TD-Agent(percept) returns an action
inputs: percept, a percept indicating the current state s’ and reward
signal r’
variable: , a fixed policy
U, a table of utilities, initially empty
Ns, a table of frequencies for states, initially zero
s, a, r, the previous state, action, and reward, initially null
if s’ is new then U[s’]  r’
if s is not null then
increment Ns[s]
U[s]  U[s] + (Ns[s])(r +  U [s’] – U [s])
if Terminal[s’] then s, a, r  null else s, a, r  s’, [s’], r’
return a
end
36
Temporal difference learning

Does not need a model to perform its updates

The environment supplies the connections between neighboring
states in the form of observed transitions.
ADP and TD comparison



ADP and TD try both to make local adjustments to the utility
estimates in order to amke each state « agree » with its successors
TD adjusts a state to agree with the observed successor
ADP adjusts a state to agree with all of the successors that might
occur, weighted by their probabilities
37
b) Active reinforcement learning
 Passive learning agent – has a fixed policy that
determines its behavior
 An active learning agent must decide what action to
take
 The agent must learn a complete model with outcome
probabilities for all actions (instead of a model for the
fixed policy)
Active learning of action-value functions  Q-learning
action-value function = assigns an expected utility to taking a
given action in a given state, Q-values
Q(a, s) – the value of doing action a in state i (expected utility)
Q-values are related to utility values by the equation:
U(s) = maxaQ(a, s)
38
Active learning of action-value functions  Q-learning
TD learning, unknown environment
Q(a,s)  Q(a,s) + (R(s) +  maxa’Q(a’, s’) – Q(a,s))
calculated after each transition from state s to s’.

Is it better to learn a model and a utility function or to learn an
action-value function with no model?
39
Q-learning
function Q-Learning-Agent(percept) returns an action
inputs: percept, a percept indicating the current state s’ and reward
signal r’
variable: Q, a table of action values index by state and action
Nsa, a table of frequencies for state-action pairs
s, a, r the previous state, action, and reward, initially null
if s is not null then
increment Nsa[s,a]
Q[a,s]  Q[a,s] + (Nsa[s,a])(r +  maxa’Q [a’,s’] – Q [a,s])
if Terminal[s’] then s, a, r  null
else s, a, r`  s’, argmaxa’ f(Q[a’, s’], Nsa[a’,s’]), r’
return a
end
f(u,n) – increasing in u and decreasing in n
40
Generalization of RL

The problem of learning in large spaces is addressed
through generalization techniques, which allow compact
storage of learned information and transfer of knowledge
between "similar" states and actions.

The RL algorithms include a variety of mappings, including
S  A (policies), S  R (value functions), S  A  R (Q
functions and rewards), S  A  S (deterministic transitions),
and S  A  S  [0, 1] (transitions probabilities).
Some of these, such as transitions and immediate rewards,
can be learned using straightforward supervised learning.
Popular techniques: various neural network methods, fuzzy
logic, logical approaches to generalization.


41
5 Learning in MAS
 The credit-assignment problem (CAP) = the
problem of assigning feed-back (credit or
blame) for an overall performance of the MAS
Increase, decrease) to each agent that
contributed to that change
 inter-agent CAP = assigns credit or blame to
the external actions of agents
 intra-agent CAP = assigns credit or blame for a
particular external action of an agent to its
internal inferences and decisions
 distinction not always obvious
 one or another
42
5.1 Learning action coordination
 s – current environment state
 Agent i – determines the set of actions it can do
in s: Ai(s) = {Aij(s)}
 Computes the goal relevance of each action:
Eij(s)
 Agent i announces a bid for each action with
Eij(s) > threshold
 Bij(s) = ( + ) Eij(s)
  - risk factor (small)  - noise term (to prevent
convergence to local minima)
43
 The action with the highest bid is selected
 Incompatible actions are eliminated
 Repeat process until all actions in bids are either
selected or eliminated
 A – selected actions = activity context
 Execute selected actions
 Update goal relevance for actions in A
Eij(s)  Eij(s) – Bij(s) + (R / |A|)
R –external reward received
 Update goal relevance for actions in the previous activity
context Ap (actions Akl)
Ekl(sp)  Ekl(sp) + (AijA Bij(s)/ |Ap|)
44
5.2 Learning individual
performance
The agent learns how to improve its
individual performance in a multiagent settings
Examples
 Cooperative agents - learning
organizational roles
 Competitive agents - learning from
market conditions
45
5.2.1 Learning organizational roles (Nagendra, e.a.)
 Agents learn to adopt a specific role in a particular






situation (state) in a cooperative MAS.
Aim = to increase utility of final states
Each agent may play several roles in a situation
The agents learn to select the most appropriate role
Use reinforcement learning
Utility, Probability, and Cost (UPC) estimates of a role
in a situation
Utility - the agent's estimate of a final state worth for a
specific role in a situation – world states mapped to
smaller set of situations
S = {s0,…,sf}
Urs = U(sf),
s0  …  sf
46
 Probability - the likelihood of reaching a final
state for a specific role in a situation
Prs = p(sf), s0  …  sf
 Cost - the computational cost of reaching a final
state for a specific role in a situation
 Potential of a role - estimates the usefulness of
a role, discovering pertinent global information
and constraints (ortogonal to utilities)
 Representation:



Sk - vector of situations for agent k, SK1,…,SKn
Rk - vector of roles for agent k, Rk1,…,Rkm
|Sk| x |Rk| x 4 values to describe UPC and Potential
47
Functioning
Phase I: Learning
Several learning cycles; in each cycle:
 each agent goes from s0 to sf and selects its role as the
one with the highest probability
Probability of selecting a role r in a situation s
f (U rs , Prs , Crs , Potrs )
Pr ( r ) 
 f (U js , Pjs , C js , Pot js )
jRk
f - objective function used to rate the roles
(e.g., f(U,P,C,Pot) = U*P*C + Pot)
- depends on the domain
48
 Use reinforcement learning to update UPC and the
potential of a role
 For every s  [s0,…,sf] and chosen role r in s
 Ursi+1 = (1-)Ursi + Usf
i - the learning cycle
Usf - the utility of a final state
01 - the learning rate
 Prsi+1 = (1-)Prsi + O(sf)
O(sf) = 1 if sf is successful, 0 otherwise
49
 Potrsi+1 = (1-)Potrsi + Conf(Path)
Path = [s0,…,sf]
Conf(Path) = 0 if there are conflicts on the Path,
1 otherwise
 The update rules for cost are domain dependent
Phase II: Performing
In a situation s the role r is chosen such that:
r  arg max f (U js , Pjs , C js , Pot js )
jRk
50
5.2.2 Learning in market environments
(Vidal & Durfee)
Agents use past experience and evolved models of
other agents to better sell and buy goods

Environment = a market in which agents buy and sell
information (electronic marketplace)

Open environment

The agents are self-interested (max local utility)
{g} - a set of goods
P - set of possible prices for goods
Qg - set of possible qualities for a good g
51







information has a cost for the seller and a value for the buyer
information is sold at a certain price
a buyer announces a good it needs
sellers bid their prices for delivering the good
the buyer selects from these bids and pays the corresponding price
the buyer assesses the quality of information after it receives it from
the seller
Profit of a seller s for selling the good g at price p
Profitsg(p) = p - csg
csg - the cost of producing the good g by s;

p - the price
Value of a good g for a buyer b
Vbg(p,q)
p - price b paid for g
q - quality of good g
Goal
seller - maximize profit in a transaction
buyer - maximize value in a transaction
52
3 types of agents
0-level agents
 they set their buying and selling prices based on their
own past experience
 they do not model the behavior of other agents
1-level agents

model other agents based on previous interactions

they set their buying and selling prices based on these
models and on past experience

they model the other agents as 0-level agents
2-level agents

same as 1-level agents but they model the other agents
as 1-level agents
53
Strategy of 0-level agents
0-level buyer
- learns the expected value function, fg(p), of buying g at price p
- uses reinforcement learning

fgi+1(p) = (1-)fgi(p) + Vbg(p,q), min   1, for i=0,  = 1
- chooses the seller s* for supplying a good g
s*  arg max f g ( psg )
sS
0-level seller
- learns the expected profit function, hg(p),if it offers good g at price p
- uses reinforcement learning

hgi+1(p) = (1-)hgi(p) + Profitbg(p)
where Profitbg(p) = p - csg if it wins the auction, 0 otherwise
- chooses the price ps* to sell the good g so as to maximize profit
ps *  arg
g
h
max s ( p)
pP& p csg
54
Strategy of 1-level agents
1-level buyer
- models sellers for good g
- does not model other buyers
- uses a probability distribution function qsg(x) over the qualities x of a
good g
- computes expected utility, Esg, of buying good g from seller s
1
E 
Q
g
s
g
g
q
(
x
)

V
(
p

b
s , x)
xQ
- chooses the seller s* for supplying a good g that maximizes this
expected utility
s*  arg max Esg
sS
55
1-level seller
- models buyers for good g
- models the other sellers s for good g
 Buyer's modeling
- uses a probability distribution function mbg(p) - the probability that
b will choose price p for good g
 Seller's modeling
- uses a probability distribution function ns'g(y) - the probability that
s' will bid price y for good g
- computes the probability of bidding lower than a given seller s'
with the price p
Prob_of_bidding_lower_than_s' =
p'(Prob of bid of s' with p' for which s wins) =
p' N(g,b,s;s',p,p')
N(g,b,s;s',p,p') =
ns'g(p') if mbg(p')  mbg(p)
0 otherwise
56
- computes the probability of bidding lower than all other sellers with
the price p
Prob_of_bidding_lower_with_p =

(Prob_of_bidding_lower_than_s')
s'S - {s}
- chooses the best price p* to bid so as to maximize profit
p*  arg max ( p  csg )  Pr ob_of_bidd ing_lower_ with _ p
pP
57
5.3 Learning to communicate

What to communicate (e.g., what information is
of interest to the others)

When to communicate (e.g., when try doing
something by itself or when look for help)

With which agents to communicate

How to communicate (e.g., language, protocol,
ontology)
58
Learning with which agents to
communicate (Ohko, e.a. )
Learning to which agents to ask for performing a task
 Used in a contract net protocol for task allocation to
reduce communication for task announcement
 Goal = acquire and refine knowledge about other agents'
task solving abilities
 Case-based reasoning used for knowledge acquisition
and refinement
A case consists of:
(1) A task specification
(2) Information about which agents solved a task or similar
tasks in the past and the quality of the provided solution

59
(1) Task specification
Ti = {Ai1 Vi1, …, Aimi Vimi}
Aij - task attribute, Vij - value of attribute
 Similar tasks
Sim(Ti, Tj) = r s Dist(Air, Ajs)
AirTi, AjsTj
Dist(Air, Ajs) = Sim_Attr(Air, Ajs) * Sim_Vals(Vir, Vjs)

Set of similar tasks
S(T) = {Tj : Sim(T, Tj) 0.85}
60
(2) Which agents performed T or similar tasks in the past
Suitability of Agent k
1
Suit ( Ak , T ) 
S (T )
 Perform( A , T )
T j S ( T )
k
j
Perform(Ak, Tj) - quality of solution for Tj assured by agent Ak
performing Tj in the past

The agent computes
{ Suit(Ak, T), Suit(Ak, T)>0 } and selects the agent k* such that
k*  arg max Suit ( Ak , T )
k
or the first m agents with best suitability

After each encounter, the agent stores the tasks performed by
other agents and the solution quality

Tradeoff between exploitation and exploration
61
5.4 Layered learning
(Stone & Veloso)
 A hierarchical machine learning paradigm in MAS
 Used simulated robotic soccer – RoboCup
Learning


Input  Output – Intractable
Decompose the learning task L into subtasks: L1, …, Ln
Characteristics of the environment:
 Cooperative MAS
 Teammates and adversaries
 Hidden states – agents have a partial world view
at any given moment
 Agents have noisy sensory data and actuators
 Perception and action cycles are asynchronous
 Agents must make their decisions in real-time
62


Problem: the agent receives a moving ball and must decide what to do
with it: dribble, pass to a teammate, shoot towards the goal
Decompose the problem into 3 subtasks:
Layer
L1
L2
L3




Behavior type
Individual
Multiagent
Team
Example
Ball interception
Pass evaluation
Pass selection
The decomposition into subtasks enables the learning of more
complex behaviors
The hierarchical task decomposition is constructed bottom-up, in a
domain dependent fashion
Learning methods are chosen to suit the task
Learning in one layer feeds into the next layer either by providing a
portion of the behavior used for training (ball interception – pass
evaluation) or by creating the input representation and pruning the
action space (pass evaluation – pass selection)
63
L1 – Ball interception
behavior = individual
 Aim:
 Blocks or intercepts opponents shots or passes or
 Receive passes from teammates
 Learning method: a fully connected backpropagation
NN
 Repeatedly shooting the ball towards a defender in front

of a goal.The defender collects t.e. by acting randomly
and noticing when it successfully stops the ball
Classification:
Saves = successful interceptions
Goals = unsuccessful attempts
Misses = shoots that went wide of the goal



64
L2 – Pass evaluation
behavior = multiagent
 Uses its learned ball-interception skills as part of the behavior
for training MAS behavior
 Aim: the agent must decide
 To pass (or not) the ball to a teammate and
 If the teammate will successfully receive the ball (based on




positions + abilities of the teammate to receive or intercept
a pass)
Learning method: decision trees (C4.5)
Kick the ball towards randomly placed teammates interspread
with randomly placed opponents
The intended pass recipient and the opponents all use the
learned ball-interception behavior
Classification of a potential pass to a receiver:
Success, with a c.f.  (0,1]
Failure, with a c.f.  [-1,0)
Miss, (= 0)
65



L3 – Pass selection
behavior = team
 Uses its learned pass-evaluation capabilities to create the
input and output set for learning pass selection
 Aim: the agent has the ball and must decide
 To which teammate to pass the ball or
 Shoot on goal
 Learning method: Q-learning of a function that depends on
the agent’s position on the field
 Simulate 2 teams playing with identical behavior others than
their pass-selection policies
 Reinforcement = total goals scored
 Learns:
 Shoot the goal
 The teammate to which to pass
66
6 Conclusions
 There is no unique method or set of
methods for learning in MAS
 Many approaches are based on

extending ML techniques in a MAS
setting
Many approaches use reinforcement
learning, but also NN or genetic
algorithms
67
References





S. Sen, G. Weiss. Learning in Multiagent systems. In Multiagent
Systems - A Modern Approach to Distributed Artificial Intelligence, G.
Weiss (Ed.), The MIT Press, 2001, p.257-298.
T. Ohko, e.a. - Addressee learning and message interception for
communication load reduction in multiple robot environment. In
Distributed Artificial Intelligence Meets Machine Learning, G. Weiss,
Ed., Lecture Notes in Artificial Intelligence, Vol. 1221, SpringerVerlag, 1997, p.242-258.
M.V. Nagendra, e.a. Learning organizational roles in a
heterogeneous multi-agent systems. In Proc. of the Second
International Conference on Multiagent Systems, AAAI Press, 1996,
p.291-298.
J.M. Vidal, E.H. Durfee. The impact of nested agent models in an
information economy. In Proc. of the Second International
Conference on Multiagent Systems, AAAI Press, 1996, p.377-384.
P. Stone, M. Veloso. Layered Learning, Eleventh European
Conference on Machine Learning, ECML-2000.
68
Web References
An interesting set of training examples and the connection between decision
trees and rules.
http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html
 Decision trees construction
http://www.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/4_dtrees2.html
 Building Classification Models: ID3 and C4.5

http://yoda.cis.temple.edu:8080/UGAIWWW/lectures/C45/


Introduction to Reinforcement Learning
http://www.cs.indiana.edu/~gasser/Salsa/rl.html

On-line book on Reinforcement Learning
http://www-anw.cs.umass.edu/~rich/book/the-book.html
69