Autonomic Computing - Cardiff University

Download Report

Transcript Autonomic Computing - Cardiff University

Autonomic Computing
Omer F. Rana (Cardiff University)
Overview
• Illustrative example:
– Managing Web Servers
– Reference to IBM’s AC vision
• Use of SLAs to support system
management
– SLA standards, use of SLA in adaptation
• Approaches to adaptation
– Stigmergy (social insects)
– Utility-based approaches
• Toolkits
Recap … AC
• Automating the management of computer
resources
• System components more complex
– Better functionality
– Hard to appreciate functionality
– Interaction between components not always
obvious
• System admins under increasing pressure
to respond to complexity
AC … 2
• Manual tuning
– Generally script driven (requires updates to
configuration files)
– Error-prone process (requires skilled personnel)
• Automated tuning
– Try to model behaviour of the system
– Use this behaviour as a “predictive” tool to determine
likely response from system
– Design feedback control mechanisms (and use online operation to adjust control)
AC application
Can be applied at two levels:
• Individual component level
– Make each component more intelligent
– Provide support infrastructure around this
intelligent component
• Interaction level
– Facilitate better interaction between
components in some way
– Allow “useful” interactions to “emerge”
Four Concepts
• Self-configuring:
– Dynamic adaptation to changing environment
– Addition of new features dynamically
• Self-healing:
– Discover, diagnose and react to disruptions
– Handling failure and isolating a component
• Self-Optimising:
– Monitor and tune resource utilisation
– Includes: dynamic partitioning, workload management
• Self-Protecting:
– Anticipate/Identify, detect and protect from attacks
– Extend existing security infrastructure to achieve this
Relationship to other themes
•
•
•
•
•
Machine Learning and AI
Knowledge Management (Semantics)
Coordination Mechanisms and Protocols
System Administration
Performance Engineering and Monitoring
• Related Emerging areas
– Ambient Intelligence
– Amorphous Computing
– Computational “Fabrics”
From:
IBM
From Alan Ganek, IBM
1. Steady State
1
#Active Servers
#Requested Servers
Actual BOPS
Predicted BOPS
Response Time
From Alan Ganek, IBM
2. Monitor, Detect Surge
1
#Active Servers
#Requested Servers
Actual BOPS
Predicted BOPS
Response Time
2
From Alan Ganek, IBM
3. Forecast, Provision Servers
1
#Active Servers
#Requested Servers
Actual BOPS
Predicted BOPS
Response Time
2 3
From Alan Ganek, IBM
4. Monitor, Remove Servers
1
#Active Servers
#Requested Servers
Actual BOPS
Predicted BOPS
Response Time
2 3
4
Apache Web Server Tuning
• Based on a client-server basis with a limit on
MaxClients and KeepAlive
– Tuning is equivalent to modifying MaxClients and
KeepAlive
• Performance Metrics
– End-user response time
– Resource utilisation
– CPU and memory utilisation
• Measure parameters on server side
• Over utilisation == thrashing and potential failure
Basis for Metrics
• Master process + pool of worker processes
• Each worker process handles interaction with a
Client
• Worker processes limited by MaxClients
• Worker Process: idle, waiting and busy
– Idle (no TCP connection made)
– Waiting (waiting for HTTP request from client)
– Busy (processing request)
• Persistent HTTP/1.1: TCP connection remains
open between consecutive HTTP requests
(reduces time to set up a connection)
• Persistent connection can be terminated by
master or client process – if waiting time
exceeds max. allowed by KeepAlive
Manual Tuning
Desired CPU level=0.5, and Memory=0.6
Dynamic Workload (additional requests at 20th Control Interval)
Manual Tuning … 2
Dynamic Workload
• To maintain CPU and Memory criteria, it is
necessary to tune manually
• Achieved by adjusting MaxClients and
KeepAlive parameters
• Dynamic workload (generally unpredictable)
requires continuous re-tuning
• Trying to follow changes resulting from dynamic
workload can be continuous process
AutoTune agents
• Autotune Adaptor Bean
– Interfaces with target system for service level
metrics
– Sets tuning parameters
• Autotune Controller Bean
– Specifies control strategies (based on data
captured)
– Interacts with system admin to configure
control strategy
AutoTune Functionality
Can set (1) control
and (2) sample intervals
Manages (1) timer,
(2) Async events
AutoTune Architecture
Data set generator
AutoTune Agent Operations
• Three agents:
– Feedback controller design
• Model based controller
• Linear Quadratic Regulation (LQR) controller
– Modelling
•
•
•
•
Non-production/testing mode
Alters tuning parameters: MaxClients and KeepAlive
Records performance metrics: CPU and memory
Construct dynamic model (based on time series)
– Run-time control
• Production mode
• Uses output from controller – dynamically adjusts MaxClients
and KeepAlive
Modelling agent
• Build a mathematical model of the system
– Queuing theory
– Data analysis based
• Mathematical model
– Requires understanding of inner workings of server
– May need to know about particular properties (exceptions) of the
way the server operates
• Data-based model (“blackbox” approach)
– Gather data of system in the “wild”
– Assume have covered sufficient number of test cases
• User Input
– Range of Tuning Parameters: MaxClient [1,1024]; KeepAlive
[1,50]
– Max delay required for tuning parameters to take effect on the
performance metrics: MaxClients (10m); KeepAlive (20m)
Linear
Model
Feedback Control
• PID (proportional-integral-derivative)
control
– Correct error between a measured process
variable and a desired point
– Calculating and outputting a corrective action
to adjust process accordingly
Feedback Control … 2
• Proportional: reaction to current error
• Integral: reaction based on recent error
(time based)
• Derivative: reaction based on rate by
which error has been changing
• Use a weighted sum of the three modes
• Output as a corrective action to a control
element
Proportional Mode
• Responds to a change in the process
variable proportional to the current
measured error value
• Multiply the error by a constant Kp
(proportional gain)
m: output signal;
Kp : proportional gain
e: error (expected – actual)
PB: proportionall Band
Integral Model
• Controller output is proportional to the
amount and duration of the error
• Algorithm calculates the accumulated
proportional offset over time
• Leads to controller approaching required
value quicker – but contributes to system
instability – may cause “overshoot”
m: output signal;
Ti: Integral time
e: error (expected – actual)
Derivative Mode
• Acts as a breaking or damping action to
the controller response – as it overshoots
• Use of slope of error vs. time (rate of error
change)
• Controller may be slower to reach required
point (counters work of integral model
controller)
m: output signal;
Ti: Derivative time
e: error (expected – actual)
Combining the three
• Output(t) = P + I + D
K_p = K; K_i = (K/T) ; K-D = KT_d
Run-time Control agent
• Implements an error
feedback controller
• Makes use of a (1)
desired, and (2) actual
system utilisation
• Kp and Ki matrices
obtained by the controller
design agent
• Controller performance
– Time to recover from a
workload change in the
system
Kp = proportional control
gain, Ki = integral control gain
For stead state error
e=error between actual
and desired value at kth
interval
Accumulated error
Controller Design Agent
• Relies on output of modelling
agent
• Aims to minimise a quadratic
cost function (J(Kp,Ki))
• Q, R are weighting matrices: Q
is a 2x2 matrix and R is a 4x4
matrix
• Q = diag(q1,q2,q3,q4), and
R=diag(r1,r2)
– q1=1, q2=2, q3=(1/10^2),
q4=(1/2^2) (10% random CPU
fluctuation, and 2% memory)
– r1=(1/50^2), r2=(1/1000^2)
Implementation
• Undertaken with ABLE – extend AutoTune agent
• Modelling agent
– Data generator extends AutotuneController bean
(extends the process() method)
– ApacheAdaptor extends AutotuneAdaptor bean
(implements socket connection with Apache Web
server)
• Run-time Controller agent
– Extends the AutotuneController bean
– Also uses the ApacheAdaptor
• Controller Design agent
– Extends the AutotuneController bean
– Extends AutotuneAdaptor to read in model
parameters from Modelling agent
Experiment setup
• Linux (v2.2.16) Apache HTTP v1.3.19
• MaxClient and KeepAlive parameters to be
dynamically modifiable
• Multiple clients supporting workload generator
– WAGON (Web trAffic GeneratOr and beNchmark) –
Liu et al. (INRIA)
– Httperf to generate synthetic HTTP requests
– File access distributions from Webstone 2.5
• Static and Dynamic workloads used
– Static: Web page requests – session arrivals followed
a Poisson distribution (20 sessions/second)
– Dynamic: Web page requests – session arrivals
followed a Poisson distribution (10 sessions/second)
• Control Parameters
– Control interval (adaptation time): 5 seconds
– Goal: CPU=0.5 and Memory=0.6
Automatic tuning of Apache Web Server (about 50 control intervals to converge)
With Dynamic Workload (at 20th Interval) – takes 20 intervals to adjust
Types of system components
• Computer Servers
• Web Servers
• Database systems
• Devices
– Pervasive Computing
– Ubiquitous Computing
Upgrades and Problem Diagnosis
Faulty
Modules
Upgrades and Problem Diagnosis
• Upgrade has 5 new autonomic modules
• Three modules found to be faulty (system
reverts to old version)
• Analyse module dependencies
• Analyse log files to infer which of the three
modules is the culprit
• Generate a “problem ticket” to software
developer
QoS Management
• QoS has been explored in:
– Computer Networks
• Bandwidth, Delay, Packet loss rate and Jitter.
– Multimedia Applications
• Frame rate and computation resource.
– Grid Computing
• Network QoS, computation and storage
requirements.
Continue …
• QoS management:
– Covers a range of different activities, from resource
specification, selection and allocation through to
resource release.
• QoS system should address the following:
– Specifying QoS requirements
– Mapping of QoS requirements to resource capability
– Negotiating QoS with resource owners
– Establishing contracts / SLAs with clients
– Reserving and allocating resources
– Monitoring parameters associated with QoS sessions
– Adapting to varying resource quality characteristics
– Terminating QoS sessions
• User Expectations vs. Resource Management
When QoS is needed?
• Interactive sessions
– Computation steering (control parameters & data
exchange)
– Interactive visualization (visualization & simulations
services)
• Response within a limited time span
• Co-scheduling or co-location support
–
Application QoS
–User
perception, response time,
appl. Security, etc.
–
Middleware QoS
–Comp.,
–
Memory and Storage
Network QoS
–BW,
Packet loss, Delay, Jitter
From SCIRun,
University of Utah
What is a Service Level Agreement
(SLA) and why is useful for AC?
A relationship between a client and provider in the context of a particular
capability (service) provision
Provider
Client
SLA
Can you
do X for me
for Y in
return?
SLA
SLA-Offer
Yes
SLA-Accept
SLA-Reject
Distinguish between: Discovery of suitable provider
Establishment of an SLA
SLA as a basis to support adaptive behaviour
P2P Search,
Directory
Service
What is an SLA?
Provider
Client
SLA
Can you
do X for me
for Y in
return?
SLA
SLA-Offer
No, but I
can do Z
for Y
SLA-CounterOffer
SLA-Accept
SLA-Reject
Accept
What is an SLA?
Provider
Client
Can you
do X for me
for Y in
return?
SLA
SLA
SLA-Offer
No
Dependency
SLA-CounterOffer
SLA-Offer
Can you
do Z for me
for Y in
return?
Negotiation
Phase
(Single or
Multi-Round)
Variations
Providers
Providers
Client
Client
SLA
SLA
SLA
Multi-provider SLA
SLA dependencies
Single SLA is divided
across multiple providers
(e.g. workflow composition)
For an SLA to be valid, another
SLA has to be agreed
(e.g. co-allocation)
What is an SLA?
• Dynamically established and managed
relationship between two parties
• Objective is “delivery of a service” by one of the
parties in the context of the agreement
• Delivery involves:
– Functional and non-functional properties of service
• Management of delivery:
– Roles, rights and obligations of parties involved
Forming the Agreement
• Distinguish between:
– Agreement itself
– Mechanisms that lead to the formation of the
agreement
• Mechanisms that lead to agreement:
– Negotiation (single or multi-shot)
– One-shot creation
– Policy-based creation of agreements, etc.
SLA Life Cycle
• Identify Provider
– On completion of a discovery phase
• Define SLA
– Define what is being requested
• Agree on SLA terms
– Agree on Service Level Objectives
• Monitor SLA Violation
– Confirm whether SLO’s are being violated
• Destroy SLA
– Expire SLA
• Penalty for SLA Violation
WS-Agreement
• Framework for SLA creation – interface
conforming to Web Services standards
• Service Client/Provider does not need to
be a Web Service
• Provides a two layered model:
– Agreement layer: Web Service-based
interface to create, represent and monitor
agreements
– Service layer: Application specific-layer of
service being provided
WS-Agreement
Agreement
Layer
Service
Layer
Agreement Initiator may be Service Consumer or Service Provider
WS-Agreement
Agreement
Name/ID
Information about Agreement
Initiator
Responder
Expiration Time
Context
Information about Service
Service Description Terms
(generally, these are domain
dependent)
Service Terms
Guarantee Terms
Terms Composition
Information about Service
Level
Service Level Objectives,
Qualifying Conditions for
the agreement to be valid,
Penalty Terms, etc
WS-Agreement Terms
From: Viktor Yarmolenko (U Manchester)
WS-Agreement
• Specification for Service Level Agreements
– Developed through GRAAP WG at the Open Grid
Forum
– WSLA (from IBM) – previous efforts
• Provides:
–
–
–
–
Schema for agreement terms
A very simple protocol (two stage)
A state sequence
Support penalty clauses
• No support for negotiation
WS-Agreement Specification Document (GFD.107)
Data Center Scenario … 1
• Identical servers – dynamically allocated among
multiple Web apps
• For each application:
– Application Manager (performance optimiz.)
Interacting with a Resource Arbiter (server allocation)
– Optimisation goal (“expected business value”) defined
by an “objective function”
• Resource Arbiter goal:
– Allocate servers to maximise sum of expected
business value over all applications
– Local value functions must share a common scale
Data Center Scenario … 2
Vi(.): utility curve
Estimate of expected
business value;
e.g. Payments-penalties
Arbiter assigns
list of assigned
servers
Use of Reinforcement
Learning
Resource Arbiter goal: allocate servers to maximize the sum of expected business
Value over all applications (assuming a common scale).
A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation
Gerald Tesauro et al., Proceedings of ICAC 2006, Dublin, Ireland.
Not all SLAs are equal
• App  events for trade stock data
• Customer classes:
– Gold customers: pay for data
– Public customers: connected over Internet
• Public customers get less information than Gold
• Gold customers expect reliable delivery
– Need for acks  increasing overhead in system
• Cannot alter flow rate to tolerate delays
– But can support “admission” control
Utility  Abstract measure of benefit to user (seek
to maximize this given available resources)
Assumes the existence
of multiple QoS
classes
SLA Classes
Risk-Aware Limited Lookahead Control for Dynamic Resource Provisioning in Enterprise Computing
Systems, Dara Kusic and Nagarajan Kandasamy, Proceedings of ICAC 2006, Dublin, Ireland.
Control System Architecture
•
•
r_alloc: rate to a flow when it enters system
n_alloc: number of consumers (admitted for each class)
Utility-aware Resource Allocation in an Event Processing System, Sumeer Bhola, Mark Astley, Robert
Saccone and Michael Ward, Proceedings of ICAC 2006, Dublin, Ireland.
Control System Strategies
• Assumes knowledge of some “good” (ideal) state
• Move system towards the good/ideal state
• Impacted by:
–
–
–
–
Response time (current  good state transition)
Variability in operational environment (stability of approach)
Execution time
Discrete domain (tuning options from a finite set)
• Feedback control
– PID
– Kalman filter
• Neural network-based control
– Use of learning approaches
• Rule-based approaches
– Use of event recognition and triggers
Kalman Filters
• Discrete time linear dynamic systems
• Modelled on a Markov chain (with noise)
• Linear operator applied to state to generate a new state
Fk = state transition model applied
to previous state xk-1
Bk = control input model applied to
Control vector uk
Wk: process noise (normally distributed)
Differentiated Quality of
Service
Silver
Policy
Gold
Policy
Platinum
Policy
SAN
Storage
SAN Manager
Silver
Customer
Gold
Customer
Platinum
Customer
From Joe Bigus (IBM)
SAN Manager Scenario
Overview
Uses new AbleRuleAgent as rules-based policy manager
Models multiple quality of service levels (represented by rule
sets)
N systems are defined, each with associated QoS levels
Requests include system identifier and current utilization
The SAN Manager:
Looks up QoS for that system
Invokes the corresponding QoS rule set
Rule sets make recommendations that allocations are either
unchanged, increased or decreased
SAN Manager evaluates recommendations and changes
allocations
based on total capacity limit
From Joe Bigus (IBM)
Platinum QoS RuleSet
// Low allocation
: if Allocation is Low and Utilization is Low
then RecommendedAction = NoAction;
: if Allocation is Low and Utilization is Normal
then RecommendedAction = NoAction;
: if Allocation is Low and Utilization is High
then
RecommendedAction = IncreaseAllocation;
// Normal allocation
: if Allocation is Normal and Utilization is Low
then RecommendedAction = DecreaseAllocation;
: if Allocation is Normal and Utilization is Normal
then RecommendedAction = NoAction;
: if Allocation is Normal and Utilization is High
then RecommendedAction = IncreaseAllocation;
// High allocation
: if Allocation is High and Utilization is Low
then RecommendedAction = DecreaseAllocation;
: if Allocation is High and Utilization is Normal
then RecommendedAction = DecreaseAllocation;
: if Allocation is High and Utilization is High
then RecommendedAction = Send.Warning_LowMem;
: if Allocation is positively High and Utilization is positively High
then RecommendedAction = Send.Warning_CritMem;
From Joe Bigus (IBM)
From Joe Bigus (IBM)
Dynamic SLA
• Limitations of a single agreement
– Modifications since agreement was in place
• Cost of doing re-establishment
– Not fully aware of operating environment
• Flexibility in describing Service Level
Objectives
– Not sure what to ask for (not fully aware of the
environment in which operating)
– Too many violations
Dynamic WS-Agreement
• Case 1: Static Agreement
– Identify Service Description Terms,
– Guarantee Terms, and
– Service Level Objectives (SLOs)
• Case 2: Dynamic Agreement
– Identify Service Description Terms,
– Guarantee Terms: defined as ranges or as
functions
– Service Level Objectives: defined as ranges
or as functions
From: Viktor Yarmolenko
Function-based SLA (Yarmolenko et al.)
• Express initial SLA-Offer as a function of
provider capability
From: Viktor Yarmolenko
From: Viktor Yarmolenko
From: Viktor Yarmolenko
Guarantee terms as functions
From: Viktor Yarmolenko
From: Viktor Yarmolenko
From: Viktor Yarmolenko
From: Viktor Yarmolenko
From: Viktor Yarmolenko
SLA Classes
• Guaranteed
– constraints to be exactly observed
– SLA is precisely/exactly defined
– adaptation algorithm/optimization heuristics
• Controlled-load
– some constraints may be observed
– Range-oriented SLA
– optimization heuristics
• Best-effort
– any resources will do
– no adaptation support
SLA Adaptation
Aim: compensation for QoS degradation for
‘guaranteed’ class only
• Assume capacityTotal: C= CG + CA + CB
• ‘best effort’ can uses the adaptive capacity,
as long as its not used by the ‘guaranteed’
•
•
•
G
A
B
G
A
B
A
B
When QoS degrades for ‘guaranteed’
Then adaptive is utilized to compensate for
the degradation
‘best effort’ can still utilize the remaining
capacity of the adaptive, as long as its not
used by the ‘guaranteed’
G
G
•
When the congested capacity is restored,
the adaptive capacity can be used entirely
by the ‘best effort’
A
G
A
o Before invoking the adaptive function:
o Ensuring that the request at time (t)  the agreed upon in the SLA
o Ensuring that the total capacities within all SLAs at time (t)  CG
B
B
Resources
Policy Manager
Allocation Manager
Reservation Manager
QoS Grid Service
Grid QoS service interface
Grid Node
Main components
• Policy Manager
– To provide dynamic info about the domain-specific
resource characteristics and policy
• Reservation Manger
– To provide advance/immediate resource reservation
• Data structure contains reservation entries
• Interact with policy manager for resource char.
• Allocation Manger
– To interact with the underlying resource manager for
resource allocation (e.g DSRT, Bandwidth Broker)
Grid node 1
Grid node 2
Policy
Allocation
Grid node 3
Policy
Reservation
Allocation
QoS service
Policy
Reservation
Allocation
QoS service
Reservation
QoS service
SLA
SLA
SLA
QoS Broker
Joint work with
Argonne National Lab.
(Gregor von Laszewski
et al.)
Client's Appl.
UDDIe
QoS Discovery
Reservation Approaches
• Resource reservation / allocation based on two
strategies:
– Time-domain: reserve the whole ‘compute’
power of Grid node.
• Guaranteed exclusive access
– Resource-domain: reserve a CPU slot of the
Grid node.
• Shared access – guaranteed resource capacity
• Suitable for light weight applications/services.
G-QoSM
Resource Mangrs.
G-QoSM
Architecture
CoG Reputation
Service
Service Agreement
Client
Grid
Reservation Manager
Network
Allocation Manager
Resources
Policy Manager
CPU
CoG QoS Grid Service
Reput Handler
Disk
UDDIe Handler
QoS Handler
UDDIe
GT2 Handler
GT3 Handler
Java CoG Kit Core
CoG QoS Broker
Applications
Portals
Swing
Legacy
Implementation Status
• References:
–
–
•
Rashid Al-Ali, Kaizar Amin, Gregor von Laszewski, Omer Rana and David Walker. An OGSA-Based
Quality of Service Framework. Proceedings of the Second International Workshop on Grid and
Cooperative Computing (GCC 2003), Shanghai, China, December 2003.
Rashid Al-Ali, Omer Rana, David Walker, Sanjay Jha and Shaleeza Sohail. G-QoSM: Grid Service
Discovery Using QoS Properties. Computing and Informatics Journal , Special Issue on Grid Computing,
21 (4), 2002.
The QoS implementation is open source available for download from the Java CoG
site http://www.globus.org/cog/java
Application Integration
1. Prepare: QoS negotiation Task
Returns: Agreement ID
2. Prepare: QoS job submission Task
3. Submit job to QoS service
QoS Job Submission Task
private void prepareQosJobSubmissionTask()
{
// create a QoS JobSumbission Task
Task task =
new TaskImpl(``myTask'', QoS.JOBSUBMISSION);
this.task.setAttribute(``agreementToken'', token);
// create a remote job specification
JobSpecification spec = new JobSpecificationImpl();
// set all the job related parameters
spec.setExecutable(``/rashid/myExecutable'');
spec.setRedirected(false);
spec.setStdOutput(``QosOutput'');
//associate the specification with the task
task.setSpecification(spec);
// create a Globus version of the security context
SecurityContextImpl securityContext =
new GlobusSecurityContextImpl();
securityContext.setCredential(null);
task.setSecurityContext(securityContext);
Contact contact = new Contact(``myQoScontact'');
ServiceContact service =
new ServiceContactImpl(qosServiceURL);
contact.setServiceContact(``QGSurl'',service);
task.setContact(contact);
}
QoS Task Submission
/*** QoS: Task Submission to QoS Handler ***/
private void QosTaskSubmission(Task task)
{
TaskHandler handler = new QoSTaskHandlerImpl();
// submit the task to the handler
handler.submit(task);
}
With Globus Toolkit 2
Best Effort
Guaranteed
Web Services Distributed
Management (WSDM)
• Management USING Web Services (MUWS)
– Web services to describe and access manageability
of resources
– Management applications use Web services
just like other applications use Web services
• Management OF Web Services (MOWS)
– An application of Management Using Web Services
for the Web Service as the IT resource
• Use Web Services as the distributed computing
platform to enable interoperability between
managers and manageable resources
WSDM Presentation
WSMF Presentation
WSDM
Disturbance Benchmarking
From Aaron Brown and Peter Shum (IBM)
From Aaron Brown and Peter Shum (IBM)
From Aaron Brown and Peter Shum (IBM)
Useful to compare
this with performance
benchmarks that
we are much more
aware of
Compare with automated
testing mechanisms
From Aaron Brown and Peter Shum (IBM)
From Aaron Brown and Peter Shum (IBM)
From Aaron Brown and Peter Shum (IBM)
From Aaron Brown and Peter Shum (IBM)
From Aaron Brown and Peter Shum (IBM)
From Aaron Brown and Peter Shum (IBM)
From Aaron Brown and Peter Shum (IBM)
From Aaron Brown and Peter Shum (IBM)
From Aaron Brown and Peter Shum (IBM)
Behaviours and Interactions
• Interactions not “hard coded” – but
expressed as high level objectives, eg.
– Maximise this utility function
– Find a reputable message translation service
• Autonomic Service providers can say “No”
– Service provision must be consistent with
local policy and long term goals
• Policies may be expressed using logic or
other formalisms
Emergence and Self-Organisation
• Increased complexity and autonomy implies that
“global” coherent behaviours may be hard to
specify
• Concept of “Emergence”
• Interactions between autonomous systems that
can lead to useful global behaviours
– How can we constrain each individual element within
such a system?
– How can useful global behaviours be recognised
effectively?
Self Organisation
• Self-Organisation is a set of dynamical
processes whereby structures or order appears
at global level of a system from the interactions
between the lower-level entities. The rules
underlying the behaviour and that specify the
interactions among the entities are implemented
on the basis of local information, without any
reference to the global pattern.
Emergence
• A dynamic, non-linear process that results
in “macro-level” structures to form, based
on interactions of system parts at the
micro-level.
• Such emergence is “novel” – i.e. cannot
be easily understood by taking the system
apart and looking at the parts
(reductionism)
Issues
• Macro-Micro effect
• Novelty
– Global behaviour is novel
• Coherence
– Emergence has some sense of identity (i.e.
persists over some time)
• Dynamic
– Emergence arise as system evolves over time
• Non-Linear
• Distributed/Non-Centralised Control
– Not possible to control the entire system
Influences
• Social Societies
– Emerging area of “Socionics”
• Biological Paradigms (Stigmergy)
– Ant Colonies (Social Insects)
– Swarms
• Particle Systems (fluidity and elasticity)
– Chemical reactions
– Spin Glass theory (due to temperature
changes)
Concepts of Utility
• What is considered “important”
• Value assigned to actions and operations
• Utility
– Cost
– Performance
– Availability
• Some kind of “measurable” metric
Utility … 2
• Payoff function
– assess behaviour of a particular action
(reward signal)
• Analysis tool
– relationship between local utility vs. utility of
the community
• Cost function
– success w.r.t. a particular task
• Trust measure
– measure of trust in a particular participant
Economic Utility: Metrics “Pyramid”
Utility Optimisation
Expected Utility – E(x)
0<g<1
Finite Horizon
“U” may be negative
Infinite Horizon
Long term rewards less useful
Social Insect Behaviour
• Self-organising Behaviour
•
The idea of simple behaviours interacting in a manner that produces a
range of interesting complex behaviours is very useful and exciting for
designing complex systems :
• Positive Feedback (Autocatalytic) - Recruitment and
Reinforcement
• Negative Feedback - Saturation, Exhaustion, or
Competition
• Fluctuations and Randomness - Random Walks, Errors,
Random Task-Switching etc.
• Multiple Interactions
• Stigmergetic Behaviour
• Waggle and Tremble dances (Bees)
From: Ashish Umre
Stigmergy
• Indirect communication via interaction with
environment [Gassé, 59]
– Sematonic [Wilson, 75] stigmergy
• action of agent directly related to problem solving
and affects behavior of other agents.
– Sign-based stigmergy
• action of agent affects environment not directly
related to problem solving activity.
Self-organised behaviour can be
characterised by key properties like • The creation of spatiotemporal structures in an
initially homogeneous medium, e.g. Nest
Architectures, foraging trails, or social
organisation.
• Multistability - possible coexistence of several
stable states
• Existence of Bifurcations when some
parameters are varied. (“Snowball effect”).
From: Ashish Umre
What do Ants do?
• A few examples of collective behaviour that have been observed in
several species of Ants are:
 regulating nest temperature within limits of 1C;
 forming bridges;
 raiding particular areas of food;
 building and protecting their nest;
 sorting brood and food items;
 co-operating in carrying large items;
 emigration of a colony;
 complex patterns of egg and brood care;
 finding the shortest routes from nest to a food source;
 preferentially exploiting the richest available food source.
 task partitioning and division of labour
From: Ashish Umre
Ants in Nature
From: Ashish Umre
Adapting to Environment Changes
Pheromone Trails
E
E
E
T=
1
T=
0
30 ants
30 ants
D
d=1.0
d=0.5
H
C
d=1.0
B
A
15
ants
D
15
ants
C
H
d=0.5
15
ants
10
ants
B
15
ants
30
A ants
D
20
ants
C
H
10
ants
B
20
ants
30
A ants
What do Bees do?
• Foraging Behaviour
(Waggle Dance)
• Task Partitioning and
Division of Labour
• Scout-Recruit Concept
(Tremble Dance)
• Group Decision Making
and Colony Cooperation
• Regulating Hive
temperature
• Communication : Food
sources are exploited
according to quality and
distance from the hive
Waggle Dance
From: Ashish Umre
Wasps
• Pulp foragers, water
foragers & builders
• Complex nests
– Horizontal columns
– Protective covering
– Central entrance
hole
Pervasive Ants : Resource Discovery in
Dynamic and Reconfigurable Networks
using Artificial Ants
• Ants continuously explore new solutions
• Pulses “Drumming” used to update resource tables
(The
Modulatory Communication signal category of Drumming in the European Carpenter
ants Camponotous herculeeanus and C. ligniperda. The worker ants strike the
surface of the wooden chambers and galleries in which they live within their
mandibles and gasters, producing vibrations that can be perceived by nestmates for
20 centimetres or more. Much, of the behaviour is classifiable as direct alarm
communication. The behaviour of some categories is “tightened up”. Transition
probabilities are raised, and hence uncertainty is reduced. The modulatory
communication appears to be a primitive phenomenon in ants and other social
insects.)
• Adaptive to continuous node failure and addition of new
nodes and resources, and change in traffic conditions
From: Ashish Umre
Ant-Based Control Introduction
• Ant Based Control (ABC) is introduced to
route calls on a circuit-switched telephone
network
– ABC is the first SI routing algorithm for
telecommunications networks
• 1996
R. Schoonderwoerd, O. Holland, J. Bruten, L. Rothkranz, Ant-based
load balancing in telecommunications networks, 1996.
ABC: Overview
• Ant packets are control packets
• Ants discover and maintain routes
– Pheromone is used to identify routes to each node
– Pheromone determines path probabilities
• Calls are placed over routes managed by ants
• Each node has a pheromone table maintaining
the amount of pheromone for each destination it
has seen
– Pheromone Table is the Routing Table
ABC: Route Maintenance
• Ants are launched regularly to random
destinations in the network
• Ants travel to their destination according to
the next-hop probabilities at each
intermediate node
– With a small exploration probability an ant will
uniformly randomly choose a next hop
• Ants are removed from the network when
they reach their destination
ABC: Routing Probability
Update
• Ants traveling from source s to destination
d lay s’s pheromone
– Ants lay a pheromone trail back to their
source as they move
– Pheromone is unidirectional
• When a packet arrives at node n from
previous hop r, and having source s, the
routing probability to r from n for
destination s increases
Ant Algorithm
Ants going from node 1 to 3
An ant in the network launched at
node 3 with destination node 2, and has just
travelled from node 4 to node 1.
This ant will first alter node 1’s table
corresponding to node 3 (its source node) by
increasing the probability of selection of
node 4; it will then select its next node
randomly according to the probabilities in the
table corresponding to its destination node,
node 2.
•Every node has a pheromone table for every destination node in the network
•A node with four neighbours in a 30-node network has 29 pheromone tables with
four entries each.
Updating Pheromone table
• Ants can be launched from any node
• Select next node according to probabilities
in the pheromone table for their
destination nodes
• When ants arrive at a node – they update
the probabilities of that node’s pheromone
table (corresponding to their source node)
• Alter table to increase probability pointing
to their previous node
• On reaching destination – ants die
Update law
• P = new probability (or pheromone)
increase
• Probability can be reduced by operation of
normalization (increase in another cell in
table)
• Prob. can approach zero but never
reaches it
Ant Algorithm
r (t  1) 
i
s ,m
r (t )  r
i
s ,m
rsi.l (t  1) 
0.25
r =
age
1  r
i
s ,l
r (t )
1  r
This equation specifies the new
reinforced weight for the relevant
node that corresponds to the ant’s last
node
This equation specifies the weight
for all other weights that do not
correspond to the ant's last node
This equation specifies the
reinforcement parameter that is
employed in first two equations
From: Ashish Umre
Ageing
• Delta_p changes with the age of the ant
– Age == path length (each hop increases ants
age)
– Ants moving along shorter routes have higher
age
– Age == delay of ants at nodes that are
congested
– Delay  ants age increases quicker
• As flow rate of ants to neighbours
decreases – prevents ants from affecting
pheromone table
ABC: Route Selection (Call Placement)
• When a call is originated, a circuit must be
established
• The highest probability next hop is
followed to the destination from the source
• If no circuit can be established in this way,
the call is blocked
• Calls operate independently of ants
ABC: Initialization
• Pheromone Tables are randomly initialized
• Ants are released onto the network to
establish routes
• When routes are sufficiently short, actual
calls are placed onto the network
• Calls and ants dynamically interact
• New calls influence load on nodes 
influences the ants by means of a delay
mechanism
Relationship between calls, node utilisation, pheromone
tables and ants. An arrow indicates the direction of
influence
From: Ashish Umre
Average Packet Delay (With the Algorithm)
From: Ashish Umre
Average Packet Delay(Without Algorithm)
From: Ashish Umre
Packet and Pulse Loss (With the Algorithm)
From: Ashish Umre
Packet and Pulse Loss (Without the Algorithm)
From: Ashish Umre
Design Concerns
• Swarm Intelligent Systems are hard to
‘program’ since the problems are usually
difficult to define
– Solutions are emergent in the systems
– Solutions result from behaviors and
interactions among and between individual
agents
Summary of ABC
• Ants regularly launched with random destinations
• Ants walk randomly according to probabilities in
pheromone tables for their particular destination
• Ants update the probabilities in the pheromone table for
the location they were launched
• from, by increasing the probability of selection of their
previous location by subsequent ants.
• The increase in these probabilities is a decreasing
function of the age of the ant, and of the original
probability.
• This probability increase could also be a function of
penalties or rewards the ant has gathered on its way.
• The ants get delayed on parts of the system that are
heavily used.
• The ants could eventually be penalised or rewarded as a
function of local system utilisation.
• To avoid overtraining through freezing of pheromone
trails, some noise can be added to the behaviour of the
ants.
Possible Solutions to Create Swarm
Intelligence Systems
• Create a catalog of the collective behaviours
• Model how social insects collectively perform
tasks
– Use this model as a basis upon which artificial
variations can be developed
– Model parameters can be tuned within a biologically
relevant range or by adding non-biological factors to
the model
What are Ad Hoc Networks?
• Ad Hoc networks are
– self-organising multi-hop wireless networks;
– no fixed infrastructure, such as base stations
or routers, is required;
– ad hoc networks are rapidly deployable
networks;
– all mobile hosts are embedded with packet
forwarding capabilities;
From: Ashish Umre
Current Routing Algorithms for Ad hoc
Mobile Wireless Networks
• Table Driven routing Protocols:
• Destination-Sequenced Distance Vector Routing (DSDV)
• Clustered Gateway Switch Routing (CGSR)
• The Wireless Routing Protocol (WRP)
• Source-Initiated On-Demand Routing:
•
•
•
•
•
Ad hoc On-Demand Distance Vector Routing (AODV)
Dynamic Source Routing (DSR)
Temporally-Ordered Routing Algorithm (TORA)
Associativity-Based Routing (ABR)
Signal Stability Routing (SSR)
From: Ashish Umre
Four Ingredients of
Self Organization
•
•
•
•
Positive Feedback
Negative Feedback
Amplification of Fluctuations - randomness
Reliance on multiple interactions
Positive Feedback
Positive Feedback reinforces good solutions
• Ants are able to attract more help when a
food source is found
• More ants on a trail increases pheromone
and attracts even more ants
Negative Feedback
Negative Feedback removes bad or old
solutions from the collective memory
• Pheromone Decay
• Distant food sources are exploited last
– Pheromone has less time to decay on closer
solutions
Randomness
Randomness allows new solutions to arise
and directs current ones
• Ant decisions are random
– Exploration probability
• Food sources are found randomly
• Initially an ant will attempt to follow a
random path to “explore” possible food
sources
Multiple Interactions
No individual can solve a given problem.
Only through the interaction of many can a
solution be found
• One ant cannot forage for food;
pheromone would decay too fast
• Many ants are needed to sustain the
pheromone trail
• More food can be found faster
• “Swarm” behaviour
Stigmergy
in
Action
This general “Clustering” behaviour is a key theme
in such approaches
Ants  Agents
• Stigmergy can be operational
– Coordination by indirect interaction is
more appealing than direct communication
– Stigmergy reduces (or eliminates)
communications between agents
SI Advantages for Routing
SI based algorithms generally enjoy:
• Multipath routing
– Probabilistic routing will send packets all over the
network
• Fast route recovery
– Packets can easily be sent to other neighbors by
recomputing next-hop probabilities
• Low Complexity
– Little special purpose information must be maintained
aside from pheromone/probability information
More SI Advantages
for Routing
• Scalability
– As with any colonies numbering in the
millions, SI algorithms can potentially scale
across several orders of magnitude
• Distributed Algorithm
– SI based algorithms are inherently distributed
SI Disadvantages for Routing
SI also suffers from:
• Directional Links
– Bidirectional links are generally assumed by
using reverse paths
• Novelty
– SI is a relatively new approach to routing. It
has not been characterized very well,
analytically
Pharaoh Ant (Monomorium Pharaonis)
• Colony Behaviours
• Multiple Queening
• Nest Conflict and
Cooperation
• Migration
• Clustering
• Analogies
• Resource Allocation,
Discovery and Sharing
• Adaptive Clustering
From: Ashish Umre
Current Issues in Mobile Agent
Technologies
• Application Issues
• Jumping Agents (Shopping, Taxi/Airport)
• Location Sensitive (Bluetooth, HomeRF)
• Profile Oriented
• Deployment Issues
• Is the Infrastructure ready?
• Security Issues
• Physical Mobility
• Logical Mobility
From: Ashish Umre
Mobile Agents
• Generalizing the “ant” based approach as a mobile agent
• A paradigm based on code mobility
– Remote Evaluation
– Code-on-demand (the Java Applet model)
– Peer-2-Peer
• Migrate from one host to another “autonomously”
– “Intelligent Viruses”? (do we really want these?)
– Lead to security nightmares
– Require writing in obscure languages (Tcl, Java etc)
• Provide an interesting paradigm for Grid computing
– Assuming other Grid infrastructure is there
How do they differ from other DC paradigms
• Host supported mobility vs. autonomous migration
– weak vs. strong mobility (Bradshaw and Suri’s
work on Nomad, vs. Aglets or Voyager)
• What’s in a message?
– state
– code or data
• How large should be a mobile agent
• Tracking a mobile agent (forwarders, location service,
pheromone trails)
• Host assisted
– state persistence (vs. soft state)
– introspection
•
•
•
•
The overhyped differences between mobile objects
and agents
Mobile objects do not migrate autonomously
– control transfer issues
Mobile objects generally part of some application
– limited or no access to a separate execution
context
Mobile object granularity is generally much finer
– agents must carry code to interact with host
(context or place)
Mobile objects do not support a well defined API
– such as moveTo, retract, dispatch etc
• Division of application into agents vs. objects will be
different
• Absence of any standard framework
The overhyped reasons for why mobile agents are
(apparently) useful
• Reduce in network load
• Overcome network latency
• Can encapsulate a protocol
• Can execute autonomously and asynchronously
• Can dynamically adapt their itinerary
• May be heterogeneous
• Are robust and can sustain faults in their environment
and why not …
• all of the above can be done via messaging
• too many security issues to be useful
• unlike to support host platforms (standardisation has
not resulting in anything useful)
• too hard to code, and abstraction is not obvious
Standardisation
• MASIF (Mobile Agent System Interoperability Facility)
– Crystaliz, General Magic, IBM, GMD Fokus, Open
Group
• Address interface between agent systems, and not
agent applications
• MASIF Aim: Enable mobile agents to travel across
various hosts in an open environment
• Support for locating an agent (MAFFinder)
• Released via OMG
MASIF
Standardise on four areas:
• Agent Management
– use of standard operations to manage agents from
different vendors
• Agent Transfer
– use of standard operations to create and migrate
agents from different agent systems
• Agent and Agent System Naming
– use of standard Syntax and Semantics of parameters
– part of MAFFinder
• Agent System Type and Location Syntax
– use of standard syntax for location
– part of MAFFinder
IDL Definition
MASIF … 2
void create_agent (
in Name agent_name,
in AgentProfile agent_profile,
in OctetString agent,
in string place_name,
in Arguments arguments,
in ClassNameList class_names,
in string code_base,
in MAFAgentSystem class_provider)
raises (ClassUnknown, ArgumentInvalid,
SerializationFailed,MAFExtendedException);
IDL Definition
MASIF … 3
Location find_nearby_agent_system_of_profile(
in AgentProfile profile)
raises (EntrynotFound);
void resume_agent(
in Name agent_name_
raises (NameInvalid, ResumeFailed);
void list_all_agents_of_authority(
in Authority authority) ;
NameList list_all_agents() ;
Location list_all_places() ;
IDL Definition
MASIF … 4
interface MAFFinder{
void register_agent(
in Name agent_name,
in Location agent_location,
in AgentProfile agent_profile)
raises (NameInvalid);
void register_agent_system(
in Name agent_system_name,
in Location agent_system_location,
in AgentSystemInfo agent_system_info)
raises (NameInvalid);
IDL Definition
MASIF … 5
Location lookup_agent(
in Name agent_name,
in AgentProfile agent_profile)
raises (EntryNotFound);
Location register_place(
in string place_name,
in Location place_location)
raises (NameInvalid);
At each host ...
• An Agent Server
– one or more such servers can co-exist on a
particular machine
– an agent server must be identifiable by a unique
URL
– must also be able to launch and subsequently
support tracking of the agent
• System support for migratable, non-persistent code
– memory, CPU
• System support for handling local security policy
– sandbox, authentication/access control
mechanism, certificate verification mechanism,
etc
Based on IBM Aglets
MA Lifecycle
A
Class file
deactivate
activate
dispatch
create
retract
A
Class file
dispose
Why are they useful in Grids?
• Important code delivery paradigm
• Must operate in the context of existing Grid systems
– may alleviate some issues with mobility
• Support essential needs of Grid computing
– software and protocol updates
– load balancing and migration
– user migration
• Most importantly -- they support a “Demand Oriented”
style of computing
– move computation and data “on demand”
– move a limited set of functionality “on demand”
Achieving Parallelism
• Mobile Agents also useful to support parallelism at a
coarser granularity
– simultaneous dispatch of agents to multiple sites
– simultaneous dispatch of messages to multiple
sites via specialised group formation (aspect of
“Spaces” -- formed through multicast groups)
– Integration with existing message passing libraries
(MPI or PVM) via the host machine
• Achieved parallelism can be more dynamic
– Agents can decide where to migrate vs. predefined message transfer based on MPI or PVM
• May not be useful for “production grade” parallelism
Supporting Mobility
• Object Identity: Killing old object as copy sent to a
remote host (address space) -- use of Java garbage
collection when no references exist to object
– mobile object pool
• Object Serialisation: what happens to private,
transient and state variables -- when to move?
– Java.io.serializable
– serialization of threads?
• State synchronisation and sharing: HORUS -- object
server?
• Concurrency through Actors (objects that own their
own thread) -- Actors are non-blocking
Explicit Serialization
• Via the Externalizable interface in Java
– must be manually implemented by programmer
– can customise how an object’s fields are mapped to a
stream
– means of checkpointing state (includes object’s field
values + metadaat about class version, and field types)
– Write out all visible states of a thread to a stream, read
back state, initiate a thread
• Consider method invocation as a “single” unit of
computation
– allow thread read only before or after a method
invocation (i.e. no active threads)
• Access to stack variables
– stack variables made part of object’s state
Custom Classloaders
• Can also implement custom classloaders
• Classloader used to:
– dynamically determine which code to migrate
– which code should be released
– how code interacts with the operating environment
• Classloaders are a useful way to extend existing Grid
systems
– use of the CoG Java toolkit or OGSA to link to Globus
– interactivity between existing scheduling systems
• Offer class loading features as a Grid Service
– characterised by application features?
• Classloaders take away intelligence from migrating code - hence not the ideal solution
Write your own Classloader()
• Extend “Primordial Classloader” in Java
– invoked after calling main() method
– Matrix m = new Matrix() ; -- execute
“new” bytecode
– System.out.println() -- invoke static
reference to class (putstatic, getstatic etc)
• Class loaders enable Java apps (EMACS or
Scientific codes) to be dynamically extended
• Byte code verifier - defineClass,
ClassFormatError
• Package over-write/addition: java.lang.hackit -protect system namespace
• Multiple Classloaders can co-exit
Dynamic Itinerary
• A mobile agent may visit a number of hosts
• This itinerary may change over time
– based on data collected at intermediate hosts
– may not return to host machine
• Itinerary may be dictated by a particular host
– agent may override this
• Dynamic itinerary useful in Grid context
– load may not be known beforehand
– hosts may not always be available or reliable
– services may not always be present
– users/experts may migrate
Locating an agent
• Use of proxy
– local proxy to track agent
• Forwarders
– creating a chain of non-persistent forwarders
– pheromone based approaches
• A location service
– event notification service
– query service
Application scenario: Load gathering
• Sensors measure network load
– similar to SNMP
• Report this to an event gateway and monitor this at a
given control site
• JAMM system an example
– other work taking place in the Global Grid Forum
Network Monitoring group
• Mobile agent may be used to gather load
– carry a schema for gathering parameters
– interact via local host to SNMP gateway
– record local parameters and carry statistics
– pass through a given host to lodge results
– itinerary may be application dependent
Java Agent Measurement and Monitoring (JAMM) - LBNL
JAMM scenario
Load gathering
Application Profiles
• Application categories:
– restrict itinerary
– identify common patterns
• Resource suggestions
– identify common patterns
– resource characteristics
• MA-MA interaction
– used to inform about other resources
– share application requirements
– determine commonality in applications
Load imposed by Mobile Agents
• MA performance becomes an issue
• Issues
– where should a mobile agent visit next?
– What should the mobile agent carry vs. leave behind?
– How long does a mobile agent spend on a given host?
– How long does it take for a mobile agent to visit from
A->B
• Need for tools that can help gather this data
– Recorded within each agent
– Support for specialised services which gather this
– Data can be queried based on MA authorisation
David Kotz, Guofei Jiang et al.
(Dartmouth College)
Fernando Pinel, Omer F. Rana (Cardiff)
Benchmarking
• MA benchmarking efforts also important in this context.
• Benchmarks can be micro– create (locally or remotely) and dispatch an agent
– Retrieve an agent
– blocking and non-blocking message exchanges
• or macro– forwarding
– roaming
– proxy servers
M. Dikaiakos, M. Kyriakou, G. Samaras, "Performance Evaluation of Mobile-agent
Middleware: A Hierarchical Approach." In Proceedings of the 5th IEEE International
Conference on Mobile Agents, J.P. Picco (ed.), Lecture Notes of Computer Science
series, vol. 2240, pages 244-259, Springer, Atlanta, USA, December 2001
Additional uses: Consumer Grids
• More open perspective on Grids
• Individuals and organisations can operate as
suppliers of services/resources
• Service providers must be able to:
– Dynamically download software to participate on
the Grid
– Varying resource capabilities
– Dynamically determine resource properties
• Resource aware visualisation
– Remotely configure resource
• Mobile agents provide an important abstraction
• Many existing technologies are useful contenders:
Peer-2-Peer and Web Services
Resource sharing
• Peer-2-Peer
– CPU sharing (Entropia, Parabon, UD, SETI@HOME)
– File sharing (Napster, Gnutella, Freenet)
• CPU sharing
– Utilisation of free cycles via standard downloads
– Requires upload of data on which to operate
– Generally high redundancy and replication
• File Sharing
– Search for common file types, and support file
placement
– Use of indexing or intermediate servers
• Development libraries: JXTA
Resource Sharing … 2
• In MA:
– CPU sharing: migration of mobile agent
– File sharing: migration of associated data and state
• Migration and execution can be more intelligent
• Use of forwarding and location services can be coupled
with additional services:
– Work distribution and current state of computation
– Resource events to support migration
• P2P infrastructure also useful:
– Development of itineraries via overlay networks or
index servers
– Security issues (?)
File Space Management
• Cache management
– migration support for files (temporary results,
configuration etc)
• File space re-ordering
– sharing of directory space across machines
– virtual “file stores”
• Results to common queries
– file placement closer to computation
– file replication to support availability levels
• Managing user and project groups
Common Themes
• Load balancing and migration
• Data capture (especially performance
related)
• Trigger and configuration
– set up of execution at remote sites
– updates to execution or changes
– user set up
• Establishing dynamic resource groups
• Resource provisioning beyond regional and
national centres
Concerns
• Dealing with licensed software
– proprietary code or data
• Dealing with production codes
– highly tuned performance
– issues of Grid computing are questionable here
• Domain decomposition
– issues in translating large scale codes to mobile
agents
– where is the abstraction most suitable/relevant
• Interfaces between Grid systems and Mobile Agent
systems
Issues … Swarm/Ant Systems
• Tragedy of the Commons: Self Organisation
does not always produce the desired outcome
(Thomas Schelling's Micromotives and
Macrobehavior):
– El Farol Bar problem
– Sheep Grazing problem
• Some individuals and organizations are more
comfortable and more
efficient with hierarchical organizations that are
more centrally
controlled
Issues … 2
• Useful in an “experiment” and “explorative”
environment
• System must be “non-conservative” in its
approach to experiment and evaluate
different system behaviours
El Farol Bar … 2
• Agents select a night (1—7) – based on
expected attendance or reward (from prior
experience)
• Agent attends the bar
– Attendance on selected night
– Output of the reward function
• Update agent’s model of the system
• Agents cannot communicate with each other
• Global objective: Maximise cumulative reward of
entire system
Tragedy of the Commons
• Self-interested gain of one member of the
community is to the detriment of the whole
community
• Pasture on which each agent keeps cattle
– Utility increases as number of animals
increase
– Overgrazing affects all agents detrimentally
• Agent needs to decide whether to
cooperate or defect
Braess’ Paradox
• Agents traverse a network consisting of a set of
nodes – and a number of connections between
the nodes
• Aim: each agent must reach its destination as
quickly as possible
– Traffic networks, water supply networks, electrical
networks etc
• BP: Addition of an extra link has a detrimental
effect on performance
• Introducing a shortest path link in a network that
has reached equilibrium
A
B
A
C
D
B
C
D
Occurs when a community of agents is unable to coordinate their activities to take
advantage of changes in the environment.
Collective Intelligence (COIN)
• Developed at NASA by Wolpert et al.
• Scalable coordination technique for
adaptable, learning based multiagent
systems (MAS).
• All agents strive to maximise their local
utility function.
• The goal of the system is to maximise the
global utility function.
Collective Intelligence (COIN)
Local utility functions are derived from the
global utility functions so that:
• Maximisation of local utility functions
maximises the global utility function – global
optimum ‘line-up’ with the Nash Equilibrium.
• Local utility functions are learnable: good
signal-to-noise ratio for learning algorithms.
• Agents are coordinated indirectly. Emergent
behaviour is still possible as agents are not
given explicit instructions and behaviour is
not predefined.
Adapting Collective Intelligence
• We are aiming to adapt this technique for
agents that can be deployed via the
internet.
• COIN concentrates of specific
applications: coordinating communications
satellites, robotic rovers.
• We want to apply this technique
dynamically and concentrate on software
agents.
LEAF – Learning Agent FIPA
Compliant Community Toolkit
• Utility functions assigned dynamically.
• Utility extended to form two separate
types: functional utility and performance
utility.
• Assignment of multiple utility functions
possible.
• Java API provided to support development
of FIPA compliant agents.
FIPA - Foundation for Intelligent
Physical Agents
• Standards for interoperable agent systems.
• FIPA ACL: conversations consisting of FIPA
performatives such as inform, request,
query etc.
• Agent management system (AMS) and
directory facilitator (DF) part of the FIPA
platform.
• LEAF utilises FIPA-OS implementation from
Emorphia.
Community Building Kit: LEAF
Four core concepts:
LEAF agents
LEAF utility functions
ESNs
LEAF tasks
Provides support for:
JESS based policy description
Reinforcement learning
LEAF Agent
LEAF: Learning Agent FIPA-Compliant Community Toolkit
Implementation of LEAF is based on FIPAOS
ESN Class
LeafTask Class
LeafNode Class
LEAF
FIPAOSAgent Class
Task Class
FIPA-OS
LEAF: Learning Agent FIPA-Compliant Community Toolkit
• Coordination: utility functions are
assigned to agents by an environment
service node.
ESN
f2
f1
Community
LEAF: Learning Agent FIPA-Compliant Community Toolkit
Multiple utility
functions can be
assigned
ESN
ESN
f3
f2
f1
sum f2,f3
Community b
Community a
LEAF: Learning Agent FIPA-Compliant Community Toolkit
• Utility functions can have parameters
that are not available locally to the
agent.
ESN
f1
Community
LEAF: Learning Agent FIPA-Compliant Community Toolkit
• Utility functions can have parameters that
are not available locally to the agent.
ESN
O
R
f1
Community
O: observable properties
R: remote parameters
LEAF: Learning Agent FIPA-Compliant Community Toolkit
P
Speed of
execution,
number of
tasks, CPU
usage etc.
Performance
and Functional
Utility
Decision making,
learning - high level
behaviour.
F
Performance Utility
• Provides a utility measure based on performance
engineering related aspects
– Comms metrics:
• number of messages exchanged, size of message, response
time
– Execution metrics:
• execution time, time to convergence, queue time
– Memory and I/O metrics:
• Memory access time, disk access time
• The effect of implementation decisions (algorithms;
languages) and deployment decisions (platforms;
networks), can be assessed.
Functional Utility
• Utility based on “problem solving” capability
• Statically defined
– related to service properties (capability based)
– degree of match between task properties and service
capability
• syntax match (exact match)
• range match
• semantic match (subsumption/subclass)
• Dynamically defined
– related to execution output (MSE)
Utility Function Implementation
• Extend the LocalUtilityFunction
abstract class.
• Implement the compute() method.
• Functions have access to remote
parameters and observable properties.
Utility Function Implementation
Utility functions
• Global Utility (G) = Si Local Utility (Ui)
• U = (jobs of type X processed)/(jobs of
type X submitted)
• U = 1/(idle time)
For students
Can you consider other utility functions that may be relevant?
Access to utility functions
double computeFunctionalUtility()
Computes the sum of all currently assigned functional utility functions.
double computePerformanceUtility()
Computes the sum of all currently assigned performance utility functions.
String[] getFunctionalUtilityRequiredProperties()
Returns the observable properties required to compute the agent’s
functional utility functions.
String[] getPerformanceUtilityRequiredProperties()
Returns the observable properties required to compute the agent’s
performance utility functions.
Resource management
• The objective is to provide users with ondemand access to resources needed to
execute applications.
• Each peer/agent can undertake three
different roles: application agent, resource
agent, broker agent.
• Multiple roles may be undertaken by the
same peer.
• Each peer is an autonomous agent capable
of learning within it’s environment with the
goal of local utility maximisation.
Application Agents
• Accept applications from users.
• Decompose applications into tasks.
• Identify suitable resources for task
execution, via broker agents.
• Schedule and submit tasks to resource
agents.
• Manage dynamic application execution
process.
• Coordinated learning may be of benefit in
resource selection.
Resource Agents
• Manage access to a particular resource.
• Resources may be computational,
visualisation, scientific, or instrumentation
based.
• Resource agents allow tasks to be submitted
and executed on the resource.
• Coordinated learning may allow resource
agents to optimise resource properties, and
prioritise tasks.
Broker Agents
• Maintain information about discovered
resource agents.
• Offer a matchmaking service, aimed at
allowing application agents to discover
resource agents.
• Coordinate learning may allow brokers to
optimise their matchmaking service.
Agent based resource
management
• Previous work used planning based BDI
agents within the same framework.
• Current research involves investigating
whether agents can benefit from
coordinated learning.
• The eventual goal is to integrate the two
techniques.
Agent Communities
• Communities are centred on the
application/resource type: computational (C),
visualisation (V), scientific (S),
instrumentation (I) – there can be multiple
communities of the same type.
• When an agent joins a community, it is
assigned a local utility function.
• The agent learns to optimise this function to
benefit the community.
• Agents are allowed to join multiple
communities in an attempt to maximise their
utility.
Agent Communities
Each community has a global utility
function, based on community objectives:
1. Peers acting as application agents process
as many applications as possible.
2. Peers acting as as application agents
process as many applications as possible.
3. Peers acting as broker agents facilitate (1)
and (2).
Global Utility Functions
where A is the number of applications processed
by the community, idlei is the amount of time
agent i spends idle. c1,c2 are constants
Application agent utility
functions
where Aa is the number of applications processed
by agent a, and Ja is the total resource usage
time used by a. c1,c2 are constants
Resource agent utility functions
where Tr is the number of tasks processed by
resource agent r, and idler is the total time spent
idle by the resource. c1,c2 are constants
Broker agent utility functions
where n resources have been recommended by
the resource agent, and Ul(i)Ti is the local utility of
the recommended resource at the time of
recommendation.
Simulations
•
•
•
•
•
4 communities – (C,V,S,I)
10 resource agents
3 application agents
1 broker agent
The current focus is on resource agent
learning – joining communities and updating
resource properties
• Peers attempt to join communities in order to
increase their utility, and will only remain in
the community as long as their utility is above
a certain threshold.
35
Global Utility
Number of Members
computational
community
30
25
20
15
10
5
0
0
50
100
time
150
200
250
90
Global Utility
Number of Members
visualisation
community
80
70
60
50
40
30
20
10
0
0
50
100
150
200
time
250
300
350
400
50
Global Utility
Number of Members
storage
community
45
40
35
30
25
20
15
10
5
0
0
50
100
150
200
250
time
300
350
400
450
500
5
Global Utility
Number of Members
instrumentation
community
4
3
2
1
0
0
100
200
300
time
400
500
600
Current research objectives
• The aim is to allow peers to form
communities, around which the collection of
peers is ‘greater than the sum of their parts’.
• Current work involves the engineering of this
application, and the evolution of the utility
functions to include a greater degree of social
context
• Learning is currently very difficult for the
agents – need to allow learning algorithms to
converge.
Common Themes
• Load balancing and migration
• Data capture (especially performance
related)
• Trigger and configuration
– set up of execution at remote sites
– updates to execution or changes
– user set up
• Establishing dynamic resource groups
• Resource provisioning beyond regional and
national centres
Toolkits: ABLE
• ABLE (Agent Building and Learning
Environment)
• Support use of Java Beans
• Provides a host of pre-built functionality
• Also provides Tuning agents for:
– Load Balancing
– System Control function
AbleBeans – Java Agent Building
Blocks
Notification Events
AbleBean
Direct method calls
AbleEvents
Action Events
AbleBean
AbleBean, AbleRemoteBean: a Java interface (local and remote)
AbleObject: AbleBean instantiation with autonomous thread
Bean interactions: Direct method calls and event passing
AbleEvents: Notification and Action events with synchronous
and asynchronous event handling
AbleBeanInfo and Customizer required for use in Agent Editor
Set of core data access and algorithm beans supplied
From Joe Bigus (IBM)
AbleAgents – Intelligent
JavaBeans
App/Service 1
App/Service 2
AbleBean A
get app data
AbleAgent
Sensor
AbleBean B
AbleBean C
Effector
call app action
AbleAgent, AbleRemoteAgent: a Java interface (extends AbleBean)
Composable: can contain other AbleBeans and AbleAgents
Sensors and Effectors: Allow agents to interface with apps
Can be distributed, synchronous or asynchronous (autonomous)
From Joe Bigus (IBM)
ABLE Component Library
Agents
Machine
Learning
Back propagation
Self organizing maps
Radial Basis Functions
T D-Lambda
Decision T rees
N aive Bayes
Classification
Clustering
Prediction
Autotune (closed loop control)
Storage manager (multiple QoS)
Machine
Reasoning
Script (procedures)
Forward / Backward chaining
Predicate logic (Prolog)
Rete'-based pattern match
Fuzzy systems
Planning (ST RIPS)
Data
Access/Analysis
T ext/DB read/write
Cache, Filter, T ransform
Statistical routines
Genetic algorithms
other math analysis
From Joe Bigus (IBM)
ABLE Application Design
Application
ABLE Library
Agent
Custom Beans
ABLE Core Beans
(domain-specific)
From Joe Bigus (IBM)
AbleBean Wrapper Design Pattern
myAlgorithmBean
myAlgorithmBean()
init()
process()
processTimerEvent()
getters() setters()
myAlgorithmBeanInfo
myAlgorithmCustomizer
ins
tan
ce
theAlgorithm
theAlgorithm()
init()
process()
getters()
setters()
Allows easy integration of existing Java algorithms into the Able environment
Requires creation of 3 Java classes, Bean wrapper, BeanInfo and Customizer
Bean contains an instance of the algorithm and calls methods on it
No (or minimal) source changes required in the algorithm class
From Joe Bigus (IBM)
Rule Blocks
<type> <name>() using <engine> { ruleList } ;
•
•
•
•
•
•
•
•
Semantically equivalent to Java methods
Can specify a return data type
Can use pre-defined or user-defined name
No formal parameter lists, use global vars
Specify inference engine via using <engine> clause
<engine> can be any AbleInferenceEngine Java subclass
Body of ruleblock contains one or more Rules
Use setControlParameter() built-in function to set goals,
options, etc.
• Ruleblock can have local or shared working memory
ARL Rule Syntax
<ruleLabel> { preConditions } [priority] : <ruleBody>;
• ruleLabel – unique identifier in ruleset
• preConditions – list of Java objects
(e.g.TimePeriods)
• priority – used in conflict resolution during
inferencing
• Rule body must be one of the ARL rule types
• myRule { weekdaysOnly } [ 3.0 ] : println(“wow”);
ABLE Rule Templates
 Allow IT Developer or Programmer to create rulesets and
templates using WSAD editor
 Minimize external meta-data or artifacts
 Business user can create rules from templates using web-based UI
 Allow easy parameterization of rules and rule logic,
with constraints on parameter values
 Reuse existing ABLE data types and ARL syntax
 Allow users to customize rule templates and create new rules
 Variable values are constrained based on ruleset author constraints
 Can generate individual rules or entire rulesets via templates
 Can edit generated rules using same authoring environment
ARL Rule Template Syntax
Ruleset myRuleTemplateExample {
import com.ibm.myclass.Customer;
variables {
Customer
template Categorical
template String
template Continuous
Double
customer = new Customer() ; // myclass type
customerLevel = new Categorical("gold", "silver", "platinum");
salesMsg = new String("Thank you for shopping IBM"); // example msg
discountValue = new Continuous(0.01, 0.50); // allow range from 1% to 50%
discount = new Double(0.0) ;
}
inputs { customer } ;
outputs { discount } ;
void process() {
Rule1: if (a > b) then println("regular old rule") ;
Rule2: if (a <= b) then println("another regular old rule") ;
template myRuleTemplate1: if ( customer.level == customerLevel )
then { discount = discountValue ;
println( salesMsg ) ; }
}
}
// NOTE: Rule is a template
Autotune Agent Web-Tuning
Scenario
Apache
Web Server
Users
Agent Properties
• Flexible
• Autonomic
• Generic
• KeepAlive
• CPU
• MaxClients
• MEM
AutoTune Agent
- Modeling
- Run-time Control
Desired
Utilization
Level
Design Phase I: System Modeling
iSeries System Adminstration using ABLE
Task/Info Agents
SysAdminBrain
RuleSet
SysAdminActions
RuleSet
CPUWatcher
DiskWatcher
DiskPredictor
NOJWatcher
SysAdmin Agent
FindRunawayJobs
findDuplcateJobs
Cleanup
FindLargeObjects
Action Agents
WinGamma
• Data analysis toolkit – especially for time
series data
• Can support identification of:
– Time series “embedding” dimension
– Level of noise present within data
– Based on the “Gamma” statistic
• Can be used prior to training a neural
network
WEKA: Waikato Environment for
Knowledge Analysis
Explorer: building “classifiers”
• Classifiers in WEKA are models for
predicting nominal or numeric quantities
• Implemented learning schemes include:
– Decision trees and lists, instance-based
classifiers, support vector machines, multilayer perceptrons, logistic regression, Bayes’
nets, …
• “Meta”-classifiers include:
– Bagging, boosting, stacking, error-correcting
output codes, locally weighted learning, …
Monitoring Tools
• NWS (Network Weather Service)
– Support a forecasting model
– Work at “application-level” and not necessarily at the
network (resource) level
• NetLogger
– Now supports instrumentation for Globus calls
– Useful data capture process (event based)
– Manage level of data captured
• Specialist support via Apache Web Server
– Messaging and Execution time
From Brian Tierney (LBNL)
From: G. Obertelli (UCSB)
Additional Info.
• IBM Autonomic Computing Web site
– http://www.research.ibm.com/autonomic/
• IBM Autonomic Computing Library
– http://www-03.ibm.com/autonomic/library.html
• LEAF project
– http://users.cs.cf.ac.uk/O.F.Rana/leaf/
• DIPSO/FAEHIM project
– http://users.cs.cf.ac.uk/Ali.Shaikhali/faehim/
• WinGamma
– http://www.cs.cf.ac.uk/wingamma/
• WEKA
– http://www.cs.waikato.ac.nz/ml/weka/
• ABLE Toolkit – Tutorial
– http://www.cs.iastate.edu/~colloq/docs/able2_bigus.ppt