Preemptive Process Migration

Transcript Preemptive Process Migration

A COMPLETE GUIDE TO
OPENMOSIX (updated to
2.4.26)

Note :- THIS PRESENTATION IS BEST
VIEWED UNDER MICROSOFT
POWERPOINT XP BECAUSE OF USE
OF ANIMATIONS
Authors
Students of IIT Delhi, India




Aastha Jain (Btech 3rd Year )
Shruti Garg (Btech 3rd Year)
Akram Khan (Dual Degree 3rd Year)
Rohan Choudhary (Dual Degree 3rd Year)
This presentation has been made as a part of a reading
assignment in the Operating Systems course (2006
Spring) taught by Dr. Subhashis Banerjee
OPEN MOSIX PROJECT HISTORY
• Born early 80s on PDP-11/70. One full PDP and disk-less
PDP, therefore process migration idea.
•First implementation on BSD/pdp as MS.c thesis.
•VAX 11/780 implementation (different word size, different
memory architecture)
•Motorola / VME bus implementation as Ph.D. thesis in 1993
for under contract from IDF (Israeli Defence Forces)
•1994 BSDi version , GNU and Linux since 1997
•Contributed dozens of patches to the standard Linux kernel
•Split Mosix / openMosix November 2001
What is openMOSIX (today version
2.4.26)
Linux kernel extension (2.4.26) for
clustering
 Single System Image - like an SMP, for:

 No
need to modify applications
 Adaptive resource management to dynamic
load characteristics (CPU intensive, RAM
intensive, I/O etc.)
 Linear scalability (unlike SMP)
Single System Image Cluster



Users can start from any node in the cluster, or
sysadmin setups a few nodes as "login" nodes
use round-robin DNS: “hpc.qlusters” with many
IPs assigned to same name
Each process has a Home-Node
 Migrated
processes always seem to run at the home
node,
e.g., “ps” show all your processes, even if they run
elsewhere
A two level technology

1. Information gathering and dissemination



Support scalable configurations by probabilistic
dissemination algorithms
Same overhead for 16 nodes or 2056 nodes
2. Pre-emptive process migration that can migrate
any process, anywhere, anytime - transparently


Supervised by adaptive algorithms that respond to global
resource availability
Transparent to applications, no change to user interface
Organization of topics


Gossip Algorithms for Information Dissemination
Process migration by adaptive resource
management algorithms
 Load
Balancing
 Memory Ushering
 How Actual migration takes place ?


Direct File System Access
MigSHM (A Special study)
Gossip Algorithms
and
Open Mosix
Agenda

A short introduction to
gossip algorithms

Cluster/Grid Information
services requirements
 How good is old
information

The distributed bulletin
board model

Implementation
A Problem



In an n node system assume that every pair of nodes
can communicate directly
node i wishes to send a message
(rumor, color) to all other nodes.
Possible deterministic solutions
 BROADCAST (only in a broadcast medium)
 Defining a static tree between the nodes and
sending the message along the edges of this tree
A Gossip Style solution





Starting with the round in which a rumor is generated
each node that holds the rumor selects another node
independently and uniformly at random
send the rumor to this node
The distribution of the rumor is terminated after some
fixed number of O( ln n ) rounds
At this point all players are informed with high
probability
Uniform Gossip Example
1
t
Uniform Gossip Example
2
t
Uniform Gossip Example
3
t
Uniform Gossip Example
4
t
Uniform Gossip Example
5
t
Gossip benefits



Robustness to the presence of node failures
 Messages will continue to propagate due to the
random selection of destination
 F nodes failure results in only O(F) uninformed players
Simplicity
 All nodes run the same algorithm
Scalability
 The number of massages each nodes send (and
possibly receive) each round is fixed
Gossip taxonomy



Other names are
 Epidemic algorithms (demers et al)
 Randomized communication (Karp et al)
Propagation can be done by
 Push – sending the information from the node to the selected
node
 Pull – the other way around
 Push&Pull both ways
We distinguish between 2 conceptual layers
 A basic gossip algorithm
 by which nodes choose other nodes for communication
 A gossip-based protocol
 Built on top of a gossip algorithm
 Determine the content of the messages that are sent
 The way received messages cause nodes to update their
internal state
Rumor speeding bounds
From a single node to all
O(lnn)

Time complexity:

Message complexity (Karp el al) lower bound to the
number of messages:
(n ln ln n)
Spatial Gossip (Kampe at al)




New information is most interesting to
nodes that are nearby
Combines the benefits of
 Uniform gossip
 Deterministic flooding
 D
p

c
(
d

1
)
x, y
x
The gossip algorithm chooses the
nodes according to
New information is spread to nodes at
1
distance d with high probability,in : O(log d )
Aggregating values
Gossip can also be used to aggregate a value
over all nodes
 Average, maximum, minimum …
 In this case the question is how fast the local
value in each node converge to the desired
value

Cluster/Grid Information
services


Basic properties of Grid environment
 Information sources are distributed
 Individual sources are subject to failure
 Total number of information providers is large
 Both the types of information sources and the ways it
is used can be varied
We cannot in general provide users with accurate
information: any information delivered to a user is “old”
 How useful is old information? (Mitzenmacher)
 How to build an information service with guaranteed
age properties?
Distributed Bulletin board





The system
 Consists of ‘N’ nodes (or clusters)
 Distributed
 Nodes are subject to failure
Each node maintains a data structure that holds
an entry on selected (or all) nodes in the system
We refer to this data structure as “The vector”
Each vector entry holds:
 state of the resources (static and dynamic)
about the corresponding node
 age of the information (tune to the local clock)
The vector is a distributed bulletin board that
serves information requests locally
Algorithm 1- Information
dissemination


Each time unit
 Update local information
 Find all vector entries
which are up to age t
 Choose a random node
 Send the above entries
to that node
Upon receiving a message
 Compute the received
entries age
 Update the entries
which the newly
received information is
fresher
A:1
C:2
D:4
A:1
B:12
C:2
B:1
C:3
E:3
A:4
B:12
C:2
D:4
E:11
D:4
E:11
Algorithm 1 : t=2
1
t
Algorithm 1 : t=2
2
t
Algorithm 1 : t=2
3
t
Algorithm 1 : t=2
4
t
Algorithm 1 : t=2
5
t
Handling inactive nodes

The presence of
inactive nodes causes
problems
 Age quality of the
information
deteriorate
 Number of ARP
broadcasts increase
linearly

Using a fixed size
window improves the
age quality but the
number of ARP
broadcasts stay the
same
Algorithm 2

Algorithm 2 solves the above 2 issues

Works basically the same as algorithm 1 with the
following difference when sending a message
 Calculate l the number of active nodes
(from the local vector)
 Generate a random number between k=0…l
 If K=0 send the window to all nodes
 Else send the window only to the active nodes

Using Algorithm 2 the maximal expected number of
messages to inactive nodes ≤ 1
 From all nodes at each round
Algorithm 2 – minimizing messages to
inactive nodes
1
t
Algorithm 2
2
t
Algorithm 2
3
t
Algorithm 2
4
t
Supporting Urgent information






In previous algorithm information is propagated from all
nodes constantly
In some cases we wish to send an important message
urgently to all
 such as the detection of a newly dead node
 In this case the source node give the message high
priority 2*log(n)
When a node assemble the window it is about to send it
takes the entries with the highest priority and only then
the younger entries
The priority of an entry is decremented every time unit
The result is that urgent messages are disseminated in
O(log(n)) steps
And regular information is disseminated a bit slower
Conclusions

Constructed a distributed bulletin board
 Age properties are guaranteed
 The administrator can configure it to the desired
properties
 No two nodes have the same view of the system
 Information requests are served locally
 Noise level (messages to inactive) is constant
 Urgent messages are propagated quickly
Load Balancing
Load calculation and
distribution
The key Idea
Convert the total usage of several
heterogeneous resources, such as memory
and CPU, into a single homogeneous “cost”.
Processes are then assigned to the machine
where they have the lowest cost.
The Model

Cluster of n machines

Machine i has
1) CPU resource of speed – rc(i)
2) Memory resource of size – rm(i)

Each process is defined by three parameters:
1) Its arrival time, a(j),
2) The number of CPU seconds it requires, t(j), and
3) The amount of memory it requires, m(j)
Assumption- m(j) is known when a job arrives, but t(j) is not.
A job must be assigned to a machine immediately upon its
arrival,
and may or may not be able to move to another machine later.

Let J(t,i). be the set of jobs in machine i at time t.

CPU load and the memory load of machine i at time t
lc(t,i) = | J(t,i)|
lm(t,i) = ∑j€J(t,i) m(j)

When a machine runs out of main memory, it is slowed down
by a multiplicative factor of r ,due to disk paging. The effective
CPU load of machine i at time t , L(t,i), is therefore:
lc(t,i)
lc(t,i)*r

if lm(t,i) < rm(i)
otherwise
At time t, each job on machine i will receive 1/L(t, i) of the
CPU resource. A job's completion time, c(j), therefore satisfies
:
c(j)
∫a(j) rc(i) / L(t,i) = t(j)
Machine models
1. Identical Machines. All of the machines are identical,and the
speed of a job on a given machine is determined only by the
machine's load.
2. Related Machines. The machines are identical except that
some of them have different speeds i.e they have different rc
values,
3. Unrelated Machines. Many different factors can influence the
effective load of the machine and the completion times of jobs
running there.
Jobs:
1. Permanent Jobs.
2. Temporary Jobs.
Main Goal

Our goal is a method for job assignment and/or reassignment
that will minimize the average slowdown over all jobs.

Slowdown of job
(c(j) – a(j) ) / t(j)

The algorithm that we’ll describe from here onwards is called
as ASSIGN-U algorithm
ASSIGN-U algorithm for
single resource
Identical and related mach
CPU time can be considered as the only resource , hence in
case when no reassignments are possible , a simple greedy
algorithm performs well
(competive ratio is 2-(1/n))

An algo(ALG) is c competetive if for any input seq I
ALG(I) <= c*OPT(I) + d
Where OPT is any optimal algo and d is constant

Unrelated Mach
concept of Marginal Cost
Let:
 a be a constant, 1 < a < 2,
 L(i,j) be the load of machine i before assigning job j,
 P(i,j) be the load job j will add to machine i.
Assign j to the machine i that minimizes the marginal cost
Hi(j) = a l(i,j) + p(i,j) – a l(I,j)
This algorithm is O(log n) competitive for unrelated machines
and
permanent jobs and can be extended to temporary jobs, using
up
to O(log n) reassignments per job with the same competetive
ratio.
ASSIGN-u for k-resources




Cost = i=1∑k a (load on resource i)
where 1<a<2
The load on a given resource equals its usage
divided by its capacity.
Assign each resource a capacity = (size) * some
constant factor δ which is chosen so that the optimal
algorithm achieves a maximum load of 1. This
algorithm can achieve loads as high as O(log n).
So,
Cost = i=1∑k a o(log n) + utilized ri / max usage of ri
= i=1∑k n (utilized ri / max usage of ri)
Pseudo Code
Contd…
Performance
Memory ushering

Memory ushering means migrating a processes
from a node that nearly exhausted its free
memory, to prevent paging .

Memory ushering algorithm is executed only
when free memory becomes scarce otherwise
only load balancing algorithm is used.
Memory ushering algorithm.

1.
2.
3.

There are three parts to the memory
ushering algorithmWhen to migrate a process.
Which process to migrate.
Choice of Target node.
Mosix_mem_daemon logs memory usage
for each process for memory ushering.
When to migrate a process.


The algorithm is triggered when node’s free
memory falls short of the threshold memory.
The choice of threshold memory is critical(
should be adjusted according to the page
allocation rate.)
Choosing the process to migrate.
At first stage the process with lowest
migration overhead is selected from the
processes whose memory requirement is
more than the memory overflow value.
 If there is no such process or we don’t
have a target node to accommodate this
process then move to stage 2.

Stage 2
In second stage algorithm finds the node
with the largest free memory index.
 It finds the largest process which can fit
into this node.
 Algorithm again moves to stage1.

Choosing the target node




Target node is chosen from among the subset of
nodes whose free memory indices are available.
The node whose free memory index is larger
than the requirement of the process which has
to be migrated is chosen.
The node with lowest load is chosen to avoid
repeated migrations.
Before start of process migration approval is
required from destination site to avoid
simultaneous migrations from different nodes.
Preemptive Process
Migration
Unique home node

Each process has a Unique Home-Node (UHN) where it
was created.

Processes that migrate to other (remote) nodes use
local (in the remote node) resources whenever possible,
but interact with the user's environment through the UHN

To a user, process seems to be running at UHN
(transparency) irrespective of wherever it may have
migrated (SSI).
Does the process migrate
completely?

User Context ( remote) - contains the program code,
stack, data, memorymaps, and registers of the process.
The remote encapsulates the process when it is running
in the user level ; hence migrates easily.

System context (deputy) - contains a description of the
resources that the process is attached to, and a kernelstack for the execution of system code on behalf of the
process. The deputy encapsulates the process when it is
running in the kernel ; hence it must remain in the UHN
of the process.
On the Local node -- - - - - - - Contacts remote mig daemon. After successful
handshake, calls mig_do_send():





send the mm area, which sends some fields (ex: start, end
....) of the process's mm.
send vma areas.
send pages to populate memory area.
send floating point context (fp registers, fp state).
send process context (ex: credentials, normal registers, misc
...)
When this function returns, mark the process DDEPUTY here.
On the remote node - - - - - - - When the mig daemon accepts a contact on remote, it
starts a new user process which calls mig_do_receive() after
a successull handshake. As it doesn't know either how much
memory pages or how much vma it is going to receive:




loop waiting for data to receive.
identify the type of data we are going to receive.
pass the data to the right receive function.
break the loop on receiving MIG_TASK calling
arch_kickstart(), which will jump into a kernel return to
userspace path (ret_from_kickstart).
What happens to Deputy Stub ?

Must not be killed because we need to complete some task
that a remote process can't do (ex: open a file, read from disks).
We can't also return in userspace because of lack of userspace
memory or free cpu cycles.

Before returning into userspace test whether a task is
DMIGRATED (migrating),if so call openmosix_pre_usermode().

This function will call deputy_main_loop() , which will loop until
the process has been marked DDEPUTYand then wait for
either incoming (from remote) or pending (to be sent to remote)
messages .
Handling Syscalls

1)
2)
2 kinds –
Can be executed like a local syscall eg-which interact
with the migrated memory,
need to be handled on the home node, like file related
ones.
Calling sys_close on remote
[remote]
program calls sys_close(13);
--- entering kernel --execute om_sys_remote
eax = 6 (NR_CLOSE)
ebx = 13 (FIRST ARG)
call do_remote_syscall
[deputy]
receive remote syscall request
call sys_close
reply to syscall remote
[remote]
return value
--- resume userspace ---
Passing Arguments has a problem !

Arguments which are pointer, are user adresses. On
deputy, there's no memory, so all syscalls which refers to
at least one memory address, will fail. We need to pack
all data associated with pointer on remote, then send it
with argument and syscall number.

When the deputy will try to copy to or from user memory,
deputy_copy_(from|to)_user will be called. The address
will be search into all packed data, and if success it will
copy data to/from the realaddr into the ucache data.
Fork and Clone

Requires creation of a new link between the child on
deputy and the child on the remote.

When receiving a remote_sys_fork create a socket
waiting for communication, then the socket's sockaddr is
send with register to deputy with a request for forking.

On deputy, after getting the request to fork, connect to
the socket sockaddr received, then create the new task.
If all goes well, assign the new socket to the new task,
then reply to remote's request.

After receiving the reply on remote, fork as well ,the task
and then assign the socket to the task.
Signal delivery

Other forms of interaction :- signal delivery and process
wakeup events, such as when network data arrives.

In a typical scenario, the kernel at the UHN informs the
deputy of the event. The deputy checks whether any
action needs to be taken, and if so, informs the remote.
The remote monitors the communication channel for
reports of asynchronous events, like signals, just before
resuming user-level execution. (by calling
openmosix_pre_usermode().)
Direct File System Access :
An Open Mosix special
Direct File System Access
As the name implies, DFSA is a provision
that allows a migrated process to directly
access files in its current location.
 When combined with an appropriate
cluster file system it substantially
increases I/O performance and decreases
network congestion.

The need for DFSA
As we have seen, openMosix is efficient
for load balancing CPU-bound processes.
 However, it became inefficient for running
processes with significant amount of I/O or
file system operations.
 For example,

The need for DFSA
The need for DFSA
So we see that a simple read() syscall
requires 4 messages and an I/O intensive
process has the potential to congest the
entire network, not to mention slow itself
down.
 To overcome this problem, the scheme of
direct file system access was devised.

What is DFSA all about?
So whenever openMosix encounters a
high I/O oriented process, it migrates it to
a node (file server) where the necessary
files reside.
 Now on this node DFSA acts as a rerouting switch which intercepts and
performs most of the file related syscalls
on the node itself!

What is DFSA all about?
Advantages of DFSA
Substantial increase of I/O performance
 Reduction in network congestion
 Possible to partition large files into several
smaller ones and place them on different
file servers so that parallel access
(through parallel migrated processes) on
that file is possible.

Requirements from the supporting
file systems
Single-node consistency.
 Time-stamps on files in the same partition
must be consistent and non-decreasing
regardless from which node the
modifications are made.

Single-node consistency
The results of any sequence of read/write
operations on a file by processes running
in a set of nodes must be the same if all
those processes were running on one
node.
 NFS does not provide single node
consistency when used in multiple nodes.

The Mosix File System
At the time DFSA was developed, there
was no production level, scalable (without
shared hardware) file system available
which satisfied the above criteria.
 So to test the performance of DFSA, the
Mosix file system (MFS/oMFS) was
developed!

The Mosix File System

Features of MFS
 Provides
a unified view of all files on all
mounted file systems on all the nodes of a
MOSIX cluster as if they were all on a single
partition.

E.g. if MFS is mounted on /mfs mount-point then
the file /mfs/14/usr/tmp/m.c refers to the file
/usr/tmp/m.c on node #14
 Scalability
and single node consistency.
The Mosix File System

Client/server model
 When
a process issues an MFS related
syscall, the local kernel acts as a client and
redirects the request to the appropriate MFS
server.
 Thus each node can be used both as a client
and a server.
The Mosix File System
The Mosix File System

Disadvantages
 No
cache on clients is a major drawback for
I/O operations with smaller block sizes.
 High availability is not supported i.e., failure of
a node prevents any access to files which
were on that node.
Bringing the process to the file



Each process collects statistics on its MFS
usage, including usage, rates and amount of
data accessed on each node.
The list of nodes is continuously updated
according to the process’s most recent I/O
activities.
These statistics are incorporated along with the
other collected information like CPU usage etc.
and into the process migration policy.
Bringing the process to the file


If the supervising algorithm detects that a
process’s I/O is increasing a certain threshold
then it is marked as a candidate for migration.
After weighing the amount of I/O operations with
the CPU load-balancing considerations, the
process is migrated (if required) to the node on
which it does most of its I/O.
Performance
MigSHM :- The
Indian touch
Shared Memory handling.
Earlier version of openmosix did not have
provision for migrating processes using
shared memory.
 Migshm patch for openmosix introduced
by MAASK team ( 5 girls from Pune)
 Migshm’s job is to intercept function calls
like shmget(), shmat(), shmdt() which
handle the shared memory.

Modules in Migshm



Migration of shared memory processes- This
module enables shared memory processes to
be picked up during their execution for
migration.
It keeps track of information such as identifier of
the process, virtual memory address of the
shared memory region to which the process is
attached .
Keeps a track of identifier of the shared memory
region,access counts of shared memory region
by all processes attached to it.
Consistency module



Its task is to maintain consistency among copies
of shared memory pages that exist on different
nodes.
Local copies of modified pages will only be
written to the original owner when the lock on
that memory segment is released, thus not for
every write.
After writing to the owner node invalidate
message is sent to all the nodes.When process
on some other node has to access the modified
page it page faults to the owner node.
When a process tries to access invalid
page it page faults to the owner node.
 The function sys_semop is used to acquire
or release a semaphore lock.The release
of semaphore is used as the event for
flushing out dirty pages of shared memory
region and sending the invalidate
message.

Communication module
To enable communication between two
processes on different nodes a daemon
MigSharedMemD has been implemented.
 This daemon is responsible for receiving
messages that are identified with the help
of a header type.

Access logs and migration decision
This module keeps a track of number of
times shared memory has been accessed
by the process.
 A process is strongly linked if it accesses
the shared memory more than all other
processes taken together.
 If a process is strongly linked we migrate
the shared memory along with the
process.


The function log_access increments count
for read and write accesses .It is invoked
from mosix_mem_daemon which is
memsorter and logs memory usage for
each process.
Migration of shared memory
Openmosix uses the logical migration i.e.
the owner node of the shared memory has
the latest copy without creating new entry
in the IPC array on the new node.
 MigShm also implements the owner page
fault policy by fetching the page from the
new owner node and not from initial
owner.

If page was not referenced and is not
present on the owner node then it is
fetched from the home node.
 The function consider performs this task.
Here shared memory may be migrated
with a strongly linked task. It is logically
migrated and change in ownership is
broadcast.

The openMosix API
Kernel 2.4. API and Implementation







No new system-calls
Everything done through /proc
/proc/hpc
/proc/hpc/admin
Administration
/proc/hpc/info
Cluster-wide information
/proc/hpc/nodes/nnnn/ Per-node information
/proc/hpc/remote/pppp/ Remote proc. information
Impact on the kernel

MOSIX for the 2.2.19 kernel:
 80
new files (40,000 lines)
 109 modified files (7,000 lines changed/added)
 About 3,000 lines are load-balancing algorithms

openMOSIX for Linux 2.4.26
 47
new files (38,500 lines)
 126 kernel files modified (5,200 lines
changed/added)
 48 user-level files (12,000 lines)
People behind openMosix



Copyright for openMosix, Moshe Bar
Barak and Moshe Bar were co-project managers
of Mosix until Nov 2001
Team Members
 Danny
Getz (migration)
 Avraham Ben Yehudah (MFS and 2.5.x)
 David Santo Orcero (user-space utilities)
 Michael Farnbach (extern. Patch matching, ie XFS,
JFS etc.)
 Many others, including help from Ingo Molnar, Alan
Cox, Andrea Arcangeli and Rik van Riel
Present and Future of openMosix
Current Projects
 Migrating sockets
 Network RAM
 Checkpoint / Restart
 Queue Manager / Scheduler
Future Plans
Inclusion in Linux 2.6
 Re-writing MFS
 Increase developers to 20-30

References









Amir Y., Awerbuch B., Barak A., Borgstrom R.S. and Keren A., An
Opportunity Cost Approach for Job Assignment in a Scalable Computing
Cluster. IEEE Tran. Parallel and Distributed Systems (11) 7, pp. 760-768,
July 2000.
Barak A. and La'adan O., The MOSIX Multicomputer Operating System for
High Performance Cluster Computing. Journal of Future Generation
Computer Systems (13) 4-5, pp. 361-372, March 1998.
Barak A., The MOSIX Organizational Grid - A White Paper, August 2005.
Amar L., Barak A. and Shiloh A., The MOSIX Direct File System Access
Method for Supporting Scalable Cluster File Systems. Cluster Computing
(7) 2, pp. 141-150, April 2004.
I. Peer, A. Barak, and L. Amar, “A Gossip-Based Distributed Bulletin Board
with Guaranteed Age Properties,”, submitted for publication.
Introduction to openMosix presented by Moshe Bar and Maya, Anu, Asmita,
Snehal, Krushna (MAASK)
openMosix Internals - Hacker's Guide by Vincent Hanquez
MigShm Project: Migration of Shared Memory - by MAASK
Memory Ushering in a Scalable Computing Cluster (1997) Amnon Barak
For downloading OpenMosix , visit
http://openmosix.sourceforge.net
For a Live example of OpenMosix, please login to yogini
THANK YOU

Preemptive Process Migration

Transcript Preemptive Process Migration

Directory