HPCC - Chapter1

Download Report

Transcript HPCC - Chapter1

High Performance Cluster Computing
Architectures and Systems
Hai Jin
Internet and Cluster Computing Center
Scheduling Parallel Jobs on Clusters







2
Introduction
Background
Rigid Jobs with Process
Migration
Malleable Jobs with
Dynamic Parallelism
Communication-Based
Coscheduling
Batch Scheduling
Summary
Introduction (I)

Clusters are increasingly being used for
HPC application



How to add the HPC workload


3
High cost of MPPs
Wide availability of networked workstations
and PCs
Original general-purpose workload on the
cluster
Not degrading the service of the original
workload
Introduction (II)

The issues to support for HPC applications

The acquisition of resources


The requirement to give priority to workstation owners


4
place different constraints on the scheduling of their
processes
Possible use of admission control and scheduling policies


not cause noticeable degradation to their work
The requirement to support the different styles of parallel
programs


how to distinguish between workstations that are in active use
and those that have available spare resources
regulate the additional HPC workload
These issues are interdependent
Background (I)

Cluster Usage Modes

NOW (Network Of Workstations)





PMMPP (Poor Man’s MPP)




5
Based on tapping the idle cycles of existing resource
Each machines has an owner
Berkeley, Condor, and MOSIX
When the owner is inactive, the resources become available
for general use
Dedicated cluster acquired for running HPC application
Less constraints regarding interplay between the regular
workload and the HPC workload
Beowulf project, RWC PC cluster, and ParPar
Concentrate on scheduling in a NOW environment
Background (II)

Job Types and Requirements


Job structure and the interactions types place
various requirements on the scheduling system
Three most common types

Rigid jobs with tight coupling



6
MPP environments
 a fixed number of processes
 communicate and synchronize at a high rate
a dedicated partition of the machine for each job
gang scheduling, time slicing is used, not dedicated
Background (III)

Job Types and Requirements

Three most common types

Rigid jobs with balanced processes and loose
interactions



Jobs structured as a workpile of independent tasks



7
not require that the processes execute simultaneously
require that the processes progress at about the same
rate
executed by a number of worker processes that takes
from the workpile and execute them, possibly creating new
tasks in the process
a very flexible model that allows the number of workers to
change at runtime
leads to malleable jobs that are very suitable for NOW
environment
Rigid Jobs with Process Migration

Process Migration



The subsystem responsible for the HPC applications doesn't have
full control over the system
Process migration involves the remapping of processes to
processors during execution
Reasons for migration



Metrics


The need to relinquish a workstation and return it to its owner
The desire to achieve a balanced load on all workstations
Overhead, detach degree
Algorithmic aspects



Which process to migrate
Where to migrate process
These decisions depend on the data


8
the data that is available about the workload on the node
Issues of load measurement and information dissemination
Case Study: PVM with Migration (I)

PVM is a software package for writing and executing
parallel applications on a LAN




Communication / Synchronization operations
Configuration control
Dynamic spawning of processes
To create a virtual parallel machine


A user spawns PVM daemon processes on a set of
workstations
Establish communication links among daemons


PVM distributes its processes in round-robin manner
among the workstations being used

9
Creating the infrastructure of the parallel machine
May create unbalanced loads and may lead to unacceptable
degradation in the service
Communication in PVM is Mediated
by a Daemon (pvmd) on Each Node
PVMD
PVMD
PVMD
PVMD
PVM Daemon
Processes
Local Processes
10
Case Study: PVM with Migration (II)

Several experimental version of PVM



11
Migratable PVM and Dynamic PVM
Include migration in order to move processes to
more suitable locations
PVM has been coupled with MOSIX and Condor to
achieve similar benefits
Case Study: PVM with Migration (III)

Migration decisions



By global scheduler
Based on information regarding load and owner activity
Four Steps





Migration in this system is asynchronous


12
The global scheduler notifies the responsible PVM daemon that
one of its processes should be migrated
The daemon then notifies all of the other PVM daemon about
the pending migration
The process state is transferred to the new location and a new
process is created.
The new process connects to the local PVM daemon in the new
location, notifies all the other PVM daemons
Can happen at any time
Affects only the migrating process and other processes that
may try to communicate with it
Migratable PVM
 From GS: migrate VP1 to host2
mpvmd
mpvmd
mpvmd
VP1
VP1
mpvmd
VP2
Host1
VP3
Host2
 Flush messages sent to VP1,
VP2
VP3
VP1’
State Transfer
Code
Data
Heap
Stack
Skeleton Process
Code
UDP
Data
Heap

Stack
Migrating Process
Host1
Host2
Block sends to VP1
mpvmd
VP1
13
VP2
mpvmd
VP3
mpvmd
VP2
Host1
mpvmd
VP3
VP1’
Host2

Restart VP1’
Unblock sends to VP1
Case Study: MOSIX (I)



Multicomputer Operating System for unIX
Support adaptive resource sharing in a scalable
cluster computing by dynamic process migration
Based on a Unix kernel augmented with



All processes enjoy about the same level of service


14
a process migration mechanism
a scalable facility to distribute load information
Both sequential jobs and components of parallel jobs
Maintains a balanced load on all the workstations in
cluster
MOSIX Infrastructure
Unix BSD Linux
ARSA
PPM
Preemptive Process Migration
Adaptive Resource Sharing Algorithm
15
Case Study: MOSIX (II)

The load information is distributed using a
randomized algorithm




16
Each workstation maintains a load vector with data
about its own load and the other machine loads
At certain intervals (e.g. once every minute), it sends
this information to another, randomly selected machine
With high probability, it will also be selected by some
other machine, and receive such a load vector
If it is found that some other machine has a
significantly different load, a migration operation is
initiated
Case Study: MOSIX (III)

Migration is based on the home-node concept


Each process has a home node: it’s own workstation
When it is migrated, it is split into two parts


The body contains






17
all the user-level context
site independent kernel context
is migrated to another node
The deputy contains


body, deputy
site dependent kernel context
is left on the home node
A communication link is established between the two parts so that
the process can access its local environment via the deputy, and so
that other processes can access it
Running PVM over MOSIX leads to improvements in
performance
Process Migration in MOSIX
Overloaded Workstation
Underloaded Workstation
body
deputy
Site-dependent
system calls
Return values ,
signals
Process migration in MOSIX divides the process into a migratable
body and a site-dependent deputy, which remains in the home node.
18
Malleable Jobs with Dynamic
Parallelism

Parallel jobs should adjust to resources


Emphasizes the dynamics of workstation
clusters

19
reclaiming by owners at unpredictable times
parallel jobs should adjust to such varying
resources
Identifying Idle Workstations


Use only idle workstations
Using idle workstation to run parallel jobs
requires

The ability to identify idle workstation


The ability to retreat from workstation

20
Monitoring keyboard and mouse activity
when a workstation has to be evicted, the worker is
killed and its tasks reassigned to other workers
Case Study: Condor and WoDi (1)

Condor is a system for running batch jobs in
the background on a LAN, using idle
workstations




21
reclaimed by owner
suspend the batch process
restarted from a checkpoint on another node
basis for the LoadLeveler product used on IBM
workstations
Case Study: Condor and WoDi (2)

Condor was augmented with a CARMI


CARMI (Condor Application Resource
Management Interface)
allows jobs


22
to request additional resources
to be notified if resources are taken away
Case Study: Condor and WoDi (3)

WoDi (Work Distributor)


23
To support simple programming of master-worker
applications
The master process sends work requests (tasks)
to the WoDi server
Master-Worker Applications using WoDi
Server to Coordinate Task Execution
application
master
tasks
results
WoDi
server
resource requests
request
Condor
CARMI
scheduler
server
Allocation
24
application
worker
tasks
results
spawn
spawn
CARMI
server
Case Study: Piranha and Linda (1)

Linda





25
A parallel programming language
A coordination language that can be added to
Fortran or C
Based on an associative tuple space that acts as a
distributed data repository
Parallel computations are created by injecting
unevaluated tuples into the tuple space
The tuple space can also be viewed as a workpile of
independent tuples that need to be evaluated
Case Study: Piranha and Linda (2)

Piranha


A system for executing Linda applications on a
NOW
Programs that run under Piranha must include
three special user-defined functions


The piranha function


26
Executed automatically on idle workstation and
transform the work tuples into result tuples(r)
The feeder function


feeder, piranha, and retreat
generates work tuples(w)
The retreat function

the workstation is reclaimed by its owner
Piranha programs include a feeder function that generates work tuples(w), and
piranha functions that are executed automatically on idle workstation and
transform the work tuples into result tuples(r). If a workstation is reclaimed,
the retreat function is called to return the unfinished work to tuple space
w4
piranha
w9
w6
r4
Feeder
piranha
r1
w8
w7
newly idle
r3
w5
piranha
r2
retreat
tuple
space
27
w5
reclaimed
busy
workstations
Communication-Based Coscheduling

If the processes of a parallel application
communicate and synchronize frequently




Saves the overhead of frequent context switches
Reduces the need of buffering during communication
Combined with time slicing



provided by gang scheduling
Gang scheduling implies that the participating
processes are known in advance
The alternative is to identify them during execution

28
execute simultaneously on different processors
only a sub-set of the processes are scheduled together,
leading to coscheduling rather than gang scheduling
Demand-Based Coscheduling (1)


The decision about what processes should be
scheduled together on actual observations of the
communication patterns
Requires the cooperation of the communication
subsystem




monitors the destination of incoming messages
raises the priority of the destination process
sender process may be coscheduled with destination process
Problem

raising the priority of any process


multiple parallel jobs co-exist

29
that receives a message is that it is unfair
epoch numbers
Epoch numbers allow a parallel job to take over the
whole cluster, in the face of another active job
P1
P5
P4
P3
1
1
1
1
2
1
2
1
1
1
1
1
2
2
1
2
2
2
30
P6
2
time
spontaneous
switch
P2
Demand-Based Coscheduling (2)

The epoch number on each node is
incremented when a spontaneous context
switch is made




31
not the result of an incoming message
This epoch number is appended to all outgoing
messages.
When a node receives a message, compares its
local epoch number with the one in the
incoming message
Switch to the destination process only if the
incoming epoch number is greater
Implicit Coscheduling (1)

Explicit control may be unnecessary

Using Unix facilities (sockets)

Unix process that perform I/O (including
communication) get a higher priority



32
processes participating in a communication phase will get
high priority on their nodes
without any explicit measures being taken
Long phases of computation, intensive
communication
Implicit Coscheduling (2)

Make sure that a communicating process is not
de-scheduled while it is waiting for a reply
from another process

Using two-phase blocking (spin blocking)




33
A waiting process will initially busy wait (spin) for
some time, waiting for the anticipated response
If the response does not arrive within the prespecified time, the process blocks and relinquishes
its processor in favor of another ready process
Implicit coscheduling keeps processes in step only
when they are communicating
During the computation phase -> do not need to be
coscheduled
Batch Scheduling

Qualitative difference


Between the work done by workstation owners and
the parallel jobs that try to use spare cycles
Workstation owner



Parallel Jobs



34
do interactive work
require immediate response
compute-intensive
run for long periods
to queue them until suitable resources become
available
Admission Controls (1)



35
HPC application places a heavy load
Consideration for interactive users implies
that these HPC applications be curbed if they
hog the system
Can refuse to admit them into the system in
the first place
Admission Controls (2)

A general solution: a batch scheduling system
(ex. DQS, PBS )

Define a set of queues to which batch jobs are
submitted



The batch scheduler chooses jobs for execution



36
contains jobs that are characterized by attributes
such as expected run-time, memory requirements
based on attributes and the available resources
other jobs are queued so as not to overload the
system
To use only idle workstations vs. use all
workstations, with preference for those that
are lightly loaded
Case Study: Utopia/LSF (1)

Utopia is a environment for load sharing on large scale
heterogeneous clusters




Collection of load info


a mechanism for collecting load information
a mechanism for transparent remote execution
a library to use them from applications
done by a set of daemons, one on each node
LIM (Load Information Manager)


lowest host ID
collects/distributes load vectors to all slave nodes


the slaves can use this information to make placement decisions
for new processes

37
recent CPU queue length, memory usage, the number of users
using a centralized master does not scale to large systems
Utopia uses a two-level design to spread
load information in large systems
n
L
s1
s2
L
n
s2
L
s3
n
n
38
LIM
n
Node
s
Strong Servers
s1
n
n
L
s3
L
Case Study: Utopia/LSF (2)



Support for load sharing across clusters is provided
by communication among the master LIMs of the
different clusters
Possible to create virtual clusters that group
together powerful servers that are physically
dispersed across the system
Utopia’s batch scheduling system


The actual execution and control over the batch
processes

39
Queueing and allocation decisions are done by a master
batch daemon, which is co-located with the master LIM
Done by slave batch daemons on the various nodes
Summary (1)

It is most important

to balance the loads on the different machines


not to interfere with workstations owners


simultaneous execution of interacting processes
not to flood the system with low-priority compute
intensive jobs

40
using idle workstation
to provide parallel programs with a suitable
environment


all processes get equal services
admission controls and batch scheduling are
necessary
Summary (2)

There is room for improvement


The following combination is possible




41
by considering how to combine multiple assumptions and
merge the approaches used in different systems
Have a tunable parameter that selects whether
workstations are in general shared, or used only if idle
Provide migration to enable jobs to evacuate workstations
that become over loaded, are reclaimed by their owner, or
become too slow relative to others parallel job
Provide communication-based coscheduling for jobs that
seems to need it, better not to require the user to specify
it
Provide batch queueing with a check-point facility so that
heavy jobs will run only when they do not degrade
performance for others