Platform LSF HPC Business Presentation

Download Report

Transcript Platform LSF HPC Business Presentation

Integrated Workload Management for Beowulf Clusters
Bill DeSalvo – April 14, 2004
[email protected]
1
What We’ll Cover
Platform LSF Family of Products
What is Platform LSF HPC
Key Features & Benefits
How it Works
Q&A
2
© Platform Computing Inc. 2003
What is the Platform LSF Family of Products?
3
© Platform Computing Inc. 2003
What Problems Are We Solving?
Solve large, grand challenge, complex problems by optimizing the
placement of workload in High Performance Computing environments
4
© Platform Computing Inc. 2003
Platform LSF HPC
Intelligent, policy-driven high performance computing (HPC) workload
processing
Parallel & sequential batch workload management for High
Performance Computing (HPC)
Includes patent-pending topology-based scheduling
Intelligently schedules parallel batch jobs
Virtualizes resources
Prioritizes service levels based on policies
Based on Platform LSF:
Standards-based, OGSI-compliant, grid-enabled solution
Commercial production quality product
5
© Platform Computing Inc. 2003
Platform Customers
6
© Platform Computing Inc. 2003
Platform Customers
7
© Platform Computing Inc. 2003
Platform Customers
8
© Platform Computing Inc. 2003
Platform LSF HPC
Platform LSF HPC AlphaServer SC
Platform LSF HPC for IBM
Platform LSF HPC for Linux
Platform LSF HPC for SGI
Platform LSF HPC for Cray
9
© Platform Computing Inc. 2003
Extensive Hardware Support
HP
SGI
HP AlphaServer SC
SGI IRIX
HP XC
SGI TRIX
HP Superdome
SGI Altix, SGI Propack
HP-UX 11i
IBM
Linux
IBM RS/6000 AIX
IA-64 systens with RedHat
IBM SP2/SP3
Intel, AMD 32-bit systems
with LINUX kernel
Sun
SUN Solaris
High Performance Interconnects
Myrinet with GM
Quadrics QsNet
SGI Numa Flex SGI NumaLink
IBM SP Switch
10
© Platform Computing Inc. 2003
Platform LSF HPC – Linux Support
HP
HP XC Systems running Unlimited Linux
HP Itanium 2 systems running LINUX 2.4.x kernel, glibc 2.2 with RMS on
Quadrics QsNet/Elan3
HP Alpha/AXP systems running LINUX 2.4.x kernel, glibc 2.2.x with RMS on
Quadrics QsNet/Elan3
Linux
IA-64 systems, Kernel 2.4.x, compiled with glibc 2.2.x, tested on RedHat 7.3
x86 systems:
Kernel 2.2.x, compiled with glibc 2.1.x, tested on Debian 2.2, OpenLinux 2.4,
RedHat 6.2 and 7.0, SuSE 6.4 and 7.0, TurboLinux 6.1
Kernel 2.4.x, compiled with glibc 2.1.x, tested on RedHat 7.x and 8.0, and
SuSE 7.0, and RedHat Linux Advanced Server 2.1
Clustermatic Linux 3.0 Kernel 2.4.x, compiled with glibc 2.2.x, tested on
RedHat 8.0
Scyld Linux, Kernel 2.4.x, compiled with glibc 2.2.x.
SGI
11
© Platform Computing Inc. 2003
SGI Altix systems running Linux Kernel 2.4.x compiled with glibc 2.2.x and
SGI Propack 2.2 and higher
Key Features and Benefits
Platform LSF HPC
Key Features
Optimized Application, System and Hardware Performance
Enhanced Accounting, Auditing & Control
Commercial Grade System Scalability & Reliability
Extensive Hardware Support
Comprehensive, Extensible and Standards-based Security
13
© Platform Computing Inc. 2003
Key Features – Platform LSF HPC
Optimized Application, System and Hardware Performance
Enhanced Accounting, Auditing & Control
Commercial Grade System Scalability & Reliability
Comprehensive, Extensible and Standards-based Security
14
© Platform Computing Inc. 2003
Adaptive Interconnect Performance Optimization
Scheduling that takes advantage of unique interconnect
properties
IBM SP Switch at the POE software level
RMS on AlphaServer SC (Quadrics)
SGI topology hardware graph
Out-of-the-box functionality without any customization
required
15
© Platform Computing Inc. 2003
Generic Parallel Job Launcher
Generic support for all different types of Parallel Job
Launchers
LAMMPI, MPICH-GM, MPICH-P4, POE, SCALI,
CHAMPION PRO, etc
Customizable for any vendor or publicly available parallel
solution
Control - ensuring no jobs can escape the workload
management system
16
© Platform Computing Inc. 2003
Integrated out-of-the-box Parallel Launcher Support
Full integration with IRIX MPI and array session daemon
Full integration with SGI MPI for Linux
Full integration with Sun HPC Clustertools providing full MPI control,
accounting and integration with SUNs PRISM debugger
Vendor MPI libraries provide better performance than open source
libraries
Vendor MPI library full support
Vendor integration supported by Platform
Seamless control and accounting
17
© Platform Computing Inc. 2003
HPC Workload Scheduling
Dynamic load balancing supporting heterogeneous workloads
IBM SP switch aware scheduling
Scheduling of parallel jobs
Number of CPUs, min/max, node span
Backfill on processor & memory
Processor & memory reservation
Topology aware scheduling
Exclusive scheduling
Advance Reservation
Fairshare, Preemption
Accounting
18
© Platform Computing Inc. 2003
High Performing, Open, Scalable Architecture
Scalable scheduler architecture
Modularized, support for over 500,000 active jobs per cluster
More than 2,000 multi-processor host per cluster - with multiple processors in
each host
Process 5x more work & achieve 100% utilization
Scale with business growth
External executable support
Collect information from multiple external resources to track site specific local and
global resources
Extends out-of-the-box capabilities to manage additional resources and customer
application execution
Differentiation
Multiple vs single external resource collector
Job Groups
Organize jobs into higher level work units - hierarchical tree
Easy to manage and control work to increase user productivity by reducing
complexity
OGSI compliance
Future-proof & protect grid investment using standards-based solutions,
interoperate with third-party systems
19
© Platform Computing Inc. 2003
Intelligent Scheduling Policies
Fairshare (User & Project-based)
Ensure job resources are used for the right work
Guarantees resource allocation among users and projects are met
Co-ordinate access to the right number of resources for different users and
projects according to pre-defined shares
Differentiation
Hierarchal & guaranteed
Fairshare
Policy-based Preemption
Goal-oriented SLA driven policies
20
© Platform Computing Inc. 2003
Based on customer SLA driven goals: Deadline, Velocity, Throughput
Guarantees projects are completed on time
Reduces projects and administration costs
Provides visibility into the progress of projects
Allows the admin focus on “What work and When” needs to be done, not
“how” the resources are to be allocated
Intelligent Scheduler
Maximizes throughput of high priority critical work based on priority and
load conditions
Prevents starvation of lower priority work
Differentiation
Platform LSF supports multiple preemption policies
Preemption
Resource
Reservation
Advance
Reservation
License
Scheduling
SLA
Scheduling
Service Level
Agreement
MultiCluster
Other
Scheduling
Modules
Plugin Schedulers
Advanced Self-Management
Flexible, Comprehensive Resource Definitions
Resources defined on a node basis across an entire cluster or subset of the nodes in a
cluster
Auto-detectable or user defined resources
Adaptive membership – nodes join and leave Platform LSF clusters dynamically and
automatically without administration effort
Dynamic or static resources
Job Level Exception Management
Exception-based error detection to take automatic, configurable, corrective actions
Increased job reliability & predictability
Improved visibility on job and system errors & reduced administration overhead and
costs
Automatic Job Migration and Requeue
Automatically migrate and requeue jobs based on policies in the event of host or
network failures
Reduce user and administrator overhead in managing failures & reduce risk of running
critical workloads
Master Scheduler Failover
21
© Platform Computing Inc. 2003
Automatically fail over to another host if the master host is unavailable
Continuous scheduling service and execution of jobs & eliminate manual intervention
Backfill
Policy configured at the queue level and applies to all jobs in a queue
Smaller sequential jobs are ‘backfilled’ behind larger parallel jobs
Improves hardware utilization
Users provided with an accurate time when their job will start
22
© Platform Computing Inc. 2003
Key New Feature & Benefits
Platform LSF V6.0
Feature Overview
OGSI Compliance
Goal-Oriented SLA-Driven Scheduling
License-Aware Scheduling
Job-Level Exception Management (Self Management Enhancement)
Job Group Support
Other Scheduling Enhancements
Queue-Based Fairshare
User Fairshare by Queue Priority
Job Starvation Prevention plug-in
24
© Platform Computing Inc. 2003
Feature Overview (Cont.)
HPC Enhancements
Dynamic ptile Enforcement
Resource Requirement Specification for Advance Reservation
Thread Limit Enforcement
General Parallel Support
Parallel Job Size Scheduling
Job Limit Enhancements
Non-normalized Job Run Limit
Resource Allocation Limit Display
Administration and Diagnostics
Scheduler Dynamic Debug
Administrator Action Messages
25
© Platform Computing Inc. 2003
Goal-Oriented SLA-Driven Scheduling
What is it?
A new scheduling policy.
Unlike current scheduling policies based on configured shares or limits,
SLA-driven scheduling is based on customer provided goals:
Deadline based goal: Specify the deadline for a group of jobs.
Velocity based goal: Specify the number of jobs running at any one time.
Throughput based goal: Specify the number of finished jobs per hour.
This scheduling policy works on top of queues and host partitions.
Benefits
Guarantees projects are completed on time according to explicit SLA
definitions.
Provides visibility into the progress of projects to see how well projects are
tracking to SLAs
Allows the admin focus on “What work and When” needs to be done, not
“how” the resources are to be allocated.
Guarantees service level deliveries to the user community, reduces the risks
of projects and administration cost.
26
© Platform Computing Inc. 2003
User case
27
© Platform Computing Inc. 2003
Problem: we need to finish all simulation jobs before 15:00pm.
Solution: Configure a deadline service class in lsb.serviceclasses file.
Begin ServiceClass
NAME=simulation
PRIORITY=100
GOALS = [deadline timeWindow (13:00 – 15:00)]
DESCRIPTION = A simple deadline demo
End ServiceClass
Submitting and monitoring jobs
$bsub –sla simulation –W 10 –J A[1-50] mySimulation
$date;bsla
Wed Aug 20 14:00:16 EDT 2003
SERVICE_CLASS_NAME: simulation
GOAL: DEADLINE ACTIVE_WINDOW: (13:00 – 15:00)
STATUS: Active:Ontime
DEAD_LINE: (Wed Aug 20 15:00)
ESTIMATED_FINISH_TIME: (Wed Aug 20 14:30)
Optimum Number of Running Jobs: 5
NJOBS PEND RUN SSUSP USUSP FINISH
50
25
5
20
Job-Level Exception Management (Self Management
Enhancement)
What is it?
Platform LSF can monitor the exception behavior and take action accordingly.
Benefits
Increased reliability of job execution
Improved visibility on job and system errors
Reduced administration overhead and costs
How it works
Platform LSF V6 handles following exceptions:
“Job eating” machine (or “black-hole” machine): for some reason, jobs keep exiting
abnormally on a machine (e.g. no processes, mount daemon dies, etc.)
Job underrun (job run time less than configured minimum time)
Job overrun (job run time more than configured maximum time)
Job run idle (job run without cpu usage increasing).
28
© Platform Computing Inc. 2003
Job-Level Exception Management (Self Management
Enhancement) (Cont.)
Use Case 1:
Requirement: If the host has more than 30 jobs exited in past 5
minutes, I want LSF to close that machine, then notify me and tell me
the machine name.
Solution:
Configure host exceptions (EXIT_RATE in lsb.hosts).
Begin Host
HOST_NAME MXJ
Default
!
EXIT_RATE # Keywords
6
End Host
Configure the JOB_EXIT_RATE_DURATION = 5 in lsb.params
(default value is 10 minutes)
29
© Platform Computing Inc. 2003
Job-Level Exception Management (Self Management
Enhancement) (Cont.)
Use Case 2:
Requirement: If any job runs more than 3 hours, I want LSF to notify me
and tell me the jobID.
Solution:
Configure job exceptions (lsb.queues)
Begin Queue
…
JOB_OVERRUN = 3*60
End Queue
30
© Platform Computing Inc. 2003
# run time in minutes
Job Starvation Prevention Plug-in
What is it?
External scheduler plug-in allows users to define their own equation for job
priority
Benefits
Low priority work is guaranteed to run after ‘waiting’ for a specified time
ensuring that the job does not wait forever (i.e. starvation).
How it works
By default, the scheduler provides the following calculation
Job priority =A * (q_priority) *MIN(1, int(wait_time/T0))
* (B*requested_processors+MAX(C*wait_time*(1+1/run_time),D)
+E*requested_memory)
Where A, B, C, D, E are coefficients. T0 is the grace period. Default
run_time= INFINIT
Admin can define different coefficients for each queue with the following
format:
MANDATORY_EXTSCHED=JOBWEIGHT[A=val1; B=val2; …]
31
© Platform Computing Inc. 2003
Job Starvation Prevention Plug-in
Use Case:
Requirement: Lowest priority queue can wait no more than 10 hours.
Solution: If highest priority queue PRIORITY = 100, lowest priority
queue PRIORITY = 20. Configure the following in Lowest queue:
MANDATORY_EXTSCHED=JOBWEIGHT[A=1;B=0;C=10;D=1;E=0;T0=
0.1]
After waiting 10 hours, the job in Lowest queue will have higher priority
than jobs in highest priority queue.
Note: The formula for calculating job weight is open source and
customers can customize it.
32
© Platform Computing Inc. 2003
Resource Requirement Specification For Advance
Reservation
What is it?
Enable users to select the hosts for advance reservation based on
the resource requirement.
Benefit
More flexible to reserve the host slots for the mission critical job.
How it works
brsvadd command supports select string:
brsvadd –R “select[type==LINUX]” –n 4 –u xwei –b 10:00 –e 12:00
33
© Platform Computing Inc. 2003
Key Features – Platform LSF HPC
Optimized Application, System and Hardware Performance
Enhanced Accounting, Auditing & Control
Commercial Grade System Scalability & Reliability
Comprehensive, Extensible and Standards-based Security
34
© Platform Computing Inc. 2003
Job Termination Reasons
Accounting log with detailed audit & error information for
every job in the system
Indicates why a job was terminated
Difference between an abnormal termination or caused by
Platform LSF HPC
35
© Platform Computing Inc. 2003
Key Features – Platform LSF HPC
Optimized Application, System and Hardware Performance
Enhanced Accounting, Auditing & Control
Commercial Grade System Scalability & Reliability
Comprehensive, Extensible and Standards-based Security
36
© Platform Computing Inc. 2003
Enterprise Proven
Running on several of the top 10 supercomputers in the
world on the “TOP500” (#2,4,5,6)
More than 250,000 licenses in use spanning 1,500 customer
sites
Scales to over 100 clusters, 200,000 CPUs and 500,000
active jobs per cluster
11+ years experience in distributed & grid computing
Risk free investment – proven solution
Commercial production quality
37
© Platform Computing Inc. 2003
Key Features – Platform LSF HPC
Optimized Application, System and Hardware Performance
Enhanced Accounting, Auditing & Control
Commercial Grade System Scalability & Reliability
Comprehensive, Extensible and Standards-based Security
38
© Platform Computing Inc. 2003
Comprehensive, Extensible, Standards-based Security
Scalable scheduler architecture
Multiple scheduler plug-in API support
External executable support
Web GUI
Open source components
Risk free investment – proven solution
Commercial grade
Scalability and flexibility as a business grows
39
© Platform Computing Inc. 2003
How It Works
Platform LSF HPC
Fault Tolerance via Master Election
mbsched
mbd
sbd
sbd
sbd
slave
LIM
slave
LIM
Am I master ?
Master
LIM
Host 1
41
© Platform Computing Inc. 2003
Host i
Host N
Virtual Server Technology
LIM: Collects & centralizes status of all resources in cluster
RES: Transparent remote task execution
System
Monitor
Workload
Management
Admin
Tools
Cluster APIs
Master
LIM
Free memory
Disk I/O Rate
Idle Time
Slave
LIM
Load
Information
RES
Slave
LIM
Free swap
space
Custom
Status
RES
ELIM
42
© Platform Computing Inc. 2003
Host Status
Number of CPUs
Slave
LIM
Slave
LIM
RES
RES
Executing Work
Jobs
Chooses best, available
resource to process the job
MBD
Clients
SBD
Gaussian
Distribution
Job
Master
LIM
BLAST
Sequence Job
Computational
Chemistry Job
SBD
Slave
LIM
Slave
LIM
Protein
Modeling Job
SBD
SBD
ELIM
43
© Platform Computing Inc. 2003
Slave
LIM
SBD
Slave
LIM
Grid-enabled, Scalable Architecture
Open, modular plug-in schedulers scale
with the growth of your business
44
© Platform Computing Inc. 2003
Scheduler Framework
The framework hides the complexity of
interacting with core services.
Resource Broker responsible for
resource information collection from
other core services.
Scheduler Modules
Minimize the inter-dependencies
between scheduling policies
Scheduler Framework
Maximize extensibility through the
plug-in scheduler module stack
46
© Platform Computing Inc. 2003
Resource Broker
The Four Scheduling Phases
Pre-Selected Jobs
Localized setup
1. Pre-Processing
2. Matching / Limits
• Match eligible resources to
nodes
3. Order / Allocation
• Prioritize jobs and allocate
resources
• Allocation adjustments
4. Post-Processing
47
© Platform Computing Inc. 2003
Scheduling Decisions/
Job Control Decisions
Multiple Scheduling Modules
48
PreProcessing
PreProcessing
Matching /
Limits
Matching /
Limits
Order /
Allocation
Order /
Allocation
PostProcessing
PostProcessing
Internal
Module
Add-on
Module 1
© Platform Computing Inc. 2003
PreProcessing
...
...
...
...
Order /
Allocation
• Vendor
specific
matching
policies
(without
changing the
existing
scheduler
PostProcessing
• Support for
external
scheduler
Matching /
Limits
Add-on
Module N
Maui Integration
MAUI Plugin
Event Handle
(wait until GO event)
Job, Host, Res Info
RMGetInfo
Pre-processing
Decisions and ack
Order jobs
SCH_FM
Sync
MBD
QueueScheduleSJobs
QueueScheduleRJobs
QueueScheduleIJobs
QueueBackFill
Post-Processing
UIProcessClients
49
© Platform Computing Inc. 2003
MAUI
Scheduler
Linux-specific Solutions
Controlling an MPI job
On a distributed system (Linux cluster) there are many problems to
address:
51
1.
Job launch across multiple nodes
2.
Gather resource usage while job executes
3.
Propagate signals
4.
Job “clean-up” to eliminate “dangling” MPI processes
5.
Comprehensive job accounting
© Platform Computing Inc. 2003
“traditional” MPI sequence
Job
launcher
Resource
manager
submit
Jobscript
mpirun
a.out
52
© Platform Computing Inc. 2003
a.out
Platform LSF HPC for Linux - MPICH-GM
mbatchd
sbatchd
bsub
res
Job script
pam
gmmpirun_wrapper
mpirun
res
res
TS
PIM
TS
a.out
PIM
a.out
53
© Platform Computing Inc. 2003
Platform LSF HPC for Linux/Myrinet - Generic PJL
master LIM
PIM
LIM
LIM
MBD
SBD
MBSCHD
SBD child
LIM
PIM
SBD
Signals and
rusage
collection
Master Host
pam
lsblib
high
PJL wrapper
Root res
Root res
med
Hostname & pid
hpc_queue
bsub
PJL
Submission host
Hostname & pid
TaskStarter
TaskStarter
Queues
a.out: process 1
Execution Host H1
54
© Platform Computing Inc. 2003
a.out: process 2
H2
Platform LSF HPC for Linux/Myrinet - MPICH_GM
master LIM
PIM
LIM
LIM
elim
Report resource availability
LIM
MBD
Report resource availability
SBD
PIM
elim
elim
SBD child
MBSCHD
SBD
Mpirun.lsf
Master Host
Signals and rusage collection
lsblib
pam
high
Gmmpirun_w
rapper
med
Root res
Root res
Hostname & pid
hpc_queue
bsub
Set LSF_PJL_TYPE
To mpich_gm
Mpirun.ch_g
m
Hostname & pid
TaskStarter
esub
Submission host
© Platform Computing Inc. 2003
TaskStarter
Queues
a.out: process 1
Execution Host H1
55
rsh
a.out: process 2
H2
Platform LSF HPC for Linux/Myrinet - LAM/MPI
master LIM
PIM
LIM
LIM
elim
Report resource availability
LIM
MBD
Report resource availability
SBD
PIM
elim
MBSCHD
elim
SBD child
SBD
Master Host
Mpirun.lsf
lsblib
pam
high
Lammpirun_
wrapper
med
Signals and rusage collection
Root res
Root res
Hostname & pid
hpc_linux
bsub
mpirun
lamd
Set LSF_PJL_TYPE
To lammpi
Hostname & pid
TaskStarter
esub
Submission host
© Platform Computing Inc. 2003
TaskStarter
Queues
a.out: process 1
Execution Host H1
56
lamd
a.out: process 2
H2
Platform LSF HPC for Linux/Myrinet - Scali MPI
master LIM
PIM
LIM
SBD
MBD
LIM
LIM
PIM
SBD child
MBSCHD
Signals and rusage collection
SBD
pam
Master Host
lsblib
Scali mpi
wrapper
Root res
Root res
Hostname & pid
high
mpimon
med
low
bsub
Hostname & pid
Queues
Submission host
mpid
mpid
mpisubmon
mpisubmon
TaskStarter
TaskStarter
a.out:
process 1
Execution Host H1
57
© Platform Computing Inc. 2003
a.out:
process 2
H2
Platform LSF HPC for Linux/QsNet
master LIM
PIM
LIM
MBD
LIM
PIM
SBD
RLA
MBSCHD
SBD
RMS
plugin
Master Host
SBD child – exec() res
lsblib
Res – rms_run()
high
bsub
med
low
Submission host
Job’s Allocation
User Job
Queues
58
© Platform Computing Inc. 2003
LSF Execution host /
RMS node n0
Node
n1
Node
n2
Scyld Beowulf Integration
• Scyld Beowulf handles the systems management challenge
effectively
• No OS to distribute / synchnronize
• Central point of control from master
• Single process space makes it appear as large SMP
• Platform integrates with Scyld treating cluster as SMP and
allocating resources
• Integrate with mpirun, mpprun or bpsh to start tasks
• Collect resource usage from BPROC
• Collect load information via BPROC APIs
• Singe user interface across Sycld & non-Scyld env.
59
© Platform Computing Inc. 2003
Platform LSF HPC for Linux/BProc
Computing Nodes
1C
master LIM
PIM
LIM
allocated
nodes
3
MBD
SBD
LIM
1B
PIM
User Job
Processes
4
2
MBSCHD
5
SBD child –exec()
res
SBD
Master Host
lsblib
Res6B
1A
high
bsub
6C
med
low
Job file
Modify submission options
esub
Submission host
60
© Platform Computing Inc. 2003
Bpsh/mpirun
Queues
Bproc Front-end Node
H3
More info at:
61
•
www.platform.com/customers
•
www.platform.com/barriers
© Platform Computing Inc. 2003
Q&A