Jefferson Lab and the Portable Batch System Walt Akers High Performance Computing Group.

Download Report

Transcript Jefferson Lab and the Portable Batch System Walt Akers High Performance Computing Group.

Jefferson Lab and the
Portable Batch System
Walt Akers
High Performance Computing Group
Jefferson Lab and PBS: Motivating Factors
•
New Computing Cluster
– Alpha Based Compute Nodes
– 16 XP1000 Single Processor Nodes (LINPACK 5.61 GFlop/Sec)
– 8 UP2000 Dual Processor Nodes (LINPACK 7.48 GFlop/Sec)
•
Heterogeneous Job Mix
– Combination of Parallel and Non-Parallel Jobs
– Job execution times range from a few hours to weeks
– Data requirements range from minimal to several gigabytes
•
Modest Budget
– Much of our funding was from internal sources
– Initial hardware expense was relatively high
•
Expandability
– Can the product be expanded from a few nodes to hundreds
Jefferson Lab and PBS: Alternative Systems
•
PBS - Portable Batch System
– Open Source Product Developed at NASA Ames Research Center
•
DQS - Distributed Queuing System
– Open Source Product Developed by SCRI at Florida State University
•
LSF - Load Sharing Facility
– Commercial Product from Platform Computing
– Already Deployed by the Computer Center at Jefferson Lab
•
Codine
– Commercial Version of DQS from Gridware, Inc.
•
Condor
– A Restricted Source ‘Cycle Stealing’ Product From The University of Wisconsin
•
Others To Numerous To Mention
Jefferson Lab and PBS: Why We Chose PBS?
•
Portability
– The PBS distribution compiled and ran immediately on both the 64 bit Alpha and 32 bit
Intel platforms.
•
Documentation
– PBS comes with comprehensive documentation including an Administrators Guide,
External Reference, and Internal Reference.
•
Active Development Community
– There is a large community worldwide that continues to improve and refine PBS.
•
Modularity
– PBS is a component oriented system.
– A well defined API is provided to allow components to be replaced with locally defined
modules.
•
Open Source
– The source code for the PBS system is available without restriction.
•
Price
– Hey, its free…
Jefferson Lab and PBS: The PBS View Of The World
•
PBS Server
– Mastermind of the
PBS System
– Central Point of
Contact
•
PBS Scheduler
– Prioritizes Jobs
– Signals Server to
Start Jobs
•
Machine Oriented
Mini-Server (MOM)
– Executes Scripts on
Compute Nodes
– Performs User File
Staging
Jefferson Lab and PBS: The PBS Server
•
Routing Queues
– Can move jobs between multiple PBS Servers
•
Execution Queues
– Defines default characteristics for submitted jobs
– Defines a priority level for queued jobs
– Holds jobs before, during and after execution
•
Node Capabilities
– The server maintains a table of nodes, their
capabilities and their availability.
•
Job Requirements
– The server maintains a table of submitted jobs that
is independent of the queues.
•
Global Policy
– The server maintains global policies and default
job characteristics.
Jefferson Lab and PBS: The PBS Scheduler
•
Prioritizes Jobs
– Called periodically by the PBS Server
– Downloads job lists from the server sorts them
based on locally defined requirements.
•
Tracks Node Availability
– Examines executing jobs to determine projected
availability time for nodes.
– Using this data the scheduler can calculate future
deployments and determine when back-filling
should be performed.
•
Recommends Job Deployment
– At the end of the scheduling cycle, the scheduler
will submit a list of jobs that can be started
immediately to the server.
– The PBS Server is responsible for verifying that
the jobs can be started, and then deploying them.
Jefferson Lab and PBS: Machine Oriented Mini-Server
•
Executes Scripts
– At the direction of the PBS Server, MOM
executes the user provided scripts
– For parallel jobs, the primary MOM (Mother
Superior) starts the jobs on itself and all other
assigned nodes.
•
Stages Data Files
– Prior to script execution, the MOM is
responsible for remotely copying user specified
data files to Mother Superior.
– Following execution, the resultant data files are
remote copied back to the user specified host.
•
Tracks Resource Usage
– MOM tracks the cpu time, wall time, memory
and disk that has been used by the job.
•
Kills Rogue Jobs
– Kills jobs at the PBS Server’s request
Jefferson Lab and PBS: Our Current Implementation
Jefferson Lab and PBS: What We’ve Learned So Far
•
PBS Is Reasonably Reliable, But Has Room For Improvement
– PBS Server and PBS Scheduler components work well and behave predictably
– PBS MOM works okay, but behaves bizarrely in certain situations
• Disk full = chaos
• Out of process slots = chaos
• Improper file transfer or staging = chaos
Note: The first two can be avoided by conspicuous system management, the last is the
responsibility of the job submitter.
•
Red Hat Linux 6.2
– We’ve seen many problems associated with NFS. After upgrading to Kernel 2.2.16-3
many of these problems went away.
– Klogd occasionally spins out of control and uses all available CPU cycles.
– Sshd on SMP machines dies for no apparent reason.
– Crontab works intermittently on SMP nodes.
We’re considering experimenting with True 64 Unix to see if these problems exist there.
•
Writing a Scheduler Is Hard Work
– We have developed two interim schedulers and are now working on the ‘final’
implementation.
Jefferson Lab and PBS: Ongoing Development
•
Underlord Scheduling System
– Built on the existing PBS Scheduler Framework
• Plug-in replacement for the default scheduler
• Uses an object oriented interface to the PBS Server
– Comprehensive match making scheme
•
•
•
•
•
Starts from an ordered list of jobs
Works with a collection of homogeneous or heterogeneous nodes
Locates the optimal node or combination of nodes where a job should be deployed
Uses user specified job parameters to project future job deployment
Uses future job scheduling in combination with backfilling to maximize system utilization.
– Multi-layered job sorting algorithm
•
•
•
•
•
Time in queue
Projected execution time
Number of processors requested
Queue priority
Progressive user share (similar to the LSF scheme)
– Generates a projection table
• Allows users to determine when their job is projected to start
Jefferson Lab and PBS: Future Directions
•
Data Grid Server
– In order to provide greater flexibility to the Batch System and allow it to accommodate
data provided through the proposed Data Grid system, a Data Grid Server will be added
to the existing system components.
– This module will have the following capabilities
•
•
•
•
•
Will provide time projections for when data will be available
Will perform data migration to a script accessible host
Will provide mechanisms to transfer resultant data to a specified location
Will replace the existing staging capabilities of the PBS Server and PBS MOM.
PBS Meta-Facility - The Overlord Scheduler
– The Overlord Scheduler will be a centralized location where jobs are submitted that can
be forwarded to other PBS Clusters for execution. The Overlord Scheduler will have
the following capabilities.
• Will prioritize and sort all jobs based on global Meta-Facility rules
• Will consider job requirements, data location and network throughput and will forward each
job to the PBS Server where it will be scheduled earliest.
• Will not forward jobs to one of the ‘Underlord’ systems until it is eligible for immediate
execution there.
– We don’t have all of this figured out yet… but, we are confident.
Jefferson Lab and PBS: Places On The Web
•
Jefferson Lab HPC Home Page
– http://www.jlab.org/hpc
• Currently we have most of the PBS documentation and some
statistics about our cluster and its development.
•
PBS Home Page
– http://www.openpbs.org
• Register and download PBS and all documentation from this site`