Cluster Components - University of Pennsylvania

Transcript Cluster Components - University of Pennsylvania

Introduction to
Linux Clusters
Clare Din
SAS Computing
University of Pennsylvania
March 15, 2004
Cluster Components
Hardware
 Nodes
 Disk array
 Networking gear
 Backup device
 Admin front end
 UPS
 Rack units
Software
 Operating system
 MPI
 Compilers
 Scheduler
Cluster Components
Hardware
 Nodes








Compute nodes
Admin node
I/O node
Login node
Disk array
Networking gear
Backup device
Admin front end
Software
 Operating system
 Compilers
 Scheduler
 MPI
Cluster Components
Hardware
 Disk array
 RAID5
 SCSI 320
 10k+ RPM, TB+
capacity
 NFS-mounted from I/O
node
 Networking gear
 Backup device
 Admin front end
Software
 Operating system
 Compilers
 Scheduler
 MPI
Cluster Components
Hardware
 Networking gear








Myrinet, gigE, 10/100
Switches
Cables
Networking cards
Backup device
Admin front end
UPS
Rack units
Software
 Operating system
 Compilers
 Scheduler
 MPI
Cluster Components
Hardware
 Backup device
 AIT3, DLT, LTO
 N-slot cartridge drive
 SAN
 Admin front end
 UPS
 Rack units
Software
 Operating system
 Compilers
 Scheduler
 MPI
Cluster Components
Hardware
 Admin front end
 Console (keyboard,
monitor, mouse)
 KVM switches
 KVM cables
 UPS
 Rack units
Cluster Components
Hardware
 UPS
 APC SmartUPS 3000
 3 per 42U rack
 Rack units
Software
 Operating system
 Compilers
 Scheduler
 MPI
Cluster Components
Hardware
 Rack units
 42U, standard or deep
Software
 Operating system
 Compilers
 Scheduler
 MPI
Cluster Components
Software
 Operating system





Red Hat 9+ Linux
Debian Linux
SUSE Linux
Mandrake Linux
FreeBSD and others
 MPI
 Compilers
 Scheduler
Cluster Components
Software
 MPI




MPICH
LAM/MPI
MPI-GM
MPI Pro
 Compilers
 Scheduler
Cluster Components
Software
 Compilers
 gnu
 Portland Group
 Intel
 Scheduler
Cluster Components
Software
 Scheduler
 OpenPBS
 PBS Pro
 Maui
Filesystem Requirements
 Journalled filesystem
 Reboots happen more quickly after a
crash
 Slight performance hit for this feature
 ext3 is a popular choice (old ext2 was
not journalled)
Space and Power
Requirements
 Space
 Standard 42U rack is
about 24”W x 80”H x
40”D
 Blade units give you
more than 1 node per
1U space in a deeper
rack
 Cable management
inside the rack
 Consider overhead or
raised floor cabling for
the external cables
 Power
 67 node Xeon cluster
consumes 19,872W =
5.65 tons of A/C to
keep it cool
 Ideally, each UPS plug
should connect to its
own circuit
 Clusters (especially
blades) run real hot;
make sure there is
adequate A/C and
ventilation
Network Requirements
 External Network
 One 10mbps
network line is
adequate (all
computation and
message passing
is within the
cluster)
 Internal Network
 gigE
 Myrinet
 Some combo
 Base your net gear
selection on
whether most of
your jobs are CPUbound or I/O bound
Network Choices Compared
 Fast Ethernet (100BT)
 0.1 Gb/s (or 100 Mb/s) bandwidth
 Essentially free
 gigE
 0.4 Gb/s to 0.64 Gb/s bandwidth
 ~$400 per node
 Myrinet




1.2 Gb/s to 2.0 Gb/s bandwidth
~$1000 per node
Scales to thousands of nodes
Buy fiber instead of copper cables
Networking Gear Speeds
2500
2000
1500
1000
500
0
Fast Ethernet
gigE
Myrinet
I/O Node
 Globally accessible filesystem (RAID5 disk
array)
 Backup device
I/O Node
 Globally accessible filesystem (RAID5 disk
array)
 NFS share it
 Put user home directories, apps, and scratch
space directories on it so all compute nodes
can access them
 Enforce quotas on home directories
 Backup device
I/O Node
 Globally accessible filesystem (RAID5 disk
array)
 Backup device
 Make sure your device and software is
compatible with your operating system
 Plan a good backup strategy
 Test the ETA of bringing back a single file or a
filesystem from backups
Admin Node
 Only sysadmins log into this node
 Runs cluster management software
Admin Node
 Only sysadmins log into this node
 Accessible only from within the cluster
 Runs cluster management software
Admin Node
 Only admins log into this node
 Runs cluster management software
 User and quota management
 Node management
 Rebuild dead nodes
 Monitor CPU utilization and network traffic
Compute Nodes
 Buy the fastest CPUs and bus speed you
can afford.
 Memory size of each node depends on the
application mix.
 Lots of hard disk space is not so much a
priority since the nodes will primarily use
shared space on the I/O node.
Compute Nodes
 Buy the fastest CPUs and bus speed you
can afford.
 Don’t forget that some software companies
license their software per node, so factor in
software costs
 Stick with a proven technology over future
promise
 Memory size of each node depends on the
application mix.
Compute Nodes
 Buy the fastest CPUs and bus speed you
can afford.
 Memory size of each node depends on the
application mix.
 2 GB + for for large calculations
 < 2 GB for financial databases
 Lots of hard disk space is not so much a
priority since the nodes will primarily use
shared space on the I/O node.
Compute Nodes
 Buy the fastest CPUs and bus speed you
can afford.
 Memory size of each node depends on the
application mix.
 Lots of hard disk space is not so much a
priority since the nodes will primarily use
shared space on the I/O node.
 Disks are cheap nowadays... 40GB EIDE is
standard per node
Compute Nodes
 Choose a CPU architecture you’re
comfortable with
 Intel: P4, Xeon, Itanium
 AMD: Opteron, Athlon
 Other: G4/G5
 Consider that some algorithms require 2n
nodes
 32-bit Linux is free or close-to-free, 64-bit
Red Hat Linux costs $1600 per node
Login Node
 Users login here
 Only way to get into the cluster
 Compile code
 Job control
Login Node
 Users login here
 ssh or ssh -X
 Cluster designers recommend 1 login node
per 64 compute nodes
 Update /etc/profile.d so all users get the same
environment when they log in
 Only way to get into the cluster
 Compile code
 Job control
Login Node
 Users login here
 Only way to get into the cluster
 Static IP address (vs. DHCP addresses on all
other cluster nodes)
 Turn on built-in firewall software
 Compile code
 Job control
Login Node
 Users login here
 Only way to get into the cluster
 Compile code
 Licenses should be purchased for this node
only
 Don’t pay for more than you need
 2 licenses might be sufficient for code compilation
for a department
 Job control
Login Node
 Users login here
 Only way to get into the cluster
 Compile code
 Job control (using a scheduler)
 Choice of queues to access subset of
resources
 Submit, delete, terminate jobs
 Check on job status
Spare Nodes
 Offline nodes that are put into service
when an existing node dies
 Use for spare parts
 Use for testing environment
Cluster Install Software
 Designed to make cluster installation
easier (“cluster in a box” concept)
 Decreases ETA of the install process
using automated steps
 Decreases chance of user error
 Choices:
 OSCAR
 Felix
 IBM XCAT
 IBM CSM
Cluster Management Software
 Run parallel commands via GUI
 Or write Perl scripts for command-line control
 Install new nodes, rebuild corrupted nodes
 Check on status of hardware (nodes,
network connections)
 Ganglia
 xpbsmon
 Myrinet tests (gm_board_info)
Cluster Management Software
 xpbsmon shows jobs
running that
were
submitted via
the scheduler
Cluster Consistency
 Rsync or rdist
/etc/password,
shadow, gshadow,
and group files from
login node to compute
nodes
 Also consider (auto or
manually) rsync’ing
/etc/profile.d files, pbs
config files, /etc/fstab,
etc.
Local and Remote
Management
 Local management
 GUI desktop from console monitor
 KVM switches to access each node
 Remote management
 Console switch
 ssh in and see what’s on the console monitor screen
from your remote desktop
 Web-based tools
 Ganglia
ganglia.sourceforge.net
 Netsaint
www.netsaint.org
 Big Brother
www.bb4.com
Ganglia
 Tool for monitoring clusters of up to 2000
nodes
 Used on over 500 clusters worldwide
 For multiple OS’s and CPU architectures
# ssh -X coffee.chem.upenn.edu
# ssh coffeeadmin
# mozilla &
Open http://coffeeadmin/ganglia
Periodically auto-refreshes web page
Ganglia
Ganglia
Ganglia
Scheduling Software (PBS)
 Set up queues for different groups of users
based on resource needs (i.e. not everyone
needs Myrinet; some users only need 1 node)
 The world does not end if one node goes down;
the scheduler will run the job on another node
 Make sure pbs_server and pbs_sched is running
on login node
 Make sure pbs_mom is running on all compute
nodes, but not on login, admin, or I/O nodes
Scheduling Software
 OpenPBS
 PBS Pro
 Others
Scheduling Software
 OpenPBS
 Limit users by number of jobs
 Good support via messageboards
 *** FREE ***
 PBS Pro
 Others
Scheduling Software
 OpenPBS
 PBS Pro
 The “pro” version of OpenPBS
 Limit by nodes, not just jobs per user
 Must pay for support ($25 per CPU, or $3200
for a 128 CPU cluster)
 Others
Scheduling Software
 OpenPBS
 PBS Pro
 Others
 Load Share Facility
 Codeine
 Maui
MPI Software
 MPICH (Argonne National Labs)
 LAM/MPI (OSC/Univ. of Notre Dame)
 MPI-GM (Myricom)
 MPI Pro (MSTi Software)
 Programmed by one of the original
developers of MPICH
 Claims to be 20% faster than MPICH
 Costs $1200 plus support per year
Compilers and Libraries
 Compilers
 gcc/g77
 Portland Group
 Intel
www.gnu.org/software
www.pgroup.com
www.developer.intel.com
 Libraries






BLAS
ATLAS - portable BLAS
www.math-atlas.sourceforge.net
LAPACK
SCALAPACK - MPI-based LAPACK
FFTW - Fast Fourier Transform
www.fftw.org
many, many more
Cluster Security
 Securing/patching your Linux cluster is
much like securing/patching your Linux
desktop
 Keep an eye out for the latest patches
 Install a patch only if necessary and do it
on a test machine first
 Make sure there’s a way to back out of a
patch before installing it
Cluster Security
 Get rid of unneeded software
 Limit who installs and what gets installed
 Close unused ports and services
 Limit login service to ssh between login
node and outside world
 Use ssh to tunnel X connections safely
 Limit access using hosts.allow/deny
 Use scp and sftp for secure file transfer
Cluster Security
 Carefully configure NFS
 Upgrade to the latest, safest Samba
version, if used
 Disable Apache if not needed
 Turn on built-in Linux firewall software
Troubleshooting
 Make sure the core cluster services are
running
 Scheduler, MPI, NFS, cluster managers
 Make sure software licenses are up-todate
 Scan logs for break-in attempts
 Keep a written journal of all patches
installs and upgrades
Troubleshooting
 Sometimes a reboot will fix the problem
 If you reboot the login node where the
scheduler is running, be sure the scheduler is
started after the reboot
 Any jobs in the queues will be flushed
 Hard-rebooting hardware, such as tape
drives, usually fixes the problem
Troubleshooting
 Reboot order: I/O node, login node, admin
node, compute nodes (i.e. master nodes
first, then slave nodes)
 Rebuilding a node takes 30 minutes with
the cluster manager; reconfiguring it may
take an hour more
Vendor Choices








Dell
IBM
Western Scientific
Aspen Systems
Racksaver
eRacks
Penguin Computing
Many, many others
 Go with a proven vendor
 Get every vendor to spec
out the same hardware
and software before you
compare prices
 Compare service
agreements
 How fast can they deliver
a working cluster?
Buying Commercial Software
 Is it worth the
money?
 Is it proven software?
 Are all the bells and
whistles really
necessary?
 Paid software does
not necessarily have
the best support
Cluster Tips
 Keep all sysadmin
scripts in an easily
accessible place
 /4sysadmin
 /usr/local/4sysadmin
Cluster Tips
 Force everyone to
use the scheduler to
run their jobs (even
uniprocessor jobs)
 Police it
 Don’t let users get
away with things
 Wrapping some
applications into a
scheduler script can
be tricky
Cluster Upgrades
 Nodes become obsolete in 2 to 3 years
 Upgrade banks of nodes at a time
 If upgrading to a new CPU, check for
compatibility problems and new A/C
requirements
 Upgrading memory and disk space is easy
but tedious
Cluster Upgrades
 Upgrading the OS can be a major task
 Even installing patches can be a major
task
Common Sense Cluster
Administration
 Plan a little before you do anything
 Keep a journal of everything you do
 Create procedures that are easy to follow
in times of stress
 Document everything!
Common Sense Cluster
Administration
 Test software before announcing it
 Educate and “radiate” your cluster
knowledge to your support team
coffee.chem
 6 P.I.’s in Chemistry funded it
 Located in FBA121 next to A/C3
 69 dual-CPU node cluster






64 compute nodes
1 login node
1 admin node
1 I/O node
1 backup node
1 firewall node
coffee.chem
 Myrinet on 32 compute nodes, gigE on
other 32
 2 TB RAID5 array (1.7 TB formatted)
 12-slot, 4.8 TB capacity LTO tape
drive
 2U fold-out console with LCD monitor,
keyboard, trackpad
coffee.chem
 5 KVM daisy chained switches
 9 APC 3000 UPS units each connected to
their own circuit
 3 42U racks
coffee.chem
 Red Hat 9
 Felix cluster install and management





software
PBS Pro
MPICH, LAM/MPI, MPICH-GM
gnu and Portland Group compilers
BLAS, SCALAPACK, ATLAS libraries
Gaussian98 (Gaussian03 + Linda soon)
coffee.chem
 /data on I/O node (coffeecompute00)
holds common apps and user home
directories
 Admin node (coffeeadmin) runs Felix
cluster manager
 Compute nodes (coffeecompute01..64)
 Every node in the cluster can access /data
via NFS
coffee.chem
 Can ssh into compute nodes, admin, and
I/O node only via login node
 Backup node (javabean) temporarily has
our backup device attached (we use tar
right now)
Logging Into coffee.chem
 Everyone in this room will have user
accounts on coffee.chem and home
directories in /data/staff
 Our existence on the system is for
Chemistry’s benefit
 Support scripts are found in /4sysadmin
 If a reboot is necessary, make sure that
PBS is started (/etc/init.d/pbs start)
Compiling and Running Code
 pgCC -Mmpi -o test hello.cpp
 mpirun -np 8 test
Compiling Code
 pgCC -Mmpi -o test hello.cpp
 MPICH includes mpicc and mpif77 to
compile and link MPI programs
 Scripts that pass the MPI library arguments to
cc and f77
Running Code
 mpirun -np XXX -machinefile YYY -nolocal
test
 -np = number of processors
 -machinefile = filename with list of processors
you want to run job on
 -nolocal = don’t run the job locally
Submitting a Job
 3 queues to choose from
 Coffeeq
 general purpose queue
 12 hours max run time
 16 processors max
 Espressoq
 Higher priority than coffeeq
 3 weeks max run time
 Some may still use piq, but this will go away
soon
Submitting a Job
 Prepare a scheduler script










#!/bin/tcsh
#PBS -l arch=linux
{define architecture}
#PBS -l cput=1:00:00
{define CPU time needed}
#PBS -l mem=400mb
{define memory space needed}
#PBS -l nodes=64:ppn=1 {define number of nodes needed}
#PBS -m e
{mail me the results}
#PBS -c c
{minimal checkpointing}
#PBS -k oe
{keep the output and errors}
#PBS -q coffeeq
{run the job on coffeeq}
mpirun -np 8 -machinefile machines_gige_32.LINUX
/data/staff/din/newhello
 qsub the scheduler script
More PBS Commands
 Check on the status of all submitted jobs




with: qstat
Submit a job with: qsub
Delete a job with: qdel
Terminate the execution of a job with:
qterm
See all your available compute node
resources with: pbsnodes -a
Node Terms
 Login node = Service node = Head node = the node users log into
 Master scheduler node = node where scheduler runs, usually login node
 Admin node = the node the sysadmin logs into to gain access to cluster






management apps
Compute node = one or more nodes that perform pieces of a larger
computation
Storage node = the node that has the RAID array or SAN attached to it
Backup node = the node that has the backup solution attached to it
I/O node = can combine features of storage and backup nodes
Visualization node = the node that contains a graphics card and
graphics console; multiple visualization nodes can be combined in a
matrix to form a video wall
Spare node = nodes that are not in service, but can be rebuilt to take the
place of a compute node or, in some cases, an admin or login node
References
 Bookman, Charles. Linux Clustering: Building and Maintaining





Linux Clusters. New Riders, Indianapolis, Indiana, 2003.
Howse, Martin. "Dropping the Bomb: AMD Opteron" in Linux User
& Developer, Issue 33. pp 33-36.
Robertson, Alan. "Highly-Affordable High Availability" in Linux
Magazine, November 2003. pp 16-21.
The Seventh LCI Workshop Systems Track Notes. Linux Clusters
Institute, March 24-28, 2003.
Sterling, Thomas et al. How to Build a Beowulf: A Guide to the
Implementation and Application of PC Clusters. The MIT Press,
Cambridge, Massachusetts, 1999.
Vrenios, Alex. Linux Cluster Architecture. Sams Publishing,
Indianapolis, Indiana, 2002.
coffee.chem Contact List
 Dell hardware problems 800-234-1490
 Myrinet problems [email protected]
 “Very limited” software support [email protected]
 PGI Compiler issues [email protected]
Introduction to
Linux Clusters
Clare Din
SAS Computing
University of Pennsylvania
March 15, 2004