Cluster Components - University of Pennsylvania
Download
Report
Transcript Cluster Components - University of Pennsylvania
Introduction to
Linux Clusters
Clare Din
SAS Computing
University of Pennsylvania
March 15, 2004
Cluster Components
Hardware
Nodes
Disk array
Networking gear
Backup device
Admin front end
UPS
Rack units
Software
Operating system
MPI
Compilers
Scheduler
Cluster Components
Hardware
Nodes
Compute nodes
Admin node
I/O node
Login node
Disk array
Networking gear
Backup device
Admin front end
Software
Operating system
Compilers
Scheduler
MPI
Cluster Components
Hardware
Disk array
RAID5
SCSI 320
10k+ RPM, TB+
capacity
NFS-mounted from I/O
node
Networking gear
Backup device
Admin front end
Software
Operating system
Compilers
Scheduler
MPI
Cluster Components
Hardware
Networking gear
Myrinet, gigE, 10/100
Switches
Cables
Networking cards
Backup device
Admin front end
UPS
Rack units
Software
Operating system
Compilers
Scheduler
MPI
Cluster Components
Hardware
Backup device
AIT3, DLT, LTO
N-slot cartridge drive
SAN
Admin front end
UPS
Rack units
Software
Operating system
Compilers
Scheduler
MPI
Cluster Components
Hardware
Admin front end
Console (keyboard,
monitor, mouse)
KVM switches
KVM cables
UPS
Rack units
Cluster Components
Hardware
UPS
APC SmartUPS 3000
3 per 42U rack
Rack units
Software
Operating system
Compilers
Scheduler
MPI
Cluster Components
Hardware
Rack units
42U, standard or deep
Software
Operating system
Compilers
Scheduler
MPI
Cluster Components
Software
Operating system
Red Hat 9+ Linux
Debian Linux
SUSE Linux
Mandrake Linux
FreeBSD and others
MPI
Compilers
Scheduler
Cluster Components
Software
MPI
MPICH
LAM/MPI
MPI-GM
MPI Pro
Compilers
Scheduler
Cluster Components
Software
Compilers
gnu
Portland Group
Intel
Scheduler
Cluster Components
Software
Scheduler
OpenPBS
PBS Pro
Maui
Filesystem Requirements
Journalled filesystem
Reboots happen more quickly after a
crash
Slight performance hit for this feature
ext3 is a popular choice (old ext2 was
not journalled)
Space and Power
Requirements
Space
Standard 42U rack is
about 24”W x 80”H x
40”D
Blade units give you
more than 1 node per
1U space in a deeper
rack
Cable management
inside the rack
Consider overhead or
raised floor cabling for
the external cables
Power
67 node Xeon cluster
consumes 19,872W =
5.65 tons of A/C to
keep it cool
Ideally, each UPS plug
should connect to its
own circuit
Clusters (especially
blades) run real hot;
make sure there is
adequate A/C and
ventilation
Network Requirements
External Network
One 10mbps
network line is
adequate (all
computation and
message passing
is within the
cluster)
Internal Network
gigE
Myrinet
Some combo
Base your net gear
selection on
whether most of
your jobs are CPUbound or I/O bound
Network Choices Compared
Fast Ethernet (100BT)
0.1 Gb/s (or 100 Mb/s) bandwidth
Essentially free
gigE
0.4 Gb/s to 0.64 Gb/s bandwidth
~$400 per node
Myrinet
1.2 Gb/s to 2.0 Gb/s bandwidth
~$1000 per node
Scales to thousands of nodes
Buy fiber instead of copper cables
Networking Gear Speeds
2500
2000
1500
1000
500
0
Fast Ethernet
gigE
Myrinet
I/O Node
Globally accessible filesystem (RAID5 disk
array)
Backup device
I/O Node
Globally accessible filesystem (RAID5 disk
array)
NFS share it
Put user home directories, apps, and scratch
space directories on it so all compute nodes
can access them
Enforce quotas on home directories
Backup device
I/O Node
Globally accessible filesystem (RAID5 disk
array)
Backup device
Make sure your device and software is
compatible with your operating system
Plan a good backup strategy
Test the ETA of bringing back a single file or a
filesystem from backups
Admin Node
Only sysadmins log into this node
Runs cluster management software
Admin Node
Only sysadmins log into this node
Accessible only from within the cluster
Runs cluster management software
Admin Node
Only admins log into this node
Runs cluster management software
User and quota management
Node management
Rebuild dead nodes
Monitor CPU utilization and network traffic
Compute Nodes
Buy the fastest CPUs and bus speed you
can afford.
Memory size of each node depends on the
application mix.
Lots of hard disk space is not so much a
priority since the nodes will primarily use
shared space on the I/O node.
Compute Nodes
Buy the fastest CPUs and bus speed you
can afford.
Don’t forget that some software companies
license their software per node, so factor in
software costs
Stick with a proven technology over future
promise
Memory size of each node depends on the
application mix.
Compute Nodes
Buy the fastest CPUs and bus speed you
can afford.
Memory size of each node depends on the
application mix.
2 GB + for for large calculations
< 2 GB for financial databases
Lots of hard disk space is not so much a
priority since the nodes will primarily use
shared space on the I/O node.
Compute Nodes
Buy the fastest CPUs and bus speed you
can afford.
Memory size of each node depends on the
application mix.
Lots of hard disk space is not so much a
priority since the nodes will primarily use
shared space on the I/O node.
Disks are cheap nowadays... 40GB EIDE is
standard per node
Compute Nodes
Choose a CPU architecture you’re
comfortable with
Intel: P4, Xeon, Itanium
AMD: Opteron, Athlon
Other: G4/G5
Consider that some algorithms require 2n
nodes
32-bit Linux is free or close-to-free, 64-bit
Red Hat Linux costs $1600 per node
Login Node
Users login here
Only way to get into the cluster
Compile code
Job control
Login Node
Users login here
ssh or ssh -X
Cluster designers recommend 1 login node
per 64 compute nodes
Update /etc/profile.d so all users get the same
environment when they log in
Only way to get into the cluster
Compile code
Job control
Login Node
Users login here
Only way to get into the cluster
Static IP address (vs. DHCP addresses on all
other cluster nodes)
Turn on built-in firewall software
Compile code
Job control
Login Node
Users login here
Only way to get into the cluster
Compile code
Licenses should be purchased for this node
only
Don’t pay for more than you need
2 licenses might be sufficient for code compilation
for a department
Job control
Login Node
Users login here
Only way to get into the cluster
Compile code
Job control (using a scheduler)
Choice of queues to access subset of
resources
Submit, delete, terminate jobs
Check on job status
Spare Nodes
Offline nodes that are put into service
when an existing node dies
Use for spare parts
Use for testing environment
Cluster Install Software
Designed to make cluster installation
easier (“cluster in a box” concept)
Decreases ETA of the install process
using automated steps
Decreases chance of user error
Choices:
OSCAR
Felix
IBM XCAT
IBM CSM
Cluster Management Software
Run parallel commands via GUI
Or write Perl scripts for command-line control
Install new nodes, rebuild corrupted nodes
Check on status of hardware (nodes,
network connections)
Ganglia
xpbsmon
Myrinet tests (gm_board_info)
Cluster Management Software
xpbsmon shows jobs
running that
were
submitted via
the scheduler
Cluster Consistency
Rsync or rdist
/etc/password,
shadow, gshadow,
and group files from
login node to compute
nodes
Also consider (auto or
manually) rsync’ing
/etc/profile.d files, pbs
config files, /etc/fstab,
etc.
Local and Remote
Management
Local management
GUI desktop from console monitor
KVM switches to access each node
Remote management
Console switch
ssh in and see what’s on the console monitor screen
from your remote desktop
Web-based tools
Ganglia
ganglia.sourceforge.net
Netsaint
www.netsaint.org
Big Brother
www.bb4.com
Ganglia
Tool for monitoring clusters of up to 2000
nodes
Used on over 500 clusters worldwide
For multiple OS’s and CPU architectures
# ssh -X coffee.chem.upenn.edu
# ssh coffeeadmin
# mozilla &
Open http://coffeeadmin/ganglia
Periodically auto-refreshes web page
Ganglia
Ganglia
Ganglia
Scheduling Software (PBS)
Set up queues for different groups of users
based on resource needs (i.e. not everyone
needs Myrinet; some users only need 1 node)
The world does not end if one node goes down;
the scheduler will run the job on another node
Make sure pbs_server and pbs_sched is running
on login node
Make sure pbs_mom is running on all compute
nodes, but not on login, admin, or I/O nodes
Scheduling Software
OpenPBS
PBS Pro
Others
Scheduling Software
OpenPBS
Limit users by number of jobs
Good support via messageboards
*** FREE ***
PBS Pro
Others
Scheduling Software
OpenPBS
PBS Pro
The “pro” version of OpenPBS
Limit by nodes, not just jobs per user
Must pay for support ($25 per CPU, or $3200
for a 128 CPU cluster)
Others
Scheduling Software
OpenPBS
PBS Pro
Others
Load Share Facility
Codeine
Maui
MPI Software
MPICH (Argonne National Labs)
LAM/MPI (OSC/Univ. of Notre Dame)
MPI-GM (Myricom)
MPI Pro (MSTi Software)
Programmed by one of the original
developers of MPICH
Claims to be 20% faster than MPICH
Costs $1200 plus support per year
Compilers and Libraries
Compilers
gcc/g77
Portland Group
Intel
www.gnu.org/software
www.pgroup.com
www.developer.intel.com
Libraries
BLAS
ATLAS - portable BLAS
www.math-atlas.sourceforge.net
LAPACK
SCALAPACK - MPI-based LAPACK
FFTW - Fast Fourier Transform
www.fftw.org
many, many more
Cluster Security
Securing/patching your Linux cluster is
much like securing/patching your Linux
desktop
Keep an eye out for the latest patches
Install a patch only if necessary and do it
on a test machine first
Make sure there’s a way to back out of a
patch before installing it
Cluster Security
Get rid of unneeded software
Limit who installs and what gets installed
Close unused ports and services
Limit login service to ssh between login
node and outside world
Use ssh to tunnel X connections safely
Limit access using hosts.allow/deny
Use scp and sftp for secure file transfer
Cluster Security
Carefully configure NFS
Upgrade to the latest, safest Samba
version, if used
Disable Apache if not needed
Turn on built-in Linux firewall software
Troubleshooting
Make sure the core cluster services are
running
Scheduler, MPI, NFS, cluster managers
Make sure software licenses are up-todate
Scan logs for break-in attempts
Keep a written journal of all patches
installs and upgrades
Troubleshooting
Sometimes a reboot will fix the problem
If you reboot the login node where the
scheduler is running, be sure the scheduler is
started after the reboot
Any jobs in the queues will be flushed
Hard-rebooting hardware, such as tape
drives, usually fixes the problem
Troubleshooting
Reboot order: I/O node, login node, admin
node, compute nodes (i.e. master nodes
first, then slave nodes)
Rebuilding a node takes 30 minutes with
the cluster manager; reconfiguring it may
take an hour more
Vendor Choices
Dell
IBM
Western Scientific
Aspen Systems
Racksaver
eRacks
Penguin Computing
Many, many others
Go with a proven vendor
Get every vendor to spec
out the same hardware
and software before you
compare prices
Compare service
agreements
How fast can they deliver
a working cluster?
Buying Commercial Software
Is it worth the
money?
Is it proven software?
Are all the bells and
whistles really
necessary?
Paid software does
not necessarily have
the best support
Cluster Tips
Keep all sysadmin
scripts in an easily
accessible place
/4sysadmin
/usr/local/4sysadmin
Cluster Tips
Force everyone to
use the scheduler to
run their jobs (even
uniprocessor jobs)
Police it
Don’t let users get
away with things
Wrapping some
applications into a
scheduler script can
be tricky
Cluster Upgrades
Nodes become obsolete in 2 to 3 years
Upgrade banks of nodes at a time
If upgrading to a new CPU, check for
compatibility problems and new A/C
requirements
Upgrading memory and disk space is easy
but tedious
Cluster Upgrades
Upgrading the OS can be a major task
Even installing patches can be a major
task
Common Sense Cluster
Administration
Plan a little before you do anything
Keep a journal of everything you do
Create procedures that are easy to follow
in times of stress
Document everything!
Common Sense Cluster
Administration
Test software before announcing it
Educate and “radiate” your cluster
knowledge to your support team
coffee.chem
6 P.I.’s in Chemistry funded it
Located in FBA121 next to A/C3
69 dual-CPU node cluster
64 compute nodes
1 login node
1 admin node
1 I/O node
1 backup node
1 firewall node
coffee.chem
Myrinet on 32 compute nodes, gigE on
other 32
2 TB RAID5 array (1.7 TB formatted)
12-slot, 4.8 TB capacity LTO tape
drive
2U fold-out console with LCD monitor,
keyboard, trackpad
coffee.chem
5 KVM daisy chained switches
9 APC 3000 UPS units each connected to
their own circuit
3 42U racks
coffee.chem
Red Hat 9
Felix cluster install and management
software
PBS Pro
MPICH, LAM/MPI, MPICH-GM
gnu and Portland Group compilers
BLAS, SCALAPACK, ATLAS libraries
Gaussian98 (Gaussian03 + Linda soon)
coffee.chem
/data on I/O node (coffeecompute00)
holds common apps and user home
directories
Admin node (coffeeadmin) runs Felix
cluster manager
Compute nodes (coffeecompute01..64)
Every node in the cluster can access /data
via NFS
coffee.chem
Can ssh into compute nodes, admin, and
I/O node only via login node
Backup node (javabean) temporarily has
our backup device attached (we use tar
right now)
Logging Into coffee.chem
Everyone in this room will have user
accounts on coffee.chem and home
directories in /data/staff
Our existence on the system is for
Chemistry’s benefit
Support scripts are found in /4sysadmin
If a reboot is necessary, make sure that
PBS is started (/etc/init.d/pbs start)
Compiling and Running Code
pgCC -Mmpi -o test hello.cpp
mpirun -np 8 test
Compiling Code
pgCC -Mmpi -o test hello.cpp
MPICH includes mpicc and mpif77 to
compile and link MPI programs
Scripts that pass the MPI library arguments to
cc and f77
Running Code
mpirun -np XXX -machinefile YYY -nolocal
test
-np = number of processors
-machinefile = filename with list of processors
you want to run job on
-nolocal = don’t run the job locally
Submitting a Job
3 queues to choose from
Coffeeq
general purpose queue
12 hours max run time
16 processors max
Espressoq
Higher priority than coffeeq
3 weeks max run time
Some may still use piq, but this will go away
soon
Submitting a Job
Prepare a scheduler script
#!/bin/tcsh
#PBS -l arch=linux
{define architecture}
#PBS -l cput=1:00:00
{define CPU time needed}
#PBS -l mem=400mb
{define memory space needed}
#PBS -l nodes=64:ppn=1 {define number of nodes needed}
#PBS -m e
{mail me the results}
#PBS -c c
{minimal checkpointing}
#PBS -k oe
{keep the output and errors}
#PBS -q coffeeq
{run the job on coffeeq}
mpirun -np 8 -machinefile machines_gige_32.LINUX
/data/staff/din/newhello
qsub the scheduler script
More PBS Commands
Check on the status of all submitted jobs
with: qstat
Submit a job with: qsub
Delete a job with: qdel
Terminate the execution of a job with:
qterm
See all your available compute node
resources with: pbsnodes -a
Node Terms
Login node = Service node = Head node = the node users log into
Master scheduler node = node where scheduler runs, usually login node
Admin node = the node the sysadmin logs into to gain access to cluster
management apps
Compute node = one or more nodes that perform pieces of a larger
computation
Storage node = the node that has the RAID array or SAN attached to it
Backup node = the node that has the backup solution attached to it
I/O node = can combine features of storage and backup nodes
Visualization node = the node that contains a graphics card and
graphics console; multiple visualization nodes can be combined in a
matrix to form a video wall
Spare node = nodes that are not in service, but can be rebuilt to take the
place of a compute node or, in some cases, an admin or login node
References
Bookman, Charles. Linux Clustering: Building and Maintaining
Linux Clusters. New Riders, Indianapolis, Indiana, 2003.
Howse, Martin. "Dropping the Bomb: AMD Opteron" in Linux User
& Developer, Issue 33. pp 33-36.
Robertson, Alan. "Highly-Affordable High Availability" in Linux
Magazine, November 2003. pp 16-21.
The Seventh LCI Workshop Systems Track Notes. Linux Clusters
Institute, March 24-28, 2003.
Sterling, Thomas et al. How to Build a Beowulf: A Guide to the
Implementation and Application of PC Clusters. The MIT Press,
Cambridge, Massachusetts, 1999.
Vrenios, Alex. Linux Cluster Architecture. Sams Publishing,
Indianapolis, Indiana, 2002.
coffee.chem Contact List
Dell hardware problems 800-234-1490
Myrinet problems [email protected]
“Very limited” software support [email protected]
PGI Compiler issues [email protected]
Introduction to
Linux Clusters
Clare Din
SAS Computing
University of Pennsylvania
March 15, 2004