Tools for Cluster Administration and Applications (ancient technology – from 2001…) Oak Ridge National Laboratory.

Download Report

Transcript Tools for Cluster Administration and Applications (ancient technology – from 2001…) Oak Ridge National Laboratory.

Tools for Cluster
Administration and Applications
(ancient technology – from 2001…)
Oak Ridge National Laboratory
Large Cluster Administration: what is the...
...Problem
• System Administrators DO NOT scale
• Install / update operating system
• Install applications
• Add / Remove users
• etc.
• Users DO NOT scale
• Install applications
• Move data files
• Launch applications
• Interact with active jobs
• etc.
Oak Ridge National Laboratory -- U.S. Department of Energy
...Solution
• Tools that…
• Treat cluster as single machine
• Scale from 1-to-N nodes
• 10,000’s of nodes
• Scale to Federated clusters
• Easy to  learn – use – adapt
2
Tool Review
• Systemimager
• LUI – Linux Utility for cluster Install
• VA Cluster Management (VACM)
• Alert
• Parallel UNIX Commands – (Ptools)
• dsh
• prsh
• Webmin
• ALINKA LCM - Linux Cluster Manager
• ALINKA RAISIN
• SCMS – Smile Cluster Management System
• C3 – Cluster Command & Control
• M3C – Managing Multiple Multi-User Clusters
Oak Ridge National Laboratory -- U.S. Department of Energy
3
Systemimager
• Disk image / system administration
• maintain disk coherency across cluster
• administrator level tool
• image server stores images
• can build image server database of site disk images
• Pros:
• supported by VA Linux as opensource
• architecture independent
• Cons:
• requires each node to request image (“pull image”)
• only operates at disk image level (not individual file)
• Dependencies:
• rsync, DHCP
• http://download.sourceforge.net/systemimager
Oak Ridge National Laboratory -- U.S. Department of Energy
4
Linux Utility for cluster Install – (LUI)
• System install / restore
• administrator level tool
• easy to duplicate install by resource
• linux kernel, system map, partition table, RPMs, “user exits”, local &
remote NFS file systems
• no need to store disk images
• Pros:
• LUI available as an RPM
• supported by IBM as opensource
• architecture independent
• machine & resource groups
• Cons:
• only useful for system initialization
• manually installed packages will have to be reinstalled
• Dependencies:
• NFS, tftp-hpa, bootp or dhcp, perl
• http://oss.software.ibm.com/developer/opensource/linux/projects/lui
Oak Ridge National Laboratory -- U.S. Department of Energy
5
VA Cluster Management - (VACM)
•GUI based Hardware level monitor
• device power control, hardware reset, remote bios control, chasis intrusion,
cpu fan status
• Intel Intelligent Platform Management Interface motherboards
• Pros:
• monitor does not impact performance as IPMI runs in hardware micro
controllers
• Cons:
• only available for Intel IMPI compliant motherboards
• does not monitor power supply fan or external fan
• Dependencies:
• IMPI motherboard:
• NB440BX Server Platform (Nightshade)
• T440BX Server Platform (Nightlight)
• L440GX Server Platform (Lancewood)
• GTK+ v1.02, Gnome-libs, GDK v1.2, imlib v1.0.6
• http://www.valinux.com/software/vacm/
Oak Ridge National Laboratory -- U.S. Department of Energy
6
Alert
• Web based UNIX cluster monitoring tool
• local clients on each node reports to monitor node(s)
• clients are scripts running as cron jobs
• monitors run daemon to receive reports from clients
• Monitors
• alerts
• print web pages
• email notification of events
• Pros
• supports cluster configuration files, allowing definitions of subclusters
• errors can be categorized
• notifications can be assigned for each category
• uses a special Alert log as opposed to having to search syslog
• clients can be written to handle new monitoring tasks
• Cons
• no proactive event correction ability
• http://www.cs.virginia.edu/~jdm2d/alert/
Oak Ridge National Laboratory -- U.S. Department of Energy
7
Parallel UNIX Commands – (Ptools project)
• Parallel version of common UNIX commands
• cp, cat, ls, rm, mv, find, ps, kill, exec, and test
• Other parallel tools
• parallel process find, command execution on satisfied condition, command
execution on collection of files, display command output
•Target Architecture
• MPP with full Unix environment on each node
• SP-1
• Meiko CS-2
• Unix NOWs
• Argonne National Laboratory
• William Gropp
• Ewing Lusk
• Status: vaporware -- latest reference ‘94 SHPCC paper
• http://www.ptools.org/
• http://www.ptools.org/projects.html#PUC
Oak Ridge National Laboratory -- U.S. Department of Energy
8
Distributed Shell – (dsh)
• Command line based
• sequential execution across collection of hosts
• rsh to access nodes
• output prepended with host name
• Pros:
• single or multiple remote commands
• can create node groups
• command can specify individual hosts or use node groups
• Cons:
• no concurrent execution
• no interactive operation
• Dependencies:
• rsh, Perl
• environment vars:
• BEOWULF_ROOT – directory with beowulf related files
• WCOLL – location of file with default working collective
• http://www.ccr.buffalo.edu/dsh.htm
Oak Ridge National Laboratory -- U.S. Department of Energy
9
Parallel Remote Shell – (prsh)
• Command line based
• concurrent execution across collection of hosts
• run UNIX command across nodes
• stderr & stdout returned to originating computer
• Pros:
• ability to use rsh or ssh
• hosts and options can be specified in environment variables
• output can be associated with hostname using --prepend
• Cons:
• not able to perform interactive tasks (stdin set to /dev/null)
• using --status with rsh unreliable
• Dependencies:
• rsh, ssh, Perl
• environment vars:
• PRSH_OPTIONS – used before command line options
• PRSH_HOSTS – default host list
• http://www.cacr.caltech.edu/projects/beowulf/GrendelWeb/
software/index.html
Oak Ridge National Laboratory -- U.S. Department of Energy
10
Webmin
• web interface for system administration
• designed for use on individual systems – not clusters
• web server and CGI programs to perform administration tasks
• Pros:
• quick, graphical interface to most common system administration tasks
• telnet module for console access to hosts
• ability to define custom commands
• view and manage running processes
• easy addition of user written modules, and standards for writing them
• Cons:
• not intended for clusters
• must have web server on every host
• modules must be written entirely in Perl
• Dependencies:
• Perl 5 or later
• web server
• http://www.webmin.com/webmin/
Oak Ridge National Laboratory -- U.S. Department of Energy
11
ALINKA LCM - Linux Cluster Manager
• Command line based management and configuration
• Pros:
• cluster-wide command execution, except superuser commands
• ability to define and manage subclusters
• load monitoring of nodes
• MPI/PVM job execution support
• Cons:
• master node is NFS server for /home, /etc, and /var, limiting scalability
• no support for using SSH, and cluster command doesn't work as root
• no support for NIS or Shadow passwords
• limited to homogeneous clusters
• difficult to install and operate
• Dependencies:
• rsh, tar, nfs-server, sudo, php cgi-bin with pgsql support, bootpd, tcpdump,
postgresql, gawk
• http://www.alinka.com/download.htm#lcm
Oak Ridge National Laboratory -- U.S. Department of Energy
12
ALINKA RAISIN
• GUI based management and configuration
• same functionality as ALINKA LCM
• added GUI
• Pros:
• cluster-wide command execution, except superuser commands
• ability to define and manage subclusters
• load monitoring of nodes
• MPI/PVM job execution support
• Cons:
• all cons of ALINKA LCM
• commercial license
• Dependencies:
• same as ALINKA LCM
• apache
• php module for apache with postgresql support
• gnuplot
• http://www.alinka.com/araisin.htm
Oak Ridge National Laboratory -- U.S. Department of Energy
13
Smile Cluster Management System – (SCMS)
• Command line and GUI environment
• designed managing beowulf-type clusters as single machine
• latest version looks promising with ptools like command line interface
• Pros:
• many system utilities (e.g. node status, node control panel, node file system,
disk space, ftp, process status, reboot/shutdown, rpm package manager, telnet,
parallel UNIX commands, alarm services, and motherboard monitoring)
• performance monitoring/logging of CPU, memory, I/O, and network
• user-definable alarm levels with e-mail or script notifications
• Cons:
• no support for job scheduling and cluster resource allocation
• no MPI/PVM job submission tool
• no support for using SSH
• Dependencies:
• rsh, Java, Perl
• http://smile.cpe.ku.ac.th/
Oak Ridge National Laboratory -- U.S. Department of Energy
14
Cluster Command & Control (C3) Tools
• Command line based
• single machine interface
• cluster configuration file
• serial & parallel versions
• Pros:
• serial version – deterministic execution, good for debugging
• parallel version – efficient execution
• ability to rapidly deploy software updates and update system images
• command line list option allows subcluster management
• distributed file scatter and gather operations
• execution of any non-interactive command
• Cons:
• no support for interactive command execution
• Dependencies:
• DHCP, rsync 2.4.3 or later, OpenSSL, OpenSSH, DNS, SystemImager v0.23,
Perl v5.6.0 or later
• http://www.csm.ornl.gov/clusterpowertools
Oak Ridge National Laboratory -- U.S. Department of Energy
[email protected]
15
Cluster Command & Control (C3) Tools
• System administration
• cpushimage - “push” image across cluster
• cshutdown - Remote shutdown to reboot or halt cluster
• User tools
• cpush - push single file -to- directory
• crm - delete single file -to- directory
• cget - retrieve files from each node
• cexec - execute arbitrary command on each node
• cps - run ps and retrieve the output from each node
• ckill - kill a process on each node
• Add “s” to end for serial version -- cshutdowns, cpushs, etc...
Oak Ridge National Laboratory -- U.S. Department of Energy
16