Main Title 32pt

Download Report

Transcript Main Title 32pt

“Beowulfery” – Cluster Computing
Using GoldSim to Solve Embarrassingly
Parallel Problems
Presented to:
GoldSim Users Conference - 2007
October 25, 2007
San Francisco, CA
Presented by:
Patrick D. Mattie, M.S., P.G.
Senior Member of Technical Staff
Sandia National Laboratories
Contributions by: Stefan Knopf, GTG and Randy Dockter, SNL-YMP
OFFCIAL USE ONLY
Presentation Outline
• Cluster Computing Defined
– GoldSim and Beowulf?
• ‘COTS’ Cluster Computing using GoldSim
– GoldSim and E.T.?
• Example Cluster
– TSPA-Wulf
• What is next? Pushing the limits….
OFFCIAL USE ONLY
Background
What is Cluster Computing?
What is a Beowulf Cluster?
OFFCIAL USE ONLY
Cluster Computing Defined
• What is a compute cluster?
– A Cluster is a widely-used term meaning
independent computers combined into a unified
system through software and networking. At the
most fundamental level, when two or more
computers are used together to solve a problem, it
is considered a cluster.
– Clusters are typically used for High Availability (HA)
for greater reliability or High Performance
Computing (HPC) to provide greater computational
power than a single computer can provide.
OFFCIAL USE ONLY
Beowulf Class Cluster
• Beowulf Class Cluster is a simple design for high-performance computing
clusters on inexpensive personal computer hardware.
– Originally developed in 1994 by Thomas Sterling and Donald Becker at NASA
• Beowulf Clusters
– are scalable performance clusters
– based on commodity hardware
– require no custom hardware or software
• A Beowulf Cluster is constructed from commodity computer hardware (Dell, HP,
IBM, etc.) as simple as two networked computers sharing a file system on the
same LAN or as complex as thousands of nodes with a high-speed, low-latency
interconnects (networking)
• Common uses are traditional technical applications such as simulations,
biotechnology, and petroleum; financial market modeling, data mining and
stream processing.
• http://www.beowulf.org
OFFCIAL USE ONLY
Advantages of a Beowulf Class Cluster
• Less computation time then running a serial process
• COTS –’Commodity Off the Shelf’
– Doesn’t require a big budget
– Doesn’t require specialized skill set
• Can be built using existing computer resources and Local Area
Networks (LAN)
• Can be constructed over different system
configurations/brands/resources
• Useful for solving embarrassingly parallel problems
OFFCIAL USE ONLY
Why do I need a cluster?
• An embarrassingly parallel problem is one for
which no particular effort is needed to segment
the problem into a very large number of parallel
tasks, and there is no essential dependency (or
communication) between those parallel tasks
– A Monte Carlo simulation is an embarrassingly
parallel problem
• For example: a 100 realization simulation can be
broken into 100 separate problems, each solved
independently from the other.
• http://en.wikipedia.org/wiki/Embarrassingly_parallel
OFFCIAL USE ONLY
Why do I need a cluster?
• 100 realization run takes 1 minute per realization
– One Computer (or core):
• ~1.6 hours
– On four computers (or cores):
• 25 minutes
– Ten computers (or cores):
• 10 minutes
OFFCIAL USE ONLY
Cluster Computing Using GoldSimPro
• GoldSim Distributed Processing Module
– The Distributed Processing Module uses multiple
copies of GoldSim running on multiple machines
(and/or multiple processes within a single machine
that has a multi-core CPU)
– Grid Computing:
Slaves
Master
OFFCIAL USE ONLY
Cluster Computing - Distributed Processing
"Distributed" or "grid computing" - in general is a special
type of parallel computing which relies on complete
computers (with onboard CPU, storage, power supply,
network interface, etc.) connected to a network (private,
public or the internet) by a conventional network
interface, such as Ethernet.
Examples include:
– SETI@home Project:
http://setiathome.ssl.berkeley.edu/
Analyzing radio telescope data in
search of extraterrestrial intelligence
OFFCIAL USE ONLY
Cluster Computing Using GoldSimPro
There are two versions of the Distributed Processing
Module:
– GoldSim DP (comes with all versions of GoldSim)
– GoldSim DP Plus (licensed separately)
OFFCIAL USE ONLY
“Beowulfery” - YMP & GoldSim
A Cluster Computing Example
OFFCIAL USE ONLY
TSPA-Wulf – Cluster Configuration
•
•
Window Server 2003, and
Windows 2000 Advanced Server
(3GB)
Network simulations (masterslave)
– About 220 Intel Xeon 3.6 GHz
dual-processor nodes with 8
GB RAM per machine, on a
GigE LAN
– 60 Intel Xeon 3.0 GHz dualprocessor dual-core nodes
with 16 GB RAM per
machine, on a GigE LAN
– One realization per slave
CPU—after a slave CPU
finishes one realization it
accepts another from the
master server
– 680 processors available
(plus 62 legacy processors)
– 752 total
OFFCIAL USE ONLY
OFFCIAL USE ONLY
OFFCIAL USE ONLY
Running the Model -- Overview
File Server
Master Computer
Slave Computers
Cases are run by
GoldSim as a
distributed
process from a
directory on a
Master.
Individual
realizations are
run by GoldSim
processes on
Slaves.
Storage area for
TSPA model file
Controlled Storage
area for:
• Parameter Database
• DLLs
• input files
Storage area for
completed TSPA
cases.
OFFCIAL USE ONLY
Set-Up On the Master Computer
• TSPA model file
• Parameter Database
- parameter values
- links to DLLs
- links to input files
• input files
• DLLs
(1) Manually move model
file to the Master
computer.
(3a) Global download of
parameter values to
model file.
(3b) Global download
transfers input files and
DLLs to the Maste
computer.
storage areas on
file server
• TSPA model file
(2) Set-up model file to
run specific case.
(4) Document changes
- conceptual write-up
- check list
- version control file
• input files
• DLLs
directory on
master computer
(Transfers occur over LAN)
File Server
Master Computer
OFFCIAL USE ONLY
Running - Transfers to Slaves
• PA02
• TSPA model file
• DLLs
• input files
directory on
master server
- Networked1
- Networked2
(1) At the start of the
distributed process:
• A “Networked” directory is
created for each processor on
each Slave computer.
• GoldSim slave process is
started for each processor on
each Slave computer.
• Model file transferred
• DLLs transferred
• Input files transferred
(2) Information (i.e., LHS
sampling) for each realization
is transferred to slave
processes as they are
available.
Master Computer
(Transfers occur over LAN)
OFFCIAL USE ONLY
• PA03
- Networked1
- Networked2
• PA04
- Networked1
- Networked2
144 other slave computers
Slave Computers
Running - Transfers from Slaves
• PA02
• TSPA model file
- Networked1
- Networked2
• PA03
(2) GoldSim loads the .gsr
files into the model file when
all realizations are completed.
•.gsr files
(1) .gsr files transferred as each
realization is completed.
• PA04
one per realization
- Networked1
- Networked2
• DLLs
• input files
directory on
master computer
- Networked1
- Networked2
144 other slave computers
(Transfers occur over LAN)
Slave Computers
Master Computer
OFFCIAL USE ONLY
TSPA Model Architecture
•
File size and count
– 645 input files (approximately 5 GB in size)
– 14 DLLs
– GoldSim file with no results (pre-run) is about 200 MB
in size
– GoldSim file after a run is about 5 to 6 GB in size
(compressed); however, there is no intrinsic limitation
other than the slowness of file manipulation on a 32bit operating system
OFFCIAL USE ONLY
TSPA-Wulf Benchmarks
•
1,000 realizations @ 90 minutes per realization
– 62.5 Days to run serial mode
– 120 processors would take ~ 12.5 hours
– 99% faster
•
A Typical 1,000,000-year, 1000-realization run
(about 470 time steps) requires 24 hours on 150
CPUs (75 dual processor single core nodes, 32bit, 2.8-3.0 GHz)
OFFCIAL USE ONLY
What comes next?
OFFCIAL USE ONLY
SNL/GoldSim HPCC R&D
•
•
GoldSim evolution/migration to Microsoft HPC
- Migration from 32-bit to 64-bit architecture?
Optimize modeling system for Microsoft HPC
• Combined SNL/Microsoft/GoldSim task
• Link GoldSim with the Microsoft CCS scheduler tool to
automatically queue jobs and ‘on the fly’ prioritize or reprioritize job resources.
•
•
• Microsoft’s developers working with GoldSim
True Parallel processing?
• Using OpenMP to take advantage of multi-cores
Optimize HPC Software for large compute cluster
• Combined SNL/Microsoft task
OFFCIAL USE ONLY