Parallel Libraries and Parallel I/O John Urbanic Pittsburgh Supercomputing Center September 14, 2004 Outline  Libraries  I/O Solutions   Code Level Parallel Filesystems.

Download Report

Transcript Parallel Libraries and Parallel I/O John Urbanic Pittsburgh Supercomputing Center September 14, 2004 Outline  Libraries  I/O Solutions   Code Level Parallel Filesystems.

Parallel Libraries and
Parallel I/O
John Urbanic
Pittsburgh Supercomputing Center
September 14, 2004
Outline
 Libraries
 I/O Solutions


Code Level
Parallel Filesystems
Scientific Libraries
Leveraging libraries for your code.
Libraries

Math Libraries



Parallel
Serial
Graphic Libraries
File I/O Libraries
Communication




MPI, Grid
Application Specific

Protein/Nucleic Sequencing
Serial Math Libraries
 CXML (Alphas)
 BLAS
 EISPACK
 LAPACK
 SCILIB (portable version)
Some “Preferred” Parallel Math
Libraries
 PDE solvers (PETSC)
 Parallel Linear Algebra (ScaLAPACK)
 Fourier transforms (FFTW)
PETSc
 PETSc, the Portable Extensible Toolkit for
Scientific Computation, is a suite of data
structures and routines for the uni- and parallel
processor solution of large-scale scientific
application problems modeled by partial
differential equations. PETSc employs the MPI
standard for all message-passing
communication.
 As a framework, it does have a learning curve.
 Very scalable
PETSc Codes
Some examples of applications that use PETSc
 Quake – Earthquake simulation code. This
year’s Gordon Bell prize winner. Runs over
1TFLOP on Lemieux.
 Multiflow - curvlinear, multiblock, multiprocessor
flowsolver for multiphase flows.
 FIDAP 8.5 - Fluent's commercial finite element
fluid code uses PETSc for parallel linear solves.
 Many, many others.
PETSc Design
PETSc integrates a hierarchy of components, enabling the user to employ
the level of abstraction that is most natural for a particular problem.
Some of the components are:
 Mat - a suite of data structures and code for the manipulation of parallel
sparse matrices;
 PC - a collection of preconditioners;
 KSP - data-structure-neutral implementations of many popular Krylov
space iterative methods;
 SLES - a higher-level interface for the solution of large-scale linear
systems;
 SNES - data-structure-neutral implementations of Newton-like methods
for nonlinear systems.


Further details at http://www-unix.mcs.anl.gov/petsc
Parallel Programming with MPI, Peter Pacheco, Morgan-Kaufmann,
Devotes a couple of sections to PETSc, 1997.
ScaLAPACK
 ScaLAPACK is a linear algebra library for parallel computers.




Routines are available to solve the linear system A*x=b, or to
find the matrix eigensystem, for a variety of matrix types.
One of the design goals of ScaLAPACK was to have the
ScaLAPACK routines resemble their LAPACK equivalents as
much as possible.
ScaLAPACK implements the block-oriented LAPACK linear
algebra routines, adding a special set of communication
routines to copy blocks of data between processors as needed.
As with LAPACK, a single subroutine call typically carries out
the requested computation.
However, ScaLAPACK requires the the user to configure the
processors and distribute the matrix data, before the problem
can be solved.
Similarly to PETSC, the user is spared the mechanics of the
parallelization
ScaLAPACK Project
The ScaLAPACK project was a collaborative effort involving
several institutions and comprised four components:




dense and band matrix software (ScaLAPACK)
large sparse eigenvalue software (PARPACK and
ARPACK)
sparse direct systems software (CAPSS and MFACT)
preconditioners for large sparse iterative solvers
(ParPre)
 Includes parallel versions of EISPACK routines.
 TCS and genersal information at


http://www.psc.edu/general/software/packages/scalapac
k/scalapack.html
http://www.netlib.org/scalapack/
FFTW
 FFTW is a C subroutine library for computing the Discrete




Fourier Transform in one or more dimensions, of both
real and complex data, of arbitrary input size.
FFTW is callable from Fortran. It works on any platform
with a C compiler.
Parallelization through library calls.
The API of FFTW 3.x is incompatible with that of FFTW
2.x, for reasons of performance and generality (see the
FAQ and manual). MPI parallel transforms are still only
available in 2.1.5.
FFTW Web Page at http://www.fftw.org/
FFTW
 FFTW is a C subroutine library for
computing the Discrete Fourier
Transform in one or more dimensions, of
both real and complex data, of arbitrary
input size.
 FFTW is callable from Fortran. It works
on any platform with a C compiler.
 Parallelization through library calls.
 FFTW Web Page at http://www.fftw.org/
Other Common Packages
 CACTUS
 CHOMBO
 NAG - Parallel Version (built on
ScaLAPACK)
Resources
 At PSC


Staff (Hotline, [email protected])
Web
(www.psc.edu/general/software/categories/
categories.html)
 In General

Netlib (http://netlib.belllabs.com/netlib/master/readme.html)
Parallel I/O
Achieving scalable I/O.
Motivation
Many best-in-class codes spend significant
amounts of time doing file I/O.
By
significant I mean upwards of 20% and
often approaching 40% of total run time.
These are mainstream applications
running on dedicated parallel computing
platforms.
Terminology
A few terms will be useful here:
 Start/Restart File
 Checkpoint File
 Visualization File
 Start/Restart File(s): The file(s) used by the
application to start or restart a run. May be about
25% of total application memory.
 Checkpoint File(s): a periodically saved file used
to restart a run which was disrupted in some way.
May be exactly the same as a Start/Restart file,
but may also be larger if it stores higher order
terms. If it is automatically or system generated it
will be 100% of app memory.
 Visualization File(s): used to generate interim data
which is usually for visualization or similar
analysis. These are often only a small fraction of
total app memory (5-15%) each.
How Often Are These Generated?
 Start/Restart File: Once at startup and
perhaps at completion of run.
 Checkpoint: Depends on MTBF of
machine environment. This is getting
worse, and will not be better on a
PFLOP system. On order of hours.
 Visualization: Depends on data analysis
requirements but can easily be several
times per minute.
Latest (Most Optimistic) Numbers
 Blue Gene/L



16TB Memory
40 GB/s I/O bandwidth
400s to checkpoint memory
 ASCI Purple



50TB Memory
40 GB/s
1250s to checkpoint memory
Latest machine will still take on order of minutes to
10’s of minutes to do any substantial IO.
Example Numbers
We’ll use Lemieux, PSC’s main machine,
as most of these high-demand applications
have similar requirements on other
platforms, and we’ll pick an application
(Earthquake Modeling) that won the
Gordon Bell prize this past year.
3000 PE Earthquake Run
 Start/Restart: 3000 files totaling 150 GB
 Checkpoint: 40 GB every 8 hours
 Visualization: 1.2 GB every 30 seconds
Although this is the largest unstructured mesh ever run, it
still doesn’t push the available memory limit. Many apps
are closer to being memory bound.
A Slight Digression:
Visualization Cluster
What was once a neat idea has now
become a necessity. Real time volume
rendering is the only way to render down
these enormous data sets to a storable
size.
Actual Route
 Pre-load startup data from FAR to SCRATCH (~12 hr)
 Start holding breath (no node remapping)
 Move from SCRATCH to LOCAL (4 hr)
 Run (16 hour, little IO time w/ 70GB/s path)
 Move from LOCAL to SCRATCH (6 hr)
 Release breath
 Move to FAR/offsite (~12 hr)
Bottom Line (which is always some bottleneck)
Like most of the TFLOP class machines,
we have several hierarchical levels of file
systems. In this case we want to leverage
the local disks to keep the app humming
along (which it does), but we eventually
need to move the data off (and on) to these
drives. The machine does not give us free
cycles to do this. This pre/post run file
migration is the bottleneck here.
Skip local disk?
Only if we want to spend 70X more time
during the run. Although users love a nice
DFS solution, it is prohibitive for 3000 PE’s
writing simultaneously and frequently.
Where’s the DFS?
It’s on our giant SMP ☺
Just like the difficulty in creating a massive SMP
revolves around contention, so does making a
DFS (NFS, AFS, GPFS, etc.) that can deal with
thousands of simultaneous file writes.
Our
SCRATCH (~ 1 GB/s) is as close as we get. It is a
globally accessible filesystem. But, we still use
locally attached disks when it really counts.
Parallel Filesystem Test Results
Parallel filesystems were tested with a simple mpi program that
reads and writes a file from each rank. These tests were run Jan,
2004 on the clusters while they were in production mode. The
filesystems and clusters were not in dedicated mode, and so these
results are only a snapshot.
Hosts *
ppn
Approx. Size of Test
File
Filesystem
Agg. Transfer rate
[MB/s]
32*4
4 gigabytes
PSC/scratch
3000 (5/2/04)
110*2
5 gigabytes
SDSC /gpfs
753
128*2
5 gigabytes
NCSA /
gpfs
423
32*2
2.5 gigabytes
Caltech /
pvfs
99
Data path jumps through hoops,
how about the code?
Most parallel code has naturally modular,
isolated I/O routines. This makes the
above issue much less painful. This is
very unlike computational algorithm
scalability issues which often permeate a
code.
How many lines/hours?
Quake, which has thousands of lines of
code, has only a few dozen lines of I/O
code in several routines (startup,
checkpoint, viz). To accommodate this
particular mode of operation (as compared
to the default “magic DFS” mode) took only
a couple hours of recoding.
How Portable?
This is one area where we have to forego
strict portability. However, once we modify
these isolated areas of code to deal with
the notion of local/fragmented disk spaces,
we can bend to any new environment with
relative ease.
Pseudo Code (writing to local)
synch
if (not subgroup #X master)
send data to subgroup #X master
else
openfile datafile.data.X
for (1 to number_in_subgroup)
receive data
write data
Pseudo Code (reading from local)
synch
if (not subgroup #X master)
receive data
else
openfile datafile.data.X
for (1 to number_in_subgroup)
read data
send data
Pseudo Code (writing to DFS)
synch
openfile SingleGiantFile
Setfilepointer(based on PE #)
write data
Platform and Run Size Issues
 Various platforms will strongly suggest
different numbers or patterns of
designated I/O nodes (sometime all
nodes, sometime a very few). Simple to
accommodate in code.
 Different numbers of total PE’s or I/O
PE’s will require different distributions of
data in local files. This can be done offline.
File Migration Mechanics
 ftp, scp, gridco, gridftp, etc.
 tcsio (a local solution)
How about MPI-IO?
 Not many (any?) full MPI-2 implementations.
More like some vendor/site combinations have
implemented the features to accomplish the
above type of customization for a particular disk
arrangement. Or:
 Portable-looking, code that runs very, very slow.
 You can explore this separately via ROMIO:
http://www-unix.mcs.anl.gov/romio/
Parallel Filesystems
 PVFS

http://www.parl.clemson.edu/pvfs/index.html
 LUSTRE

http://www.lustre.org/
Current deployments
 Summer 2003 (3 of the top 8 run Linux. Lustre on all 3)
 LLNL


MCR: 1,100 node cluster
LLNL ALC: 950 node cluster
PNNL EMSL: 950 node cluster
 Installing in 2004



NCSA: 1,000 nodes
SNL/ASCI Red Storm: 8,000 nodes
LANL Pink: 1,000 nodes
LUSTRE = Linux+Cluster
 Provides





Caching
Failover
QOS
Global Namespace
Security and Authentication
 Built on


Portals
Kernel mods
Interface (for striping control)
 Shell

lstripe
 Code

ioctl
Performance
From http://www.lustre.org/docs/lustre-datasheet.pdf
File I/O % of raw bandwidth:
Achieved client I/O:
Aggregate I/O 1,000 clients:
Attribute retrieval rate:
>90%
>650 MB/s
11.1 GB/s
7500/s
(in 10M file directory, 1,000 clients)
Creation rate:
(one directory,1,000 clients)
5000/s
Benchmarks
 FLASH

http://flash.uchicago.edu/~zingale/flash_benchmark_io/#intro
 PFS

http://www.osc.edu/~djohnson/gelato/pfbs-0.0.1.tar.gz
Didn’t Cover (too trivial for us)
 Formatted/Unformatted
 Floating Point Representations
 Byte Ordering
 XDF – No help for parallel performance