Document 7430180

Download Report

Transcript Document 7430180

HPC At PNNL
March 2004
R. Scott Studham,
Associate Director
Advanced Computing
April 13, 2004
HPC Systems at PNNL
Molecular Science Computing Facility

11.8TF Linux based supercomputer using Intel
Itanium2 processors and Elan4 interconnect

A balance for our users: 500TB Disk, 6.8 TB memory
PNNL Advanced Computing Center


128 Processor SGI Altix
NNSA-ASC “Spray Cool” Cluster
2
William R. Wiley
Environmental Molecular Sciences Laboratory
Who are we?

A 200,000 square-foot U.S. Department of
Energy national scientific user facility

Operated by Pacific Northwest National
Laboratory in Richland, Washington
What we provide for you

Free access to over 100 state-of-the-art
research instruments

A peer-review proposal process

Expert staff to assist or collaborate
Why use EMSL?

EMSL provides - under one roof - staff and
instruments for fundamental research on
physical, chemical, and biological
processes.
3
HPCS2 Configuration
1,976 next generation Itanium® processors
928
compute
nodes
…...
Elan4
Elan3
…
4 Login nodes
with 4Gb-Enet
2 System Mgt
nodes
The 11.8TF system is in full
operations now.
Lustre
2Gb SAN / 53TB
11.8TF
6.8TB Memory
4
Who uses the MSCF, and what do they run?
Other DOE Labs
VASP
Other
NWChem - PW
ADF
Jaguar
EMSL
Academia
Own Code
Other
Other Gov.
Agencies
Guassian
Gaussian
NWChem Ab Initio
Climate Code
PNNL (not EMSL)
Private Industry
FY02 numbers
Support
Pilot Project
NWChem - MD
Grand
Challenge
5
MSCF is focused on grand challenges
Fewer users focused on Longer, Larger runs and Big Science.
FY98
Demand for access to this
resource is high.
FY00
FY01
FY02
40%
% Node-Hours Used
More than 67% of the
usage is for large jobs.
FY99
35%
30%
25%
20%
15%
10%
5%
0%
<3%
3-6%
6-12%
12-15%
25-50%
>50%
Percent of system used by a single job
6
The world-class science is enabled by having systems
that enable the fastest time-to-solution for our science
Significant improvement (2545% for moderate number of
processors) in time to solution
by upgrading the interconnect
to Elan4.


Improved efficiency
Improved scalability
HPCS2 is a science driven
computer architecture that has
the fastest time-to-solution for
our users science of any
system we have benchmarked.
7
Accurate binding energies for
large water clusters
These results provide unique information
on the transition from the cluster to the
liquid and solid phases of water.
Code: NWChem
Kernel: MP2 (Disk Bound)
Sustained Performance: ~0.6 Gflop/s per
processor (10% of peak)
Choke Point: Sustained 61GB/s of Disk
IO and used 400TB of scratch space.
Only took 5 hours on 1024 CPUs of the HP
cluster. This is a capability class
problem that could not be completed on
any other system.
8
Energy calculation of a protein complex
The Ras-RasGAP protein complex
is a key switch in the signaling
network initiated by the epidermal
growth factor (EGF). This signal
network controls cell death and
differentiation, and mutations in the
protein complex are responsible for
30% of all human tumors.
Code: NWChem
Kernel: Hartree-Fock
Time for solution:~3 hours for one iteration
on 1400 processors
Computation of 107 residues of the
full protein complex using
approximately 15,000 basis
functions. This is believed to be the
largest calculation of its type.
9
Biogeochemistry:
Membranes for Bioremediation
Molecular dynamics of a
lipopolysaccharide (LPS)
HPCS1
Classical molecular dynamics
of the LPS membrane of
Pseudomonas aeruginosa and
mineral
HPCS2
Quantum mechanical/molecular
mechanics molecular dynamics
of membrane plus mineral
HPCS3
10
A new trend is emerging
Supercomputer
Computational
Archive
Experimental
The MSCF provides a synergy between the
computational and experimentalists.
Projected Growth Trend for Biology
10
1
0.1
0.01
0.001
Proteomic data
0.0001
GenBank
20
16
20
14
20
12
20
10
20
08
20
06
20
02
20
04
20
00
19
98
19
96
19
94
19
92
19
90
0.00001
19
88
EMSL users have
stored >50TB in the
past 8 months.
More than 80% of
the data is from
experimentalists.
100
PetaBytes
With the expansion
into biology, the
need for storage
has drastically
increased.
Log Scale!
1000
11
Storage Drivers
We support Three different domains with different
requirements
High Performance Computing – Chemistry



Low storage volumes (10 TB)
High performance storage (>500MB/s per client, GB/s aggregate)
POSIX access
High Throughput Proteomics – Biology




Large storage volumes (PB’s) and exploding
Write once, read rarely if used as an archive
Modest latency okay (<10s to data)
If analysis could be done in place it would require faster storage
Atmospheric Radiation Measurement - Climate


Modest side storage requirements (100’s TB)
Shared with community and replicated to ORNL
12
PNNL's Lustre Implementation
PNNL and the ASCI Tri-Labs
are currently working with
CFS and HP to develop
Lustre.
Lustre has been in full
production since last Aug and
used for aggressive IO from
our supercomputer.
3.5
3
GB/s
2.5
Lustre over Elan4
Lustre over Elan3
Aggregate Local IO
NFS over GigE
2
1.5
1
0.5

0
1
2
4
8
Clients
660MB/s from a single client with a
simple “dd” is faster than any local or
global filesystem we have tested.

Highly stable
Still hard to manage
We are expanding our use of
Lustre to act as the
filesystem for our archival
storage.

Deploying a ~400TB filesystem
We are finally in the era where global filesystems provide faster access
13
Security
Open computing requires a trust
relationship between sites.
User logs into siteA and ssh’s to siteB. If
siteA is compromised the hacker has
probably sniffed the password for siteB.



Reaction #1: Teach users to minimize jumping
through hosts they do not personally know are
secure (why did the user trust SiteA?)
Reaction #2: Implement one-time passwords
(SecureID)
Reaction #3: Turn off open access (Earth
simulator?)
14
Thoughts about one-time-passwords
A couple of different hurdles to cross:


We would like to avoid having to force our users to carry
a different SecureID card for each site they have access
to.
However the distributed nature of security (it is run by
local site policy) will probably end up with something like
this for the short term.
As of April 8th the MSCF has converted over to the
PNNL SecureID system for all remote ssh logins.
Lots of FedEx’ed SecureID cards
15
Summary
HPCS2 is running well and the IO capabilities of
the system are enabling chemistry and biology
calculations that could not be run on any other
system in the world.
Storage for proteomics is on a super-exponential
trend.
Lustre is great. 660MB/s from a single client.
Building 1/2PB single filesystem.
We rapidly implemented SecureID authentication
methods last week.
16