Transcript Slide 1

Higher Ed & Research

May 8th, 2012

Sections Included *

Molecular Dynamics Applications Overview

AMBER

NAMD

GROMACS

LAMMPS

* In fullscreen mode, click on link to view a particular module. Click on NVIDIA logo in each slide to return to this page.

Molecular Dynamics (MD) Applications

Application

AMBER NAMD

Features Supported GPU Perf

PMEMD Explicit Solvent & GB Implicit Solvent 89.44 ns/day JAC NVE on 16X 2090s Full electrostatics with PME and most simulation features 6.44 ns/days STMV 585X 2050s

Release Status

Released Multi-GPU, multi-node Released 100M atom capable Multi-GPU, multi-node

Notes/Benchmarks

AMBER 12 http://ambermd.org/gpus/benchmarks . htm#Benchmarks NAMD 2.8.

2.9 version April 12 http://biowulf.nih.gov/apps/namd/na md_ bench.html

GROMACS

Implicit (5x), Explicit (2x) Solvent via OpenMM 165 ns/Day DHFR 4X C2075s 4.5 Single GPU released 4.6 Multi-GPU Released http://biowulf.nih.gov/apps/gromacs gpu.html

LAMMPS

Lennard-Jones, Gay Berne, Tersoff 3.5-15x Released.

Multi-GPU, multi-node 1 billion atom on Lincoln: http://lammps.sandia.gov/bench.html

# machine GPU Perf compared against Multi-core x86 CPU socket.

GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison

New/Additional MD Applications Ramping

Application

Abalone ACEMD DL_POLY HOOMD Blue

Features Supported

TBD Simulations

GPU Perf

4-29X (on 1060 GPU)

Release Status

Released Written for use on GPUs 160 ns/day Two-body Forces, Link cell Pairs, Ewald SPME forces, Shake VV Written for use on GPUs 4x 2X (32 CPU cores vs. 2 10XX GPUs) Released V 4.0 Source only Results Published

Notes

Single GPU. Agile Molecule, Inc.

Production bio-molecular dynamics (MD) software specially optimized to run on single and multi-GPUs Multi-GPU, multi-node supported Released, Version 0.9.2

Single and multi-GPU.

GPU Perf compared against Multi-core x86 CPU socket.

GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison

GPU Value to Molecular Dynamics

What Why How

Study disease & discover drugs Predict drug and protein interactions Speed of simulations is critical Enables study of: Longer timeframes Larger systems More simulations GPUs increase throughput & accelerate simulations AMBER 11 Application 4.6x performance increase with 2 GPUs with only a 54% added cost * • • AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333

GPU Ready Applications Abalone ACEMD AMBER DL_PLOY GAMESS GROMACS LAMMPS NAMD GPU Test Drive Pre-configured Applications AMBER 11 NAMD 2.8

All Key MD Codes are GPU Ready

AMBER, NAMD, GROMACS, LAMMPS Life and Material Sciences Great multi-GPU performance Additional MD GPU codes: Abalone, ACEMD, HOOMD-Blue Focus: scaling to large numbers of GPUs

Outstanding AMBER Results with GPUs

Run AMBER Faster

Up to

5x

Speed Up With GPUs

ns/day

60

Cluster Performance Scaling

AMBER 11 JAC Simulation time, ns/day 56,45 DHFR (NVE) 23,558 Atoms 50 45,58 40 ns/Day 58,28 31,57 30 21,29 20 ns/Day 14,16 8,22 13,49 10 4,55 2,44 0 GPU+CPU CPU 1 node 2 nodes 4 nodes CPU only 8 nodes CPU + 1x C2050 per node

CPU Supercomputer

46,01 NICS Kraken (Athena)

“ …with two GPUs we can run a single simulation as fast as on 128 CPUs of a Cray XT3 or on 1024 CPUs of an IBM BlueGene/L machine. We can try things that were undoable before. It still blows my mind.

” Axel Kohlmeyer Temple University

AMBER Make Research More Productive with GPUs

5,0

318% Higher Performance

4,0

AMBER 11 on 2X E5670 CPUs (per node)

3,0 2,0 1,0

54% Additional Expense

With GPU No GPU

AMBER 11 on 2X E5670 CPUs + 2X Tesla M2090s (per node)

Base node configuration:

Dual Xeon X5670s and Dual Tesla M2090 GPUs per node

0,0 Cost of 1 Node Performance Speed-up Adding Two 2090 GPUs to a Node Yields a > 4 x Performance Increase

Run NAMD Faster

Up to

7x

Speed Up With GPUs

ns/Day 2,94 ApoA-1 92,224 Atoms ns/Day 0,51

Ns/Day

3,5 3 2,5 2 1,5 1 0,5 0 GPU+CPU CPU Test Platform: 1 Node, Dual Tesla M2090 GPU (6GB), Dual Intel 4-core Xeon (2.4 GHz), NAMD 2.8, CUDA 4.0, ECC On.

Visit www.nvidia.com/simcluster for more information on speed up results, configuration and test models.

1 2 STMV 1,066,628 Atoms

NAMD 2.8 Benchmark

4 8

# of Compute Nodes

12 GPU+CPU CPU only 16 NAMD 2.8 B1 + unreleaesd patch, STMV Benchmark A Node is Dual-Socket, Quad-core x5650 with 2 Tesla M2070 GPUs Performance numbers for 2 M2070 8 cores (GPU+CPU) vs. 8 cores (CPU)

Make Research More Productive with GPUs

4,0 3,0

NAMD 2.8 on 2X E5670 CPUs (per node) NAMD 2.8 on 2X E5670 CPUs + 2X Tesla C2070s (per node) 250% Higher

2,0 1,0 54%

Additional Expense

With GPU No GPU 0,0 Cost of 1 Node Performance Speed-up Get up to a 250% Performance Increase

(STMV – 1,066628 atoms)

GROMACS Partnership Overview

Erik Lindahl, David van der Spoel, Berk Hess are head authors and project leaders. Szilárd Páll is a key GPU developer.

2010: single GPU support (OpenMM library in GROMACS 4.5) NVIDIA Dev Tech resources allocated to GROMACS code 2012: GROMACS 4.6 will support multi-GPU nodes as well as GPU clusters

GROMACS 4.6 Release Features

GROMACS Multi-GPU Expected in April 2012

Multi-GPU support - GPU acceleration is one of main focus: majority of features will be accelerated in 4.6 in a transparent fashion PME simulations get special attention, and most of the effort will go into making these algorithms well load-balanced Reaction-Field and Cut-Off simulations also run accelerated List of non-supported GPU accelerated features will be quite short

GROMACS 4.6 Alpha Release

Absolute Performance

Absolute performance of GROMACS running CUDA- and SSE-accelerated non-bonded kernels with PME on 3-12 CPU cores and 1-4 GPUs. Simulations with cubic and truncated dodecahedron cells, pressure coupling, as well as virtual interaction sites enabling 5 fs are shown Benchmark systems: RNAse in water with 24040 atoms in cubic and 16816 atoms in truncated dodecahedron box Settings: electrostatics cut-off auto-tuned >0.9 nm, LJ cut-off 0.9 nm, 2 fs and 5 fs (with vsites) time steps Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075

GROMACS 4.6 Alpha Release

Strong Scaling

Strong scaling of GPU-accelerated GROMACS with PME and reaction-field on: Up to 40 cluster nodes with 80 GPUs Benchmark system: water box with 1.5M particles Settings: electrostatics cut-off auto-tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps Hardware: Bullx cluster nodes with 2x Intel Xeon E5649 (6C), 2x NVIDIA Tesla M2090, 2x QDR Infiniband 40 Gb/s

GROMACS 4.6 Alpha Release

PME Weak Scaling

Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: 3-12 CPU cores and 1-4 GPUs. The gradient background indicates the range of system Sizes which fall beyond the typical single-node production size.

Benchmark systems: water boxes size ranging from 1.5k to 3M particles.

Settings: electrostatics cut-off auto tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps.

Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075.

GROMACS 4.6 Alpha Release

Rxn-Field Weak Scaling

Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: 3-12 CPU cores and 1-4 GPUs. The gradient background indicates the range of system sizes which fall beyond the typical single-node production size Benchmark systems: water boxes size ranging from 1.5k to 3M particles Settings: electrostatics cut-off auto tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075

GROMACS 4.6 Alpha Release

Weak Scaling

Weak scaling of the CUDA non-bonded force kernel on GeForce and Tesla GPUs. Perfect weak scaling, challenges for strong scaling Benchmark systems: water boxes size ranging from 1.5k to 3M particles Settings: electrostatics & LJ cut-off 1.0 nm, 2 fs time steps Hardware: workstation with 2x Intel Xeon X5650 (6C) CPUs, 4x NVIDIA Tesla C2075

LAMMPS Released GPU Features and Future Plans

LAMMPS August 2009

First GPU accelerated support

LAMMPS Aug. 22, 2011

Selected accelerated Non-bonded short‐range potentials (SP, MP, DP support) Lennard-Jones (several variants with & without coulombic interactions) Morse Buckingham CHARMM Tabulated Course grain SDK Anisotropic Gay-Bern RE-squared “Hybrid” combinations (GPU accel & no GPU accel) Particle-Particle Particle-Mesh (SP or DP) Neighbor list builds

Longer Term*

Improve performance on smaller particle counts Neighbor List is the problem Improve long-range performance MPI/Poisson Solve is the problem Additional pair potential support (including expensive advanced force fields) – See “Tremendous Opportunity for GPUs” slide* Performance improvements focused to specific science problems * Courtesy of Michael Brown at ORNL and Paul Crozier at Sandia Labs

LAMMPS

8.6x

Speed-up with GPUs

W.M. Brown, “GPU Acceleration in LAMMPS”, 2011 LAMMPS Workshop

LAMMPS

4x

Faster on Billion Atoms

Billion Atom Lennard-Jones Benchmark

103 Seconds 29 Seconds

288 GPUs + CPUs 1920 x86 CPUs Test Platform: NCSA Lincoln Cluster with S1070 1U GPU servers attached CPU-only Cluster- Cray XT5

LAMMPS

4X-15X Speedups Gay-Berne RE-Squared From August 2011 LAMMPS Workshop Courtesy of W. Michael Brown, ORNL

LAMMPS Conclusions

Runs both with individual multi-GPU node, as well as GPU clusters Outstanding raw performance! Performance is 3x-40X higher than equivalent CPU code Impressive linear strong scaling Good weak scaling, scales to a billion particles Tremendous opportunity to GPU accelerate other force fields