Blacklight_PSC - Pittsburgh Supercomputing Center Staff

Download Report

Transcript Blacklight_PSC - Pittsburgh Supercomputing Center Staff

PSC Blacklight, a Large Hardware-Coherent Shared Memory Resource

In TeraGrid Production Since 1/18/2011

Why Shared Memory?

machine learning graph based informatics high productivity languages rapid prototyping data exploration statistics

Enable memory-intensive computation

Change the way we look at data

Increase users’ productivity

Boost scientific output Broaden participation

algorithm expression interactivity viz ISV apps ...

… 2 SG-WG Update | Sanielevici | March 18, 2011

SG-WG Update | Sanielevici | March 18, 2011

PSC’s Blacklight (SGI Altix

UV 1000)

Programmability + Hardware Acceleration

→

Productivity

• • • • •

16 TB of cache-coherent shared memory

– hardware coherency unit: 1 cache line (64B) – 16 TB exploits the processor’s full 44-bit physical address space –

ideal for fine-grained shared memory applications, e.g. graph algorithms, sparse matrices

32 TB addressable with PGAS languages, MPI, and hybrid approaches

– low latency, high injection rate supports one-sided messaging –

also ideal for fine-grained shared memory applications

NUMAlink ® 5 interconnect

– fat tree topology spanning full UV system; low latency, high bisection bandwidth –

transparent hardware support

for cache-coherent shared memory, message pipelining and transmission, collectives, barriers, and optimization of fine-grained, one-sided communications –

hardware acceleration for PGAS, MPI, gather/scatter, remote atomic memory operations, etc.

Intel Nehalem-EX processors: 4096 cores (2048 cores per SSI)

– 8-cores per socket, 2 hardware threads per core, 4 flops/clock, 24MB L3, Turbo Boost, QPI – 4 memory channels per socket 

strong memory bandwidth

– x86 instruction set with SSE 4.2 

excellent portability and ease of use

SUSE Linux operating system

– supports OpenMP, p-threads, MPI, PGAS models  – supports a huge number of ISV applications 

high programmer productivity high end user productivity

SG-WG Update | Sanielevici | March 18, 2011

Programming Models & Languages

• UV supports an extremely broad range of programming models and languages for science, engineering, and computer science – Parallelism • Coherent shared memory: OpenMP, POSIX threads (“p-threads”), OpenMPI, q-threads • Distributed shared memory: UPC, Co-Array Fortran* • Distributed memory: MPI, Charm++ • Linux OS and standard languages enable users’ domain-specific languages, e.g. NESL – Languages • C, C++, Java, UPC, Fortran, Co-Array Fortran* • R, R-MPI • Python, Perl, … → Rapidly express algorithms that defy distributed-memory implementation.

→ To existing codes, offer 16-32 TB memory and high concurrency.

* pending F2008-compliant compilers 5 SG-WG Update | Sanielevici | March 18, 2011

ccNUMA memory (a brief review; 1)

•

ccNUMA: cache-coherent non-uniform memory access

• Memory is organized into a non-uniform hierarchy, where each level takes longer to access: registers 1 clock L1 cache, ~32 kB per core L2 cache, ~256-512 kB per core ~4 clocks ~11 clocks 1 socket L3 cache, ~1-3 MB per core, shared between cores DRAM attached to a processor (“socket”) ~2-4

ockets DRAM attached to a neighboring processor on the node many

ockets DRAM attached to processors on other nodes

O O

~40 clocks (200) clocks (200) clocks

(1500) clocks

Cache coherency protocols ensure that all data is maintained consistently in all levels of the memory hierarchy. The unit of consistency should match the processor, i.e. one cache line. Hardware support is required to this maintain memory consistency at acceptable speeds.

6 SG-WG Update | Sanielevici | March 18, 2011

Blacklight Architecture: Blade

NL5 NL5 NL5 NL5

“node pair” “node”

NUMAlink-5 UV Hub UV Hub QPI QPI Intel Nehalem EX-8 Intel Nehalem EX-8 Intel Nehalem EX-8 Intel Nehalem EX-8 64 GB RAM 64 GB RAM 64 GB RAM 64 GB RAM

Topology • fat tree, spanning all 4096 cores Per SSI: • 128 sockets • 2048 cores • 16 TB • hardware-enabled coherent shared memory Full system: • 256 sockets • 4096 cores • 32 TB • PGAS, MPI, or hybrid parallelism 7 SG-WG Update | Sanielevici | March 18, 2011

I/O and Grid

• /bessemer – PSC’s Center-wide Lustre filesystem • $SCRATCH: Zest-enabled – high efficiency scalability (designed for

(10 6 ) cores), low-cost commodity components, lightweight software layers, end-to-end parallelism, client-side caching and software parity, and a unique model of load-balancing outgoing I/O onto high speed intermediate storage followed by asynchronous reconstruction to a 3rd-party parallel file system P. Nowoczynski, N. T. B. Stone, J. Yanovich, and J. Sommerfield,

Zest Checkpoint Storage System for Large Supercomputers,

Petascale Data Storage Workshop ’08.

http://www.pdsi-scidac.org/events/PDSW08/resources/ papers/Nowoczynski_Zest_paper_PDSW08.pdf

• Gateway ready: Gram5, GridFTP, comshell, Lustre WAN… 8 SG-WG Update | Sanielevici | March 18, 2011

Memory-Intensive Analysis Use Cases

• Algorithm Expression – Implement algorithms and analyses, e.g. graph-theoretical, for which distributed-memory implementations have been elusive or impractical.

–

Enable rapid, innovative analyses of complex networks.

• Interactive Analysis of Large Datasets – Example: fit the whole ClueWeb09 corpus into RAM to enable development of rapid machine-learning algorithms for inferring relationships.

–

Foster totally new ways of exploring large datasets. Interactive queries and deeper analyses limited only by the community’s imagination.

9 SG-WG Update | Sanielevici | March 18, 2011

User Productivity Use Cases

• Rapid Prototyping – Rapid development of algorithms for large-scale data analysis – Rapid development of “one-off” analyses –

Enable creativity and exploration of ideas

• Familiar Programming Languages – Java, R, Octave, etc.

–

Leverage tools that scientists , engineers, and computer scientists already know and use. Lower the barrier to using HPC.

• ISV Applications – ADINA, Gaussian, VASP, … • Vast memory accessible from even a modest number of cores –

Leverage tools that scientists , engineers, and computer scientists already know and use. Lower the barrier to using HPC.

10 SG-WG Update | Sanielevici | March 18, 2011

Data crisis: genomics • DNA sequencing machine throughput increasing at a rate of

5x

per year • Hundreds of

petabytes

of data will be produced in the next few years • Moving and analyzing these data will be the major bottleneck in this field

SG-WG Update | Sanielevici | March 18, 2011

http://www.illumina.com/systems/hiseq_2000.ilmn

Genomics analysis: two basic flavors • Loosely-coupled problems

Sequence alignment:

Read many short DNA sequences from disk and map to a reference genome – Lots of disk I/O – Fits well with MapReduce framework

• Tightly-coupled problems

De novo assembly:

Assemble a complete genome from short genome fragments generated by sequencers – Primarily a large graph problem – Works best with a lot of shared memory 12 SG-WG Update | Sanielevici | March 18, 2011

Sequence Assembly of Sorghum

Sarah Young and Steve Rounsley (University of Arizona) PSC Blacklight: EARLY illumination • Tested various genomes, assembly codes, and parameters to determine best options for plant genome assemblies • Performed assembly of a 600+ Mbase genome of a member of the

Sorghum

using ABySS.

genus on Blacklight • Sequence assemblies of this type will be key to the iPlant Collaborative. Larger plant assemblies are planned in the future.

SG-WG Update | Sanielevici | March 18, 2011

What can a machine with 16 TB shared memory do for genomics?

Exploring efficient solution of both loosely and tightly-coupled problems: • Sequence alignment: – Experimenting with use of bottlenecks and increase performance – Configuring

Hadoop ramdisk

to alleviate I/O to work on large shared memory system – Increasing

productivity

by allowing researchers to use simple, familiar MapReduce framework • De novo assembly of huge genomes: – Human genome with 3 gigabases (Gb) of DNA typically requires

256-512 GB RAM

to assemble –

Cancer research

requires

hundreds

of these assemblies – Certain important species, e.g. Loblolly pine, have genomes ~

10x

larger than humans requiring

terabytes

of RAM to assemble – Metagenomics (sampling unknown microbial populations): no theoretical limit to how many base pairs one might assemble together (

100x

more than human assembly!)

Pinus taeda (Loblolly Pine)

SG-WG Update | Sanielevici | March 18, 2011

Thermodynamic Stability of Quasicrystals

PSC Blacklight: EARLY illumination Max Hutchinson and Mike Widom (Carnegie Mellon University) • A leading proposal for the thermodynamic stability of quasicrystals depends on the configurational entropy associated with tile flips (“phason flips”).

• Exploring the entropy of symmetry-broken structures whose perimeter is an irregular octagon will allow an approximate theory of quasicrystal entropy to be developed, replacing the actual discrete tilings with a continuum system modeled as a dilute gas of interacting tiles.

• Quasicrystals are modeled by rhombic/octagonal tilings, for which enumeration exposes thermodynamic properties.

T(1)=8 • Breadth-first search over a graph that grows super exponentially with system size; very little locality.

• Nodes must carry arbitrary-precision integers.

graph for the 3,3,3,3 quasicrystal

T(7) = 10042431607269542604521005988830015956735912072

SG-WG Update | Sanielevici | March 18, 2011

Performance Profiling of Million-core Runs

Sameer Shende (ParaTools and University of Oregon) PSC Blacklight: EARLY illumination • ~500 GB of shared memory successfully applied to the visual analysis of very large scale performance profiles, using TAU.

• Profile data: synthetic million-core dataset assembled from 32k-core LS3DF runs on ANL’s BG/P.

Metadata Information about 1million core profile datasets, TAUParaProf Manager Window.

Execution Time Breakdown of LS3DF subroutines over all MPI ranks.

LS3DF Routines Profiling Data on rank 1,048,575.

Histogram of MPI_Barrier, distribution of the routine calls over the execution time.

SG-WG Update | Sanielevici | March 18, 2011

Summary

• On PSC’s

Blacklight

resource, hardware-supported cache-coherent shared memory is enabling new data-intensive and memory intensive analytics and simulations. In particular, Blacklight is: – – – enabling new kinds of analyses on large data , bringing new communities into HPC , and increasing the productivity of both “traditional HPC” and new users .

• PSC is actively working with the research community to bring this new analysis capability to diverse fields of research. This will entail development of data-intensive workflows, new algorithms, scaling and performance engineering, and software infrastructure.

Interested? Contact [email protected]

, [email protected]

17 SG-WG Update | Sanielevici | March 18, 2011

Blacklight_PSC - Pittsburgh Supercomputing Center Staff

Transcript Blacklight_PSC - Pittsburgh Supercomputing Center Staff

PSC Blacklight, a Large Hardware-Coherent Shared Memory Resource

In TeraGrid Production Since 1/18/2011

Why Shared Memory?

Enable memory-intensive computation

Increase users’ productivity

PSC’s Blacklight (SGI Altix

UV 1000)

Programmability + Hardware Acceleration

→

Productivity

Programming Models & Languages

ccNUMA memory (a brief review; 1)

Blacklight Architecture: Blade

I/O and Grid

Memory-Intensive Analysis Use Cases

User Productivity Use Cases

Data crisis: genomics • DNA sequencing machine throughput increasing at a rate of

5x

per year • Hundreds of

petabytes

of data will be produced in the next few years • Moving and analyzing these data will be the major bottleneck in this field

Genomics analysis: two basic flavors • Loosely-coupled problems

• Tightly-coupled problems

Sequence Assembly of Sorghum

What can a machine with 16 TB shared memory do for genomics?

Thermodynamic Stability of Quasicrystals

T(7) = 10042431607269542604521005988830015956735912072

Performance Profiling of Million-core Runs

Summary

Directory