Emulating PetaFLOPS Machines and Blue Gene

Download Report

Transcript Emulating PetaFLOPS Machines and Blue Gene

Emulating Massively Parallel
(PetaFLOPS) Machines
Neelam Saboo, Arun Kumar Singla
Joshua Mostkoff Unger, Gengbin Zheng,
Laxmikant V. Kalé
http://charm.cs.uiuc.edu
Department of Computer Science
Parallel Programming Laboratory
Roadmap
•
•
•
•
•
BlueGene Architecture
Need for an Emulator
Charm++ BlueGene
Converse BlueGene
Future Work
•
Blue Gene: Processor-inmemory Case Study
Five steps to a PetaFLOPS, taken from:
– http://www.research.ibm.com/bluegene/
BOARD
PROCESSOR
1 GFlop/s, 0.5 MB
NODE/CHIP
25 GFlop/s, 12.5 MB
TOWER
BLUE GENE
1 PFlop/s, 0.5 TB
FUNCTIONAL MODEL:
34X34X36 cube of shared memory
nodes each having 25 processors.
SMP Node
•25 processors
•200 processing elements
•Input/Output Buffer
•32 x 128 bytes
•Network
•Connected to six
neighbors via duplex
link
•16 bit @ 500 MHz =
1 Gigabyte/s
•Latencies:
•5 cycles per hop
•75 cycles per turn
Processor
STATS:
•500 MHz
in
out
•Memory-side cache
eliminates coherency
problems
•10 cycles local cache
•20 cycles remote cache
•10 cycles cache miss
•8 integer units sharing
2 floating point units
•8 x 25 x ~40,000 = ~8 x 106
processing elements!
Need for Emulator
• Emulator – enables programmer to
develop, compile, and run software
using programming interface that will be
used in actual machine
Emulator Objectives
• Emulate Blue Gene and other petaFLOPS
machines.
• Memory limitations and time limitations on
single processor requires that simulation
MUST be performed on parallel architecture.
• Issues:
– Assume that program written for processor-inmemory machine will handle out-of-order
execution and messaging.
– Therefore don’t need complex event
queue/rollback.
Emulator Implementation
• What are basic data structures/interface?
– Machine configuration (topology), handler
registration
– Nodes with node-level shared data
– Threads (associated with each node) representing
processing elements
– Communication between nodes
• How to handle all these objects on parallel
architecture? How to handle object-to-object
communication?
• Difficulties of implementation eased by using
Charm++, object-oriented parallel programming
paradigm.
Experiments on Emulator
• Sample applications implemented:
– Primes
– Jacobi relaxation
– MD prototype
•40,000 atoms, no
bonds calculated,
nearest neighbor cutoff
•Ran full Blue Gene
(with 8 x 106 threads)
on ~100 ASCI-Red
processors
ApoA-I: 92k Atoms
Collective Operations
• Explore different algorithms for broadcasts
and reductions
RING
LINE
OCTREE
z
y
x
Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene
emulation on 50 processor Linux cluster
Converse BlueGene Emulator
Objective
• Performance estimation (with proper time
stamping)
• Provide API for building Charm++ on top
of emulator.
Bluegene Emulator
Communication threads
Worker thread
inBuffer
Non-affinity message queue
Affinity message queue
Node Structure
Performance
• Pingpong
– Close to Converse pingpong;
• 81-103 us v.s. 92 us RTT
– Charm++ pingpong
• 116 us RTT
– Charm++ Bluegene pingpong
• 134-175 us RTT
Charm++ on top of Emulator
• BlueGene thread represents Charm++ node;
• Name conflict:
– Cpv, Ctv
– MsgSend, etc
– CkMyPe(), CkNumPes(), etc
Future Work: Simulator
• LeanMD : Fully functional MD with only cutoff
• How can we examine performance of
algorithms on variants of processor-inmemory design in massive system?
• Several layers of detail to measure
– Basic: Correctly model performance, timestamp
messages with correction for out-of-order
execution
– More detailed: network performance, memory
access, modeling sharing of floating-point unit,
estimation techniques