ECE562 Mini/Microcomupters

Download Report

Transcript ECE562 Mini/Microcomupters

BGL Photo (system)
BlueGene/L
IBM Journal of Research and Development, Vol. 49, No. 2-3.
<http://www.research.ibm.com/journal/rd49-23.html>
1
Main Design Principles
• Some science & engineering applications scale up to and
beyond 10,000 parallel processes;
• Improve computing capability, holding total system cost;
• Cost/perf trade-offs considering the end-use:
– Applications <> Architecture <> Packaging
• Reduce complexity and size.
– ~25KW/rack is max for air-cooling in standard room.
– Need to improve performance/power ratio.
– 700MHz PowerPC440 for ASIC has excellent FLOP/Watt.
• Maximize Integration:
– On chip: ASIC with everything except main memory.
– Off chip: Maximize number of nodes in a rack..
• Large systems require excellent reliability, availability,
serviceability (RAS)
Physical Layout of BG/L
3
The Compute Chip
• System-on-a-chip (SoC)
• 1 ASIC
–
–
–
–
2 PowerPC processors
L1 and L2 Caches
4MB embedded DRAM
DDR DRAM interface and
DMA controller
– Network connectivity
hardware
– Control / monitoring equip.
(JTAG)
4
Compute and Node Cards
5
Node Architecture
• IBM PowerPC embedded CMOS processors, embedded
DRAM, and system-on-a-chip technique is used.
• 11.1-mm square die size, allowing for a very high density
of processing.
• The ASIC uses IBM CMOS CU-11 0.13 micron
technology.
• 700 Mhz processor speed close to memory speed.
• Two processors per node.
• Second processor is intended primarily for handling
message passing operations
Midplane and Rack
• 1 rack holds 1024 nodes
• Nodes optimized for low power
• ASIC based on SoC technology
– Outperform commodity clusters while saving
on power
– Aggressive packaging of processor, memory
and interconnect
– Power efficient & space efficient
– Allows for latencies and bandwidths that are
significantly better than those for nodes
typically used in ASC scale supercomputers
The Torus Network
• 64 x 32 x 32
• Each compute node is
connected to its six
neighbors: x+, x-, y+, y-,
z+, z• Compute card is 1x2x1
• Node card is 4x4x2
– 16 compute cards in 4x2x2
arrangement
• Midplane is 8x8x8
– 16 node cards in 2x2x4
arrangement
• Each uni-directional link is
1.4Gb/s, or 175MB/s.
• Each node can send and
receive at 1.05GB/s.
• Supports cut-through
routing, along with both
deterministic and adaptive
routing.
• Variable-sized packets of
32,64,96…256 bytes
• Guarantees reliable delivery
8
BG/L System Software
• System software supports efficient execution of parallel
applications
• Compiler support for MPI-based C, C++, Fortran
• Front-end nodes are commodity PCs running Linux
• I/O nodes run a customized Linux kernel
• Compute nodes: extremely lightweight custom kernel
– Space sharing, single-thread/processor (dual-threaded per node)
– Flat address space, no paging
– Physical resources are memory-mapped
• Service node is a single multiprocessor machine running a
custom OS
9
Space Sharing
• BG/L system can be partitioned into
electronically isolated sets of nodes (power
of 2)
• Single-user, reservation-based for each
partition
• Faulty hardware are electrically isolated to
allow other nodes to continue to run in the
presence of component failures.