Blue Gene Experience at the National Center for Atmospheric Research October 4, 2006 Theron Voran [email protected] Computer Science Section National Center for Atmospheric Research Department of Computer.

Download Report

Transcript Blue Gene Experience at the National Center for Atmospheric Research October 4, 2006 Theron Voran [email protected] Computer Science Section National Center for Atmospheric Research Department of Computer.

Blue Gene Experience at the National Center
for Atmospheric Research
October 4, 2006
Theron Voran
[email protected]
Computer Science Section
National Center for Atmospheric Research
Department of Computer Science
University of Colorado at Boulder
Why Blue Gene?




Extreme scalability, balanced architecture, simple design
Efficient energy usage
A change from IBM Power systems at NCAR
But familiar
 Programming model
 Chip (similar to Power4)
 Linux on front-end and IO nodes
 Interesting research platform
University of Colorado at Boulder / National Center for Atmospheric Research
2
Outline






System Overview
Applications
In the Classroom
Scheduler Development
TeraGrid Integration
Other Current Research Activities
University of Colorado at Boulder / National Center for Atmospheric Research
3
Frost Fun Facts
 Collaborative effort
 Univ of Colorado at Boulder (CU)
 NCAR
 Univ of Colorado at Denver
 Debuted in June 2005, tied for 58th
place on Top500
 5.73 Tflops peak – 4.71 sustained
 25KW loaded power usage
 4 front-ends, 1 service node
 6TB usable storage
 Why is it leaning?
Henry Tufo and Rich Loft, with Frost
University of Colorado at Boulder / National Center for Atmospheric Research
4
System Internals
5.5GB/s
11GB/s
PLB (4:1)
32k/32k L1
2.7GB/s
256
128
L2
440 CPU
4MB
EDRAM
“Double FPU”
snoop
Multiported
Shared
SRAM
Buffer
256
32k/32k L1
440 CPU
I/O proc
128
Shared
L3 directory
for EDRAM
1024+
144 ECC
L3 Cache
or
M emory
22GB/s
L2
256
Includes ECC
256
“Double FPU”
l
128
Ethernet
Gbit
JTAG
Access
Torus
Tree
Global
Interrupt
DDR
Control
with ECC
5.5 GB/s
Gbit
Ethernet
JTAG
6 out and
6 in, each at
1.4 Gbit/s link
3 out and
3 in, each at
2.8 Gbit/s link
4 global
barriers or
interrupts
144 bit wide
DDR
256MB
Blue Gene/L system on-a-chip
University of Colorado at Boulder / National Center for Atmospheric Research
5
More Details
 Chips
 Storage
 PPC440 @700MHZ, 2 cores per
 4 Power5 systems as GPFS
node
 512 MB memory per node
 Coprocessor vs Virtual Node
 1:32 IO to Compute ratio
cluster
 NFS export to BGL IO nodes
 Interconnects
 3D Torus (154 MB/s one




direction)
Tree (354 MB/s)
Global Interrupt
GigE
JTAG/IDO
University of Colorado at Boulder / National Center for Atmospheric Research
6
Frost Utilization
Utilization
BlueGene/L (frost) Usage
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
University of Colorado at Boulder / National Center for Atmospheric Research
9/10/06
8/27/06
8/13/06
7/30/06
7/16/06
7/2/06
6/18/06
6/4/06
5/21/06
5/7/06
4/23/06
4/9/06
3/26/06
3/12/06
2/26/06
2/12/06
1/29/06
1/15/06
1/1/06
0%
7
HOMME
 High Order Method Modeling
Environment
 Spectral element dynamical core
 Proved scalable on other platforms
 Cubed-sphere topology
 Space-filling curves
University of Colorado at Boulder / National Center for Atmospheric Research
8
HOMME Performance
 Ported in 2004 on BG/L
prototype at TJ Watson, with
eventual goal of Gordon Bell
submission in 2005
7
Serial and parallel obstacles:
 SIMD instructions
 Eager vs Adaptive routing
 Mapping strategies
6
5
4
3
2
1
Result:
 Good scalability out to 32,768
processors (3 elements per
processor)
00
1
2
3
4
5
6
7
0
1
2
3
4
6
5
Snake mapping on 8x8x8 3D torus
University of Colorado at Boulder / National Center for Atmospheric Research
9
7
HOMME Scalability on 32 Racks
University of Colorado at Boulder / National Center for Atmospheric Research
10
Other Applications
 Popular codes on Frost
 WRF
 CAM, POP, CICE
 MPIKAIA
 EULAG
 BOB
 PETSc
 Used as a scalability test bed, in preparation for runs on 20-rack
BG/W system
University of Colorado at Boulder / National Center for Atmospheric Research
11
Classroom Access
 Henry Tufo’s ‘High Performance
Scientific Computing’ course at
University of Colorado
 Let students loose on 2048
processors
 Thinking BIG
 Throughput and latency studies
 Scalability tests - Conway’s
Game of Life
 Final projects
 Feedback from ‘novice’ HPC
users
University of Colorado at Boulder / National Center for Atmospheric Research
12
Cobalt
 Component-Based Lightweight Toolkit
 Open source resource manager and scheduler
 Developed by ANL along with NCAR/CU
 Component Architecture
 Communication via XML-RPC
 Process manager, queue manager, scheduler
 ~3000 lines of python code
 Manages traditional clusters also
http://www.mcs.anl.gov/cobalt
University of Colorado at Boulder / National Center for Atmospheric Research
13
Cobalt Architecture
University of Colorado at Boulder / National Center for Atmospheric Research
14
Cobalt Development Areas
 Scheduler improvements
 Efficient packing
 Multi-rack challenges
 Simulation ability
 Tunable scheduling parameters
 Visualization
 Aid in scheduler development
 Give users (and admins) better understanding of machine allocation
 Accounting / project management and logging
 Blue Gene/P
 TeraGrid integration
University of Colorado at Boulder / National Center for Atmospheric Research
15
NCAR joins the TeraGrid, June 2006
University of Colorado at Boulder / National Center for Atmospheric Research
16
TeraGrid Testbed
Experimental Environment
Production Environment
CU
experimenta
l
Storage
Cluster
CSS Switch
Teragri
d
NETS Switch
NLR
NCAR
1GBNe
t
E NT E RPR ISE
6 0 0 0

E NT E RPR ISE
6 0 0 0

Computational
Cluster
E NT E RPR ISE
6 0 0 0

FRGP
Datagrid
University of Colorado at Boulder / National Center for Atmospheric Research
17
TeraGrid Activities
 Grid-enabling Frost
 Common TeraGrid Software Stack (CTSS)
 Grid Resource Allocation Manager (GRAM) and Cobalt
interoperability
 Security infrastructure
 Storage Cluster
 16 OSTs, 50-100 TB usable storage
 10G connectivity
 GPFS-WAN
 Lustre-WAN
University of Colorado at Boulder / National Center for Atmospheric Research
18
Other Current Research Activities
 Scalability of CCSM components
 POP
 CICE
 Scalable solver experiments
 Efficient communication mapping
 Coupled climate models
 Petascale parallelism
 Meta-scheduling
 Across sites
 Cobalt vs other schedulers
 Storage
 PVFS2 + ZeptoOS
 Lustre
University of Colorado at Boulder / National Center for Atmospheric Research
19
Frost has been a success as a …
 Research experiment
 Utilization rates
 Educational tool
 Classroom
 Fertile ground for grad students
 Development platform
 Petascale problems
 Systems work
Questions?
[email protected]
https://wiki.cs.colorado.edu/BlueGeneWiki
University of Colorado at Boulder / National Center for Atmospheric Research
20