Blue Gene Experience at the National Center for Atmospheric Research October 4, 2006 Theron Voran [email protected] Computer Science Section National Center for Atmospheric Research Department of Computer.
Download
Report
Transcript Blue Gene Experience at the National Center for Atmospheric Research October 4, 2006 Theron Voran [email protected] Computer Science Section National Center for Atmospheric Research Department of Computer.
Blue Gene Experience at the National Center
for Atmospheric Research
October 4, 2006
Theron Voran
[email protected]
Computer Science Section
National Center for Atmospheric Research
Department of Computer Science
University of Colorado at Boulder
Why Blue Gene?
Extreme scalability, balanced architecture, simple design
Efficient energy usage
A change from IBM Power systems at NCAR
But familiar
Programming model
Chip (similar to Power4)
Linux on front-end and IO nodes
Interesting research platform
University of Colorado at Boulder / National Center for Atmospheric Research
2
Outline
System Overview
Applications
In the Classroom
Scheduler Development
TeraGrid Integration
Other Current Research Activities
University of Colorado at Boulder / National Center for Atmospheric Research
3
Frost Fun Facts
Collaborative effort
Univ of Colorado at Boulder (CU)
NCAR
Univ of Colorado at Denver
Debuted in June 2005, tied for 58th
place on Top500
5.73 Tflops peak – 4.71 sustained
25KW loaded power usage
4 front-ends, 1 service node
6TB usable storage
Why is it leaning?
Henry Tufo and Rich Loft, with Frost
University of Colorado at Boulder / National Center for Atmospheric Research
4
System Internals
5.5GB/s
11GB/s
PLB (4:1)
32k/32k L1
2.7GB/s
256
128
L2
440 CPU
4MB
EDRAM
“Double FPU”
snoop
Multiported
Shared
SRAM
Buffer
256
32k/32k L1
440 CPU
I/O proc
128
Shared
L3 directory
for EDRAM
1024+
144 ECC
L3 Cache
or
M emory
22GB/s
L2
256
Includes ECC
256
“Double FPU”
l
128
Ethernet
Gbit
JTAG
Access
Torus
Tree
Global
Interrupt
DDR
Control
with ECC
5.5 GB/s
Gbit
Ethernet
JTAG
6 out and
6 in, each at
1.4 Gbit/s link
3 out and
3 in, each at
2.8 Gbit/s link
4 global
barriers or
interrupts
144 bit wide
DDR
256MB
Blue Gene/L system on-a-chip
University of Colorado at Boulder / National Center for Atmospheric Research
5
More Details
Chips
Storage
PPC440 @700MHZ, 2 cores per
4 Power5 systems as GPFS
node
512 MB memory per node
Coprocessor vs Virtual Node
1:32 IO to Compute ratio
cluster
NFS export to BGL IO nodes
Interconnects
3D Torus (154 MB/s one
direction)
Tree (354 MB/s)
Global Interrupt
GigE
JTAG/IDO
University of Colorado at Boulder / National Center for Atmospheric Research
6
Frost Utilization
Utilization
BlueGene/L (frost) Usage
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
University of Colorado at Boulder / National Center for Atmospheric Research
9/10/06
8/27/06
8/13/06
7/30/06
7/16/06
7/2/06
6/18/06
6/4/06
5/21/06
5/7/06
4/23/06
4/9/06
3/26/06
3/12/06
2/26/06
2/12/06
1/29/06
1/15/06
1/1/06
0%
7
HOMME
High Order Method Modeling
Environment
Spectral element dynamical core
Proved scalable on other platforms
Cubed-sphere topology
Space-filling curves
University of Colorado at Boulder / National Center for Atmospheric Research
8
HOMME Performance
Ported in 2004 on BG/L
prototype at TJ Watson, with
eventual goal of Gordon Bell
submission in 2005
7
Serial and parallel obstacles:
SIMD instructions
Eager vs Adaptive routing
Mapping strategies
6
5
4
3
2
1
Result:
Good scalability out to 32,768
processors (3 elements per
processor)
00
1
2
3
4
5
6
7
0
1
2
3
4
6
5
Snake mapping on 8x8x8 3D torus
University of Colorado at Boulder / National Center for Atmospheric Research
9
7
HOMME Scalability on 32 Racks
University of Colorado at Boulder / National Center for Atmospheric Research
10
Other Applications
Popular codes on Frost
WRF
CAM, POP, CICE
MPIKAIA
EULAG
BOB
PETSc
Used as a scalability test bed, in preparation for runs on 20-rack
BG/W system
University of Colorado at Boulder / National Center for Atmospheric Research
11
Classroom Access
Henry Tufo’s ‘High Performance
Scientific Computing’ course at
University of Colorado
Let students loose on 2048
processors
Thinking BIG
Throughput and latency studies
Scalability tests - Conway’s
Game of Life
Final projects
Feedback from ‘novice’ HPC
users
University of Colorado at Boulder / National Center for Atmospheric Research
12
Cobalt
Component-Based Lightweight Toolkit
Open source resource manager and scheduler
Developed by ANL along with NCAR/CU
Component Architecture
Communication via XML-RPC
Process manager, queue manager, scheduler
~3000 lines of python code
Manages traditional clusters also
http://www.mcs.anl.gov/cobalt
University of Colorado at Boulder / National Center for Atmospheric Research
13
Cobalt Architecture
University of Colorado at Boulder / National Center for Atmospheric Research
14
Cobalt Development Areas
Scheduler improvements
Efficient packing
Multi-rack challenges
Simulation ability
Tunable scheduling parameters
Visualization
Aid in scheduler development
Give users (and admins) better understanding of machine allocation
Accounting / project management and logging
Blue Gene/P
TeraGrid integration
University of Colorado at Boulder / National Center for Atmospheric Research
15
NCAR joins the TeraGrid, June 2006
University of Colorado at Boulder / National Center for Atmospheric Research
16
TeraGrid Testbed
Experimental Environment
Production Environment
CU
experimenta
l
Storage
Cluster
CSS Switch
Teragri
d
NETS Switch
NLR
NCAR
1GBNe
t
E NT E RPR ISE
6 0 0 0
E NT E RPR ISE
6 0 0 0
Computational
Cluster
E NT E RPR ISE
6 0 0 0
FRGP
Datagrid
University of Colorado at Boulder / National Center for Atmospheric Research
17
TeraGrid Activities
Grid-enabling Frost
Common TeraGrid Software Stack (CTSS)
Grid Resource Allocation Manager (GRAM) and Cobalt
interoperability
Security infrastructure
Storage Cluster
16 OSTs, 50-100 TB usable storage
10G connectivity
GPFS-WAN
Lustre-WAN
University of Colorado at Boulder / National Center for Atmospheric Research
18
Other Current Research Activities
Scalability of CCSM components
POP
CICE
Scalable solver experiments
Efficient communication mapping
Coupled climate models
Petascale parallelism
Meta-scheduling
Across sites
Cobalt vs other schedulers
Storage
PVFS2 + ZeptoOS
Lustre
University of Colorado at Boulder / National Center for Atmospheric Research
19
Frost has been a success as a …
Research experiment
Utilization rates
Educational tool
Classroom
Fertile ground for grad students
Development platform
Petascale problems
Systems work
Questions?
[email protected]
https://wiki.cs.colorado.edu/BlueGeneWiki
University of Colorado at Boulder / National Center for Atmospheric Research
20