BlueGene/L System Software Derek Lieber IBM T. J. Watson Research Center February 2004

Download Report

Transcript BlueGene/L System Software Derek Lieber IBM T. J. Watson Research Center February 2004

BlueGene/L System Software
Derek Lieber
IBM T. J. Watson Research Center
February 2004
Topics

Programming environment
 Compilation
 Execution
 Debugging
Programming model
 Processors
 Memory
 Files
 Communications

What happens under the covers

2
Programming on BG/L




A single application program image
Running on tens of thousands of compute nodes
Communicating via message passing
Each image has its own copy of


Memory
File descriptors
3
Programming on BG/L

A “job” is encapsulated in a single host-side process
A merge point for compute node stdout streams

A control point for




Signaling (ctl-c, kill, etc)
Debugging (attach, detach)
Termination (exit status collection and summary)
4
Programming on BG/L



Cross compile the source code
Place executable onto BG/L machine’s shared filesystem
Run it

“blrun <job information> <program name> <args>”

Stdout of all program instances appears as stdout of blrun

Files go to user-specified directory on shared filesystem
blrun terminates when all program instances terminate
Killing blrun kills all program instances


5
Compiling and Running on BG/L
stdout
Workstation
datafiles
shared filesystem
sources
BG/L Machine
programs + datafiles
program
local filesystem
cross-tools
6
Programming Models

“Coprocessor model”




64k instances of a single application program
each has 255M address space
each with two threads (main, coprocessor)
 non-coherent shared memory
“Virtual node model”



128k instances
127M address space
one thread (main)
7
Programming Model

Does a job behave like



A group of processes?
Or a group of threads?
A little bit of each
8
A process group?

Yes


Each program instance has its own
 Memory
 File descriptors
No



Can’t communicate via mmap, shmat
Can’t communicate via pipes or sockets
Can’t communicate via signals (kill)
9
A thread group?

Yes

Job terminates when
 All program instances terminate via exit(0)
 Any program instance terminates



Voluntarily, via exit(!0)
Involuntarily, via uncaught signal (kill, abort, segv, etc)
No


Each program instance has own set of file descriptors
Each has own private memory space
10
Compilers and libraries

GNU C, Fortran, C++ compilers can be used with BG/L,
but they do not exploit 2nd FPU

IBM xlf/xlc compilers have been ported to BG/L, with
code generation and optimization features for dual FPU

Standard glibc library

MPI for communications
11
System calls


Traditional ANSI + “a little” POSIX
I/O


Time


Open, close, read, write, etc
Gettimeofday, etc
Signal catchers


Synchronous (sigsegv, sigbus, etc)
Asynchronous (timers and hardware events)
12
System calls

No “unix stuff”



No system calls needed to access most hardware





fork, exec, pipe
mount, umount, setuid, setgid
Tree and torus fifos
Global OR
Mutexes and barriers
Performance counters
Mantra


Keep the compute nodes simple
Kernel stays out of the way and lets the application program run
13
Software Stack in BG/L Compute Node



CNK controls all access to
hardware, and enables bypass
for application use
User-space libraries and
applications can directly
access torus and tree through
bypass
As a policy, user-space code
should not directly touch
hardware, but there is no
enforcement of that policy
Application code
User-space libraries
CNK
Bypass
BG/L ASIC
14
What happens under the covers?

The machine

The job allocation, launch, and control system

The machine monitoring and control system
15
The machine

Nodes




IO nodes
Compute nodes
Link nodes
Communications networks





Ethernet
Tree
Torus
Global OR
JTAG
16
The IO nodes



1024 nodes
talk to outside world via ethernet
talk to inside world via tree network



not connected to torus
embedded linux kernel
purpose is to run


network filesystem
job control daemons
17
The compute nodes




64k nodes, each with 2 cpus and 4 fpus
application programs execute here
custom kernel
non-preemptive


kernel and application share same address space


application program has full control of all timing issues
kernel is memory protected
kernel provides



program load / start / debug / termination
file access
all via message passing to IO nodes
18
The link nodes



Signal routing, no computation
Stitch together cards and racks of io and compute
nodes into “blocks” suitable for running independent
jobs
Isolate each block’s tree, torus, and global OR
network
19
Machine configuration
machine manager
jtag/ethernet
host
core
link
link
20
Kernel booting and monitoring
jtag/ethernet
machine manager
host
core
…1024…
ciod
ciod
ciod
cnk
…64…
cnk
cnk
cnk
…64…
…64…
cnk
cnk
21
Job execution
blrun
blrun
tcp/ethernet
host
core
…1024…
ciod
ciod
tree
ciod
cnk
…64…
cnk
cnk
cnk
…64…
…64…
cnk
cnk
22
Blue Gene/L System Software Architecture
tree
I/O Node 0
Front-end
Nodes
Console
File
Servers
Pset 0
C-Node 0
C-Node 63
CNK
CNK
Linux
ciod
Service
Service
Node
Node
DB2
Ethernet
MMCS
I/O Node 1023
I2C
Scheduler
torus
C-Node 0
C-Node 63
CNK
CNK
Linux
ciod
Ethernet
IDo chip
JTAG
Pset 1023
23
Conclusions

BG/L system software stack has




BG/L system software must scale to very large machines




Custom solution (CNK) on compute nodes for high performance
Linux solution on I/O nodes for flexibility and functionality
MPI as default programming model
Hierarchical organization for management
Flat organization for programming
Mixed conventional/special-purpose operating systems
Many challenges ahead, particularly in performance,
scalability and reliability
24