BlueGene/L System Software Derek Lieber IBM T. J. Watson Research Center February 2004
Download
Report
Transcript BlueGene/L System Software Derek Lieber IBM T. J. Watson Research Center February 2004
BlueGene/L System Software
Derek Lieber
IBM T. J. Watson Research Center
February 2004
Topics
Programming environment
Compilation
Execution
Debugging
Programming model
Processors
Memory
Files
Communications
What happens under the covers
2
Programming on BG/L
A single application program image
Running on tens of thousands of compute nodes
Communicating via message passing
Each image has its own copy of
Memory
File descriptors
3
Programming on BG/L
A “job” is encapsulated in a single host-side process
A merge point for compute node stdout streams
A control point for
Signaling (ctl-c, kill, etc)
Debugging (attach, detach)
Termination (exit status collection and summary)
4
Programming on BG/L
Cross compile the source code
Place executable onto BG/L machine’s shared filesystem
Run it
“blrun <job information> <program name> <args>”
Stdout of all program instances appears as stdout of blrun
Files go to user-specified directory on shared filesystem
blrun terminates when all program instances terminate
Killing blrun kills all program instances
5
Compiling and Running on BG/L
stdout
Workstation
datafiles
shared filesystem
sources
BG/L Machine
programs + datafiles
program
local filesystem
cross-tools
6
Programming Models
“Coprocessor model”
64k instances of a single application program
each has 255M address space
each with two threads (main, coprocessor)
non-coherent shared memory
“Virtual node model”
128k instances
127M address space
one thread (main)
7
Programming Model
Does a job behave like
A group of processes?
Or a group of threads?
A little bit of each
8
A process group?
Yes
Each program instance has its own
Memory
File descriptors
No
Can’t communicate via mmap, shmat
Can’t communicate via pipes or sockets
Can’t communicate via signals (kill)
9
A thread group?
Yes
Job terminates when
All program instances terminate via exit(0)
Any program instance terminates
Voluntarily, via exit(!0)
Involuntarily, via uncaught signal (kill, abort, segv, etc)
No
Each program instance has own set of file descriptors
Each has own private memory space
10
Compilers and libraries
GNU C, Fortran, C++ compilers can be used with BG/L,
but they do not exploit 2nd FPU
IBM xlf/xlc compilers have been ported to BG/L, with
code generation and optimization features for dual FPU
Standard glibc library
MPI for communications
11
System calls
Traditional ANSI + “a little” POSIX
I/O
Time
Open, close, read, write, etc
Gettimeofday, etc
Signal catchers
Synchronous (sigsegv, sigbus, etc)
Asynchronous (timers and hardware events)
12
System calls
No “unix stuff”
No system calls needed to access most hardware
fork, exec, pipe
mount, umount, setuid, setgid
Tree and torus fifos
Global OR
Mutexes and barriers
Performance counters
Mantra
Keep the compute nodes simple
Kernel stays out of the way and lets the application program run
13
Software Stack in BG/L Compute Node
CNK controls all access to
hardware, and enables bypass
for application use
User-space libraries and
applications can directly
access torus and tree through
bypass
As a policy, user-space code
should not directly touch
hardware, but there is no
enforcement of that policy
Application code
User-space libraries
CNK
Bypass
BG/L ASIC
14
What happens under the covers?
The machine
The job allocation, launch, and control system
The machine monitoring and control system
15
The machine
Nodes
IO nodes
Compute nodes
Link nodes
Communications networks
Ethernet
Tree
Torus
Global OR
JTAG
16
The IO nodes
1024 nodes
talk to outside world via ethernet
talk to inside world via tree network
not connected to torus
embedded linux kernel
purpose is to run
network filesystem
job control daemons
17
The compute nodes
64k nodes, each with 2 cpus and 4 fpus
application programs execute here
custom kernel
non-preemptive
kernel and application share same address space
application program has full control of all timing issues
kernel is memory protected
kernel provides
program load / start / debug / termination
file access
all via message passing to IO nodes
18
The link nodes
Signal routing, no computation
Stitch together cards and racks of io and compute
nodes into “blocks” suitable for running independent
jobs
Isolate each block’s tree, torus, and global OR
network
19
Machine configuration
machine manager
jtag/ethernet
host
core
link
link
20
Kernel booting and monitoring
jtag/ethernet
machine manager
host
core
…1024…
ciod
ciod
ciod
cnk
…64…
cnk
cnk
cnk
…64…
…64…
cnk
cnk
21
Job execution
blrun
blrun
tcp/ethernet
host
core
…1024…
ciod
ciod
tree
ciod
cnk
…64…
cnk
cnk
cnk
…64…
…64…
cnk
cnk
22
Blue Gene/L System Software Architecture
tree
I/O Node 0
Front-end
Nodes
Console
File
Servers
Pset 0
C-Node 0
C-Node 63
CNK
CNK
Linux
ciod
Service
Service
Node
Node
DB2
Ethernet
MMCS
I/O Node 1023
I2C
Scheduler
torus
C-Node 0
C-Node 63
CNK
CNK
Linux
ciod
Ethernet
IDo chip
JTAG
Pset 1023
23
Conclusions
BG/L system software stack has
BG/L system software must scale to very large machines
Custom solution (CNK) on compute nodes for high performance
Linux solution on I/O nodes for flexibility and functionality
MPI as default programming model
Hierarchical organization for management
Flat organization for programming
Mixed conventional/special-purpose operating systems
Many challenges ahead, particularly in performance,
scalability and reliability
24