Programming the IBM Power3 SP

Download Report

Transcript Programming the IBM Power3 SP

Programming the IBM Power3 SP
Eric Aubanel
Advanced Computational Research Laboratory
Faculty of Computer Science, UNB
Advanced Computational
Research Laboratory
• High Performance Computational ProblemSolving and Visualization Environment
• Computational Experiments in multiple
disciplines: CS, Science and Eng.
• 16-Processor IBM SP3
• Member of C3.ca Association, Inc.
(http://www.c3.ca)
Advanced Computational
Research Laboratory
www.cs.unb.ca/acrl
• Virendra Bhavsar, Director
• Eric Aubanel, Research Associate &
Scientific Computing Support
• Sean Seeley, System Administrator
Programming the IBM Power3 SP
•
•
•
•
History and future of POWER chip
Uni-processor optimization
Description of ACRL’s IBM SP
Parallel Processing
– MPI
– OpenMP
• Hybrid MPI/OpenMP
• MPI-I/O (one slide)
POWER chip: 1990 to 2003
1990
– Performance Optimized with Enhanced RISC
– Reduced Instruction Set Computer
– Superscalar: combined floating point multiplyadd (FMA) unit which allowed peak MFLOPS
rate = 2 x MHz
– Initially: 25 MHz (50 MFLOPS) and 64 KB
data cache
POWER chip: 1990 to 2003
1991: SP1
– IBM’s first SP (scalable power parallel)
– Rack of standalone POWER processors (62.5
MHz) connected by internal switch network
– Parallel Environment & system software
POWER chip: 1990 to 2003
1993: POWER2
–
–
–
–
2 FMAs
Increased data cache size
66.5 MHz (254 MFLOPS)
Improved instruction set (incl. Hardware square
root)
– SP2: POWER2 + higher bandwidth switch for
larger systems
POWER chip: 1990 to 2003
1993: POWERPC
Support SMP
1996: P2SC
POWER2 super chip: clock speeds up to 160
MHz
POWER chip: 1990 to 2003
Feb. ‘99: POWER3
–
–
–
–
Combined P2SC & POWERPC
64 bit architecture
Initially 2-way SMP, 200 MHz
Cache improvement, including L2 cache of 116 MB
– Instruction & data prefetch
POWER3+ chip: Feb. 2000
• Winterhawk II - 375 MHz
• 4- way SMP
• 2 MULT/ ADD - 1500
MFLOPS
• 64 KB Level 1 - 5 nsec/
3.2 GB/ sec
• 8 MB Level 2 - 45 nsec/
6.4 GB/ sec
• 1.6 GB/ s Memory
Bandwidth
• 6 GFLOPS/ Node
• Nighthawk II - 375 MHz
• 16- way SMP
• 2 MULT/ ADD - 1500
MFLOPS
• 64 KB Level 1 - 5 nsec/
3.2 GB/ sec
• 8 MB Level 2 - 45 nsec/
6.4 GB/ sec
• 14 GB/ s Memory
Bandwidth
• 24 GFLOPS/ Node
The Clustered SMP
ACRL’s SP: Four 4-way SMPs
Each node has its own copy
of the O/S
Processors on the node are
closer than those on different
nodes
Power3 Architecture
Power4 - 32 way
GX Bus
GX Bus
GX Bus
GX Bus
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
2 procs
Private L1,
L2
GX Bus
GX Bus
GX Bus
GX Bus
• Logical UMA
• SP High Node
• L3 cache shared
between all
processors on node
- 32 MB
• Up to 32 GB main
memory
• Each processor: 1.1
GHz
• 140 Gflops total
peak
Going to NUMA
NUMA up to 256 processors - 1.1 Teraflops
Programming the IBM Power3 SP
•
•
•
•
History and future of POWER chip
Uni-processor optimization
Description of ACRL’s IBM SP
Parallel Processing
– MPI
– OpenMP
• Hybrid MPI/OpenMP
• MPI-I/O (one slide)
Uni-processor Optimization
• Compiler options:
– start with -O3 -qstrict, then -O3, -qarch=pwr3
• Cache re-use
• Take advantage of superscalar architecture
– give enough operations per load/store
• Use ESSL - optimization already maximally
exploited
Memory Access Times
Memory to L2
or L1
L2 to L1
L1 to
Registers
Width
16 bytes/2
cycles
32 bytes/cycle
2x8
bytes/cycle
Rate
1.6 GB/s
6.4 GB/s
3.2 GB/s
Latency
35 cycles
6 to 7 cycles
1 cycle
(approximately) (approximately)
Cache
128 byte cache line
2 MB
2 MB
2 MB
2 MB
L2 cache: 4-way setassociative, 8 MB
total
L1 cache: 128-way
set-associative, 64 KB
How to Monitor Performance?
• IBM’s hardware monitor: HPMCOUNT
–
–
–
–
Uses hardware counters on chip
Cache & TLB misses, fp ops, load-stores, …
Beta version
Available soon on ACRL’s SP
HMPCOUNT sample output
real*8
a(256,256),b(256,256),
c(256,256)
common a,b,c
do j=1,256
do i=1,256
a(i,j)=b(i,j)+c(i,j)
end do
end do
end
PM_TLB_MISS (TLB misses)
Average number of loads per TLB miss
Total loads and stores
Instructions per load/store
Cycles per instruction
Instructions per cycle
Total floating point operations
Hardware float point rate
Mflop/sec
:
66543
:
:
:
:
:
:
:
5.916
0.525 M
2.749
2.378
0.420
0.066 M
2.749
HMPCOUNT sample output
real*8
a(257,256),b(257,256),
c(257,256)
common a,b,c
do j=1,256
do i=1,257
a(i,j)=b(i,j)+c(i,j)
end do
end do
end
PM_TLB_MISS (TLB misses)
Average number of loads per TLB miss
Total loads and stores
Instructions per load/store
Cycles per instruction
Instructions per cycle
Total floating point operations
Hardware float point rate
Mflop/sec
:
1634
:
:
:
:
:
:
:
241.876
0.527 M
2.749
1.271
0.787
0.066 M
3.525
ESSL
• Linear algebra, Fourier & related
transforms, sorting, interpolation,
quadrature, random numbers
• Fast!
– 560x560 real*8 matrix multiply
• Hand coding: 19 Mflops
• dgemm: 1.2 GFlops
• Parallel (threaded and distributed) versions
Programming the IBM Power3 SP
•
•
•
•
History and future of POWER chip
Uni-processor optimization
Description of ACRL’s IBM SP
Parallel Processing
– MPI
– OpenMP
• Hybrid MPI/OpenMP
• MPI-I/O (one slide)
ACRL’s IBM SP
• 4 Winterhawk II nodes
– 16 processors
Gigabit
Ethernet
• Each node has:
– 1 GB RAM
– 9 GB (mirrored) disk on each
node
– Switch adapter
Disk
•
•
•
•
High Perforrnance Switch
Gigabit Ethernet (1 node)
Control workstation
Disk: SSA tower with 6 18.2
GB disks
IBM Power3 SP Switch
• Bidirectional multistage
interconnection networks
(MIN)
• 300 MB/sec bi-directional
• 1.2 sec latency
General Parallel File System
Node 2
Node 3
Node 4
Application
Application
Application
GPFS Client
GPFS Client
GPFS Client
RVSD/VSD
RVSD/VSD
RVSD/VSD
SP Switch
Application
GPFS Server
RVSD/VSD
Node 1
ACRL Software
• Operating System: AIX 4.3.3
• Compilers
–
–
–
–
IBM XL Fortran 7.1 (HPF not yet installed)
VisualAge C for AIX, Version 5.0.1.0
VisualAge C++ Professional for AIX, Version 5.0.0.0
IBM Visual Age Java - not yet installed
• Job Scheduler: Loadleveler 2.2
• Parallel Programming Tools
– IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O
• Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 )
• Visualization: OpenDX (not yet installed)
• E-Commerce software (not yet installed)
Programming the IBM Power3 SP
•
•
•
•
History and future of POWER chip
Uni-processor optimization
Description of ACRL’s IBM SP
Parallel Processing
– MPI
– OpenMP
• Hybrid MPI/OpenMP
• MPI-I/O (one slide)
Why Parallel Computing?
• Solve large problems in reasonable time
• Many algorithms are inherently parallel
– image processing, Monte Carlo
– Simulations (eg. CFD)
• High performance computers have parallel
architectures
– Commercial off-the shelf (COTS) components
• Beowulf clusters
• SMP nodes
– Improvements in network technology
NRL Layered Ocean Model at Naval Research Laboratory
IBM Winterhawk II SP
Parallel Computational Models
• Data Parallelism
– Parallel program looks like serial program
• parallelism in the data
– Vector processors
– HPF
Parallel Computational Models
Send
Receive
• Message Passing (MPI)
– Processes have only local memory but can
communicate with other processes by sending &
receiving messages
– Data transfer between processes requires operations to
be performed by both processes
– Communication network not part of computational
model (hypercube, torus, …)
Parallel Computational Models
Address space
Processes
• Shared Memory (threads)
– P(osix)threads
– OpenMP: higher level standard
Parallel Computational Models
Get
Put
• Remote Memory Operations
– “One-sided” communication
• MPI-2, IBM’s LAPI
– One process can access the memory of another without
the other’s participation, but does so explicitly, not the
same way it accesses local memory
Parallel Computational Models
Network
Address space
Address space
Address space
Processes
Processes
Processes
• Combined: Message Passing & Threads
– Driven by clusters of SMPs
– Leads to software complexity!
Programming the IBM Power3 SP
•
•
•
•
History and future of POWER chip
Uni-processor optimization
Description of ACRL’s IBM SP
Parallel Processing
– MPI
– OpenMP
• Hybrid MPI/OpenMP
• MPI-I/O (one slide)
Message Passing Interface
• MPI 1.0 standard in 1994
• MPI 1.1 in 1995 - IBM support
• MPI 2.0 in 1997
– Includes 1.1 but adds new features
• MPI-IO
• One-sided communication
• Dynamic processes
Advantages of MPI
• Universality
• Expressivity
– Well suited to formulating a parallel algorithm
• Ease of debugging
– Memory is local
• Performance
– Explicit association of data with process allows
good use of cache
MPI Functionality
• Several modes of point-to-point message passing
–
–
–
–
blocking (e.g. MPI_SEND)
non-blocking (e.g. MPI_ISEND)
synchronous (e.g. MPI_SSEND)
buffered (e.g. MPI_BSEND)
• Collective communication and synchronization
– e.g. MPI_REDUCE, MPI_BARRIER
• User-defined datatypes
• Logically distinct communicator spaces
• Application-level or virtual topologies
Simple MPI Example
My_Id
0
This is from
MPI process
number 0
1
This is from
MPI
processes
other than 0
Simple MPI Example
Program Trivial
implicit none
include "mpif.h" ! MPI header file
integer My_Id, Numb_of_Procs, Ierr
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr )
print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs
if ( My_Id .eq. 0 ) then
print *, ' This is from MPI process number ',My_Id
else
print *, ' This is from MPI processes other than 0 ', My_Id
end if
call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr
stop
end
MPI Example with send/recv
Send
Receive
My_Id
0
Receive
Send
1
MPI Example with send/recv
Program Simple
implicit none
Include "mpif.h"
Integer My_Id, Other_Id, Nx, Ierr
Parameter ( Nx = 100 )
Real A ( Nx ), B ( Nx )
call MPI_INIT ( Ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr )
Other_Id = Mod ( My_Id + 1, 2 )
A = My_Id
call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr )
call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr )
call MPI_FINALIZE ( Ierr )
stop
end
What Will Happen?
/* Processor 0 */
...
MPI_Send(sendbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD);
printf("Posting receive now ...\n");
MPI_Recv(recvbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD,
status);
/* Processor 1 */
...
MPI_Send(sendbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD);
printf("Posting receive now ...\n");
MPI_Recv(recvbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD,
status);
MPI Message Passing Modes
Ready
Ready
Standard
<= eager limit Eager
> eager limit
Synchronous
Buffered
Rendezvous
Buffered
Default Eager Limit on SP is 4 KB (can be up
to 64 KB)
MPI Performance Visualization
• ParaGraph
– Developed by University of Illinois
– Graphical display system for visualizing
behaviour and performance of MPI programs
Message Passing on SMP
Call MPI_SEND
Call MPI_RECEIVE
Memory Crossbar or Switch
Data to
Send
Buffer
Buffer
Received
Data
export MP_SHARED_MEMORY=yes|no
Shared Memory MPI
MPI_SHARED_MEMORY=<yes|no>
Latency
Bandwidth
(sec)
(Mbytes/sec)
– between 2 nodes: 24
133
– same nodes:
30 (no)
80 (no)
– same nodes:
10 (yes)
270(yes)
Message Passing off Node
MPI Across all the processors
Many more messages going
through the fabric
Programming the IBM Power3 SP
•
•
•
•
History and future of POWER chip
Uni-processor optimization
Description of ACRL’s IBM SP
Parallel Processing
– MPI
– OpenMP
• Hybrid MPI/OpenMP
• MPI-I/O (one slide)
OpenMP
• 1997: group of hardware and software vendors
announced their support for OpenMP, a new API
for multi-platform shared-memory programming
(SMP) on UNIX and Microsoft Windows NT
platforms.
• www.openmp.org
• OpenMP parallelism specified through the use of
compiler directives which are imbedded in C/C++
or Fortran source code. IBM does not yet
support OpenMP for C++.
OpenMP
• All processors can access all the memory in
the parallel system
• Parallel execution is achieved by generating
threads which execute in parallel
• Overhead for SMP parallelization is large
(100-200 sec)- size of parallel work
construct must be significant enough to
overcome overhead
OpenMP
1.All OpenMP programs begin as a single
process: the master thread
2.FORK: the master thread then creates a
team of parallel threads
3.Parallel region statements executed
in parallel among the various team
threads
4.JOIN: threads
synchronize and terminate, leaving only
the master thread
OpenMP
How is OpenMP typically used?
• OpenMP is usually used to parallelize
loops:
– Find your most time consuming loops.
– Split them up between threads.
• Better scaling can be obtained using
OpenMP parallel regions, but can be
tricky!
OpenMP Loop Parallelization
!$OMP PARALLEL DO
do i=0,ilong
do k=1,kshort
...
end do
end do
#pragma omp parallel for
for(i=0; i <= ilong; i++)
for(k=1; k <= kshort; k++) {
...
}
Variable Scoping
• Most difficult part of Shared Memory
Parallelization
– What memory is Shared
– What memory is Private - each processor has its own
copy
• Compare MPI: all variables are private
• Variables are shared by default, except:
– loop indices
– scalars that are set and then used in loop
How Does Sharing Work?
Shared X initially 0
THREAD 1:
THREAD 2:
increment(x)
{
x = x + 1;
}
increment(x)
{
x = x + 1;
}
THREAD 1:
10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)
THREAD 2:
10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)
Result could be 1 or 2
Need synchronization
False Sharing
Block
Address tag
7
6
5
4
3
2
1
0
Cache line
Block in Cache
Processor 1
!$OMP PARALLEL DO
do I=1,20
A(I)= ...
enddo
Processor 2
Say A(1-5)starts on cache
line, then some of A(6-10)
will be on first cache line
so won’t be accessible
until first thread finished
Programming the IBM Power3 SP
•
•
•
•
History and future of POWER chip
Uni-processor optimization
Description of ACRL’s IBM SP
Parallel Processing
– MPI
– OpenMP
• Hybrid MPI/OpenMP
• MPI-I/O (one slide)
Why Hybrid MPI-OpenMP?
• To optimize performance on “mixed-mode”
hardware like the SP
• MPI is used for “inter-node”
communication, and OpenMP is used for
“intra-node” communication
– threads have lower latency
– threads can alleviate network contention of a
pure MPI implementation
Hybrid MPI-OpenMP?
• Unless you are forced against your will, for the
hybrid model to be worthwhile:
– There has to be obvious parallelism to exploit
– The code has to be easy to program and maintain
• easy to write bad OpenMP code
– It has to promise to perform at least as well as the
equivalent all-MPI program
• Experience has shown that converting working
MPI code to a hybrid model rarely results in better
performance
– especially true with applications having a single level of
parallelism
Hybrid Scenario
• Thread the computational portions of the code that
exist between MPI calls
• MPI calls are “single-threaded” and therefore use
only a single CPU.
• Assumes:
– application has two natural levels of parallelism
– or that in breaking an MPI code with one level
of parallelism that communication between
resulting threads is little/none
Programming the IBM Power3 SP
•
•
•
•
History and future of POWER chip
Uni-processor optimization
Description of ACRL’s IBM SP
Parallel Processing
– MPI
– OpenMP
• Hybrid MPI/OpenMP
• MPI-I/O (one slide)
MPI-IO
memory
processes
file
• Part of MPI-2
• Resulted work at IBM Research exploring the
analogy between I/O and message passing
• See “Using MPI-2”, by Gropp et al. (MIT Press)
Conclusion
• Don’t forget uni-processor optimization
• If you choose one parallel programming
API, choose MPI
• Mixed MPI-OpenMP may be appropriate in
certain cases
– More work needed here
• Remote memory access model may be the
answer