Transcript Slide 1

MPI in uClinux on Microblaze
Neelima Balakrishnan
Khang Tran
05/01/2006
Project Proposal
 Port uClinux to work on Microblaze
 Add MPI implementation on top of uClinux
 Configure NAS parallel benchmarks and port
them to work on RAMP
What is Microblaze?
 Soft core processor,
implemented using general
logic primitives
 32-bit Harvard RISC
architecture
 Supported in the Xilinx
Spartan and Virtex series of
FPGAs
 Customizability of the core
makes it challenging while
opening up vistas for kernel
configurations
Components
 uClinux - kernel v2.4
 MPICH2 - portable, high performance
implementation of the entire MPI-2
standard
 Communication via different channels sockets, shared memory, etc.
 MPI port for Microblaze communication is
over FSL
Components (contd.)
 NASPB v2.4 - MPI-based source
code implementations written
and distributed by NAS
 5 kernels
 3 pseudo-applications
Porting uClinux to Microblaze
 Done by Dr. John Williams - Embedded Systems
group, University of Queensland in Brisbane,
Australia
 Part of their reconfigurable computing research
program. The work on this is still going on
 http://www.itee.uq.edu.au/~jwilliams/mblazeuclinux
Challenge in porting uClinux to
Microblaze
 Linux derivative for microprocessors that lack a
memory management unit (MMU)
 No memory protection
 No virtual memory
 For most user applications, the fork() system
call is unavailable
 malloc() function call needs to be modified
MPI implementation
 MPI – Message Passing Interface
 Standard API used to create parallel
applications
 Designed primarily to support the SPMD (single
program multiple data) model
 Advantage over older message passing libraries
 Portability
 Fast as each implementation is optimized for the
hardware it runs on
Interactions between Application
and MPI
Other processors
………………………….
Communication Channel
MPI process manager
MPI interface
Initiating application
MPI process manager
MPI interface
Application on other
processors
NAS parallel benchmarks
 Set of 8 programs intended to aid in evaluating
the performance of parallel supercomputers
 Derived from computational fluid dynamics
(CFD) applications,
 5 kernels
 3 pseudo-applications
 Used NPB2.4 version – MPI-based source code
implementation
Phases
 Studied the uClinux and found the initial port
done for Microblaze
 Latest kernel (2.4) and distribution from
uClinux.org
 Successful compilation for Microblaze
architecture
 MPI - MPICH2 out of many versions of MPI
 Investigated the MPICH2 implementation
available from Argonne National Laboratory
 Encountered challenges in porting MPI onto
uClinux
Challenges in porting MPI to
uClinux






Use of fork and a complex state machine
Default process manager for unix platforms is MPD written in
Python and uses a wrapper to call fork
Simple fork->vfork is not possible as the function is called deep
inside other functions and require a lot of stack unwinding
Alternate Approaches
Port SMPD, written in C
 It will involve a complex state machine and stack
unwinding after the fork
Use pthreads
 Might involve a lot of reworking of code as the current
implementation is not using pthreads
 Need to ensure thread safety
NAS Parallel Benchmark
 Used NAS PB v2.4
 Compiled and executed it on a desktop and
Millennium Cluster
 Obtained information about





MOPS
Type of operation
Execution time
Number of nodes involved
Number of processes and iterations
NAS PB simulation result
(Millennium cluster, Class A)
Simulation result (cont.)
Estimated statistics for the floating
point group
 4 test benches use floating point op heavily
are: BT, CG, MG, and SP
 Very few fp comparison ops in all
 BT (Block Tridiagonal) all fp ops are add,
subtract, and multiply. About 5% of all ops is
division
 CG (Conjugate Gradient) has the highest % of
ops that is sqrt, 30%. Add, mult is about 60%,
divide is about 10%.
 MG (Multigrid) about 5% is sqrt, 20% is division.
The rest is add, subtract, and multiply
 For SP (Scalar Pentadiagonal) almost all ops are
add, 10% is division
Floating Point Operation
Frequency
SP
MG
0%
15%
5%
20%
10%
45%
75%
30%
BT
5%
CG
5%
0%
15%
25%
20%
70%
60%
Most frequently used MPI
functions in NASPB v2.4
MPI Fu n ction
MP I_IRECV
MP I_SEND
MP I_ISEND
MP I_BCAST
MP I_WAIT
MP I_ALLREDUCE
MP I_BARRIER
MP I_ABORT
MP I_COMM_SIZE
MP I_WAITALL
MP I_FINALIZE
MP I_COMM_RANK
MP I_INIT
MP I_REDUCE
MP I_ALLT OALL
MP I_COMM_DUP
MP I_COMM_SPLIT
MP I_RECV
MP I_WT IME
MP I_ALLT OALLV
Perce n tFre qu e n cy
14.4%
10.6%
10.2%
9.7%
9.7%
7.2%
7.2%
4.7%
4.2%
3.4%
3.0%
2.5%
2.5%
2.5%
1.7%
1.7%
1.7%
1.7%
0.8%
0.4%
Observations about NASPB
 NASPB suite – 6 out of 8 benchmarks are
predictive of parallel performance
 EP – little/negligible communication
between processors.
 IS – high communication overhead.
Project status
 Compiled uClinux and put it on Microblaze
 Worked on the porting of MPI but not
completed
 Compiled and executed NASPB on desktop and
Millennium (which currently uses 8 computing
nodes)