Transcript Slide 1
MPI in uClinux on Microblaze
Neelima Balakrishnan
Khang Tran
05/01/2006
Project Proposal
Port uClinux to work on Microblaze
Add MPI implementation on top of uClinux
Configure NAS parallel benchmarks and port
them to work on RAMP
What is Microblaze?
Soft core processor,
implemented using general
logic primitives
32-bit Harvard RISC
architecture
Supported in the Xilinx
Spartan and Virtex series of
FPGAs
Customizability of the core
makes it challenging while
opening up vistas for kernel
configurations
Components
uClinux - kernel v2.4
MPICH2 - portable, high performance
implementation of the entire MPI-2
standard
Communication via different channels sockets, shared memory, etc.
MPI port for Microblaze communication is
over FSL
Components (contd.)
NASPB v2.4 - MPI-based source
code implementations written
and distributed by NAS
5 kernels
3 pseudo-applications
Porting uClinux to Microblaze
Done by Dr. John Williams - Embedded Systems
group, University of Queensland in Brisbane,
Australia
Part of their reconfigurable computing research
program. The work on this is still going on
http://www.itee.uq.edu.au/~jwilliams/mblazeuclinux
Challenge in porting uClinux to
Microblaze
Linux derivative for microprocessors that lack a
memory management unit (MMU)
No memory protection
No virtual memory
For most user applications, the fork() system
call is unavailable
malloc() function call needs to be modified
MPI implementation
MPI – Message Passing Interface
Standard API used to create parallel
applications
Designed primarily to support the SPMD (single
program multiple data) model
Advantage over older message passing libraries
Portability
Fast as each implementation is optimized for the
hardware it runs on
Interactions between Application
and MPI
Other processors
………………………….
Communication Channel
MPI process manager
MPI interface
Initiating application
MPI process manager
MPI interface
Application on other
processors
NAS parallel benchmarks
Set of 8 programs intended to aid in evaluating
the performance of parallel supercomputers
Derived from computational fluid dynamics
(CFD) applications,
5 kernels
3 pseudo-applications
Used NPB2.4 version – MPI-based source code
implementation
Phases
Studied the uClinux and found the initial port
done for Microblaze
Latest kernel (2.4) and distribution from
uClinux.org
Successful compilation for Microblaze
architecture
MPI - MPICH2 out of many versions of MPI
Investigated the MPICH2 implementation
available from Argonne National Laboratory
Encountered challenges in porting MPI onto
uClinux
Challenges in porting MPI to
uClinux
Use of fork and a complex state machine
Default process manager for unix platforms is MPD written in
Python and uses a wrapper to call fork
Simple fork->vfork is not possible as the function is called deep
inside other functions and require a lot of stack unwinding
Alternate Approaches
Port SMPD, written in C
It will involve a complex state machine and stack
unwinding after the fork
Use pthreads
Might involve a lot of reworking of code as the current
implementation is not using pthreads
Need to ensure thread safety
NAS Parallel Benchmark
Used NAS PB v2.4
Compiled and executed it on a desktop and
Millennium Cluster
Obtained information about
MOPS
Type of operation
Execution time
Number of nodes involved
Number of processes and iterations
NAS PB simulation result
(Millennium cluster, Class A)
Simulation result (cont.)
Estimated statistics for the floating
point group
4 test benches use floating point op heavily
are: BT, CG, MG, and SP
Very few fp comparison ops in all
BT (Block Tridiagonal) all fp ops are add,
subtract, and multiply. About 5% of all ops is
division
CG (Conjugate Gradient) has the highest % of
ops that is sqrt, 30%. Add, mult is about 60%,
divide is about 10%.
MG (Multigrid) about 5% is sqrt, 20% is division.
The rest is add, subtract, and multiply
For SP (Scalar Pentadiagonal) almost all ops are
add, 10% is division
Floating Point Operation
Frequency
SP
MG
0%
15%
5%
20%
10%
45%
75%
30%
BT
5%
CG
5%
0%
15%
25%
20%
70%
60%
Most frequently used MPI
functions in NASPB v2.4
MPI Fu n ction
MP I_IRECV
MP I_SEND
MP I_ISEND
MP I_BCAST
MP I_WAIT
MP I_ALLREDUCE
MP I_BARRIER
MP I_ABORT
MP I_COMM_SIZE
MP I_WAITALL
MP I_FINALIZE
MP I_COMM_RANK
MP I_INIT
MP I_REDUCE
MP I_ALLT OALL
MP I_COMM_DUP
MP I_COMM_SPLIT
MP I_RECV
MP I_WT IME
MP I_ALLT OALLV
Perce n tFre qu e n cy
14.4%
10.6%
10.2%
9.7%
9.7%
7.2%
7.2%
4.7%
4.2%
3.4%
3.0%
2.5%
2.5%
2.5%
1.7%
1.7%
1.7%
1.7%
0.8%
0.4%
Observations about NASPB
NASPB suite – 6 out of 8 benchmarks are
predictive of parallel performance
EP – little/negligible communication
between processors.
IS – high communication overhead.
Project status
Compiled uClinux and put it on Microblaze
Worked on the porting of MPI but not
completed
Compiled and executed NASPB on desktop and
Millennium (which currently uses 8 computing
nodes)