BigSim Tutorial - Parallel Programming Laboratory

Download Report

Transcript BigSim Tutorial - Parallel Programming Laboratory

BigSim Tutorial
Presented by
Gengbin Zheng, Ryan Mokos
Charm++ Workshop 2009
Parallel Programming Laboratory
University of Illinois at Urbana-Champaign
Charm++ Workshop 2009
1
Outline
Overview
BigSim Emulator
BigSim Simulator
Post-mortem simulation
BigNetSim build flow
Generic network model: Simple Latency Model
Specific network models
Extensibility
Charm++ Workshop 2009
2
BigSim Infrastructure
BigSim for whole-system simulation of a large
parallel machine.
Goal: Support early application development and
identification of performance bottlenecks.
What BigSim can do:
An execution environment that can run both Charm++ and MPI
applications on large scale target machines
No or small changes to MPI application source codes.
facilitate code development and debugging
Predict parallel performance at varying levels of
resolution
Tune/scale performance
Machine vendors designing future machines
Charm++ Workshop 2009
4
BigSim Components
BigSim Emulator
Run AMPI/Charm++ on emulator
Capture computation and communication information
Parallel: Each physical processor is used to emulate multiple
target processors, leveraging Charm++’s virtualization support
BigSim Simulator
PDES, Network contention
Produce performance data in a format compatible with
the Projections graphical browser
Charm++ Workshop 2009
5
What BigSim Can not Do
BigSim
Itself does not predict cycle-accurate timing
(needs instruction-level simulation)
does not predict cache effect, virtual memory
does not model O.S. jitter
Charm++ Workshop 2009
7
Outline
Overview
BigSim Emulator
BigSim Simulator
Post-mortem simulation
BigNetSim build flow
Generic network model: Simple Latency Model
Specific network models
Extensibility
Charm++ Workshop 2009
8
BigSim Emulator
Emulate full machine on existing parallel
machines
Actually run a parallel program
E.g. multi-million objects on 128K target processors
Emulator is implemented on Charm++
Libraries that link to user application
Simple architecture abstraction
Many multiprocessor (SMP) nodes connected via
message passing
Charm++ Workshop 2009
9
BigSim Emulator: functional view
Communication
processors
Communication
processors
Worker
processors
Worker
processors
inBuf
f
inBuf
f
Correctio
nQ
Non-affinity message
queues
Affinity message
queues
Correctio
nQ
Non-affinity messageAffinity message
queues
queues
Target Node
Target Node
Converse scheduler
Real Processor
Converse Q
Charm++ Workshop 2009
10
Install BigSim Emulator
Download Charm++ v6.1.2
http://charm.cs.uiuc.edu/download/downloads.sh
tml
Compile Charm++/AMPI with “bigemulator”
option:
./build AMPI net-linux-x86_64 bigemulator –O
This builds charm++ and emulator libraries under
net-linux-x86_64-bigemulator
Compiler wrapper for MPI applications:
charm/net-linux-x86_64-bigemulator/bin/mpicc,
mpicxx, mpif90, etc
Charm++ Workshop 2009
11
Prepare MPI Applications
Make sure applications are AMPI-complaint
Adaptive MPI – an implementation of MPI standard on
Charm++
Multithreaded
Changes that may be needed:
Fortran: Program Main => Program MPI_Main
Handle global/static variables
Manual: group globals into a big structure, and allocate on
heap
Semi-automatic: use thread local storage
Int static __thread var;
Automatic: -swapglobals compiler option (ELF binaries)
Only handles globals, not statics
Charm++ Workshop 2009
12
Ring Example (ring.c)
#include "mpi.h"
time = MPI_Wtime();
#define TIMES 10
BGPRINTF("Start of major loop at %f \n");
for (i=0; i<TIMES; i++) {
#if CMK_BLUEGENE_CHARM
if (myid == 0) {
extern void BgPrintf(const char *);
MPI_Send(&value,1,MPI_INT,myid+1,999,MPI_COMM_WORLD);
#define BGPRINTF(x) if (myid == 0) BgPrintf(x);
MPI_Recv(&value,1,MPI_INT,numprocs-1,999,MPI_COMM_WORLD,&status);
#else
}
#define BGPRINTF(x)
else {
#endif
MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status);
value += myid;
Int value = 0;
MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD);
}
int main(int argc, char *argv[])
}
{
BGPRINTF("End of major loop at %f \n");
int myid, numprocs, i;
if (myid==0) printf("Sum=%d, Time=%g\n", value, MPI_Wtime()-time);
double time;
MPI_Finalize();
MPI_Status status;
}
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
Charm++ Workshop 2009
13
Ring Example (AMPI-complaint)
#include "mpi.h"
time = MPI_Wtime();
#define TIMES 10
BGPRINTF("Start of major loop at %f \n");
for (i=0; i<TIMES; i++) {
#if CMK_BLUEGENE_CHARM
if (myid == 0) {
extern void BgPrintf(const char *);
MPI_Send(&value,1,MPI_INT,myid+1,999,MPI_COMM_WORLD);
#define BGPRINTF(x) if (myid == 0) BgPrintf(x);
MPI_Recv(&value,1,MPI_INT,numprocs-1,999,MPI_COMM_WORLD,&status);
#else
}
#define BGPRINTF(x)
else {
#endif
MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status);
value += myid;
int main(int argc, char *argv[])
MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD);
{
}
int myid, numprocs, I, value=0;
}
double time;
BGPRINTF("End of major loop at %f \n");
MPI_Status status;
if (myid==0) printf("Sum=%d, Time=%g\n", value, MPI_Wtime()-time);
MPI_Finalize();
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
}
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
Charm++ Workshop 2009
14
How to Compile and Run MPI
Applications for the Emulator
Compile with AMPI and emulator
charm/net-linux-x86_64-bigemulator/mpicc –o ring ./ring.c
with performance trace module:
charm/net-linux-x86_64-bigemulator/mpicc –o ring ./ring.c –tracemode
projections
Run:
Use mpirun provided by AMPI
Give number of target processors as well number of real processors
Define target machine
Command line options
+x +y +z
+cth +wth
E.g.
mpirun –np 4 ./ring +x10 +y10 +z10 +cth2 +wth4
Or, use Config file
mpirun –np 4 ./ring +bgconfig config
Charm++ Workshop 2009
15
Bgconfig File Format
+bgconfig ./bg_config
x 10
y 10
z 10
cth 2
wth 4
stacksize 4000
timing walltime
#timing bgelapse
#timing counter
#cpufactor 1.0
fpfactor 5e-7
traceroot /tmp
log
yes
correct no
network bluegene
Charm++ Workshop 2009
16
Ring Std Output
Justice> mpirun –np 4 ./pgm +bgconfig ./bg_config
Reading Bluegene Config file ./bg_config ...
BG info> Simulating 8x1x1 nodes with 1 comm + 1 work threads each.
BG info> Network type: ibmpower.
alpha: 1.000000e-06
bandwidth :1.700000e+09.
BG info> cpufactor is 1.000000.
BG info> floating point factor is 0.000000.
BG info> BG stack size: 30000 bytes.
BG info> Using WallTimer for timing method.
BG info> Generating timing log.
BG info> bgTrace root is './'.
LB> Load balancer ignores processor background load.
Start of major loop at 0.268719
End of major loop at 0.273697
Sum=280, Time=0.00497856
[0] Number is numX:8 numY:1 numZ:1 numCth:1 numWth:1 numEmulatingPes:4 totalWorkerProcs:8 bglog_ver:5
[2] Wrote to disk for 2 BG nodes.
[3] Wrote to disk for 2 BG nodes.
[1] Wrote to disk for 2 BG nodes.
[0] Wrote to disk for 2 BG nodes.
BG> BlueGene emulator shutdown gracefully!
BG> Emulation took 0.692498 seconds!
Charm++ Workshop 2009
17
Ring Output Files
Justice> ls -l
-rwxr-xr-x 1 gzheng kale 2194434 2009-04-15 00:03 ring
-rw-r--r-- 1 gzheng kale 10105 2009-04-15 00:04 pgm.sts
-rw-r--r-- 1 gzheng kale
0 2009-04-15 00:04 pgm.projrc
-rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.7.log
-rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.6.log
-rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.5.log
-rw-r--r-- 1 gzheng kale 4559 2009-04-15 00:04 pgm.4.log
-rw-r--r-- 1 gzheng kale 4861 2009-04-15 00:04 pgm.3.log
-rw-r--r-- 1 gzheng kale 5163 2009-04-15 00:04 pgm.2.log
-rw-r--r-- 1 gzheng kale 5167 2009-04-15 00:04 pgm.1.log
-rw-r--r-- 1 gzheng kale 6670 2009-04-15 00:04 pgm.0.log
-rw-r--r-- 1 gzheng kale 23901 2009-04-15 00:04 bgTrace3
-rw-r--r-- 1 gzheng kale 23938 2009-04-15 00:04 bgTrace2
-rw-r--r-- 1 gzheng kale 24663 2009-04-15 00:04 bgTrace1
-rw-r--r-- 1 gzheng kale 24242 2009-04-15 00:04 bgTrace0
-rw-r--r-- 1 gzheng kale
60 2009-04-15 00:04 bgTrace
Charm++ Workshop 2009
8 files
Only 4 files
18
What is in the Trace Logs?
Traces for
2 target
processors
Each SEB has:
Tools for reading bgTrace binary files:
• startTime, endTime
• Incoming Message ID
• Outgoing messages
• Dependences
1.charm/example/bigsim/tools/loadlog
Convert to human-readable format
2.charm/example/bigsim/tools/log2proj
Convert to trace projections log files
Charm++ Workshop 2009
19
Ring Projections Timeline
Charm++ Workshop 2009
20
Performance Prediction
How to predict sequential performance?
Different levels of fidelity:
User supplied timing expression
Wall clock time
Performance counters
Instruction level simulation
Charm++ Workshop 2009
21
Sequential Time - BgElapse
BgElapse
Manually advance processor time
MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status);
value += myid;
...
BgElapse(0.000005);
MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD);
Run with +bgelapse
Charm++ Workshop 2009
22
Sequential Time – using Wallclock
Wallclock measurement of the time can be
used via a suitable multiplier (scale factor)
T * factor
Run application with +bgwalltime and
+bgcpufactor, or
+bgconfig ./bgconfig:
timing walltime
cpufactor 0.7
Good for predicting a larger machine using
a fraction of the machine
Charm++ Workshop 2009
23
Sequential Time – Performance Counters
Count floating-point, integer, memory and
branch instructions (for example) with hardware
counters
Derive these hardware counters to expected time on
target machine.
Cache performance and the memory footprint
effects can be approximated
by percentage of memory accesses and cache hit/miss ratio.
Example of use, for a floating-point intensive
code:
+bgconfig ./bg_config
timing counter
fpfactor 5e-7
Perfex and PAPI are supported
Charm++ Workshop 2009
24
Sequential Time – Instruction level
simulation
Run instruction-level simulator separately to
get accurate timing information
Issues:
It is a different third-party hardware simulator
Hard to integrate with BigSim
Sequential
Does not model communication
Slow!
Charm++ Workshop 2009
25
Interpolation
BigSim and instruction-level simulator interact through
logs
Reduce the problem size by sampling: An interpolationbased scheme
Run a smaller sized problem, or
Run just one processor
Assume computation can be modelled by a set of
parameters:
TC = Fn(p1, p2, p3, ...)
Use sample data from the instruction-level simulation to
interpolate large dataset
With sampling data, do a least-squares fit to determine the
coefficients of an approximation polynomial function
Charm++ Workshop 2009
26
Case study: BigSim / Mambo
void func( )
{
startTraceBigSim( )
…
Mambo
endTraceBigSim( )
Cycle-accurate prediction
of sequential blocks on
POWER7 processor
}
BigSim
Parallel
Emulation
Interpolation
Prediction
for
Target System
BigSim
Parallel
Simulation
+
Replace sequential timing
Trace files
Parameter files for
sequential blocks
Adjusted trace files
Charm++ Workshop 2009
27
Ring Example
MPI_Recv(&value,1,MPI_INT,myid1,999,MPI_COMM_WORLD,&status);
startTraceBigSim();
value += myid;
endTraceBigSim();
char param[128];
sprintf(param, “sum %d”, myid);
tagTraceBigSim(param);
MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,
999,MPI_COMM_WORLD);
Charm++ Workshop 2009
28
Output Files
justice>ls -l
total 2328
-rw-r--r-- 1 gzheng kale
60 2009-04-15 11:08 bgTrace
-rw-r--r-- 1 gzheng kale 36757 2009-04-15 11:08 bgTrace0
-rw-r--r-- 1 gzheng kale 37023 2009-04-15 11:08 bgTrace1
-rwxr-xr-x 1 gzheng kale 94886 2009-04-14 09:46 charmrun*
-rw-r--r-- 1 gzheng kale
3 2009-04-15 11:08 param.0
-rw-r--r-- 1 gzheng kale
3 2009-04-15 11:08 param.1
-rw-r--r-- 1 gzheng kale
3 2009-04-15 11:08 param.2
-rw-r--r-- 1 gzheng kale
3 2009-04-15 11:08 param.3
-rw-r--r-- 1 gzheng kale
3 2009-04-15 11:08 param.4
-rw-r--r-- 1 gzheng kale
3 2009-04-15 11:08 param.5
-rw-r--r-- 1 gzheng kale
3 2009-04-15 11:08 param.6
-rw-r--r-- 1 gzheng kale
3 2009-04-15 11:08 param.7
-rwxr-xr-x 1 gzheng kale 2153700 2009-04-15 11:07 ring*
-rw-r--r-- 1 gzheng kale
justice>cat param.7
48 sum 7
2965 2009-04-15 11:07 ring.C
Charm++ Workshop 2009
29
Run ring Through Instruction-level
Simulator
Compile normal version of ring (not
emulator)
Run sequentially through an instructionlevel simulator
Sample line of Mambo output:
10900820693: (10718653772): TRACE_END: sum
7
Charm++ Workshop 2009
30
Compile and Run Interpolation Tool
Install GSL, the GNU Scientific Library
cd charm/examples/bigsim/tools/rewritelog
Modify the file interpolatelog.C to match your
particular tastes.
OUTPUTDIR
specifies a directory for the new
logfiles
CYCLE_TIMES_FILE specifies the file which
contains accurate timing information
Make
Run interpolation tool under bgTrace dir:
./interpolatelog
Charm++ Workshop 2009
31
Record/Replay
Record only a subset of special logs when
running full size emulation
With the special logs, replay the execution
of a particular target processor through
hardware simulator
Example:
./pgm +x 32768 +y 1 +z1 +bgrecord
+bgrecordprocessors 0-32767:1024
./pgm +bgreplay 31744
Charm++ Workshop 2009
33
Out-of-core Emulation
Motivation
Applications with large
memory footprint
VM system can not handle
well
Use hard drive
Similar to checkpointing
Message driven execution
Peek msg queue => what
execute next? (prefetch)
Charm++ Workshop 2009
34
Using Out-of-core
Change bigsim configuration file:
Charm/tmp/Conv-mach-bigemulator.h
#define BIGSIM_OUT_OF_CORE
Recompile Charm++ and application
1
Run the application through the emulator,
with an addintional command line option:
+bgooc 1024
Charm++ Workshop 2009
36
Outline
Overview
BigSim Emulator
BigSim Simulator
Post-mortem simulation
BigNetSim build flow
Generic network model: Simple Latency Model
Specific network models
Extensibility
Charm++ Workshop 2009
37
Postmortem Simulation
Run application once, get trace logs, and run
simulation with logs for a variety of network
configurations
Big Network Simulator (BigNetSim) implemented
on POSE simulation framework
Particularly useful when message passing
performance is critical and strongly affected by
network contention
Note: BigSim emulator and BigSim simulator
both use same network models for latency-only
calculations located in
charm/src/langs/bluegene/bigsim_network.h
Charm++ Workshop 2009
38
Implementation
Post-Mortem Network simulators are
Parallel Discrete Event Simulations
Parallel Object Simulation Environment (POSE)
Network layer constructs (NIC, Switch, Node,
etc.) implemented as poser simulation objects
Network data constructs (message, packet, etc.)
implemented as event methods on simulation
objects
Charm++ Workshop 2009
39
POSE
Charm++ Workshop 2009
40
Terms
Several network models available
Specific: e.g., BlueGene
Latency-only model – does not account for contention
Network contention model
Generic: Simple Latency Model – uses a simple
equation for determining message transmission time
Emulating processors – physical processors on
which emulation is run (+p?)
Simulating processors – physical processors on
which simulation (BigNetSim) is run (+p?)
Target processors – virtual (or simulated)
processors on which emulation and simulation
are run (+vp?)
Charm++ Workshop 2009
41
Outline
Overview
BigSim Emulator
BigSim Simulator
Post-mortem simulation
BigNetSim build flow
Generic network model: Simple Latency Model
Specific network models
Extensibility
Charm++ Workshop 2009
42
BigNetSim Build Flow
Download and compile charm
Compile POSE
Compile bigsim
Download BigNetSim
Compile BigNetSim
Run simulator
Output
Charm++ Workshop 2009
43
Download and compile charm (if not
done already)
Download the latest version of charm from
the PPL archives:
http://charm.cs.uiuc.edu/download/downloads.shtml
Compile charm
cd charm
./build charm++ net-linux
Charm++ Workshop 2009
44
Compile POSE
cd charm
./build pose net-linux
options are set in pose_config.h
stats enabled by POSE_STATS_ON=1
user event tracing TRACE_DETAIL=1
more advanced configuration options
speculation
checkpoints
load balancing
Charm++ Workshop 2009
45
Compile bigsim
cd charm/net-linux/tmp
make bigsim
Charm++ Workshop 2009
46
Download BigNetSim
Download latest revision from repository:
svn co https://charm.cs.uiuc.edu/svn/repos/BigNetSim
Directory structure: BigNetSim/trunk/
BlueGene/ RedStorm/ and others - network
models
SimpleLatency/ - Simple Latency Model
Topology/ Routing/ InputVcSelection/
OutputVcSelection/ - network configuration
choices
Main/ - main simulation files
tools/ - tools directory
tmp/ - working directory created during build
Charm++ Workshop 2009
47
Compile BigNetSim
Fix BigNetSim/trunk/Makefile.common so
CHARMBASE points to your charm
directory
For the Simple Latency Model:
cd BigNetSim/trunk/SimpleLatency
For parallel simulator:
make
For sequential simulator (runs only on 1
simulating processor):
make SEQUENTIAL=1
Charm++ Workshop 2009
48
Run Simulator
cd BigNetSim/trunk/tmp
Copy bgTrace files into /tmp directory
For parallel build, run with:
./charmrun +p4 bigsimulator -lat 1 -bw 1
For sequential build, run with:
./bigsimulator -lat 1 -bw 1
Charm++ Workshop 2009
49
Output
Simulation completion time
Specified in “GVT ticks” (GVT = Global Virtual
Time)
GVT tick length is determined by the value of
#define factor in BigNetSim/trunk/Main/TCsim.h
Divide final GVT by factor to get simulation time
in seconds
factor = 1e8 => 1 tick = 10ns
factor = 1e9 => 1 tick = 1ns
Charm++ Workshop 2009
50
Output (continued)
Use BgPrint(char *) in source code to print
event times
Each BgPrint() called at execution time in online
execution mode is stored in trace log as a
printing event
In postmortem simulation, strings associated with
BgPrint() events are printed when the event is
committed
“%f” in the string will be replaced by committed
time
Useful for determining iteration times during
simulation as well as emulation
Charm++ Workshop 2009
51
Output (continued)
Projections
Copy emulation Projections logs and sts file into
BigNetSim/trunk/tmp
Two ways to use:
Command-line parameter: -projname <name>
Creates a new set of logs by updating the emulation logs
Assumes emulation Projections logs are: <name>.*.log
Output: <name>-bg.*.log
Disadvantage: emulation Projections overhead included
Command-line parameter: -tproj
Creates a new set of logs from the trace files, ignoring the
emulation logs
Must first copy <name>.sts file to tproj.sts
Output: tproj.*.log
Advantage: no emulation Projections overhead included
Charm++ Workshop 2009
52
Ring Example
./bigsimulator -lat 1 -bw 1
Charm++: standalone mode (not using charmrun)
Charm warning> Randomization of stack pointer is turned on
in Kernel, run 'echo 0 >
/proc/sys/kernel/randomize_va_space' as root to disable
it. Thread migration may not work!
Charm++> cpu topology info is being gathered!
Charm++> 1 unique compute nodes detected!
bgtrace: totalBGProcs=8 X=8 Y=1 Z=1 #Cth=1 #Wth=1 #Pes=1
Opts: netsim on: 0
Initializing POSE...
POSE initialization complete.
Using Inactivity Detection for termination.
netsim skip_on
0
0
Info> timing factor 1.000000e+08 ...
Info> invoking startup task from proc 0 ...
[0:RECV_RESUME] Start of major loop at 0.347418
[0:RECV_RESUME] End of major loop at 0.349147
Simulation inactive at time: 38129444
Final GVT = 38129444
1 PE Simulation finished at 0.052671.
Program finished.
Charm++ Workshop 2009
53
Projections - Ring Example
Emulation
Simulation: -lat 1 (latency = 1s) generated with -tproj
Charm++ Workshop 2009
54
Projections - Ring Example
Simulation: -lat 1 (latency = 1s) generated with -tproj
Simulation: -lat 20 (latency = 20s) generated with -tproj
Charm++ Workshop 2009
55
Outline
Overview
BigSim Emulator
BigSim Simulator
Post-mortem simulation
BigNetSim build flow
Generic network model: Simple Latency Model
Specific network models
Extensibility
Charm++ Workshop 2009
56
Simple Latency Model
Calculates message transmission time using a
simple equation:
lat + (N / bw) + [cpp * (N / psize)]
where
lat = latency in s
bw = bandwidth in GB/sec
cpp = cost per packet in s
psize = packet size in bytes
N = number of bytes sent
Only lat and bw are required; cpp and psize are
optional
Does not account for contention or topology,
except for intra-node vs. inter-node messages
Charm++ Workshop 2009
57
BigNetSim Design – Simple Latency
Model
BGnode
Transceiver
BGproc
BGproc
Net Interface
Switch
Channel
Channel
Channel
Channel
Channel
Channel
Charm++ Workshop 2009
58
Simple Latency Model Parameters
No netconfig file used for Simple Latency Model;
only command-line parameters
Required command-line parameters:
-lat : latency in s (double)
-bw : bandwidth in GB/sec (double)
Optional parameters:
-cpp : cost-per-packet in s (double)
-psize : packet size in bytes (int)
-lat_in : intra-node latency in s (double)
If not specified, lat_in defaults to charmcost (0.5 s)
-bw_in : intra-node bandwidth in GB/sec (double)
If not specified, bw_in defaults to the value from -bw
-winsize : window size in # of log entries (int)
Charm++ Workshop 2009
59
Simple Latency Model Parameters
More optional parameters:
-winthresh : threshold beyond which windows will not be
kept when looking at forward dependencies; specified in #
of windows (int)
-print_params : echoes the parameters fed into the
simulation
-debuglevel : specify level of debugging info displayed
0: display nothing extra
1: display some basic data, including memory usage and
timing stats
2: display function-level debugging messages
3: display all debugging messages
-skip : skip to a predefined point specified in the emulation
source code
-profile : display data for forward dependent distance
histogram
Charm++ Workshop 2009
60
Other Parameters
Other useful parameters for all network
models:
-check : at the end of the simulation, print all
uncommitted events for each target processor
None should be displayed, except for perhaps event 0
on each target processor
-projname : create Projections logs based on
existing emulation logs
-tproj : create Projections logs from trace files,
ignoring existing emulation logs
Charm++ Workshop 2009
61
Window Size Parameter
Specifies the window size used for reading the bgTrace
logs incrementally
Necessary for large traces
Available for all network models:
Simple Latency Model: use -winsize on command line
Other models: specify FILE_WINDOW_SIZE in netconfig file
Reduces memory footprint of simulation at the cost of
increasing runtime
Setting to 0 sets size to the whole timeline (no windowing)
Next window is loaded when simulation needs to look at
forward dependencies of a particular event
Doesn’t help much if dependencies span most of a timeline =>
use window threshold parameter
Charm++ Workshop 2009
62
Window Size Memory Savings
Depends a lot on trace log profiles
Use -profile command-line option in Simple
Latency Model to get a forward dependent
distance histogram (in text format)
Charm++ Workshop 2009
63
Window Size Memory Savings - NAMD
Sequential Simple Latency Model, 280 target procs, 1 simulating proc,
~4000 events per time line, no thresholding (-winthresh 0)
Note: Memory savings is good: save ~55%
NAMD Memory Results
NAMD Run Time Results
350
40
35
300
Run T ime (Seconds)
Max Memory (Millions of Bytes)
30
250
200
150
25
20
15
100
10
50
5
0
0
4000
100
Window Size (Events)
Old
4000
Charm++ Workshop 2009
100
Window Size (Events)
Old
64
Window Size Memory Savings - MILC
Sequential Simple Latency Model, 280 target procs, 1 simulating proc,
~4000 events per time line, no thresholding (-winthresh 0)
Note: Memory savings is bad: only ~3%
MILC 8 Run Time Results
MILC 8 Memory Results
New
New
Old
Old
40
46
35
45
30
43
25
Ru n Time (Seco n ds)
Max Mem ory (Millions of Bytes)
44
42
41
40
20
15
39
10
38
5
37
36
1
10
100
1000
Window Size (Events)
10000
100000
0
1
Charm++ Workshop 2009
10
100
1000
Window Size (Events)
10000
100000
65
Window Load Threshold Parameter
Specifies a threshold, in number of event
windows, beyond which the simulator will not keep
a window in memory when looking at forward
dependencies
Available for all network models:
Simple Latency Model: use -winthresh on command line
Other models: specify WINDOW_LOAD_THRESHOLD in
netconfig file
Balancing required
Too small => extra window loading and longer runtime
Too large => more windows loaded and larger memory
footprint
Setting to 0 sets threshold to infinity (no
thresholding)
Charm++ Workshop 2009
66
Threshold Usage
Note: each box is an event window; lines represent forward dependents
Without thresholding, the simulator loads and stores a new window when it looks
at the backward dependents of an event’s forward dependents to see if it can
queue it for execution.
This is fine for time lines where the forward dependents are “close”:
But when they’re “far away,” this could load most of the timeline, which won’t be
discarded until the events are executed:
Solution: define a threshold beyond which the windows aren’t saved.
Charm++ Workshop 2009
67
Threshold Time Savings - MILC
Sequential Simple Latency Model, 280 target procs, 1 simulating proc,
~4000 events per time line, with thresholding (-winthresh 1)
Note: Memory savings is much better: save ~80%
MILC 8 Memory Results
New – No Thresh
New – With Thresh
Old
50
45
40
Max Memory (Millions of Bytes)
35
30
25
20
15
10
5
0
1
10
100
1000
WindowWorkshop
Size (Events)2009
Charm++
10000
100000
68
Debuglevel Example - Ring
./bigsimulator -lat 1 -bw 1 -winsize 5 -debuglevel 1
Does this for each target processor:
Beginning first pass on PE 0
[0] Max memory during init: 6660424 (6.4 MB)
[0] FileWindower: totalTlineLength=60
[0] First pass took 0.004848 seconds
[0] Max memory during first pass: 1297480 (1.2 MB)
Beginning execution on PE 0
Shows execution updates:
*** Entire first pass sequence took about 0.049207
seconds
*** Execution update: at event 0 of 60 on PE 0 (run
time = 0.000000 seconds)
[0:RECV_RESUME] Start of major loop at 0.347418
[0:RECV_RESUME] End of major loop at 0.349147
Simulation inactive at time: 38129444
Final GVT = 38129444
Prints this at the end for each target processor:
On proc 0:
[0] Max memory during execution: 1643560 (1.6 MB)
[0] Max windows open: 2
[0] Open windows at completion: 0
[0] Window deletes: first pass=12 later=12
Charm++ Workshop 2009
69
Outline
Overview
BigSim Emulator
BigSim Simulator
Post-mortem simulation
BigNetSim build flow
Generic network model: Simple Latency Model
Specific network models
Extensibility
Charm++ Workshop 2009
70
Networks
Indirect Network
Direct Network
Charm++ Workshop 2009
71
BigNetSim Design
BGnode
Transceiver
BGproc
BGproc
Net Interface
Switch
Channel
Channel
Channel
Channel
Channel
Channel
Charm++ Workshop 2009
72
BigNetSim: Network Data Flow
BGproc
2
BGproc
1
Message
Message
BGnode
2
BGnode
1
Message
Net
Interface
2
Message
Net
Interface
1
Packets
Channel 1
Packets
Switch
1
Charm++ Workshop 2009
Packets
73
Interconnection Networks
Flexible Interconnection Network
modeling:
Choose from a variety of
Topologies
Routing Algorithms
Input Virtual Channel Selection strategies
Output Virtual Channel Selection strategies
Charm++ Workshop 2009
74
Topology
Topologies available
HyperCube;
Mesh; generalized k-ary-n-mesh; n-mesh;
Torus; generalized k-ary-n-cube;
FatTree; generalized k-ary-n-tree;
Low Diameter Regular graphs(LDR)
Hybrid topologies
HyperCube-Fattree;
HyperCube-LDR;
Charm++ Workshop 2009
75
Routing Algorithms
K-ary-N-mesh / N-mesh
Direction Ordered;
Planar Routing;
Static Direction Reversal Routing
Optimally Fully Adaptive Routing (modified
too)
K-ary-N-tree
UpDown (modified, non-minimal)
HyperCube
Hamming
P-Cube (modified too)
Charm++ Workshop 2009
77
Input/Output VC selection
Input Virtual Channel Selection
Round Robin;
Shortest Length Queue
Output Buffer length
Output Virtual Channel Selection
Max. available buffer length
Max. available buffer bubble VC
Output Buffer length
Charm++ Workshop 2009
78
Configuring BigNetSim
netconfig file parameters:
USE_TRANSCEIVER 0
NUM_NODES 0
MAX_PACKET_SIZE 256
SWITCH_VC 4
SWITCH_PORT 8
SWITCH_BUF 1024
CHANNELBW 1.75
CHANNELDELAY 9
RECEPTION_SERIAL 0
INPUT_SPEEDUP 8
For network analysis ignore trace and generate random traffic
Number of nodes, taken from trace file or set for transceiver
Maximum packet size
The number of switch virtual channels
Number of ports in switch, calculated automatically for direct networks
Size in memory of each virtual channel
Bandwidth in 100 MB/s
Delay in 10 ns . So 9 => 90ns
Used for direct networks where reception FIFO access has to be serialized
Used to limit simultaneous access by VC in a port. Should be less than or
equal to number of VC. Currently used only for bluegene.
ADAPTIVE_ROUTING 1
Additional flag to use adaptive/deterministic routing
COLLECTION_INTERVAL 1000000 Collection * 10ns gives statistics bin size
DISPLAY_LINK_STATS 1
Display statistics for each link
DISPLAY_MESSAGE_DELAY 1
Display message delay statistics
FILE_WINDOW_SIZE 0
Window size for incremental log reading
DEBUG_PRINT_LEVEL 1
Level of debugging messages to display
WINDOW_LOAD_THRESHOLD 0 Threshold beyond which windows are discarded
INTRA_NODE_LATENCY 0.5
Intra-node latency in s
INTRA_NODE_BANDWIDTH 1.0
Intra-node bandwidth in GB/sec
Charm++ Workshop 2009
79
Running Specific Network Model
Simulations
./charmrun +p4 bigsimulator 1 1
Parameters
First parameter selects detailed network
simulation
1 will use the detailed model (w/ or w/o contention)
0 will use latency-only model
Second parameter controls simulation skip
1 will skip forward to a time stamp set during trace
creation
0 will not skip - use if no skips points were set in the
emulation code or if network startup portion of the
simulation is interesting
Charm++ Workshop 2009
80
Artificial Network Loads - Transceiver
Generate traffic
patterns instead of
using trace files
Pattern
1 kshift
2 ring
3 bittranspose
4 bitreversal
5 bitcomplement
6 poisson
additional command
line parameters
Pattern
Frequency
Frequency
0 linear
1 uniform
2 exponential
Charm++ Workshop 2009
81
Outline
Overview
BigSim Emulator
BigSim Simulator
Post-mortem simulation
BigNetSim build flow
Generic network model: Simple Latency Model
Specific network models
Extensibility
Charm++ Workshop 2009
82
Adding a Network
mkdir new subdir in trunk
copy boilerplate InitNetwork.h
copy boilerplate Makefile
change MACHINE make variable to your
dirname
new InitNetwork.C
Define switch, channel, nic mappings
Define how switches route and select virtual
channels
Define topology and default routing
Charm++ Workshop 2009
83
Adding a Topology
New *.h *.C in trunk/Topology
constructor()
getNeighbours()
getNext()
getNextChannel()
getStartPort()
getStartVC()
getStartSwitch()
getStartNode()
getEndNode()
Charm++ Workshop 2009
84
Adding a Routing Strategy
New *.h *.C files in trunk/Routing
constructor()
selectRoute()
populateRoute()
loadTable()
getNextSwitch()
sourceToSwitchRoutes()
Charm++ Workshop 2009
85
Adding a VC Selector
Either Input or Output VC Selector
new *.h *C in [Input/Output]VCSelector
constructor()
select[Input/Output]VC()
Charm++ Workshop 2009
86
BlueWaters
BlueWaters network model design
currently in progress
Charm++ Workshop 2009
87
Thank you!
Free download of Charm++ and BigSim:
http://charm.cs.uiuc.edu
Send questions and comments to:
[email protected]