Parallel and Distributed Simulation

Download Report

Transcript Parallel and Distributed Simulation

RESAM Laboratory Univ. Lyon 1, France

lead by Prof. B. Tourancheau Laurent Lefèvre CongDuc Pham Pascale Primet PhD. student Patrick Geoffray Roland Westrelin

Research interests

High-performance communication systems – Myrinet-based clusters, cluster management – BIP, MPI-BIP, BIP-SMP • Distributed Shared Memory systems – DOSMOS system • Network support for Multimedia and

Cooperative applications

– QoS, multicast – CoTool environment • Parallel simulation, synchronization

algorithms, communication network models

– CSAM tools

Parallel and Distributed Simulation of Communication Networks

(towards cluster-based solution) C.D. Pham RESAM laboratory Univ. Lyon 1, France [email protected]

Outline

Introduction – Discrete Event Simulation (DES) – Parallel DES and the synchronization problems • Conservative protocols – Architecture of a conservative LP – The Chandy-Misra-Bryant protocol – The lookahead ability • Optimistic protocols – Architecture of an optimistic LP – Time Warp

Outline, more...

CSAM, a tools for ATM network models – kernel characteristics – results • Cluster-based solutions – Myrinet, BIP, BIP-SMP, MPI/BIP, MPI/BIP-SMP – Fast Ethernet, Gamma?

– GigaEthernet?

Introduction

Discrete Event Simulation (DES) Parallel DES and synchronization problems

Discrete Event Simulation (DES)

assumption that a system changes its

state at discrete points in simulation time

S2 a1 a2 S1 0  t d1 a3 2  t d2 S3 NOT...

3  t 4  t d3 5  t a4 6  t

DES concepts

fundamental concepts: – system state (variables) – state transitions (events) – simulation time: totally ordered set of values representing time in the system being modeled • the system state can only be modified

upon reception of an event

modeling can be – event-oriented – process-oriented

Life cycle of a DES

a DES system can be viewed as a collec-

tion of simulated objects and a sequence of event computations

each event computation contains a time

stamp indicating when that event occurs in the physical system

each event computation may: – modify state variables – schedule new events into the simulated future • events are stored in a local event list – events are processed in time stamped order – usually, no more event = termination

A simple DES model

A link model delay = 5 send processing time = 5 receive processing time = 1 packet arrival P1 at 5 , P2 at 12 , P3 at 22 5 B A receive packet P1 A sends P1 to B A receive packet P2 A sends P2 to B A receive ACK(P1) A receive packet P3 e1 e2 e3 e4 e5 e6 e7 e8 e9 local event list B receive P1 from A B sends ACK(P1) to A B receive P2 from A

Why it works?

events are processed in time stamp orderan event at time

equal to t timestamp t can only generate future events with timestamp greater or (no event in the past)

generated events are put and sorted in

the event list, according to their

– the event with the smallest timestamp is always processed first, – causality constraints are implicitly maintained.

Why change? It ’s so simple!

models becomes larger and largerthe simulation time is overwhelming or the

simulation is just untractable

example: – parallel programs with millions of lines of codes, – mobile networks with millions of mobile hosts, – ATM networks with hundreds of complex switches, – multicast model with thousands of sources, – ever-growing Internet, – and much more...

Some figures to convince...

ATM network models – Simulation at the cell-level, – 200 switches – 1000 traffic sources, 50Mbits/s – 155Mbits/s links, – 1 simulation event per cell arrival.

More than 26 billions events to simulate 1 second!

30 hours if 1 event is processed in 1us – simulation time increases as link speed increases, – usually more than 1 event per cell arrival, – how scalable is traditional simulation?

Parallel simulation - principles

execution of a discrete event simulation

on a parallel or distributed system with several physical processors.

the simulation model is decomposed into

several sub-models that can be executed in parallel

– spacial partitioning, – temporel partitioning, • radically different from simple simulation

replications.

Parallel simulation - pros & cons

pros – reduction of the simulation time, – increase of the model size, • cons – causality constraints are difficult to maintain, – need of special mechanisms to synchronize the different processors, – increase both the model and the simulation kernel complexity.

challenges – ease of use, transparency.

Parallel simulation - example

logical process (LP) h packet parallel t event

A simple PDES model

A A rec. packet P1 A sends P1 to B A rec. packet P2 A sends P2 to B A rec. packet P3 A rec. ACK(P1) link model delay = 5 send processing time = 5 receive processing time = 1 packet arrival P1 at 5 , P2 at 12 , P3 at 22 5 t e1 e2 e3 e6 e9 e7 e4 e5 e8 B B rec. P1 from A B sends ACK(P1) B rec. P2 from A causality error, violation local event list

Synchronization problems

fundamental concepts – each Logical Process (LP) can be at a different simulation time – local causality constraints: events in each LP must be executed in time stamp order • synchronization algorithms – Conservative: – Optimistic: runtime avoids local causality violations by waiting until it ’s safe allows local causality violations but provisions are done to recover from them at

Conservative protocols

Architecture of a conservative LP The Chandy-Misra-Bryant protocol The lookahead ability

Architecture of a conservative LP

LP B – LPs communicate by sending non-decreasing timestamped messages – each LP keeps a static FIFO channel for each LP with incoming communication – each FIFO channel (input channel, IC) has a clock c i that ticks according to the timestamp of the topmost message, if any, otherwise it keeps the timestamp of the last message c 1 =t B 1 LP A t B 2 t B 1 c 2 =t C 3 t C 5 t C 4 t C 3 LP C LP D t D 4 c 3 =t D 3

A simple conservative algorithm

each LP has to process event in time-

stamp order to avoids local causality violations

The Chandy-Misra-Bryant algorithm while (simulation is not over) { determine the IC i with the smallest C i if (IC i empty) wait for a message else { remove topmost event from IC i process event } }

Safe but has to block

LP B LP C LP D LP A 10 6 3 IC 1 7 4 1 IC 2 9 5 IC 3 min IC 2 1 event 1 3 2 3 3 1 2 4 5 BLOCK 6 7

Blocks and even deadlocks!

A S merge point M BLOCKED S sends all messages to B 4 4 4 B 6 4 4 cycle

How to solve deadlock: null-messages

S null-messages for artificial propagation of simulation time 10 A 10 10 M UNBLOCKED 10 4 7 6 B 5 4 4 2 1 What frequency?

How to solve deadlock: null-messages

a null-message indicates a Lower Bound Time Stamp minimum delay between links is 4 LP C initially at simulation time 0 4 12 8 12 A 11 B 10 9 C 7 LP C sends a null-message with time stamp 4 LP A sends a null-message with time stamp 8 LP B sends a null-message with time stamp 12 LP C can process event with time stamp 7

The lookahead ability

null-messages are sent by an LP to

indicate a lower bound time stamp on the future messages that will be sent

null-messages rely on the « lookahead »

ability

– communication link delays – server processing time (FIFO) • lookahead is very application model

dependant and need to be explicitly identified

Lookahead for concurrent processing

LP A LP B LP C LP D s T A s s s s s s T A +L A s safe event unsafe event

What if lookahead is small?

a null-message indicates a Lower Bound Time Stamp minimum delay between links is 4 LP C initially at simulation time 0 1 1 5 7 A 11 2 6 B 10 9 3 7 C 7 LP C sends a null-message with time stamp 1 LP A sends a null-message with time stamp 2 then 5 then 6 LP B sends a null-message with time stamp 3 LP C can process event with time stamp 7 then 7

Conservative: pros & cons

pros – simple, easy to implement – good performance when lookahead is large (communication networks, FIFO queue) • cons – pessimistic in many cases – large lookahead is essential for performance – no transparent exploitation of parallelism – performances may drop even with small changes in the model (adding preemption, adding one small lookahead link…)

Optimistic protocols

Architecture of an optimistic LP Time Warp

Architecture of an optimistic LP

LP B – LPs send timestamped messages, not necessarily in non-decreasing time stamp order – no static communication channels between LPs, dynamic creation of LPs is easy – each LP processes events as they are received, no need to wait for safe events – local causality violations are detected and corrected at runtime LP A t B 2 t C 4 t C 5 t D 4 t B 1 t C 3 LP C LP D

Processing events as they arrive

LP A LP B LP C LP D what to do with late messages?

LP A LP D 32 LP B 36 LP C 28 processed!

LP D LP C LP B 25 22 18 LP D 13 LP B 11

TimeWarp. Rollback? How?

Late messages (stragglers) are handled

with a rollback mechanism

– undo false/uncorrect local computations, • state saving: save the state variables of an LP • reverse computation – undo false/uncorrect remote computations, • anti-messages: anti-messages and (real) messages annihilate each other – process late messages – re-process previous messages: processed events are NOT discarded!

A pictured-view of a rollback

32 unprocessed processed anti-msg 45 state points 36 32 28 25 22 18 13 11 38 27 24 20 15 13 – The real rollback distance depends on the state saving period: short period reduces rollback overhead but increases state saving overhead

Reception of an anti-message

– may initiate a rollback if the corresponding positive message has already been processed, 45 43 36 28 25 22 25 rollback – may annihilate the corresponding positive message if it is still unprocessed, 45 43 36 28 25 22 43 – may wait in the input queue if the corresponding positive message has not been received yet.

48 45 43 36 28 25 22 48

Need for a Global Virtual Time

Motivations – an indicator that the simulation time advances – reclaim memory (fossil collection) • Basically, GVT is the minimum of – all LPs ’ logical simulation time – timestamp of messages in transit • GVT garantees that – events below GVT are definitive events (I/O) – no rollback can occur before the GVT – state points before GVT can be reclaimed – anti-messages before GVT can be reclaimed

A pictured-view of the GVT

LP A LP B LP C LP D old GVT c c c c c new GVT c c c D conditional event definitive event

Optimistic overheads

Periodic state savings – states may be large, very large!

– copies are very costly • Periodic GVT computations – costly in a distributed architecture, – may block computations, • Rollback thrashing – cascaded rollback, no advancement!

Memory! – memory is THE limitation

Optimistic: pros & cons

pros – exploits all the parallelism in the model, lookahead is less important, – transparent to the end-user – interactive simulations can be enabled – can be hopefully general-purpose • cons – very complex, needs lots of memory, – large overheads (state saving, GVT, rollbacks…)

Optimizations, variations

Conservative Optimistic Mixed approaches, adaptive approaches,

Conservative: outline

Add more information to reduce the

number of null-messages

– special msg: carrier null-messages [Cai90] – topology information: [DeVries90] – time/delay information: Bounded Lag [Lubuchewsky89] – time window: CTW [Ayani92] • In general, one tries to add additional

knowledge of the model in the simulator, may not be general-purpose

Optimistic: outline

Reduce rollback-related overhead – lazy-cancellation, lazy re-evaluation – limit optimism (time window, blocking) • Reduce memory comsumtion – fast GVT algorithms, hardware support – incremental state saving, reverse computation – cancelback, artificial rollback • In general, one tries to reduce the

optimism to avoid to many computation speculations

Mixed/adaptive approaches

General framework that (automatically)

switches to conservative or optimistic

Adaptive approaches may determine at

runtime the amount of conservatism or optimism

performance mixed conservative conservative optimistic optimistic messages

Parallel simulation today

Lots of algorithms have been proposed – variations on conservative and optimistic – adaptives approaches • Few end-users – impossible to compete with sequential simulators in terms of user interface, generability, ease of use etc.

Research mainly focus on – applications, ultra-large scale simulations – tools and execution environments (clusters) – composability issues

CSAM (Pham, UCBL)

CSAM

: C onservative S imulator for A TM network M odel

Simulation at the cell-levelConservative and/or sequentialC++ programming-style, predefined

generic model of sources, switches, links…

New models can be easily created by

deriving from base classes

Configuration file that describes the

topology

CSAM - Kernel characteristics

Exploits the lookahead of communication

links: transparent for the user

Virtual Input Channels – reduces overhead for event manipulation, – reduces overhead for null-messages handling.

Cyclic event executionMessage aggregation – static aggregation size, – asymmetric aggregation size on CLUMPS, – sender-initiated, – receiver-initiated.

CSAM - Life cycle

3 t10 MPI buffers 1 t9 2 t 8 3 t 7 1 t 6 3 t10 MPI buffers safetime = min(last[i]) t3 t3 3 t5 2 t4

l ast[i]

1 safetime = min(last[i]) t7 t7 3 t8 2 t9

l ast[i]

1 1 t4 end 2 t5 t2 t2+L 3 t2 3 t 3 Fut ure Event List (a) end of cycle, send a null-mes sage 2 t5 1 t6 1 t 4 3 t 7 e nd 1 t9 2 t 8 3 t 3 t2 Future Event List (b) get new messages, begin new cycle

Test case: 78-switch ATM network

Distance-Vector Routing with dynamic link cost functions Connection setup, admission control protocols

CSAM - Some results...

Routing protocol’s reconfiguration time

CSAM - Some results...

End-to-end delays

Cluster-based solution

Myrinet-based cluster of 12 Pentium Pro

at 200MHz, 64 MBytes, Linux

Myrinet-based cluster of 4 dual Pentium

Pro 450MHz, 128 Mbytes, Linux

Myrinet board with LANai 4.1, 256KB BIP, BIP-SMP, MPI/BIP, MPI/BIP-SMP

communication libraries

CSAM - speedup on a myrinet cluster

Pentium Pro 200MHz 6 5 1 0 4 3 2 2 4 6

number of processors

8 10 More than 53 millions events to simulate 0.31s

CSAM - speedup with CLUMPS

Dual Pentium Pro 450MHz 2.5

2 ext.

2 int.

4 ext.

2x2 int.

2 1.5

1 0.5

0 no aggr 156 256 512 1024 256 156 512 156 1024 156

Increasing the model size (CLUMPS)

Dual Pentium Pro 450MHz, 4x2 int 5 4.5

4 3.5

3 2.5

2 1.5

1 0.5

0 no aggr 156 78 switches 156 switches 256 512 1024 256 156 512 156 1024 156

Conclusions

Parallel simulation techniques can be

successfully be applied to communication network models

To enable the PSTTL approach (parallel

simulation to the labs), we need

– powerful software, – powerful BUT, cheap and accessible, execution environment – Myrinet? Fast Ethernet? Giga Ethernet?

We will always take the cheapest if performance are good

References

Parallel simulation – K. M. Chandy and J. Misra, Distributed Simulation: A Case Study in Design and Verification of Distributed Programs, IEEE Trans. on Soft. Eng., 1979, pp440-452 – R. Fujimoto, Parallel Discrete Event Simulation, Comm. of the ACM, Vol. 33(10), Oct. 90, pp31-53 – http://