PFAT - Stanford University

Download Report

Transcript PFAT - Stanford University

The Parallel Packet Switch
Sundar Iyer,
Amr Awadallah,
&
Nick McKeown
High Performance Networking Group,
Stanford University.
Web Site: http://klamath.stanford.edu/fjr
Stanford University © 1999
Contents
Motivation
 Introduction
 Key Ideas



Speedup, Concentration, Constraints
Centralized Algorithm
– Theorems, Results & Summary
Motivation for a Distributed Algorithm
 Concepts


Independence, Trade-Off, Request Duplication
Performance of DPA
 Conclusions & Future Work

Stanford University © 1999
Motivation

To build
– a switch with memories running slower than the line rate
– a switch with a highly scaleable architecture

To build
– an extremely high-speed packet switch
– a switch with extremely high line rates
Quality of Service
 Redundancy

“I want an ideal switch”
Stanford University © 1999
Architecture Alternatives
Y
QoS
Support
Ideal !
PPS
Switch ?
CIOQ
Switch
An Ideal Switch:
Output
Queued
• The memory runs
at lower than line
rate speeds
•Supports QoS
•Is easy to
implement
1x
2x
Nx
Z
Memory
Speeds
Stanford University © 1999
Input
Queued
X
Ease of
Implementation
What is a Parallel Packet Switch ?
A parallel packet-switch (PPS) is comprised of multiple identical lower-speed packet-switches operating
independently and in parallel. An incoming stream of packets is spread, packet-by-packet, by a de-multiplexor
across the slower packet-switches, then recombined by a multiplexor at the output.
Demultiplexer
R
(R/k)
Output-Queued Switch
1
1
Mult iplexer
1
Demultiplexer
Output-Queued Switch
2
Demultiplexer
2
D emultiplexer
R
N=4
Stanford University © 1999
R
Multiplexer
R
3
R
M ultiplexer
R
2
(R/k)
Output-Queued Switch
k=3
3
R
Multiplexer
N=4
R
Key Ideas in a Parallel Packet Switch
•Key Concept - “Inverse Multiplexing”
•Buffering occurs only in the internal switches !
•By choosing a large value of “k”, we would like to arbitrarily
reduce the memory speeds within a switch
Can such a switch work “ideally” ?
Can it give the advantages of an output queued switch ?
What should the multiplexor and de-multiplexor do ?
Does not the switch behave well in a trivial manner ?
Stanford University © 1999
Definitions

Output Queued Switch
– A switch in which arriving packets are placed immediately in queues at the output,
where they contend with packets destined to the same output waiting their turn to
depart.
–

“We would like to perform as well as an output queued switch”
Mimic (Black Box Model)
– Two different switches are said to mimic each other, if under identical inputs,
identical packets depart from each switch at the same time

Work Conserving
– A system is said to be work-conserving if its outputs never idle unnecessarily.
–
“If you got something to do, do it now !!”
Stanford University © 1999
Ideal Scenario
Demultiplexer
(R/3)
R
Output-Queued Switch
1
1
Demultiplexer
Multiplexer
(R/3)
R
Output-Queued Switch
2
2
(R) 2
(R/3)
Demultiplexer
R
3
Multiplexer
N=4
R
(R/3)
3
Output-Queued Switch
k=3
Demultiplexer
R
1
Multiplexer
(R/3)
R
R
(R/3)
(R/3
R
N=4
Packets destined to output port two
Stanford University © 1999
Multiplexer
Potential Pitfalls - Concentration
“Concentration is when a large number of cells destined to the same output
are concentrated on a small fraction of internal layers”
Demultiplexer
(R/3)
R
Output-Queued Switch
1
1
Demultiplexer
Multiplexer
(R/3)
(2R/3)
2
Output-Queued Switch
2
2
3
Multiplexer
N=4
R
(R/3)
Output-Queued Switch
k=3
Demultiplexer
R
R
(R/3)
Demultiplexer
R
1
Multiplexer
(R/3)
R
R
Multiplexer
R
(R/3)
Stanford University © 1999
3
N=4
Packets destined to output port two
Can concentration always be avoided ?
R
C1:A, 1
R
C2:A, 2
1
A
R
R
R2
B
R
R
C
R
R
1
C3 C1
R2
A
R
B
R
C
R
C2
R
C3:A, 1
3
Cells arriving at
t=0
3
Cells departing at
(c)
R
C4:B, 2
R
R
C5:B, 2
1
(d)
R
C3
B
R
R
3
C
R
R
Stanford University © 1999
C3
1
R2
Cells arriving at
t=0’
t=1
R2
3
C5
C4
Cells departing at
A
R
B
R
C
R
t=1’
Link Constraints

Input Link Constraint-

This constraint is due to the switch architecture
An external input port is constrained to
send a cell to a specific layer at most once every ceil(k/S) time slots.
– Each arriving cell must adhere to this constraint

Output Link Constraint
– A similar constraint exists for an output port
Demultiplexer
Demultiplexer
After t =4
Stanford University © 1999
After t =5
A speedup of 2, with 10 links
AIL and AOL Sets

Available Input Link Set: AIL(i,n), is the set of layers to which
external input port i can start sending a cell in time slot n.
– This is the set of layers that external input i has not started sending any
cells to within the last ceil(k/S) time slots.
– AIL(i,n) evolves over time
– AIL(i,n) is full when there are no cells destined to an input for ceil(k/S)
time slots.

Available Output Link Set: AOL(j,n’), is the set of layers that can
send a cell to external output j at time slot n’ in the future.
– This is the set of layers that have not started to send a new cell to external
output j in the last ceil(k/S) time slots before time slot n’
– AOL(j,n’) evolves over

time & cells to output j
– AOL(j,n’) is never full as long as there are cells in the system destined to
output j.
Stanford University © 1999
Bounding AIL and AOL
Lemma1:
 Lemma2:

AIL(j,n) >= k - ceil(k/S) +1
AOL(j,n’) >= k - ceil(k/S) +1
k
Demultiplexer
ceil(k/S) -1
k - ceil(k/S) +1
At t =n
Stanford University © 1999
AIL(i,n)
Theorems

Theorem1: (Sufficiency) If a PPS guarantees that each arriving cell is
allocated to a layer l, such that l € AIL(i,n) and l € AOL(j,n’), (i.e. if it meets
both the ILC and the OLC) then the switch is work-conserving.
AIL(i,n)

AOL(j,n’)
The intersection set
Theorem2: (Sufficiency) A speedup of 2k/(k+2) is sufficient for a PPS to
meet both the input and output link constraints for every cell
– Corollary:A PPS is work conserving, if S >2k/(k+2)
Stanford University © 1999
Theorems .. contd

Theorem3:
(Sufficiency) A PPS can exactly mimic an FCFS-OQ switch
with a speedup of 2k/(k+2)
Analogy to CLOS ?
Stanford University © 1999
Summary of Results
CPA - Centralized PPS Algorithm
 Each input maintains the AIL set.
 A central scheduler is broadcast the AIL Sets
 CPA calculates the intersection between AIL and AOL
 CPA timestamps the cells
 The cells are output in the order of the global timestamp


If the speedup S >= 2, then
– CPA is work conserving
– CPA is perfectly load balancing
– CPA can perfectly mimic an FCFS OQ Switch
Stanford University © 1999
Motivation for a Distributed Solution

Centralized Algorithm not practical
– N Sequential decisions to be made
– Each decision is a set intersection
– Does not scale with N, the number of input ports

Ideally, we would like a distributed algorithm where
each input makes its decision independently.

Caveats
– A totally distributed solution leads to concentration
– A speedup of k might be required
Stanford University © 1999
Potential Pitfall
“If inputs act independently, the PPS can immediately become non work
conserving”
Demultiplexer
R
(R/k)
Output-Queued Switch
1
(R/k)
1
1
Demultiplexer
R
2
3
Demultiplexer
R
Output-Queued Switch
2
2
R
Multiplexer
Output-Queued Switch
k=3
N=4
•Decrease the number of inputs which request simultaneously
•Give the scheduler choice
•Increase the speedup appropriately
Stanford University © 1999
R
Multiplexer
Demultiplexer
R
Multiplexer
3
R
Multiplexer
N=4
R
DPA - Distributed PPS Algorithm
Inputs are partitioned into k groups of size floor(N/k)
 N schedulers

– One for each output
– Each maintains AOL(j,n’)

There are ceil(N/k) scheduling stages
– Broadcast phase
– Request phase
 Each input requests a layer which satisfies ILC & OLC (primary
request)
 Each input also requests a duplicate layer (duplicate request)
 Duplication function
– Grant phase
 The scheduler grants each input one request amongst the two
Stanford University © 1999
The Duplicate Request Function
l’ = (l +g) mod k




Input i € group g
The primary request is to layer l
l’ is the duplicate request layer
k is the number of layers
“Inputs belonging to
group k do not send
duplicate requests”
Stanford University © 1999
Layer
Group
Layer1
Layer2
Layer3
Group1
2
3
1
Group2
3
1
2
Group3
1
2
3
Key Idea - Duplicate Requests
Demultiplexor
C1: B
R
(R/k)
Output-Queued Switch
1
1
R
2
Output-Queued Switch
2
R
3
Demultiplexor
C4: B
R
Output-Queued Switch
k =3
N=4
Group 1 = 1,2; Group2 = 3; Group 3 = 4
Inputs 1,3,4 participate in the first scheduling stage
Input 4 belongs to group 3 and does not duplicate
Stanford University © 1999
Multiplexor
R
B
Multiplexor
R
Demultiplexor
C3: B
Multiplexor
R
A
Demultiplexor
C2: B
(R/k)
C
Multiplexor
R
D
Understanding the Scheduling Stage in DPA
1
5
2
4



3
A set of x nodes can pack at the most x(x-1) +1 request tuples
A set of x request tuples span at least ceil[sqrt(x)] layers
The maximum number of requests which need to be granted to a single layer
in a given scheduling stage is bounded by ceil[sqrt(k)]
So a speedup of around sqrt(k) suffices ?
Stanford University © 1999
DPA … results

Fact1: (Work Conservance - Necessary condition for PPS)
– For the PPS to be work conserving we require that no more than s cells be
scheduled to depart from the same layer in a given window of k time slots.

Fact2: (Work Conservance - Sufficiency for DPA)
– If in any scheduling stage we present only layers which have less than S - ceil[sqrt(k)] cells
belonging to the present k-window slot in the AOL. then DPA will always remain work
conserving.

Fact3: We have to ensure that there always exists 2 layers such that
– l € AIL & AOL
– l’ is the duplicate of l
– l’ also € AIL & AOL

A speedup of S suffices, where
–
S > ceil[sqrt(k)] +3, k > 16
– S > ceil[sqrt(k)] + 4, k > 2
Stanford University © 1999
Conclusions & Future Work
CPA is not practical
DPA has to be made simpler
•Extend the results to take care of
•Non FIFO QoS policies in a PPS
•Study multicasting in a PPS
Stanford University © 1999
Questions Please !
Stanford University © 1999