PFAT - Stanford University
Download
Report
Transcript PFAT - Stanford University
The Parallel Packet Switch
Sundar Iyer,
Amr Awadallah,
&
Nick McKeown
High Performance Networking Group,
Stanford University.
Web Site: http://klamath.stanford.edu/fjr
Stanford University © 1999
Contents
Motivation
Introduction
Key Ideas
Speedup, Concentration, Constraints
Centralized Algorithm
– Theorems, Results & Summary
Motivation for a Distributed Algorithm
Concepts
Independence, Trade-Off, Request Duplication
Performance of DPA
Conclusions & Future Work
Stanford University © 1999
Motivation
To build
– a switch with memories running slower than the line rate
– a switch with a highly scaleable architecture
To build
– an extremely high-speed packet switch
– a switch with extremely high line rates
Quality of Service
Redundancy
“I want an ideal switch”
Stanford University © 1999
Architecture Alternatives
Y
QoS
Support
Ideal !
PPS
Switch ?
CIOQ
Switch
An Ideal Switch:
Output
Queued
• The memory runs
at lower than line
rate speeds
•Supports QoS
•Is easy to
implement
1x
2x
Nx
Z
Memory
Speeds
Stanford University © 1999
Input
Queued
X
Ease of
Implementation
What is a Parallel Packet Switch ?
A parallel packet-switch (PPS) is comprised of multiple identical lower-speed packet-switches operating
independently and in parallel. An incoming stream of packets is spread, packet-by-packet, by a de-multiplexor
across the slower packet-switches, then recombined by a multiplexor at the output.
Demultiplexer
R
(R/k)
Output-Queued Switch
1
1
Mult iplexer
1
Demultiplexer
Output-Queued Switch
2
Demultiplexer
2
D emultiplexer
R
N=4
Stanford University © 1999
R
Multiplexer
R
3
R
M ultiplexer
R
2
(R/k)
Output-Queued Switch
k=3
3
R
Multiplexer
N=4
R
Key Ideas in a Parallel Packet Switch
•Key Concept - “Inverse Multiplexing”
•Buffering occurs only in the internal switches !
•By choosing a large value of “k”, we would like to arbitrarily
reduce the memory speeds within a switch
Can such a switch work “ideally” ?
Can it give the advantages of an output queued switch ?
What should the multiplexor and de-multiplexor do ?
Does not the switch behave well in a trivial manner ?
Stanford University © 1999
Definitions
Output Queued Switch
– A switch in which arriving packets are placed immediately in queues at the output,
where they contend with packets destined to the same output waiting their turn to
depart.
–
“We would like to perform as well as an output queued switch”
Mimic (Black Box Model)
– Two different switches are said to mimic each other, if under identical inputs,
identical packets depart from each switch at the same time
Work Conserving
– A system is said to be work-conserving if its outputs never idle unnecessarily.
–
“If you got something to do, do it now !!”
Stanford University © 1999
Ideal Scenario
Demultiplexer
(R/3)
R
Output-Queued Switch
1
1
Demultiplexer
Multiplexer
(R/3)
R
Output-Queued Switch
2
2
(R) 2
(R/3)
Demultiplexer
R
3
Multiplexer
N=4
R
(R/3)
3
Output-Queued Switch
k=3
Demultiplexer
R
1
Multiplexer
(R/3)
R
R
(R/3)
(R/3
R
N=4
Packets destined to output port two
Stanford University © 1999
Multiplexer
Potential Pitfalls - Concentration
“Concentration is when a large number of cells destined to the same output
are concentrated on a small fraction of internal layers”
Demultiplexer
(R/3)
R
Output-Queued Switch
1
1
Demultiplexer
Multiplexer
(R/3)
(2R/3)
2
Output-Queued Switch
2
2
3
Multiplexer
N=4
R
(R/3)
Output-Queued Switch
k=3
Demultiplexer
R
R
(R/3)
Demultiplexer
R
1
Multiplexer
(R/3)
R
R
Multiplexer
R
(R/3)
Stanford University © 1999
3
N=4
Packets destined to output port two
Can concentration always be avoided ?
R
C1:A, 1
R
C2:A, 2
1
A
R
R
R2
B
R
R
C
R
R
1
C3 C1
R2
A
R
B
R
C
R
C2
R
C3:A, 1
3
Cells arriving at
t=0
3
Cells departing at
(c)
R
C4:B, 2
R
R
C5:B, 2
1
(d)
R
C3
B
R
R
3
C
R
R
Stanford University © 1999
C3
1
R2
Cells arriving at
t=0’
t=1
R2
3
C5
C4
Cells departing at
A
R
B
R
C
R
t=1’
Link Constraints
Input Link Constraint-
This constraint is due to the switch architecture
An external input port is constrained to
send a cell to a specific layer at most once every ceil(k/S) time slots.
– Each arriving cell must adhere to this constraint
Output Link Constraint
– A similar constraint exists for an output port
Demultiplexer
Demultiplexer
After t =4
Stanford University © 1999
After t =5
A speedup of 2, with 10 links
AIL and AOL Sets
Available Input Link Set: AIL(i,n), is the set of layers to which
external input port i can start sending a cell in time slot n.
– This is the set of layers that external input i has not started sending any
cells to within the last ceil(k/S) time slots.
– AIL(i,n) evolves over time
– AIL(i,n) is full when there are no cells destined to an input for ceil(k/S)
time slots.
Available Output Link Set: AOL(j,n’), is the set of layers that can
send a cell to external output j at time slot n’ in the future.
– This is the set of layers that have not started to send a new cell to external
output j in the last ceil(k/S) time slots before time slot n’
– AOL(j,n’) evolves over
time & cells to output j
– AOL(j,n’) is never full as long as there are cells in the system destined to
output j.
Stanford University © 1999
Bounding AIL and AOL
Lemma1:
Lemma2:
AIL(j,n) >= k - ceil(k/S) +1
AOL(j,n’) >= k - ceil(k/S) +1
k
Demultiplexer
ceil(k/S) -1
k - ceil(k/S) +1
At t =n
Stanford University © 1999
AIL(i,n)
Theorems
Theorem1: (Sufficiency) If a PPS guarantees that each arriving cell is
allocated to a layer l, such that l € AIL(i,n) and l € AOL(j,n’), (i.e. if it meets
both the ILC and the OLC) then the switch is work-conserving.
AIL(i,n)
AOL(j,n’)
The intersection set
Theorem2: (Sufficiency) A speedup of 2k/(k+2) is sufficient for a PPS to
meet both the input and output link constraints for every cell
– Corollary:A PPS is work conserving, if S >2k/(k+2)
Stanford University © 1999
Theorems .. contd
Theorem3:
(Sufficiency) A PPS can exactly mimic an FCFS-OQ switch
with a speedup of 2k/(k+2)
Analogy to CLOS ?
Stanford University © 1999
Summary of Results
CPA - Centralized PPS Algorithm
Each input maintains the AIL set.
A central scheduler is broadcast the AIL Sets
CPA calculates the intersection between AIL and AOL
CPA timestamps the cells
The cells are output in the order of the global timestamp
If the speedup S >= 2, then
– CPA is work conserving
– CPA is perfectly load balancing
– CPA can perfectly mimic an FCFS OQ Switch
Stanford University © 1999
Motivation for a Distributed Solution
Centralized Algorithm not practical
– N Sequential decisions to be made
– Each decision is a set intersection
– Does not scale with N, the number of input ports
Ideally, we would like a distributed algorithm where
each input makes its decision independently.
Caveats
– A totally distributed solution leads to concentration
– A speedup of k might be required
Stanford University © 1999
Potential Pitfall
“If inputs act independently, the PPS can immediately become non work
conserving”
Demultiplexer
R
(R/k)
Output-Queued Switch
1
(R/k)
1
1
Demultiplexer
R
2
3
Demultiplexer
R
Output-Queued Switch
2
2
R
Multiplexer
Output-Queued Switch
k=3
N=4
•Decrease the number of inputs which request simultaneously
•Give the scheduler choice
•Increase the speedup appropriately
Stanford University © 1999
R
Multiplexer
Demultiplexer
R
Multiplexer
3
R
Multiplexer
N=4
R
DPA - Distributed PPS Algorithm
Inputs are partitioned into k groups of size floor(N/k)
N schedulers
– One for each output
– Each maintains AOL(j,n’)
There are ceil(N/k) scheduling stages
– Broadcast phase
– Request phase
Each input requests a layer which satisfies ILC & OLC (primary
request)
Each input also requests a duplicate layer (duplicate request)
Duplication function
– Grant phase
The scheduler grants each input one request amongst the two
Stanford University © 1999
The Duplicate Request Function
l’ = (l +g) mod k
Input i € group g
The primary request is to layer l
l’ is the duplicate request layer
k is the number of layers
“Inputs belonging to
group k do not send
duplicate requests”
Stanford University © 1999
Layer
Group
Layer1
Layer2
Layer3
Group1
2
3
1
Group2
3
1
2
Group3
1
2
3
Key Idea - Duplicate Requests
Demultiplexor
C1: B
R
(R/k)
Output-Queued Switch
1
1
R
2
Output-Queued Switch
2
R
3
Demultiplexor
C4: B
R
Output-Queued Switch
k =3
N=4
Group 1 = 1,2; Group2 = 3; Group 3 = 4
Inputs 1,3,4 participate in the first scheduling stage
Input 4 belongs to group 3 and does not duplicate
Stanford University © 1999
Multiplexor
R
B
Multiplexor
R
Demultiplexor
C3: B
Multiplexor
R
A
Demultiplexor
C2: B
(R/k)
C
Multiplexor
R
D
Understanding the Scheduling Stage in DPA
1
5
2
4
3
A set of x nodes can pack at the most x(x-1) +1 request tuples
A set of x request tuples span at least ceil[sqrt(x)] layers
The maximum number of requests which need to be granted to a single layer
in a given scheduling stage is bounded by ceil[sqrt(k)]
So a speedup of around sqrt(k) suffices ?
Stanford University © 1999
DPA … results
Fact1: (Work Conservance - Necessary condition for PPS)
– For the PPS to be work conserving we require that no more than s cells be
scheduled to depart from the same layer in a given window of k time slots.
Fact2: (Work Conservance - Sufficiency for DPA)
– If in any scheduling stage we present only layers which have less than S - ceil[sqrt(k)] cells
belonging to the present k-window slot in the AOL. then DPA will always remain work
conserving.
Fact3: We have to ensure that there always exists 2 layers such that
– l € AIL & AOL
– l’ is the duplicate of l
– l’ also € AIL & AOL
A speedup of S suffices, where
–
S > ceil[sqrt(k)] +3, k > 16
– S > ceil[sqrt(k)] + 4, k > 2
Stanford University © 1999
Conclusions & Future Work
CPA is not practical
DPA has to be made simpler
•Extend the results to take care of
•Non FIFO QoS policies in a PPS
•Study multicasting in a PPS
Stanford University © 1999
Questions Please !
Stanford University © 1999