CPU Performance

Transcript CPU Performance

Router Architecture
A. Jantsch / I. Sander / Z. Lu
[email protected]
Router Architecture


The discussion concentrates on a typical virtualchannel router
We will cover


Functionality
Route pipeline and pipeline stalls


Modern routers are pipelined and work at the flit level
Minimal buffer size to allow full operation speed

July 21, 2015
Most routers use credits to allocate buffer space
SoC Architecture
2
A typical virtual channel router

A router’s functional blocks
can be divided into

Datapath: handles storage
and movement of a packets
payload




Input buffers
Switch
Output buffers
Control Plane: coordinating
the movement of the
packets through the
resources of the datapath



July 21, 2015
Route Computation
VC Allocator
Switch Allocator
SoC Architecture
3
A typical virtual channel router
Route computation
VC Allocation
Switch Allocation
Switch traversal
VC Deallocation
July 21, 2015
SoC Architecture
4
A typical virtual channel router

The input unit


contains a set of flit
buffers
Maintains the state for
each virtual channel
 G = Global State
 R = Route (output port)
 O = Output VC
 P = Head and tail
Pointers
 C = Credit count
Route computation
VC Allocation
Switch Allocation
Switch traversal
VC Deallocation
July 21, 2015
SoC Architecture
5
Virtual channel state fields (Input)
Route computation
VC Allocation
Switch Allocation
Switch traversal
VC Deallocation
July 21, 2015
SoC Architecture
6
A typical virtual channel router


During route computation
the output port for the
packet is determined
Then the packet requests
an output virtual channel
from the virtual-channel
allocator
Route computation
VC Allocation
Switch Allocation
Switch traversal
VC Deallocation
July 21, 2015
SoC Architecture
7
A typical virtual channel router



Flits are forwarded via the
virtual channel by allocating
a time slot on the switch
and output channel using
the switch allocator
Flits are forwarded to the
appropriate output during
this time slot
The output unit forwards the
flits to the next router in the
packet’s path
Route computation
VC Allocation
Switch Allocation
Switch traversal
July 21, 2015
SoC Architecture
VC Deallocation
8
Virtual channel state fields
(Output)
July 21, 2015
SoC Architecture
9
Packet Rate and Flit Rate

The control of the router operates at two distinct
frequencies
 Packet Rate (performed once per packet, only
with head flits)



Route computation
Virtual-channel allocation
Flit Rate (performed once per flit, for all flits)


July 21, 2015
Switch allocation
Pointer and credit count update
SoC Architecture
10
The Router Pipeline

A typical router pipeline
includes the following
stages




RC (Routing
Computation)
VC (Virtual Channel
Allocation)
SA (Switch Allocation)
ST (Switch Traversal)
no pipeline stalls
Route computation
VC Allocation
Switch Allocation
Switch traversal
July 21, 2015
SoC Architecture
VC Deallocation
11
The Router Pipeline

Cycle 0

Head flit arrives and the
packet is directed to an
virtual channel of the
input port (G = I)
no pipeline stalls
July 21, 2015
SoC Architecture
12
The Router Pipeline

Cycle 1




Routing computation
Virtual channel state
changes to routing (G =
R)
Head flit enters RC-stage
First body flit arrives at
router
no pipeline stalls
July 21, 2015
SoC Architecture
13
The Router Pipeline

Cycle 2: Virtual
Channel Allocation





Route field (R) of virtual
channel is updated
Virtual channel state is
set to “waiting for output
virtual channel” (G = V)
Head flit enters VA state
First body flit enters RC
stage
Second body flit arrives
at router
July 21, 2015
SoC Architecture
no pipeline stalls
14
The Router Pipeline

Cycle 2: Virtual
Channel Allocation



The result of the routing
computation is input to
the virtual channel
allocator
If successful, the
allocator assigns a single
output virtual channel
The state of the virtual
channel is set to active
(G = A)
July 21, 2015
SoC Architecture
no pipeline stalls
15
The Router Pipeline

Cycle 3: Switch Allocation



All further processing is
done on a flit base
Head flit enters SA stage
Any active VA (G = A) that
contains buffered flits
(indicated by P) and has
downstream buffers
available (C > 0) bids for a
single-flit time slot through
the switch from its input VC
to the output VC
July 21, 2015
SoC Architecture
no pipeline stalls
16
The Router Pipeline

Cycle 3: Switch
Allocation


If successful, pointer field
is updated
Credit field is
decremented
no pipeline stalls
July 21, 2015
SoC Architecture
17
The Router Pipeline

Cycle 4: Switch
Traversal


Head flit traverses the
switch
Cycle 5:

Head flit starts traversing
the channel to the next
router
no pipeline stalls
July 21, 2015
SoC Architecture
18
The Router Pipeline

Cycle 7:




Tail traverses the switch
Output VC set to idle
Input VC set to idle (G =
I), if buffer is empty
Input VC set to routing
(G = R), if another head
flit is in the buffer
no pipeline stalls
July 21, 2015
SoC Architecture
19
The Router Pipeline


Only the head flits enter
the RC and VC stages
The body and tail flits
are stored in the flit
buffers until they can
enter the SA stage
no pipeline stalls
July 21, 2015
SoC Architecture
20
Pipeline Stalls

Pipeline stalls can be divided into

Packet stalls


can occur if the virtual channel cannot advance to its
R, V, or A state
Flit stalls

If a virtual channel is in active state and the flit cannot
successfully complete switch allocation due to



July 21, 2015
Lack of flit
Lack of credit
Losing arbitration for the switch time slot
SoC Architecture
21
Example for Packet Stall
Virtual-channel allocation stall
Head flit of A can first enter the VA stage when the tail flit of packet B
completes switch allocation and releases the virtual channel
July 21, 2015
SoC Architecture
22
Example for Flit Stalls
Switch allocation stall
Second body flit fails to allocate the requested connection in cycle 5
July 21, 2015
SoC Architecture
23
Example for Flit Stalls
Buffer empty stall
Body flit 2 is delayed three cycles. However, since it does not have
to enter the RC and VA stage the output is only delayed one cycle!
July 21, 2015
SoC Architecture
24
Credits



A buffer is allocated in the SA stage on the
upstream (transmitting) node
To reuse the buffer, a credit is returned over
a reverse channel after the same flit departs
the SA stage of the downstream (receiving)
node
When the credit reaches the input unit of the
upstream node the buffer is available to be
be reused
July 21, 2015
SoC Architecture
25
Credits

The credit loop can be
viewed by means of a
token that




Starting at the SA stage of
the upstream node
Traveling downwards with
the flit
Reaching the SA stage at
the downstream node
Returning upstream as a
credit
July 21, 2015
SoC Architecture
26
Credit Loop Latency


The credit loop latency tcrt, expressed in flit
times, gives a lower bound on the number of
flit buffers needed on the upstream size for
the channel to operate with full bandwidth
tcrt in flit times is given by
tcrt = tf + tc + 2Tw + 1
Flit pipeline delay
One-way wire delay
Credit pipeline delay
July 21, 2015
SoC Architecture
27
Credit Loop Latency

If the number of buffers available per virtual
channel is F, the duty factor of the channel
will be
d = min (1, F/ tcrt)

The duty factor will be 100% as long as there
are sufficient flit buffers to cover the round
trip latency
July 21, 2015
SoC Architecture
28
Credit Stall
Virtual Channel Router with 4 flit buffers
tf
TW TW
tf
tf
tf
tc
TW
TW
tc
tf = 4
tc = 2
Tw = 2
=>
tcrt = 11
Credit Transmit
Credit Update
White: upstream pipeline stages
July 21, 2015
tcrt
Grey: downstream pipeline stages
SoC Architecture
29
Flit and Credit Encoding
(a)
(b)
Flits and credits are sent over separated lines with
separate width
Flits and credits are transported via the same line.
This can be done by


July 21, 2015
Including credits into flits
Multiplexing flits and credits at phit level
SoC Architecture
30
Network Interfaces
A. Jantsch / I. Sander / Z. Lu
[email protected]
Network-on-Chip
S
S
T
S
S
T
S
T
S
S
S
T
S
July 21, 2015
T
T
T

S
T
S
T
T
S
S
S
T
T
T

S
T
S
T
T
SoC Architecture
Information in the form
of packets is routed via
channels and switches
from one terminal node
to another
The interface between
the interconnection
network and the
terminals (client) is
called network interface
32
Network Interface
Network

Switch
Network
Interface

Terminal
Node
(Resource)
July 21, 2015
SoC Architecture
Different terminals with
different interfaces shall
be connected to the
network
The network uses a
specific protocol and all
traffic on the network
has to comply to the
format of this protocol
33
Network Interface

The network interface plays an important role in a
network-on-chip



it shall translate between the terminal protocol and the
protocol of the network
it shall enable the client to use the full bandwidth at the
lowest latency offered by the network itself.
A poorly designed network interface is a bottleneck
and can increase the latency considerably
July 21, 2015
SoC Architecture
34
Network Interfaces

For message passging: symmetric


For shared memory: un-symmetric, load & store



Processor-Network Interface,
Processor-Network Interface
Memory-Network Interface
Line-fabric Interface


It connects an external network channel with an
interconnection network that used as switching fabric.
Input queuing and output queuing
July 21, 2015
SoC Architecture
35
Network Interfaces for
message passing




Two-register interface
Register-mapped interface
Descriptor-based interface
Message reception
July 21, 2015
SoC Architecture
36
Two-Register Interface



For sending, the processor
writes to a specific Net-out
register
For receiving, the processor
reads a specific Net-in register
Pro:


Efficient for short messages
Cons:



Inefficient for long messages
Processor acts as DMA
controller
Not safe. A misbehaving
processor can send the first part
of a message and then delay
indefinitely sending the end of
the message. The partial
message can tie up network
resources indefinitely.
July 21, 2015
SoC Architecture
R0
R1
:
:
R31
Net out
Net in
Network
37
Register-mapped Interface


To solve the safety problem of
the two-register interface, a
processor composes a message
in its registers and send the
message atomically into the
network interface.
Pro:



Safe since it is impossible to
leave a partial message in the
network.
Efficient for short messages
Cons:


Inefficient for long messages
Processor acts as DMA
controller
July 21, 2015
SoC Architecture
38
Descriptor Based Interface




The processor composes a
message in a set of dedicated
message descriptor registers
Each descriptor contains
 An immediate value, or
 A reference to a processor
register, or
 A reference to a block of
memory
A co-processor steps through
the descriptors and composes
the messages
Safe because the network is
protected from the
processor’s SW
Send Start
Immediate
RN
Addr
Length
END
R0
R1
:
+
Memory
RN
:
:
:
R31
:
:
:
July 21, 2015
SoC Architecture
39
Receiving Messages



A co-processor or a dedicated thread is
triggered upon reception of an incoming
message
It unpacks the message and stores it in local
memory
It informs the receiving task via an interrupt or
a status register update
July 21, 2015
SoC Architecture
40
Shared Memory Interfaces


The interconnection network is used to
transmit memory read/write transactions
between processors and memories
We will further discuss


Processor-Network Interface
Memory-Network Interface
July 21, 2015
SoC Architecture
41
Processor-Network Interface
Requests are stored in
request register
 Requests are tagged
so that answer can be
associated to request
 In case of a cache miss
requests are stored in
MSHR (miss status
holding register)
July 21, 2015
SoC Architecture
42
Processor-Network Interface




Uncacheable read request
would result in a pending
read
After forming and
transmitting the message
status changes to read
requested
When the network returns
the message status
changes to read complete
Completed MSHRs are
forwarded to reply register,
status changes to idle
July 21, 2015
SoC Architecture
43
Processor-Network Interface

1.
2.
3.
Cache coherence protocols
change the operation of the
processor-network interface
Complete cache lines are
loaded into the cache
Protocol requires larger
vocabulary

Exclusive read request

Invalidation and updating
of cache lines
Cache coherence protocol
requires interface to send
messages and update state
in response to received
messages
July 21, 2015
SoC Architecture
44
Memory-Network Interface


Interface receives
memory request
messages and sends
replies
Messages received
from the network are
stored in the TSHR
(transaction status
holding register)
July 21, 2015
SoC Architecture
45
Memory-Network Interface



Request queue is used
to hold request
messages, when all
TSHRs are busy
TSHR tracks messages
in the same way as
MSHR
Bank Control and
Message Transmit Unit
monitors changes in
TSHR
July 21, 2015
SoC Architecture
46
Memory-Network Interface





A read request initializes a
TSHR with status read
pending
Subsequent memory access
changes status to bank
activated
First word is returned from
memory bank. Status is
changed to read complete
Message transmit unit formats
message and injects it into the
network. Finally, the TSHR
entry is marked idle
Requests can be handled out
of order
July 21, 2015
SoC Architecture
47
Memory-Network Interface


Cache coherence
protocols can be
implemented with this
structure, however
TSHR must be
extended
Directory is used to
record the state
(shared) of the
requested cache line.
July 21, 2015
SoC Architecture
48
Line-Fabric Network Interface


A stream processing (data flow) based communication model
deals with messages or streams (not memory transactions)
Queues are needed to store packets that


cannot enter the network because of congestion in the network
cannot enter the terminal

July 21, 2015
if packets cannot be stored in the network interface they have to
stay in the network, which would degrade performance
SoC Architecture
49
Line-fabric Interface

Why parallel queues rather than a single FIFO?

If there are traffic classes with different priorities, there
should be a queue for every traffic class



high-priority traffic is not blocked by low-priority traffic
Alleviate head-of-line blocking
Implement an admission/ejection control policy based on
priority, rate, deadline etc.
July 21, 2015
SoC Architecture
50
Basic Functionality of Network
Interfaces

Packetization/depacketization



Queuing, multiplexing/demultiplexing




Scheduling packets to be sent and receive
Multiple threads running
Sender: multiplexing; Receiver: de-multiplexing
Re-ordering


Network deliver packets. It does not know messages and
transactions.
Sender side: packetization (messages to packets); Receiver
side: de-packetization (packets to messages)
A network servcie may not guarantee in-order delivery
End-to-end flow control
July 21, 2015
SoC Architecture
51
Summary


Network interfaces bridge processor and
processor, processor and memory
Network interfces for messaing passing




Two-register
Register-mapped
Descriptor-based
Network interfaces for shared memory

Complicated by cache coherency.
July 21, 2015
SoC Architecture
52