CPU Performance

Transcript CPU Performance

Router Architecture
Z. Lu / A. Jantsch / I. Sander
Dally and Towles, Chapter 16, 17
Overview
Interconnect Network Introduction
Deadlock, Livelock
Topology
(Regular, irregular)
Router Architecture
(pipelined, classic)
Routing
(algorithms, mechanics)
Network Interface
(message passing, shared memory)
Flow Control
(Circuit, packet switching (SAF,
wormhole, virtual channel)
Concepts
July 21, 2015
Implementation
Summary
SoC Architecture
Performance
Analysis and QoS
Evaluation
2
Network-on-Chip
S
S
T
S
S
T
S
T
S
S
S
T
S
July 21, 2015
T
T
T

S
T
S
T
T
S
S
S
T
T
T

S
T
S
T
T
SoC Architecture
Information in the form
of packets is routed via
channels and switches
from one terminal node
to another
The interface between
the interconnection
network and the
terminals (client) is
called network interface
3
Router Architecture:
First thinking questions

Functions

What functions does a router must realize?




Wormhole router without virtual channels
Virtual channel routers
What are the minimum functions?
Modules



What functional blocks should a router have to implement
the required functions?
What functions are on the data path, on the control path?
What are the minimum functional units?
July 21, 2015
SoC Architecture
4
Router Architecture





The discussion concentrates on a typical virtualchannel router
Modern routers are pipelined and work at the flit
level
Head flits proceed through buffer stages that
perform routing and virtual channel allocation
All flits pass through switch allocation and switch
traversal stages
Most routers use credits to allocate buffer space
July 21, 2015
SoC Architecture
5
A typical virtual channel router

A router’s functional blocks
can be divided into

Datapath: handles storage
and movement of a packets
payload




Input buffers
Switch
Output buffers
Control Plane: coordinating
the movements of the
packets through the
resources of the datapath



July 21, 2015
Route Computation
VC Allocator
Switch Allocator
This is a generic model.
Can we skip the output buffers or input buffers?
SoC Architecture
6
A typical virtual channel router

The input unit


contains a set of flit
buffers
Maintains the state for
each virtual channel
 G = Global State
 R = Route
 O = Output VC
 P = Pointers
 C = Credits
July 21, 2015
SoC Architecture
7
Virtual channel state fields (Input)
July 21, 2015
SoC Architecture
8
A typical virtual channel router


During route computation
the output port for the
packet is determined
Then the packet requests
an output virtual channel
from the virtual-channel
allocator
July 21, 2015
SoC Architecture
9
A typical virtual channel router



Flits are forwarded via the
virtual channel by allocating
a time slot on the switch
and output channel using
the switch allocator
Flits are forwarded to the
appropriate output during
this time slot
The output unit forwards the
flits to the next router in the
packet’s path
July 21, 2015
SoC Architecture
10
Virtual channel state fields
(Output)
July 21, 2015
SoC Architecture
11
Packet Rate and Flit Rate

The control of the router operates at two
distinct frequencies

Packet Rate (performed once per packet)



Route computation
Virtual-channel allocation
Flit Rate (performed once per flit)


July 21, 2015
Switch allocation
Pointer and credit count update
SoC Architecture
12
The Router Pipeline

A typical router pipeline
includes the following
stages




RC (Routing
Computation)
VC (Virtual Channel
Allocation)
SA (Switch Allocation)
ST (Switch Traversal)
no pipeline stalls
Do all types of flits experience the four stages?
Why?
Can we design the pipeline in less than 4 stages?
July 21, 2015
SoC Architecture
13
The Router Pipeline

Cycle 0

Head flit arrives and the
packet is directed to an
virtual channel of the
input port (G = I)
no pipeline stalls
July 21, 2015
SoC Architecture
14
The Router Pipeline

Cycle 1




Routing computation
Virtual channel state
changes to routing (G =
R)
Head flit enters RC-stage
First body flit arrives at
router
no pipeline stalls
July 21, 2015
SoC Architecture
15
The Router Pipeline

Cycle 2: Virtual
Channel Allocation




Route field (R) of virtual
channel is updated
Head flit enters VA state
First body flit enters RC
stage
Second body flit arrives
at router
no pipeline stalls
July 21, 2015
SoC Architecture
16
The Router Pipeline

Cycle 2: Virtual
Channel Allocation



The result of the routing
computation is input to
the virtual channel
allocator
If successful, the
allocator assigns a single
output virtual channel
The state of the virtual
channel is set to active
(G = A)
July 21, 2015
SoC Architecture
no pipeline stalls
17
The Router Pipeline

Cycle 3: Switch Allocation



All further processing is
done on a flit base
Head flit enters SA stage
Any active VA (G = A) that
contains buffered flits
(indicated by P) and has
downstream buffers
available (C > 0) bids for a
single-flit time slot through
the switch from its input VC
to the output VC
July 21, 2015
SoC Architecture
no pipeline stalls
18
The Router Pipeline

Cycle 3: Switch
Allocation


If successful, pointer field
is updated
Credit field is
decremented
no pipeline stalls
July 21, 2015
SoC Architecture
19
The Router Pipeline

Cycle 4: Switch
Traversal


Head flit traverses the
switch
Cycle 5:

Head flit starts traversing
the channel to the next
router
no pipeline stalls
July 21, 2015
SoC Architecture
20
The Router Pipeline

Cycle 7:




Tail traverses the switch
Output VC set to idle
Input VC set to idle (G =
I), if buffer is empty
Input VC set to routing
(G = R), if another head
flit is in the buffer
no pipeline stalls
July 21, 2015
SoC Architecture
21
The Router Pipeline


Only the head flits enter
the RC and VC stages
The body and tail flits
are stored in the flit
buffers until they can
enter the SA stage
no pipeline stalls
How the timing diagram looks like if the pipeline is stalled?
What are the circumstances when the pipeline will stall?
July 21, 2015
SoC Architecture
22
Pipeline Stalls

Pipeline stalls can be divided into

Packet stalls


can occur if the virtual channel cannot advance to its
R, V, or A state
Flit stalls

If a virtual channel is in active state and the flit cannot
successfully complete switch allocation due to



July 21, 2015
Lack of flit
Lack of credit
Losing arbitration for the switch time slot
SoC Architecture
23
Example for Packet Stall
Virtual-channel allocation stall
Head flit of A can first enter the VA stage when the tail flit of packet B
completes switch allocation and releases the virtual channel
July 21, 2015
SoC Architecture
24
Example for Flit Stalls
Switch allocation stall
Second body flit fails to allocate the requested connection in cycle 5
July 21, 2015
SoC Architecture
25
Example for Flit Stalls
Buffer empty stall
Body flit 2 is delayed three cycles. However, since it does not have
to enter the RC and VA stage the output is only delayed one cycle!
July 21, 2015
SoC Architecture
26
Credits



A buffer is allocated in the SA stage on the
upstream (transmitting) node.
To reuse the buffer, a credit is returned over
a reverse channel after the same flit departs
the SA stage of the downstream (receiving)
node.
When the credit reaches the input unit of the
upstream node, the buffer is available and
then can be reused.
July 21, 2015
SoC Architecture
27
Credits

The credit loop can be
viewed by means of a
token that




Starting at the SA stage of
the upstream node
Traveling downwards with
the flit
Reaching the SA stage at
the downstream node
Returning upstream as a
credit
July 21, 2015
SoC Architecture
28
Credit Loop Latency


The credit loop latency tcrt, expressed in flit
times, gives a lower bound on the number of
flit buffers needed on the upstream size for
the channel to operate with full bandwidth
tcrt in flit times is given by
tcrt = tf + tc + 2Tw + 1
Why plus 1 here?
Flit pipeline delay
One-way wire delay
Credit pipeline delay
July 21, 2015
SoC Architecture
29
Credit Round-trip Time and
Credit Stall
Virtual Channel Router with 4 flit buffers
tf
TW TW
tf
tf
tf
tc
TW
TW
tc
tf = 4
tc = 2
Tw = 2
=>
tcrt = 11
Credit Transmit
Credit Update
White: upstream pipeline stages
July 21, 2015
tcrt
Grey: downstream pipeline stages
What if the vrtual channel has 5 flit buffers?
When the pipeline stall starts, for how many cycles?
SoC Architecture
30
Credit Loop Latency

If the number of buffers available per virtual
channel is F, the duty factor of the channel
will be
d = min (1, F/ tcrt)

The duty factor will be 100% as long as there
are sufficient flit buffers to cover the round
trip latency
July 21, 2015
SoC Architecture
31
Flit and Credit Encoding
(a)
(b)
Flits and credits are sent over separated lines with
separate width
Flits and credits are transported via the same line.
This can be done by


July 21, 2015
Including credits into flits
Multiplexing flits and credits at phit level
SoC Architecture
32
Network Interface
Z. Lu / A. Jantsch / I. Sander
Dally and Towles, Chapter 20
Network-on-Chip
S
S
T
S
S
T
S
T
S
S
S
T
S
July 21, 2015
T
T
T

S
T
S
T
T
S
S
S
T
T
T

S
T
S
T
T
SoC Architecture
Information in the form
of packets is routed via
channels and switches
from one terminal node
to another
The interface between
the interconnection
network and the
terminals (client) is
called network interface
34
Network Interface
Network

Switch
Network
Interface

Terminal
Node
(Resource)
July 21, 2015
SoC Architecture
Different terminals with
different interfaces shall
be connected to the
network
The network uses a
specific protocol and all
traffic on the network
has to comply to the
format of this protocol
35
Network Interface

The network interface plays an important role in a
network-on-chip



it shall translate between the terminal protocol and the
protocol of the network
it shall enable the client to communicate at the speed of
the network
 it shall not further reduce the available bandwidth of the
network
 it shall not increase the latency imposed by the network
A poorly designed network interface is a bottleneck
and can increase the latency considerably
July 21, 2015
SoC Architecture
36
Network Interfaces

For message passing: symmetric


For shared memory: un-symmetric, load & store



Processor-Network Interface,
Processor-Network Interface
Memory-Network Interface
Line-card interface connecting an external network
channel with an interconnection network used as a
switching fabric
What are the differences:
message passing and shared memory communication?
July 21, 2015
SoC Architecture
37
Network Interfaces for
message passing



Two-register interface
Descriptor-based interface
Message reception
July 21, 2015
SoC Architecture
38
Two-Register Interface



For sending, the processor writes
to a specific Net-out register
For receiving, the processor reads
a specific Net-in register
Pro:


Efficient for short messages
Cons:



Inefficient for long messages
Processor acts as DMA controller
Not safe, because it does not
prevent the network from SW
running on the processor


July 21, 2015
A misbehaving processor can send the
first part of a message and then delay
indefinitely sending the end of the
message.
A processor can tie up the network by
failing to read a message from the
input register.
SoC Architecture
R0
R1
:
:
R31
Net out
Net in
Network
39
Descriptor Based Interface




The processor composes a
message in a set of dedicated
message descriptor registers
Each descriptor contains
 An immediate value, or
 A reference to a processor
register, or
 A reference to a block of
memory
A co-processor steps through
the descriptors and composes
the messages
Safe because the network is
protected from the
processor’s SW
Send Start
Immediate
RN
Addr
Length
END
R0
R1
:
+
Memory
RN
:
:
:
R31
:
:
:
July 21, 2015
SoC Architecture
40
Receiving Messages



A co-processor or a dedicated thread is
triggered upon reception of an incoming
message
It unpacks the message and stores it in local
memory
It informs the receiving task via an interrupt or
a status register update
How does a processor know that something happens at an I/O device?
July 21, 2015
SoC Architecture
41
Shared Memory Interfaces


The interconnection network is used to
transmit memory read/write transactions
between processors and memories
We will further discuss


Processor-Network Interface
Memory-Network Interface
What shared memory communication does?
July 21, 2015
SoC Architecture
42
Processor-Network Interface
Load/store requests are
stored in request register.
Type: read/write, cacheable or
uncacheable etc.


Requests are tagged,
usually encoding how the
reply is to be handled,
e.g., store in register R10.
In case of a cache miss,
requests are stored in
MSHR (miss status
holding register)
July 21, 2015
SoC Architecture
43
Processor-Network Interface
Consider a read operation:




Uncacheable read request
would result in a pending read
After forming and transmitting
the message, its status
changes to read requested
When the network returns the
message, its status changes to
read complete
Completed MSHRs are
forwarded to reply register, its
status changes to idle
July 21, 2015
SoC Architecture
44
Processor-Network Interface

1.
2.
3.
Cache coherence protocols
change the operation of the
processor-network interface
Complete cache lines are
loaded into the cache
Protocol requires a larger
vocabulary of messages

Exclusive read request

Invalidation and updating
of cache lines
Cache coherence protocol
requires interface to send
messages and update state
in response to received
messages.
How will cache change the operation of the processor-network interface?
July 21, 2015
SoC Architecture
45
Memory-Network Interface


Interfaces receive
memory request
messages and sends
replies.
Messages received
from the network are
stored in the TSHR
(transaction status
holding register).
July 21, 2015
SoC Architecture
46
Memory-Network Interface



Request queue is used to
hold request messages,
when all THSRs are
busy.
THSR tracks messages
in the same way as
MHSR
Bank Control and
Message Transmit Unit
monitors changes in
THSR
Is a reply queue needed here? Why?
July 21, 2015
SoC Architecture
47
Memory-Network Interface
Consider a read operation:




A read request initializes a
TSHR entry with status read
pending
Subsequent memory access
changes status to bank
activated
Right before the first word is
returned from memory bank,
its status is changed to read
complete
Message transmit unit formats
message and injects it into the
network and the TSHR entry is
marked idle
July 21, 2015
SoC Architecture
48
Memory-Network Interface

Cache coherence
protocols can be
implemented with this
structure, however
TSHR must be
extended, e.g., the
directory.
July 21, 2015
SoC Architecture
49
Summary


Network interfaces bridge processor with
network, and memory with network
Messaing passing interfaces



Two-register interface
Descriptor-based interface
Shared memory interfaces, complicated by
cache coherency.


Processor-Network Interface
Memory-Network Interface
July 21, 2015
SoC Architecture
50
ARM Workshop



Compulsory attendance
Tuesday, March 29, Sal D, 10:30 to 12:00
Content

ARM processors and architectures, programmers
models, the ARM Instruction Set Architecture,
basic system design, core pipelines, power
issues, development tools, and a demonstration
of the latest ARM technology.
July 21, 2015
SoC Architecture
51

CPU Performance

Transcript CPU Performance

Directory