COMPLEXITY EFFECTIVE SUPERSCALAR PROCESSORS

Download Report

Transcript COMPLEXITY EFFECTIVE SUPERSCALAR PROCESSORS

Hardware
Microarchitecure
Lecture-1
<Ch. 16,17,18,19>
ELE-580i
PRESENTATION-I
04/01/2003
Canturk ISCI
ROUTER ARCHITECTURE
Router:
Registers
Switches
Functional Units
Control Logic
Implements:
Routing & Flow Control
Pipelined
Use credits for buffer space
Flits  <downstream>
Credits  <upstream>
Constitute the credit loop
7/18/2015
2
ROUTER Diagram
 Virtual Channel Router
 Datapath:

 Input Unit:
State Vector ( for each VC) & Flit
Buffer (for each C)
Input Units | Switch | Output Units
State vector fields: GROPC >>>>
Control:
 Output Unit:
Router, VC allocator, Switch Allocator
Latches outgoing flits
State vector (GIC) >>>>>>>>>
 Switch:
Connect I/p to o/p according to SA
 VCA:
Arbitrate o/p channel RQs from each
I/p packet
Once for each packet!!
 SA:
Arbirates o/p port RQs from I/p
ports
Done for each flit
 Router:
Determines o/p ports for packets
7/18/2015
3
VC State Fields
 Input virtual channel:
G  Global State
I, R, V, A, C
x(# of VCs)
R  Route
O/p port for packet
O  o/p VC
O/p VC of port R for packet
P  Pointers
Flit head and tail pointers
C  Credit Count
# of credits for o/p VC R.O
 Output virtual channel:
G  Global State
I, A, C
x(# of VCs)
I  I/p VC
I/p port.VC forwarding this o/p VC
C  Credit count
7/18/2015
# of free buffers at the downstream
4
How it works
1)Packet  Input controller 
Router  o/p port (I.e. P3)
VCA  o/p VC (I.e. P3.VC2)
 Route Determined
2)Each flit  input controller 
SA  timeslot over Switch
Flit forwarded to o/p unit
3)Each flit  output unit 
Drive downstream physical channel
Flit Transferred
7/18/2015
5
Router Pipeline
 Route Compute:
Define the o/p port for packet header
 VC Allocate:
Assign a VC from the port if available
 Switch Allocate:
Schedule switch state according to o/p port requests
 Switch Traverse:
I/p drives the switch for o/p port
 Transmit:
Transmit the flit over downstream channel
RC VA SA ST TX
RC VA SA ST TX
RC VA SA ST TX
7/18/2015
RC VA SA ST TX
 RC, VA  Only for header
O/p channel is assigned for whole
packet
 SA, ST, TX  for all flits
Flits from different packets
compete continuously
 Flits Transmitted sequentially6 for
routing in next hops
Pipeline Walkthrough
 (0):<start>
P4.VC3: (I/p VC)
G=I | R=x | O=x | P=x | C=x
Packet Arrives at I/p port P4
Packet header  VC3 
Packet stored in P4.VC3
 (1):<RC>
P4.VC3:
G=R | R=x | O=x | P=<head>,<tail??> | C=x
Packet Header  Router  select o/p port: P3
 (2):<VA>
P4.VC3:
G=V | R=P3| O=x | P=<head>,<tail??> | C=x
P3.VC2: (o/p VC)
G=I | I=x | C=x
P3  VCA  Allocate VC for o/p port P3: VC2
7/18/2015
7
…Pipeline Walkthrough
 (3):<SA>
P4.VC3: (i/p VC)
G=A | R=P3| O=VC2 | P=<head>,<tail??> | C=#
P3.VC2: (o/p VC)
G=A | I=P4.VC3 | C=#
Packet Processing complete
Flit by flit switch allocation/traverse & Transmit
Head flit allocated on switch 
Move pointers
Decrement P4.VC3.Credit
Send a credit to upstream node to declare the available buffer space
 (4):<ST>
Head flit arrives at output VC
 (5):<TX>
Head flit transmitted to downstream
 (6):<Tail in SA>
Packet done
 (7):<Release Resources>
P4.VC3: (i/p VC)
G=I or R (if new packet already waiting) | R=x| O= x | P= x | C= x
P3.VC2: (o/p VC)
7/18/2015
G=I | I=x | C=x
8
Pipeline Stalls
 Packet Stalls:
P1) I/p VC Busy stall
P2) Routing stall
P3) VC Allocation stall
 Flit Stalls:
F1) Switch Allocation Stall
F2) Buffer empty stall
F3) Credit stall
Credit Return cycles:
pipeline(4)+RndTrip(4)+CT(1)+CU(1)+NextSA(1)=11
7/18/2015
9
Channel Reallocation
 1) Conservative
Wait until credit received for tail from downstream to
reallocate o/p VC
 2) Aggressive – single Global state
Reallocate o/p VC when tail passes SA
(same as VA stall)
Reallocate downstream I/p VC when tail passes SA
(Same as I/p VC busy stall)
7/18/2015
10
…Channel Reallocation
 2) Aggressive – Double Global state
Reallocate o/p VC when tail passes SA
(same as VA stall)
Eliminate I/p VC busy stall
 Needs 2 I/p VC state vectors at downstream:
For A:
G=A | R=Px | O=VCx | P=<head A> <tail A> | C=#
For B:
G=R | R=x
7/18/2015
| O=x
| P=<head B> <tail??> | C=x
11
Speculation and Lookahead
 Reduce latency by reducing pipe stages  Speculation (and
lookahead)
 Speculate virtual channel allocation:
Do VA and SA concurrently
If VC set from RC spans more than 1 port speculate that as well
 Lookahead:
7/18/2015
Do route compute for node I at node I-1
Start at VA at each node; overlap NRC & VA
12
Flit and Credit Format
Two ways to distinguish credits/flits:
Piggybacking Credit:
Include a credit field on each flit
No types required
Define types:
I.e. 10 start credit, 11  start flit, 0x  idle
Flit Format:
Head Flit VC Type
(Credit) Route info
Body Flit
(Credit)
VC
Type
Payload
Payload
CRC
CRC
Credit Format:
Credit
7/18/2015
VC
Type
Check
13
ROUTER COMPONENTS
Datapath:
Input buffer
Hold waiting flits
Switch
Route flits from I/p  o/p
Output unit
Send flits to downstream
Control:
Arbiter
Grant access to shared resources
Allocator
Allocate VCs to packets and switch time to flits
7/18/2015
14
Input Buffer
Smoothes down flit traffic
Hold flits awaiting:
VCs
Switch BW or
Channel BW
Organization:
Centralized
Partitioned into physical channels
Partitioned into VCs
7/18/2015
15
Centralized Input Buffer
 Combined single memory across entire router
 No separate switch, but
Need to multiplex I/ps to memory
Need to demultiplex memory o/p to o/p ports
 PRO:
Flexibility in allocating memory space
 CONs:
High Memory BW requirement
2xI (write I I/ps read I o/ps per flit time)
Flit deserialization / reserialization latency
Need to get I flits from VCs before writing to MEM
7/18/2015
16
Partitioned Input Buffers
 1 buffer per physical I/p port:
Each Memory BW: 2 (1 read, 1 write)
Buffers shared across VCs for a fixed port
Buffers not shared across ports
Less flexibility
 1 buffer per VC:
Enable switch I/p speedup
Obviously, bigger switch
Too fine granularity
Inefficient mem usage
 Intermediate solutions:
Memory
7/18/2015
Mem[ even VC]
Mem[ odd
VC]
17
Input Buffer Data Structures
 Data structures required to:
Track flit/ packet locations in Memory
Manage available free memory
Allocate multiple VCs
Prevent blocking
 Two common types:
Circular buffers
Static, simpler yet inefficient mem usage
Linked Lists
Dynamic, complex, but fairer mem usage
 Nomenclature:
Buffer (flit buffer): entire structure
Cell (flit cell): storage for a single flit
7/18/2015
18
Circular Buffer
 FIXED! First and Last ptrs
Specify the memory boundary for a VC
 Head and Tail specify current content
boundary
Flit added from tail
Tail incremented (modular)
Tail = Head  Buffer Full
Flit removed from head
Head incremented (modular)
Head = Tail  Buffer empty
 Choose size N power of 2 so that LSB log(N)
bits do the circular increment
I.e. like cache line index & byte offset
7/18/2015
19
Linked List Buffer
 Each cell has a ptr field for next cell
 Head and Tail specify 1st and last cells
NULL for empty buffers
 Free List: Linked list of free cells
Free points to head of list
 Counter registers
Count of allocated cells for each buffer
Count of cells in free list
 Bit errors have more severe effect compared to circular
buffer
7/18/2015
20
Buffer Memory Allocation
 Prevent greedy VC to flood all memory and block!
 Add a count register to each I/p VC state vector
Keep number of allocated cells
 Additional counter for free list
 Simple Policy: Reserve 1 cell for each VC
Add flit to bufferVCi if:
bufferVCi empty or #(free list) > #(empty VCs)
 Detailed policy: Sliding Limit Allocator
(r: # reserved cells per buffer,
f: fraction of empty space to use)
Add flit to bufferVCi if:
|bufferVCi|<r or r<|bufferVCi|<f.#(free list) + r
f=r=1  same as simple policy
7/18/2015
21
SWITCH
Core: directs packets/flits to their
destination
Speedup: provided switch BW / Min.
required switch BW for full thruput on
all I/ps and o/ps of router
Adding speedup simplifies allocation and
reveals higher thruput and lower latency
Realizations:
Bus switch
Crossbar
Network switch
7/18/2015
22
Bus Switches
 Switches in time
 Input port accumulates P phits of a flit, arbirates for
bus, transmits P phits over the bus to any o/p unit
I.e. P=3 
<fig. 17.5: P=4>
Feasible only if flits have # phits > P
(preferably int x P)
 Fragmentation Loss:
If phits per flit not multiple of P
7/18/2015
23
Bus timing diagram
Could actually start here!
7/18/2015
24
Bus Pros & Cons
 Simple switch allocation
I/p port owning bus can access all o/p ports
Multicast made easy
 Wasted port BW
Port BW: b  Router BW=Pb  Bus BW=Pb 
I/p deserializer BW=Pb  o/p serializer BW=Pb 
Available internal BW: PxPb=P2b
Used bus BW: Pb (speedup = 1)
 Increased Latency
2P worst case <see 17.6-bus timing diagram>
Can vary from P+1 to 2P (phit times)
7/18/2015
25
Xbar Switches
 Primary issue: speedup
1.
2.
3.
4.
kxk  no speedup - fig 17.10(a)
skxk  I/p speedup=s – fig 17.10(b)
kxsk  o/p speedup=s – fig 17.11(a)
skxsk  speedup=s – fig 17.11(b)
(Speedup simplifies allocation)
7/18/2015
26
Xbar Throughput

Ex: Random separable allocator, I/p speedup=s, uniform traffic:
Thruput:
=P{at least one of the sk flits are destined for given o/p}
=1-P{none of the sk I/ps choose given o/p}=1-[(k-1)/k]sk 
Thruput =1-[(k-1)/k]sk
s=k  thruput=100% (doesn’t verify as above!!)
 O/p speedup:
Need to implement reverse allocator
More complicated for same gain
 Overall speedup (both I/p & o/p)
Can achieve > 100% thruput
Cannot sustain since:
o/p buffer will expand to inf.
and I/p buffers need to be initially filled with inf. # of flits
I/p speedup: si & o/p speedup: so (si>so)
Similar to I/p speedup=(si/so), with overall speedup so 
Thruput:
7/18/2015
si
k


 k  1  so
  so  1  

k









27
Network Switches
 A network of smaller switches
 Reduces # of crosspoints
Localize logic
Reduces wire length
 Requires complex control or
intermediate buffering
Not very profitable!
 Ex: 7x7 switch as 3 3x3 switches
3x9=27 switches instead of 7x7=49
7/18/2015
28
OUTPUT UNIT
Essentially a FIFO to match switch
speed
If switch o/p speedup=1:
merely latch the flits to downstream
No need to partition across VCs
Provide backpressure to SA to
prevent buffer overflow
SA should block traffic to the choking
o/p buffer
7/18/2015
29
ARBITER
 Resolve multiple requests for a single source (N1)
Building blocks for allocators (N1N2)
 Communication and timing:
7/18/2015
30
Arbiter Types
 Types:
Fixed Priority: r0> r1> r2>…
Variable (iterative) Priority: rotate priorities
Make a carry chain, hot 1 inserted from priority inputs
I.e. r1 > r2 > …>r0  (p0,p1,p2,…,pn)=010…0
Matrix: implements a LRS scheme
Uses a triangular array
M(r,c)=1  RQr > RQc
Queueing: First come, first served
<The bank/STA Travel style>
Ticket counter:
Gives current ticket to requester
Increments with each ticket
Served counter:
7/18/2015
Stores current served requester's number
Increments for next customer
31
ALLOCATOR
 Provides matching:
Multiple Resources  Multiple Requesters
I.e. switch allocator:
Every cycle match I/p ports  o/p ports
1 flit per I/p port
1 flit goes to each o/p port
 nxm allocator
rij: requester i wants access to resource j
gij: requester i granted access to resource j
Request & Grant Matrices:
 Allocation rules
gij => rij: Grant if requested
gij => No other gik: Only 1 grant for each requester I/p
gij => No other gkj: Only 1 grant for each resource o/p
7/18/2015
32
Allocation Problem
 Can be represented as finding the maximum
matching grant matrix
 Also a maximum matching in a bipartite graph:
 Exact algorithms:
Augmenting path method
Always finds maximum matching
Not feasible in time budget
 Faster Heuristics:
Separable allocators:
2 sets of arbiration:
7/18/2015
Across I/ps & across o/ps
In either order: I/p first OR o/p first
33
4x3 Input-first Separable Allocator
7/18/2015
34