COMPLEXITY EFFECTIVE SUPERSCALAR PROCESSORS
Download
Report
Transcript COMPLEXITY EFFECTIVE SUPERSCALAR PROCESSORS
Hardware
Microarchitecure
Lecture-1
<Ch. 16,17,18,19>
ELE-580i
PRESENTATION-I
04/01/2003
Canturk ISCI
ROUTER ARCHITECTURE
Router:
Registers
Switches
Functional Units
Control Logic
Implements:
Routing & Flow Control
Pipelined
Use credits for buffer space
Flits <downstream>
Credits <upstream>
Constitute the credit loop
7/18/2015
2
ROUTER Diagram
Virtual Channel Router
Datapath:
Input Unit:
State Vector ( for each VC) & Flit
Buffer (for each C)
Input Units | Switch | Output Units
State vector fields: GROPC >>>>
Control:
Output Unit:
Router, VC allocator, Switch Allocator
Latches outgoing flits
State vector (GIC) >>>>>>>>>
Switch:
Connect I/p to o/p according to SA
VCA:
Arbitrate o/p channel RQs from each
I/p packet
Once for each packet!!
SA:
Arbirates o/p port RQs from I/p
ports
Done for each flit
Router:
Determines o/p ports for packets
7/18/2015
3
VC State Fields
Input virtual channel:
G Global State
I, R, V, A, C
x(# of VCs)
R Route
O/p port for packet
O o/p VC
O/p VC of port R for packet
P Pointers
Flit head and tail pointers
C Credit Count
# of credits for o/p VC R.O
Output virtual channel:
G Global State
I, A, C
x(# of VCs)
I I/p VC
I/p port.VC forwarding this o/p VC
C Credit count
7/18/2015
# of free buffers at the downstream
4
How it works
1)Packet Input controller
Router o/p port (I.e. P3)
VCA o/p VC (I.e. P3.VC2)
Route Determined
2)Each flit input controller
SA timeslot over Switch
Flit forwarded to o/p unit
3)Each flit output unit
Drive downstream physical channel
Flit Transferred
7/18/2015
5
Router Pipeline
Route Compute:
Define the o/p port for packet header
VC Allocate:
Assign a VC from the port if available
Switch Allocate:
Schedule switch state according to o/p port requests
Switch Traverse:
I/p drives the switch for o/p port
Transmit:
Transmit the flit over downstream channel
RC VA SA ST TX
RC VA SA ST TX
RC VA SA ST TX
7/18/2015
RC VA SA ST TX
RC, VA Only for header
O/p channel is assigned for whole
packet
SA, ST, TX for all flits
Flits from different packets
compete continuously
Flits Transmitted sequentially6 for
routing in next hops
Pipeline Walkthrough
(0):<start>
P4.VC3: (I/p VC)
G=I | R=x | O=x | P=x | C=x
Packet Arrives at I/p port P4
Packet header VC3
Packet stored in P4.VC3
(1):<RC>
P4.VC3:
G=R | R=x | O=x | P=<head>,<tail??> | C=x
Packet Header Router select o/p port: P3
(2):<VA>
P4.VC3:
G=V | R=P3| O=x | P=<head>,<tail??> | C=x
P3.VC2: (o/p VC)
G=I | I=x | C=x
P3 VCA Allocate VC for o/p port P3: VC2
7/18/2015
7
…Pipeline Walkthrough
(3):<SA>
P4.VC3: (i/p VC)
G=A | R=P3| O=VC2 | P=<head>,<tail??> | C=#
P3.VC2: (o/p VC)
G=A | I=P4.VC3 | C=#
Packet Processing complete
Flit by flit switch allocation/traverse & Transmit
Head flit allocated on switch
Move pointers
Decrement P4.VC3.Credit
Send a credit to upstream node to declare the available buffer space
(4):<ST>
Head flit arrives at output VC
(5):<TX>
Head flit transmitted to downstream
(6):<Tail in SA>
Packet done
(7):<Release Resources>
P4.VC3: (i/p VC)
G=I or R (if new packet already waiting) | R=x| O= x | P= x | C= x
P3.VC2: (o/p VC)
7/18/2015
G=I | I=x | C=x
8
Pipeline Stalls
Packet Stalls:
P1) I/p VC Busy stall
P2) Routing stall
P3) VC Allocation stall
Flit Stalls:
F1) Switch Allocation Stall
F2) Buffer empty stall
F3) Credit stall
Credit Return cycles:
pipeline(4)+RndTrip(4)+CT(1)+CU(1)+NextSA(1)=11
7/18/2015
9
Channel Reallocation
1) Conservative
Wait until credit received for tail from downstream to
reallocate o/p VC
2) Aggressive – single Global state
Reallocate o/p VC when tail passes SA
(same as VA stall)
Reallocate downstream I/p VC when tail passes SA
(Same as I/p VC busy stall)
7/18/2015
10
…Channel Reallocation
2) Aggressive – Double Global state
Reallocate o/p VC when tail passes SA
(same as VA stall)
Eliminate I/p VC busy stall
Needs 2 I/p VC state vectors at downstream:
For A:
G=A | R=Px | O=VCx | P=<head A> <tail A> | C=#
For B:
G=R | R=x
7/18/2015
| O=x
| P=<head B> <tail??> | C=x
11
Speculation and Lookahead
Reduce latency by reducing pipe stages Speculation (and
lookahead)
Speculate virtual channel allocation:
Do VA and SA concurrently
If VC set from RC spans more than 1 port speculate that as well
Lookahead:
7/18/2015
Do route compute for node I at node I-1
Start at VA at each node; overlap NRC & VA
12
Flit and Credit Format
Two ways to distinguish credits/flits:
Piggybacking Credit:
Include a credit field on each flit
No types required
Define types:
I.e. 10 start credit, 11 start flit, 0x idle
Flit Format:
Head Flit VC Type
(Credit) Route info
Body Flit
(Credit)
VC
Type
Payload
Payload
CRC
CRC
Credit Format:
Credit
7/18/2015
VC
Type
Check
13
ROUTER COMPONENTS
Datapath:
Input buffer
Hold waiting flits
Switch
Route flits from I/p o/p
Output unit
Send flits to downstream
Control:
Arbiter
Grant access to shared resources
Allocator
Allocate VCs to packets and switch time to flits
7/18/2015
14
Input Buffer
Smoothes down flit traffic
Hold flits awaiting:
VCs
Switch BW or
Channel BW
Organization:
Centralized
Partitioned into physical channels
Partitioned into VCs
7/18/2015
15
Centralized Input Buffer
Combined single memory across entire router
No separate switch, but
Need to multiplex I/ps to memory
Need to demultiplex memory o/p to o/p ports
PRO:
Flexibility in allocating memory space
CONs:
High Memory BW requirement
2xI (write I I/ps read I o/ps per flit time)
Flit deserialization / reserialization latency
Need to get I flits from VCs before writing to MEM
7/18/2015
16
Partitioned Input Buffers
1 buffer per physical I/p port:
Each Memory BW: 2 (1 read, 1 write)
Buffers shared across VCs for a fixed port
Buffers not shared across ports
Less flexibility
1 buffer per VC:
Enable switch I/p speedup
Obviously, bigger switch
Too fine granularity
Inefficient mem usage
Intermediate solutions:
Memory
7/18/2015
Mem[ even VC]
Mem[ odd
VC]
17
Input Buffer Data Structures
Data structures required to:
Track flit/ packet locations in Memory
Manage available free memory
Allocate multiple VCs
Prevent blocking
Two common types:
Circular buffers
Static, simpler yet inefficient mem usage
Linked Lists
Dynamic, complex, but fairer mem usage
Nomenclature:
Buffer (flit buffer): entire structure
Cell (flit cell): storage for a single flit
7/18/2015
18
Circular Buffer
FIXED! First and Last ptrs
Specify the memory boundary for a VC
Head and Tail specify current content
boundary
Flit added from tail
Tail incremented (modular)
Tail = Head Buffer Full
Flit removed from head
Head incremented (modular)
Head = Tail Buffer empty
Choose size N power of 2 so that LSB log(N)
bits do the circular increment
I.e. like cache line index & byte offset
7/18/2015
19
Linked List Buffer
Each cell has a ptr field for next cell
Head and Tail specify 1st and last cells
NULL for empty buffers
Free List: Linked list of free cells
Free points to head of list
Counter registers
Count of allocated cells for each buffer
Count of cells in free list
Bit errors have more severe effect compared to circular
buffer
7/18/2015
20
Buffer Memory Allocation
Prevent greedy VC to flood all memory and block!
Add a count register to each I/p VC state vector
Keep number of allocated cells
Additional counter for free list
Simple Policy: Reserve 1 cell for each VC
Add flit to bufferVCi if:
bufferVCi empty or #(free list) > #(empty VCs)
Detailed policy: Sliding Limit Allocator
(r: # reserved cells per buffer,
f: fraction of empty space to use)
Add flit to bufferVCi if:
|bufferVCi|<r or r<|bufferVCi|<f.#(free list) + r
f=r=1 same as simple policy
7/18/2015
21
SWITCH
Core: directs packets/flits to their
destination
Speedup: provided switch BW / Min.
required switch BW for full thruput on
all I/ps and o/ps of router
Adding speedup simplifies allocation and
reveals higher thruput and lower latency
Realizations:
Bus switch
Crossbar
Network switch
7/18/2015
22
Bus Switches
Switches in time
Input port accumulates P phits of a flit, arbirates for
bus, transmits P phits over the bus to any o/p unit
I.e. P=3
<fig. 17.5: P=4>
Feasible only if flits have # phits > P
(preferably int x P)
Fragmentation Loss:
If phits per flit not multiple of P
7/18/2015
23
Bus timing diagram
Could actually start here!
7/18/2015
24
Bus Pros & Cons
Simple switch allocation
I/p port owning bus can access all o/p ports
Multicast made easy
Wasted port BW
Port BW: b Router BW=Pb Bus BW=Pb
I/p deserializer BW=Pb o/p serializer BW=Pb
Available internal BW: PxPb=P2b
Used bus BW: Pb (speedup = 1)
Increased Latency
2P worst case <see 17.6-bus timing diagram>
Can vary from P+1 to 2P (phit times)
7/18/2015
25
Xbar Switches
Primary issue: speedup
1.
2.
3.
4.
kxk no speedup - fig 17.10(a)
skxk I/p speedup=s – fig 17.10(b)
kxsk o/p speedup=s – fig 17.11(a)
skxsk speedup=s – fig 17.11(b)
(Speedup simplifies allocation)
7/18/2015
26
Xbar Throughput
Ex: Random separable allocator, I/p speedup=s, uniform traffic:
Thruput:
=P{at least one of the sk flits are destined for given o/p}
=1-P{none of the sk I/ps choose given o/p}=1-[(k-1)/k]sk
Thruput =1-[(k-1)/k]sk
s=k thruput=100% (doesn’t verify as above!!)
O/p speedup:
Need to implement reverse allocator
More complicated for same gain
Overall speedup (both I/p & o/p)
Can achieve > 100% thruput
Cannot sustain since:
o/p buffer will expand to inf.
and I/p buffers need to be initially filled with inf. # of flits
I/p speedup: si & o/p speedup: so (si>so)
Similar to I/p speedup=(si/so), with overall speedup so
Thruput:
7/18/2015
si
k
k 1 so
so 1
k
27
Network Switches
A network of smaller switches
Reduces # of crosspoints
Localize logic
Reduces wire length
Requires complex control or
intermediate buffering
Not very profitable!
Ex: 7x7 switch as 3 3x3 switches
3x9=27 switches instead of 7x7=49
7/18/2015
28
OUTPUT UNIT
Essentially a FIFO to match switch
speed
If switch o/p speedup=1:
merely latch the flits to downstream
No need to partition across VCs
Provide backpressure to SA to
prevent buffer overflow
SA should block traffic to the choking
o/p buffer
7/18/2015
29
ARBITER
Resolve multiple requests for a single source (N1)
Building blocks for allocators (N1N2)
Communication and timing:
7/18/2015
30
Arbiter Types
Types:
Fixed Priority: r0> r1> r2>…
Variable (iterative) Priority: rotate priorities
Make a carry chain, hot 1 inserted from priority inputs
I.e. r1 > r2 > …>r0 (p0,p1,p2,…,pn)=010…0
Matrix: implements a LRS scheme
Uses a triangular array
M(r,c)=1 RQr > RQc
Queueing: First come, first served
<The bank/STA Travel style>
Ticket counter:
Gives current ticket to requester
Increments with each ticket
Served counter:
7/18/2015
Stores current served requester's number
Increments for next customer
31
ALLOCATOR
Provides matching:
Multiple Resources Multiple Requesters
I.e. switch allocator:
Every cycle match I/p ports o/p ports
1 flit per I/p port
1 flit goes to each o/p port
nxm allocator
rij: requester i wants access to resource j
gij: requester i granted access to resource j
Request & Grant Matrices:
Allocation rules
gij => rij: Grant if requested
gij => No other gik: Only 1 grant for each requester I/p
gij => No other gkj: Only 1 grant for each resource o/p
7/18/2015
32
Allocation Problem
Can be represented as finding the maximum
matching grant matrix
Also a maximum matching in a bipartite graph:
Exact algorithms:
Augmenting path method
Always finds maximum matching
Not feasible in time budget
Faster Heuristics:
Separable allocators:
2 sets of arbiration:
7/18/2015
Across I/ps & across o/ps
In either order: I/p first OR o/p first
33
4x3 Input-first Separable Allocator
7/18/2015
34