Bluespec technical deep dive - Massachusetts Institute of

Download Report

Transcript Bluespec technical deep dive - Massachusetts Institute of

IP Lookup
Arvind
Computer Science & Artificial Intelligence Lab
Massachusetts Institute of Technology
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-1
IP Lookup block in a router
LC
Line Card (LC)
Packet Processor
SRAM
(lookup table)
IP Lookup
Arbitration
Control
Processor
Switch
Queue
Manager
Exit functions
A packet is routed based on
the “Longest Prefix Match”
(LPM) of it’s IP address with
entries in a routing table
Line rate and the order of
arrival must be maintained
February 17, 2009
LC
LC
line rate  15Mpps for 10GE
http://csg.csail.mit.edu/arvind
L06-2
A
F
B
…
A
A
A
…
F
…
F
3
…
7
F
…
…
…
F
10.18.200.* C
C
7
E
*
F
IP address
Result
10
255
F
M Ref
7.13.7.3
F
2
10.18.201.5
F
3
7.14.7.2
A
4
5.13.7.2
E
10.18.200.7
C
1
4
18
200
In this lecture:
Level 1: 16 bits
Level 2: 8 bits
Level 3: 8 bits
http://csg.csail.mit.edu/arvind
F
5
D
…
5.*.*.*
F
…
…
10.18.200.5 D
February 17, 2009
14
…
B
E
F
…
5
F
…
7.14.7.3
A
…
7.14.*.*
0
…
Sparse tree representation
C
 1 to 3 memory
accesses
L06-3
0
…
28 -1
…
0
0
28 -1
…
int
lpm (IPA ipa)
/* 3 memory lookups */
{ int p;
/* Level 1: 16 bits */
p = RAM [ipa[31:16]];
if (isLeaf(p)) return value(p);
/* Level 2: 8 bits */
p = RAM [ptr(p) + ipa [15:8]];
if (isLeaf(p)) return value(p);
/* Level 3: 8 bits */
p = RAM [ptr(p) + ipa [7:0]];
return value(p);
/* must be a leaf */
}
…
“C” version of LPM
216 -1
Not obvious from the C
code how to deal with
- memory latency
- pipelining
Must process a packet every 1/15 ms or 67 ns
Memory latency
~30ns to 40ns
Must sustain 3 memory dependent lookups in 67 ns
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-4
Longest Prefix Match for IP lookup:
3 possible implementation architectures
Rigid pipeline
Linear pipeline
Inefficient memory
usage but simple
design
Efficient memory
usage through
memory port
replicator
Designer’s
Ranking:
1
2
Circular pipeline
Efficient memory
with most complex
control
Which is “best”?
Arvind,
Rosenbandhttp://csg.csail.mit.edu/arvind
& Dave ICCAD 2004
February 17, Nikhil,
2009
3
L06-5
Circular pipeline
inQ
enter?
RAM
done?
outQ
yes
no
fifo
The fifo holds the request while the memory
access is in progress
The architecture has been simplified for the sake of the
lecture. Otherwise, a “completion buffer” has to be added
at the exit to make sure that packets leave in order.
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-6
FIFO
interface FIFO#(type t);
method Action enq(t x);
method Action deq();
method t first();
endinterface
// enqueue an item
// remove oldest entry
// inspect oldest item
not empty
February 17, 2009
rdy
FIFO
module
n
first deq
enab
rdy
not full
enab
rdy
not empty
enq
n
http://csg.csail.mit.edu/arvind
n = # of bits needed
to represent a
value of type t
L06-7
Ready (ctr > 0)
ctr++
ctr
enq
deq
Data
Ready
ctr--
peek
req
Enable
Ack
deq
Request-Response Interface
for Synchronous Memory
Synch Mem
Latency N
Data
Addr
interface Mem#(type addrT, type dataT);
method Action req(addrT x);
method Action deq();
Making a synchronous
method dataT peek();
component latencyendinterface
insensitive
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-8
Circular Pipeline Code
inQ
enter?
RAM
done?
rule enter (True);
IP ip = inQ.first();
fifo
ram.req(ip[31:16]);
fifo.enq(ip[15:0]);
done? Is the same as isLeaf
inQ.deq();
endrule
rule recirculate (True);
TableEntry p = ram.peek(); ram.deq();
When can
enter fire?
IP rip = fifo.first();
inQ has an
if (isLeaf(p)) outQ.enq(p);
element and
else begin
ram & fifo
each has
fifo.enq(rip << 8);
space
ram.req(p + rip[15:8]);
end
fifo.deq();
endrule http://csg.csail.mit.edu/arvind
L06-9
February 17, 2009
Circular Pipeline Code:
discussion
inQ
enter?
RAM
done?
rule enter (True);
IP ip = inQ.first();
fifo
ram.req(ip[31:16]);
fifo.enq(ip[15:0]);
inQ.deq();
endrule
rule recirculate (True);
When can
TableEntry p = ram.peek(); ram.deq();
recirculate
IP rip = fifo.first();
fire?
if (isLeaf(p)) outQ.enq(p);
ram & fifo
else begin
each has an
element and
fifo.enq(rip << 8);
ram, fifo &
ram.req(p + rip[15:8]);
outQ each has
space
end
fifo.deq();
Is this possible?
endrule http://csg.csail.mit.edu/arvind
February 17, 2009
L06-10
One Element FIFO
February 17, 2009
http://csg.csail.mit.edu/arvind
FIFO
module
deq
enq
enq and deq cannot
module mkFIFO1 (FIFO#(t));
even be enabled
Reg#(t)
data <- mkRegU();
together much less
Reg#(Bool) full <- mkReg(False);
fire concurrently!
method Action enq(t x) if (!full);
full <= True;
data <= x;
endmethod
method Action deq() if (full);
full <= False;
n
endmethod
enab
method t first() if (full);
rdy
return (data);
not full
enab
endmethod
method Action clear();
not empty rdy
full <= False;
endmethod
The functionality we want is
endmodule
as if deq “happens” before
We can build
enq; if deq does not happen
such a FIFO
then enq behaves normally
more on this later
L06-11
Dead cycles
inQ
enter?
RAM
done?
rule enter (True);
IP ip = inQ.first();
fifo
ram.req(ip[31:16]);
assume simultaneous
fifo.enq(ip[15:0]); inQ.deq();
enq & deq is allowed
endrule
rule recirculate (True);
TableEntry p = ram.peek(); ram.deq();
Can a new
IP rip = fifo.first();
request enter
the system
if (isLeaf(p)) outQ.enq(p);
when an old one
else begin
is leaving?
fifo.enq(rip << 8);
ram.req(p + rip[15:8]);
Is this worth
worrying
end
about?
fifo.deq();
endrule
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-12
The Effect of Dead Cycles
yes
in
enter
RAM
done?
no
fifo
Circular Pipeline
 RAM takes several cycles to respond to a request
 Each IP request generates 1-3 RAM requests
 FIFO entries hold base pointer for next lookup and
unprocessed part of the IP address
What is the performance loss if “exit” and “enter” don’t ever
happen in the same cycle?
>33% slowdown!
February 17, 2009
http://csg.csail.mit.edu/arvind
Unacceptable
L06-13
The compiler issue
Can the compiler detect all the conflicting
conditions?

Important for correctness
yes
Does the compiler detect conflicts that do not
exist in reality?
yes


False positives lower the performance
The main reason is that sometimes the compiler
cannot detect under what conditions the two rules
are mutually exclusive or conflict free
What can the user specify easily?

Rule priorities to resolve nondeterministic choice
In many situations the correctness of the design is not
enough; the design is not done unless the performance
goals are met
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-14
Scheduling conflicting rules
When two rules conflict on a shared
resource, they cannot both execute in
the same clock
The compiler produces logic that
ensures that, when both rules are
applicable, only one will fire

Which one?
source annotations
(* descending_urgency = “recirculate, enter” *)
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-15
So is there a dead cycle?
inQ
enter?
RAM
done?
rule enter (True);
IP ip = inQ.first();
fifo
ram.req(ip[31:16]);
fifo.enq(ip[15:0]); inQ.deq();
endrule
rule recirculate (True);
TableEntry p = ram.peek(); ram.deq();
In general these
IP rip = fifo.first();
two rules
if (isLeaf(p)) outQ.enq(p);
conflict but
else begin
when isLeaf(p)
fifo.enq(rip << 8);
is true there is
no apparent
ram.req(p + rip[15:8]);
conflict!
end
fifo.deq();
endrule
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-16
Rule Spliting
rule foo (True);
if (p) r1 <= 5;
else r2 <= 7;
endrule

rule fooT (p);
r1 <= 5;
endrule
rule fooF (!p);
r2 <= 7;
endrule
rule fooT and fooF can be scheduled
independently with some other rule
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-17
Spliting the recirculate rule
rule recirculate (!isLeaf(ram.peek()));
IP rip = fifo.first(); fifo.enq(rip << 8);
ram.req(ram.peek() + rip[15:8]);
fifo.deq(); ram.deq();
endrule
rule exit (isLeaf(ram.peek()));
outQ.enq(ram.peek()); fifo.deq(); ram.deq();
endrule
rule enter (True);
IP ip = inQ.first(); ram.req(ip[31:16]);
fifo.enq(ip[15:0]); inQ.deq();
endrule
Now rules enter and exit can be scheduled simultaneously,
assuming fifo.enq and fifo.deq can be done simultaneously
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-18
Back to the fifo problem
FIFO
module
deq
enq
module mkFIFO1 (FIFO#(t));
Reg#(t)
data <- mkRegU();
Reg#(Bool) full <- mkReg(False);
method Action enq(t x) if (!full);
full <= True;
data <= x;
The functionality we want is
endmethod
method Action deq() if (full); as if deq “happens” before
enq; if deq does not happen
full <= False;
then enq behaves normally
endmethod
method t first() if (full);
n
return (data);
enab
endmethod
rdy
not
full
method Action clear();
enab
full <= False;
not empty rdy
endmethod
endmodule
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-19
RWire to rescue
interface RWire#(type t);
method Action wset(t x);
method Maybe#(t) wget();
endinterface
Like a register in that you can read and write it but
unlike a register
- read happens after write
- data disappears in the next cycle
RWires can
break the
atomicity of a
rule if not
used properly
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-20
One Element “Loopy” FIFO
FIFO
module
deq
enq
module mkLFIFO1 (FIFO#(t));
This works correctly
Reg#(t)
data <- mkRegU();
in both cases (fifo full
Reg#(Bool) full <- mkReg(False);
and fifo empty).
RWire#(void) deqEN <- mkRWire();
method Action enq(t x) if
(!full || isValid (deqEN.wget()));
full <= True;
data <= x;
endmethod
method Action deq() if (full);
full <= False; deqEN.wset(?);
endmethod
!full
enab
method t first() if (full);
rdy
or
not
full
return (data);
enab
endmethod
not empty rdy
method Action clear();
full <= False;
endmethod
endmodule
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-21
Problem solved!
LFIFO fifo <- mkLFIFO;
// use a loopy fifo
rule recirculate (True);
TableEntry p = ram.peek();
ram.deq();
IP rip = fifo.first();
if (isLeaf(p)) outQ.enq(p);
else
begin
fifo.enq(rip << 8);
ram.req(p + rip[15:8]);
end
fifo.deq();
endrule
February 17, 2009
http://csg.csail.mit.edu/arvind
RWire has been
safely encapsulated
inside the Loopy
FIFO – users of
Loopy fifo need not
be aware of RWires
L06-22
Packaging a module:
Turning a rule into a method
outQ
inQ
enter?
RAM
done?
fifo
rule enter (True);
IP ip = inQ.first();
ram.req(ip[31:16]);
fifo.enq(p[15:0]);
inQ.deq();
endrule
method Action enter (IP ip);
ram.req(ip[31:16]);
fifo.enq(ip[15:0]);
endmethod
Similarly a method can be written to extract elements
from the outQ
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-23
Circular pipeline
with Completion Buffer
getToken
yes
inQ
luReq
luResp
cbuf
enter?
RAM
done?
no
fifo
Completion buffer
- gives out tokens to control the entry into the
circular pipeline
- ensures that departures take place in order
even if lookups complete out-of-order
The fifo holds the token while the memory access is in
progress: Tuple2#(Bit#(16), Token)
February 17, 2009
http://csg.csail.mit.edu/arvind
remainingIP
L06-24
Circular Pipeline Code
with Completion Buffer
inQ
enter?
RAM
rule enter (True);
Token tok <- cbuf.getToken();
fifo
IP ip = inQ.first();
ram.req(ip[31:16]);
fifo.enq(tuple2(ip[15:0], tok)); inQ.deq();
endrule
rule recirculate (True);
February 17, 2009
cbuf
done?
TableEntry p <- ram.resp();
match {.rip, .tok} = fifo.first();
if (isLeaf(p)) cbuf.put(tok, p);
else begin
fifo.enq(tuple2(rip << 8, tok));
ram.req(p+rip[15:8]);
end
fifo.deq();
endrule
http://csg.csail.mit.edu/arvind
L06-25
Completion buffer
interface CBuffer#(type t);
method ActionValue#(Token) getToken();
method Action put(Token tok, t d);
method ActionValue#(t) getResult();
endinterface
i
o
cnt
typedef Bit#(TLog#(n)) TokenN#(numeric type n);
typedef TokenN#(16) Token;
I
I
V
I
V
I
buf
module mkCBuffer (CBuffer#(t))
provisos (Bits#(t,sz));
RegFile#(Token, Maybe#(t)) buf <- mkRegFileFull();
Reg#(Token)
i <- mkReg(0);
//input index
Reg#(Token)
o <- mkReg(0);
//output index
Reg#(Int#(32)) cnt <- mkReg(0); //number of filled slots
…
Elements must be representable as bits
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-26
Completion buffer
i
o
I
I
V
I
V
I
// state elements
cnt
// buf, i, o, n ...
buf
method ActionValue#(t) getToken() if (cnt < maxToken);
cnt <= cnt + 1; i <= i + 1;
buf.upd(i, Invalid);
return i; endmethod
method Action put(Token tok, t data);
return buf.upd(tok, Valid data); endmethod
method ActionValue#(t) getResult() if (cnt > 0) &&&
(buf.sub(o) matches tagged (Valid .x));
o <= o + 1; cnt <= cnt - 1;
return x; endmethod
Home work: Think about concurrency Issues, i.e., can these
methods be executed concurrently? Do they need to?
February 17, 2009
http://csg.csail.mit.edu/arvind
L06-27
Longest Prefix Match for IP lookup:
3 possible implementation architectures
Rigid pipeline
Linear pipeline
Inefficient memory
usage but simple
design
Efficient memory
usage through
memory port
replicator
Circular pipeline
Efficient memory
with most complex
control
Which is “best”?
Arvind,
Rosenbandhttp://csg.csail.mit.edu/arvind
& Dave ICCAD 2004
February 17, Nikhil,
2009
L06-28
Implementations of Static pipelines
Two designers, two results
LPM versions
Best Area
(gates)
Best Speed
(ns)
Static V (Replicated FSMs)
8898
3.60
Static V (Single FSM)
2271
3.56
IP addr
Replicated:
BEST:
result
IP addr
MUX
MUX / De-MUX
Each packet
is processed
by one FSM
FSM
Counter
February 17, 2009
FSM
FSM
result
FSM
MUX / De-MUX
RAM
http://csg.csail.mit.edu/arvind
FSM
Shared
FSM
RAM
L06-29
Synthesis results
LPM
versions
Code
size
(lines)
Best Area
(gates)
Best Speed
(ns)
Mem. util.
(random
workload)
Static V
220
2271
3.56
63.5%
Static BSV
179
2391 (5% larger)
3.32 (7% faster)
63.5%
Linear V
410
14759
4.7
99.9%
Linear BSV
168
15910 (8% larger)
4.7 (same)
99.9%
Circular V
364
8103
3.62
99.9%
Circular BSV
257
8170 (1% larger)
3.67 (2% slower)
99.9%
Synthesis: TSMC 0.18 µm lib
- Bluespec results can match carefully coded Verilog
- Micro-architecture has a dramatic impact on performance
- Architecture differences are much more important than
language differences in determining QoR
V = Verilog;BSV
February
17, 2009
= Bluespechttp://csg.csail.mit.edu/arvind
System Verilog
L06-30