Cycle Accurate modeling using Bounded Dataflow Networks

Download Report

Transcript Cycle Accurate modeling using Bounded Dataflow Networks

Transforming an implementation
into a cycle-accurate simulator
using BDN
Murali Vijayaraghavan and Arvind
Computer Science and Artificial Intelligence Laboratory
M.I.T.
RAMP Workshop, Austin, TX
June 25, 2009
http://csg.csail.mit.edu
IBM/MIT Collaboration
Sept 2007 –
Motivation: Create an ecosystem to foster and
promote the use of Power architecture in system
research
Initial Goal: Create a flexible and synthesizable
multithreaded, multicore PowerPC model that
facilitates rapid architectural exploration

parameterized for the number of threads; the number
and functionality of pipeline stages
Current Goals:


June 25, 2009
Cycle-accurate modeling
Open source distribution on widely available FPGAs by
summer 2010
http://csg.csail.mit.edu
The Team
Architecture and Bluespec Coding


K. Ekanadham, Jessica Tseng
MIT: Asif Khan, Murali Vijayaraghavan
Linux OS Bring-up Team

Hubertus Franke, Jimi Xenidis
FPGA Prototyping Team

Richard Kaufman, Kai Schleupen
Managers


June 25, 2009
Nancy Greco, Pratap Pattnaik
MIT: Arvind
http://csg.csail.mit.edu
Results
A 64-bit embedded PowerPC was created from
scratch in Bluespec System Verilog (BSV)


Implemented on an IBM internal FPGA Platform that
uses Xilinx Virtex-5 LX330 chip
Linux was booted on it by Nov 2008
Jessica has ported this design onto Xilinx
XUPV5


June 25, 2009
Takes up 92% of the area
Running at 20Mhz but probably can be jacked up to
40MHz
http://csg.csail.mit.edu
Issues in “Prototype” RTL to
FPGA mapping
Some structures consume a disproportionate
amount of FPGA resources



multiported register file
CAM
multiply, divide
Can be
implemented in
multiple cycles to
save resources
Prototype RTL implementations on FPGAs need
to compensate for external memory timing
Lack of tools for mapping on multiple FPGAs
 Cycle-accurate modeling
June 25, 2009
http://csg.csail.mit.edu
Bounded Data Flow Networks (BDNs) as a
theoretical frame for cycle-accurate modeling
of synchronous sequentional machines
Murali Vijayaraghavan & Arvind [MEMOCODE 2009]
http://csg.csail.mit.edu
Implementing RTL on FPGAs
Simulate
on
BRAMs
in multiple
cycles
3-read
2-write
Reg File
Target RTL
ASIC
On FPGA
In general, functional correctness requires
cycle accuracy
June 25, 2009
http://csg.csail.mit.edu
BDN as a refinement of an SSM
O1
I1
I1
S
R
Om
In
O1
In
Om
There is a bijective mapping between the
inputs (outputs) of S and R
for all n > 0,
Cycle
I(k) matches for S and R (1  k  n)
 O(j) matches for S and R (1  j  n)
Accuracy
Refers to the kth enqueue in each input FIFO for a BDN
June 25, 2009
http://csg.csail.mit.edu
Patient SSMs: SSMs with a “start”
signal to update registers
Combi-national
logic
Combi-national
logic
enable
June 25, 2009
http://csg.csail.mit.edu
S1
S2
(big)
SSM to BDN
and refinements
S3
(big)
SSM
cut
S1
S2
(big)
S3
(big)
Patient SSMs
to BDNs
S1
R1
June 25, 2009
S2
(big)
S3
(big)
R2’
(small)
http://csg.csail.mit.edu
R3’
(small)
BDN
refinements
BDN
SSM to BDN
The translations has to be done such
that the generated BDN is latencyinsensitive, i.e., the input-output
behavior of the BDN does not change if
we change the latency of one of its
component BDNs or the size of the
FIFOs connecting the components
June 25, 2009
http://csg.csail.mit.edu
Implementing an SSM as a BDN
a
b
a
f
c
b
d
c
f
d
rule O when (a.emptyb.emptyc.full d.full)
 c.enq(f(a.first, b.first)); d.enq(b.first);
a.deq ; b.deq
This description can be easily translated into logic
that serves as a wrapper for the original logic
The SSM and BDN have the same input-output
behavior
June 25, 2009
http://csg.csail.mit.edu
Deadlocks
a
b
f
a
c
b
c
f
d
d
rule O when (a.emptyb.emptyc.full d.full)
 c.enq(f(a.first, b.first)); d.enq(b.first);
a.deq ; b.deq
Extraneous dependencies -d unnecessarily depends upon a and c
June 25, 2009
http://csg.csail.mit.edu
Another behavior for the same BDN
a
b
a
f
c
b
d
c
f
cDone
dDone
d
rule O1 when (a.emptyb.emptyc.full cDone)
  c.enq(f(a.first, b.first)); cDone <= True
rule O2 when (b.emptyd.full dDone)
  d.enq(b.first); dDone <= True
rule In when (cDone dDone)
 a.deq ; b.deq; cDone <= False; dDone <= False;
No extraneous dependencies – No deadlock
June 25, 2009
http://csg.csail.mit.edu
Latency-Insensitive BDNs
No extraneous dependency property: if output
Oi is not enqueued n times, assuming it is not
full and all the inputs are enqueued n-1 times,
then it must be that one of the inputs in
Depends-on(Oi) is not enqueued n times
Self Cleaning property: If all outputs are
enqueued n times then all inputs must be
dequeued n times
BDNs with these properties and do not deadlock
June 25, 2009
http://csg.csail.mit.edu
Writing an LI-BDN wrapper for an
SSM
LI-BDN:
rule Oj when (donej)
 donej <= True
oj.enq( fj(ij1.first, ... ,ijIj.first, s) )
rule Finish when (done1  done2  ...)
 done1 <= False; done2 <= False; ...
s <= g(i1.first, i2.first, ... , s)
i1.deq ; i2.deq ; ...
Given the SSM:
oj(t) = fj(ij1(t), ... ,ijIj(t), s(t))
// ij1, ij2, ... ijIj are in Depends-on(oj)
s(t+1) = g(i1(t), i2(t), ... , s(t))
June 25, 2009
http://csg.csail.mit.edu
The Wrapper Circuit
All input
deqs
Patient SSM
first
deq
not-empty
Ii
value
enable
not-full
Depends-on(Oj)
June 25, 2009
Oj
enq
All
dones
donei
1
http://csg.csail.mit.edu
0
PPC In-order Pipeline
stall
PC
Fetch
BrPred
Crack
Decode
AddrCalc
BrRes
RegRd
Mem
1
bypass
Mem2
ALU
Excep
epochs
I$/ITlb
1
D$/DTlb
1
I$/ITlb2
D$/DTlb2
Mem
Mem
The designer specifies the FSM for each stage
The FIFOs are latency-insensitive, that is, the
correctness of the specification does not
depend upon the depth of FIFOs or the
number of stages
June 25, 2009
http://csg.csail.mit.edu
RegWr
The steps in Cycle-accurate
implementation on FPGAs
Can be mechanized
The specs are turned into Bluespec code to give a target
SSM

Once the size of FIFOs is fixed the whole design has a
precise timing specification
If the FPGA implementation requires refining some
stages then cuts are made in the design to isolate the
stages (SSMs) to be refined
Each SSM is turned into a BDN by introducing FIFOs for
each input and output wire, including the wires going in
and out of model FIFOs of the SSM

This converts the nth time cycle of the SSM into the nth
enqueue into input FIFOs and nth dequeue from output
FIFOs
Atomic rules for the operation of each BDN are defined
so that no extraneous dependencies are introduced

June 25, 2009
This also ensures deadlock-free operation
http://csg.csail.mit.edu
Preliminary results
Cycle-accurate refinements onto Xilinx XUPV5
(Asif & Murali)

Slice Logic Utilization:
 Number of Slice Registers: 15448 out of 69120 22%
 Number of Slice LUTs: 16702 out of 69120 24%

Specific Feature Utilization:
 Number of Block RAM/FIFO: 1 out of 148 0% (only 1
BRAM for the register file)
 Number of DSP48Es: 12 out of 64 18% (these are used
for the divider)


Minimum period: 7.988ns (Maximum Frequency:
125.188MHz)
Partially verified by running a 50 instruction program
No numbers
yet for actual
work done
June 25, 2009
Compared to Jessica has port onto Xilinx XUPV5
Takes up 92% of the area;
20Mhz  40Mhz
http://csg.csail.mit.edu
Conclusion
Cycle-accurate modeling of processors on
FPGAs is feasible and offers a 3-orders of
magnitude improvement in performance over
software simulators
BDNs offer a way to refine RTL without losing
cycle-accuracy
Bluespec is makes quick RTL generation
feasible

The generation of BDNs can be automated
We plan to release our Bluespec designs under
open source licensing to strengthen PowrPC
ecosystem.
June 25, 2009
http://csg.csail.mit.edu
Related work
Luca Carloni et al for Latency-Insensitive refinements
HAsim: Joel Emer, Michael Pellauer, et al at Intel/MIT

Cycle accurate modeling using the A-ports abstraction
UTFast: Derek Chiou and students at UT Austin

speculative functional model, corrected by timing model when
necessary
Protoflex: James Hoe, Eric Chung et al at CMU
RAMP Gold: Krste Asanovic et al at Berkeley
June 25, 2009
http://csg.csail.mit.edu
Thanks!
http://csg.csail.mit.edu
BDN Input/Output notation
Ii(n) represents the nth values enqueued in
input buffer Ii
I(n) represents the nth values enqueued in all input
buffers
Oj(n) represents the nth values dequeued from
output buffer Oj
O(n) represents the nth values dequeued from all
output buffers
I1
O1
R
I
In
June 25, 2009
o
Om
http://csg.csail.mit.edu
Examples of primitive BDNs:
Register
a
r
b
A register whose
reads and writes must
match
bDone
Behavior
rule RO when (b.full  bDone)
b.enq(r); bDone <= True
rule RI when (a.empty  bDone)
r <= a.first; a.deq; bDone <= False
June 25, 2009
http://csg.csail.mit.edu
Initial Values
bDone = False
r = r0
Examples of primitive BDNs:
Mux
p
a
b
aCnt
bCnt
c
A mux that accepts an
input value on each input
port but passes only the
appropriate value to the
output
Behavior
rule MuxO when c.full  p.empty
 if(p.first   a.empty)
then c.enq(a.first); a.deq; bCnt<=bCnt+1
else if(!(p.first)   b.empty)
then c.enq(b.first); b.deq; aCnt<=aCnt+1
rule MuxI1 when aCnt >0   a.empty
 a.deq; aCnt<=aCnt-1
rule MuxI2 when bCnt >0   b.empty
 b.deq; bCnt<=bCnt-1
June 25, 2009
http://csg.csail.mit.edu
Initial
values
aCnt = 0
bCnt = 0
Composition of BDNs
If R1 and R2 are BDNs then so is the parallel composition of
R1 and R2 (R = R1  R2)
R1
R1
R
R2
R2
R1 is a BDN then so is the ( Ii ,Oj) iterative composition of
R1 (R = (i,j)  R1) provided Ii  Depends-on(Oj)*
Ii = Oj
Ii
Oj
R1
* No direct combinational path
June 25, 2009
http://csg.csail.mit.edu
R
R1
Deadlock-free BDN
I1
O1
R
I
In
o
Om
Assuming an infinite sink, a BDN is deadlockfree if for all n > 0, if n values are enqueued
into I then eventually n values will be
dequeued from both O and I

June 25, 2009
we need a stronger property for deadlock-freeness to
be preserved under composition
http://csg.csail.mit.edu