Regular Architecture and Synthesis for Automatic
Download
Report
Transcript Regular Architecture and Synthesis for Automatic
Architecture-Level Synthesis
for Automatic Interconnect Pipelining
Jason Cong, Yiping Fan, Zhiru Zhang
VLSI CAD Lab
Computer Science Department
University of California, Los Angeles
{cong, fanyp, zhiruz}@cs.ucla.edu
Funded by GSRC, NSF, and Altera Corp.
Outline
Motivation
Our
contributions
RDR-Pipe micro-architecture
• Regular Distributed Register micro-architecture with interconnect
pipelining
Synthesis flow and algorithms
• MCAS-Pipe: automatic interconnect pipelining and sharing
Experimental
Conclusions
results
Interconnect Bottleneck in Nanometer Designs
Challenge: single-cycle full chip communication will be no longer possible
Not supported by the current CAD toolset
5 cycles
4 cycles
ITRS’01 0.07um Tech
5.63 GHz across-chip clock
800 mm2 (28.3mm x 28.3mm)
IPEM BIWS estimations
3 cycles
1 cycle
0
11.4
Semi-global layer (Tier 3)
2 cycles
22.8
28.3
Buffer size: 100x
Driver/receiver size: 100x
Can travel up to 11.4mm in
one cycle
Need 5 clock cycles From
corner to corner
Related Work
Retiming
with placement or floorplanning
Retiming + multilevel partitioning [Cong et al, ICCAD’00] and
coarse placement [Cong et al, DAC’03]
Retiming + floorplanning [Chong & Brayton, IWLS’01]
Retiming + placement for FPGAs [Singh & Brown, FPGA’02]
Global
wire pipelining in ItaniumTM processor
[McInerney et al. ISPD’00]
Buffer
and flip-flop insertion in RTL
[Lu et al. DATE’02]
[Cocchini, ICCAD’02]
Limitation during Logic/Physical Level to Explore
Multicycle Communication
Minimum
clock period achievable by logic optimization is
bounded by max. delay-to-register (DR) ratio of the loops
in the circuits [Papaefthymiou, MST’94]
• In a loop, 4 logic cells, 2 registers
• Cell delay = 1ns
• Interconnect delay = 1ns
• DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2 = 4ns
• Clock period 4ns
Interconnect pipelining
by flip-flop insertion ?
Requires considerable amount of manual rework on the original
RTL descriptions
Our Approach
Consideration
of multicycle communication during
architectural (or behavioral) synthesis
[Cong et al, ISPD’03] [Cong et al. ICCAD’03]
Regular Distributed Register (RDR) micro-architecture
• Highly regular
• Direct support of multicycle on-chip communication
MCAS: Architectural Synthesis for Multi-cycle Communication
• Efficiently maps the behavioral descriptions to RDR uArch
• Integrates architectural synthesis (e.g. resource binding, scheduling)
with physical planning
This
work
Extension of RDR and MCAS for interconnect pipelining
Outline
Motivation
Our
contributions
RDR-Pipe micro-architecture
• Regular Distributed Register micro-architecture with interconnect
pipelining
Synthesis flow and algorithms
• MCAS-Pipe: automatic interconnect pipelining and sharing
Experimental
Conclusions
results
Regular Distributed Register Micro-Architecture
…
Reg. file
Reg. file
K cycle
2 cycle
…
Reg. file
LCC
FSM
LCC
MUL
FSM
…
Reg. file
FSM
FSM
LCC
Global Interconnect
…
Reg. file
….
Local
Computational
Cluster (LCC)
K cycles
1 cycle
1 cycle
LCC
Island
FSM
LCC
FSM
2 cycles
FSM
LCC
…
…
Reg. file
Hi
MUX
ALU
Wi
Distribute registers to each “island”
Choose the island size such that local computation and communication in
each island can be done in a single cycle
Use register banks: registers in each island are partitioned to k banks for 1
cycle, 2 cycle, … k cycle interconnect communication in each island
Wiring Overhead in RDR Designs
Cycle 1
+
r1
Cycle 2
r1
r2
r3
Cycle 3
ALU1
+
r3
+
r4
MUL1
Cycle 4
Cycle 5
r2
*
ALU1
Interconnects with
delay of 2 cycles
r4
*
MUL1
Sender register
Data
transfers r1r3 and r2r4 are overlapped
Two
dedicated global wires are needed
Receiver register
Architectural Solution: RDR-Pipe
Keep
the intra-island
structures
Pipeline Register Station (PRS)
3
1
2
Inter-island pipeline
4
FSM
LCC
LCC
PRS
FSM
FSM
LCC
4
3
H channel PRS
FSM
LCC
LCC
2
1
Synchronous design
No global control signal
needed for PRS
PRS
FSM
performs
autonomous store-andforward
Reg. File
V channel
PRS
PRS
FSM
register station (PRS) for
global communications
LCC
5
6
Reducing Wiring Overhead in RDR-Pipe
Cycle 1
+
r1
Cycle 2
r1
r1
r3
Cycle 3
ALU1
+
r3
r2
MUL1
Cycle 4
Cycle 5
r4
*
r4
Sender register
+
ALU1
*
MUL1
Receiver register
Pipeline register
Data
transfers are pipelined
One
wire with a pipeline register is enough
2 cycle communication
Synthesis Flow: MCAS-Pipe System
C / VHDL
MCAS-Pipe
CDFG generation
CDFG
Resource allocation
& Functional unit binding
ICG
Scheduling-driven placement
Locations
Placement-driven
rescheduling & rebinding
Global interconnect sharing
Register and port binding
Global interconnect sharing
After scheduling and functional
unit binding
Before register and port binding
Enable multiple data
communications to shar a
physical link (a wire with pipeline
registers)
Advantages
over MCAS
Datapath & FSM generation
Expect to reduce global wiring
demand
RTL VHDL & Floorplan
constraints
No multicycle path constraint
needed
Global Interconnect Sharing
Pipeline register
Sender register
pg
Cycle 1
A
Cycle 2
Cycle 3
Receiver register
B
D=2
pe
Cycle 4
pe
ce
pg
cg
Cycle 5
Cycle 6
Cycle 7
ce
cg
Conflicted data transfers
pg
Cycle 1
Cycle 2
Cycle 3
Two physical links are needed to
support the concurrent data transfers
AA
pe
pe, pg
pg
pe
Cycle 4
BB
DD==22
cce e
ccgg
Cycle 5
Now,
Onlytwo
oneproducer
physical link
registers
is required
can betomerged,
Cycle 6
Cycle 7
ce
cg
Compatible data transfers
since
support
their
thelife-times
scheduled
become
data transfers
compatible
Global Pipelined Interconnect Minimization
Definitions
Data links: pipelined global interconnects
Channel: set of data links between two islands
• Width of a channel: number of its data links
Data transfer: movement of data from a producer to a consumer
Architectural
assumption
Channels cannot share interconnects
Theorem
Global pipelined interconnects are minimized if and only if the
width of every channel is minimized
Transfer Scheduling for a Single Channel
A decision
problem formulation
Given:
• A channel (A, B) containing m data links
• A data transfer set {e | pe A and ce B}, where each transfer e is
associated with an arrival time T(pe)+1, a deadline T(ce)-D(A, B), and unit
effective occupancy time
Fact: for every time slot, at most one transfer can be issued on a data link
Objective: to find a feasible transfer schedule on these data links
Transfer
scheduling is polynomial solvable
A special real-time scheduling problem [J. Blazewicz, 1979]
• Binary search for minimum feasible channel width m
• For each width, apply Earliest-Deadline-First (EDF) scheduling: O(nlogn)
• Overall time complexity: O(nlog2n)
EDF-Based Transfer Scheduling Example
Time slot
Time slot
1
2
3
4
5
6
Ordered by Earliest-Deadline-First
1
3 Data 4
Link 2
6
Successfully scheduling onto 2
data links
3 Data 5Link 2
2?
6
Ordered by left edge
5
1
4
1 Data Link
3
4
5
2
1
1 Data Link
2
Failed for 2 data links!
Outline
Motivation
Our
contributions
RDR-Pipe micro-architecture
• Regular Distributed Register micro-architecture with interconnect
pipelining
Synthesis flow and algorithms
• MCAS-Pipe: automatic interconnect pipelining and sharing
Experimental
Conclusions
results
Experiment Settings
C / VHDL
CDFG generation
uArch. spec.
Target clock period
Functional unit
allocation & binding
Conventional
flow
Scheduling-driven
placement
Placement-driven
rebinding & rescheduling
MCAS
Global interconnect
flow
sharing
MCAS-Pipe flow
Register and port binding
Conventional
Scheduling
Datapath & Control generation
RTL VHDL files Floorplan constraints (for MCAS and MCAS-Pipe);
(for all flows)
Multicycle path constraints (for MCAS only)
Altera QuartusII + Stratix
Experimental Results: Register and LE Usage
Design
environment: Altera QuartusII, Stratix EP1S40
MCAS vs. Conventional flow:
Uses more registers and logic elements (LE)
MCAS-Pipe
vs. MCAS:
Slightly more registers, and comparable logic element cost
Designs
Node#
PR
MCAS
CONV / MCAS
MCAS-Pipe / MCAS
Reg#
LE
Reg#
LE
Reg#
LE
46
31
1181
0.71
0.95
1.19
0.95
WANG
52
40
1435
0.63
0.81
1.20
0.85
LEE
53
29
988
0.76
0.96
1.00
0.84
MCM
98
57
2467
0.75
1.00
1.05
1.19
HONDA
101
41
2542
0.83
0.90
1.05
1.01
DIR
152
44
2260
0.75
0.95
1.05
1.01
-
-
0.74
0.93
1.09
0.98
Average
-
Experimental Results: Performance
Design
MCAS
environment: Altera QuartusII, Stratix EP1S40
vs. Conventional flow:
36% reduction in clock period and 30% in total latency
MCAS-Pipe
vs. MCAS:
Comparable design performance (4% better)
Conventional
12
Conventional
600
MCAS
MCAS
MCAS-Pipe
10
500
8
400
Total latency (ns)
Clock period (ns)
MCAS-Pipe
6
4
200
100
2
0
300
0
PR
WANG
LEE
MCM
HONDA
Clock period
DIR
Average
PR
WANG
LEE
MCM
HONDA
Total latency
DIR
Average
Interconnect Structure of Altera’s Stratix
Global: H24
H8
H4
Local: LL, LO
V4
V8
Global:V16
Experimental Results: Wirelength
Wire types
LL, LO: local wires; H4, V4, H8, V8: short global wires
V16, H24: long global wires
MCAS-Pipe vs. MCAS:
28.8% long global wires reduction, 19.3% total wirelength reduction
1.4
1.2
1
LL+LO
H4+V4
H8+V8
V16+H24
Total
0.8
0.6
0.4
0.2
0
PR
WANG
LEE
MCM
HONDA
DIR
Average
Conclusions
High-level
automatic on-chip interconnect pipelining
RDR-Pipe: extension of RDR micro-architecture
• Micro-architecture supporting interconnect pipelining
MCAS-Pipe: enhancement of MCAS synthesis system
• Add in a novel global interconnect sharing algorithm to
effectively reduce the global wiring
Experimental
results
Matches or exceeds the RDR-based approach in performance
Greatly reduces wiring demand
Thank you