Regular Architecture and Synthesis for Automatic

Transcript Regular Architecture and Synthesis for Automatic

Architecture-Level Synthesis
for Automatic Interconnect Pipelining
Jason Cong, Yiping Fan, Zhiru Zhang
VLSI CAD Lab
Computer Science Department
University of California, Los Angeles
{cong, fanyp, zhiruz}@cs.ucla.edu
Funded by GSRC, NSF, and Altera Corp.
Outline
 Motivation
 Our
contributions
 RDR-Pipe micro-architecture
• Regular Distributed Register micro-architecture with interconnect
pipelining
 Synthesis flow and algorithms
• MCAS-Pipe: automatic interconnect pipelining and sharing
 Experimental
 Conclusions
results
Interconnect Bottleneck in Nanometer Designs
 Challenge: single-cycle full chip communication will be no longer possible
 Not supported by the current CAD toolset
5 cycles




4 cycles
ITRS’01 0.07um Tech
5.63 GHz across-chip clock
800 mm2 (28.3mm x 28.3mm)
IPEM BIWS estimations



3 cycles

1 cycle
0
11.4
Semi-global layer (Tier 3)

2 cycles
22.8
28.3
Buffer size: 100x
Driver/receiver size: 100x
Can travel up to 11.4mm in
one cycle
Need 5 clock cycles From
corner to corner
Related Work
 Retiming
with placement or floorplanning
 Retiming + multilevel partitioning [Cong et al, ICCAD’00] and
coarse placement [Cong et al, DAC’03]
 Retiming + floorplanning [Chong & Brayton, IWLS’01]
 Retiming + placement for FPGAs [Singh & Brown, FPGA’02]
 Global
wire pipelining in ItaniumTM processor
 [McInerney et al. ISPD’00]
 Buffer
and flip-flop insertion in RTL
 [Lu et al. DATE’02]
 [Cocchini, ICCAD’02]
Limitation during Logic/Physical Level to Explore
Multicycle Communication
 Minimum
clock period achievable by logic optimization is
bounded by max. delay-to-register (DR) ratio of the loops
in the circuits [Papaefthymiou, MST’94]
• In a loop, 4 logic cells, 2 registers
• Cell delay = 1ns
• Interconnect delay = 1ns
• DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2 = 4ns
• Clock period  4ns
 Interconnect pipelining
by flip-flop insertion ?
 Requires considerable amount of manual rework on the original
RTL descriptions
Our Approach
 Consideration
of multicycle communication during
architectural (or behavioral) synthesis
 [Cong et al, ISPD’03] [Cong et al. ICCAD’03]
 Regular Distributed Register (RDR) micro-architecture
• Highly regular
• Direct support of multicycle on-chip communication
 MCAS: Architectural Synthesis for Multi-cycle Communication
• Efficiently maps the behavioral descriptions to RDR uArch
• Integrates architectural synthesis (e.g. resource binding, scheduling)
with physical planning
 This
work
 Extension of RDR and MCAS for interconnect pipelining
Outline
 Motivation
 Our
contributions
 RDR-Pipe micro-architecture
• Regular Distributed Register micro-architecture with interconnect
pipelining
 Synthesis flow and algorithms
• MCAS-Pipe: automatic interconnect pipelining and sharing
 Experimental
 Conclusions
results
Regular Distributed Register Micro-Architecture
…
Reg. file
Reg. file
K cycle
2 cycle
…
Reg. file
LCC
FSM
LCC
MUL
FSM

…
Reg. file
FSM


FSM
LCC
Global Interconnect
…
Reg. file
….
Local
Computational
Cluster (LCC)
K cycles
1 cycle
1 cycle
LCC
Island
FSM
LCC
FSM
2 cycles
FSM
LCC
…
…
Reg. file
Hi
MUX
ALU
Wi
Distribute registers to each “island”
Choose the island size such that local computation and communication in
each island can be done in a single cycle
Use register banks: registers in each island are partitioned to k banks for 1
cycle, 2 cycle, … k cycle interconnect communication in each island
Wiring Overhead in RDR Designs
Cycle 1
+
r1
Cycle 2
r1
r2
r3
Cycle 3
ALU1
+
r3
+
r4
MUL1
Cycle 4
Cycle 5
r2
*
ALU1
Interconnects with
delay of 2 cycles
r4
*
MUL1
Sender register
 Data
transfers r1r3 and r2r4 are overlapped
 Two
dedicated global wires are needed
Receiver register
Architectural Solution: RDR-Pipe
 Keep
the intra-island
structures
Pipeline Register Station (PRS)
3
1
2
 Inter-island pipeline
4
FSM
LCC
LCC
PRS
FSM
FSM
LCC
4
3
H channel PRS
FSM
LCC
LCC
2
1
 Synchronous design
 No global control signal
needed for PRS
PRS
FSM
performs
autonomous store-andforward
Reg. File
V channel
 PRS
PRS
FSM
register station (PRS) for
global communications
LCC
5
6
Reducing Wiring Overhead in RDR-Pipe
Cycle 1
+
r1
Cycle 2
r1
r1
r3
Cycle 3
ALU1
+
r3
r2
MUL1
Cycle 4
Cycle 5
r4
*
r4
Sender register
+
ALU1
*
MUL1
Receiver register
Pipeline register
 Data
transfers are pipelined
 One
wire with a pipeline register is enough
2 cycle communication
Synthesis Flow: MCAS-Pipe System
C / VHDL
MCAS-Pipe
CDFG generation
CDFG
Resource allocation
& Functional unit binding
ICG
Scheduling-driven placement
Locations
Placement-driven
rescheduling & rebinding
Global interconnect sharing
Register and port binding
Global interconnect sharing
After scheduling and functional
unit binding
Before register and port binding
Enable multiple data
communications to shar a
physical link (a wire with pipeline
registers)
 Advantages
over MCAS
Datapath & FSM generation
Expect to reduce global wiring
demand
RTL VHDL & Floorplan
constraints
No multicycle path constraint
needed
Global Interconnect Sharing
Pipeline register
Sender register
pg
Cycle 1
A
Cycle 2
Cycle 3
Receiver register
B
D=2
pe
Cycle 4
pe
ce
pg
cg
Cycle 5

Cycle 6
Cycle 7
ce
cg
Conflicted data transfers
pg
Cycle 1
Cycle 2
Cycle 3
Two physical links are needed to
support the concurrent data transfers
AA
pe
pe, pg
pg
pe
Cycle 4
BB
DD==22
cce e
ccgg
Cycle 5

 Now,
Onlytwo
oneproducer
physical link
registers
is required
can betomerged,
Cycle 6
Cycle 7
ce
cg
Compatible data transfers
since
support
their
thelife-times
scheduled
become
data transfers
compatible
Global Pipelined Interconnect Minimization
 Definitions
 Data links: pipelined global interconnects
 Channel: set of data links between two islands
• Width of a channel: number of its data links
 Data transfer: movement of data from a producer to a consumer
 Architectural
assumption
 Channels cannot share interconnects
 Theorem
 Global pipelined interconnects are minimized if and only if the
width of every channel is minimized
Transfer Scheduling for a Single Channel
 A decision
problem formulation
 Given:
• A channel (A, B) containing m data links
• A data transfer set {e | pe  A and ce  B}, where each transfer e is
associated with an arrival time T(pe)+1, a deadline T(ce)-D(A, B), and unit
effective occupancy time
 Fact: for every time slot, at most one transfer can be issued on a data link
 Objective: to find a feasible transfer schedule on these data links
 Transfer
scheduling is polynomial solvable
 A special real-time scheduling problem [J. Blazewicz, 1979]
• Binary search for minimum feasible channel width m
• For each width, apply Earliest-Deadline-First (EDF) scheduling: O(nlogn)
• Overall time complexity: O(nlog2n)
EDF-Based Transfer Scheduling Example
Time slot
Time slot
1
2
3
4
5
6


Ordered by Earliest-Deadline-First
1
3 Data 4
Link 2
6
Successfully scheduling onto 2
data links
3 Data 5Link 2
2?
6
Ordered by left edge
5
1
4
1 Data Link
3
4
5
2

1
1 Data Link
2

Failed for 2 data links!
Outline
 Motivation
 Our
contributions
 RDR-Pipe micro-architecture
• Regular Distributed Register micro-architecture with interconnect
pipelining
 Synthesis flow and algorithms
• MCAS-Pipe: automatic interconnect pipelining and sharing
 Experimental
 Conclusions
results
Experiment Settings
C / VHDL
CDFG generation
uArch. spec.
Target clock period
Functional unit
allocation & binding
Conventional
flow
Scheduling-driven
placement
Placement-driven
rebinding & rescheduling
MCAS
Global interconnect
flow
sharing
MCAS-Pipe flow
Register and port binding
Conventional
Scheduling
Datapath & Control generation
RTL VHDL files Floorplan constraints (for MCAS and MCAS-Pipe);
(for all flows)
Multicycle path constraints (for MCAS only)
Altera QuartusII + Stratix
Experimental Results: Register and LE Usage
 Design
environment: Altera QuartusII, Stratix EP1S40
 MCAS vs. Conventional flow:
 Uses more registers and logic elements (LE)
 MCAS-Pipe
vs. MCAS:
 Slightly more registers, and comparable logic element cost
Designs
Node#
PR
MCAS
CONV / MCAS
MCAS-Pipe / MCAS
Reg#
LE
Reg#
LE
Reg#
LE
46
31
1181
0.71
0.95
1.19
0.95
WANG
52
40
1435
0.63
0.81
1.20
0.85
LEE
53
29
988
0.76
0.96
1.00
0.84
MCM
98
57
2467
0.75
1.00
1.05
1.19
HONDA
101
41
2542
0.83
0.90
1.05
1.01
DIR
152
44
2260
0.75
0.95
1.05
1.01
-
-
0.74
0.93
1.09
0.98
Average
-
Experimental Results: Performance
 Design
 MCAS
environment: Altera QuartusII, Stratix EP1S40
vs. Conventional flow:
 36% reduction in clock period and 30% in total latency
 MCAS-Pipe
vs. MCAS:
 Comparable design performance (4% better)
Conventional
12
Conventional
600
MCAS
MCAS
MCAS-Pipe
10
500
8
400
Total latency (ns)
Clock period (ns)
MCAS-Pipe
6
4
200
100
2
0
300
0
PR
WANG
LEE
MCM
HONDA
Clock period
DIR
Average
PR
WANG
LEE
MCM
HONDA
Total latency
DIR
Average
Interconnect Structure of Altera’s Stratix
Global: H24
H8
H4
Local: LL, LO
V4
V8
Global:V16
Experimental Results: Wirelength


Wire types

LL, LO: local wires; H4, V4, H8, V8: short global wires

V16, H24: long global wires
MCAS-Pipe vs. MCAS:

28.8% long global wires reduction, 19.3% total wirelength reduction
1.4
1.2
1
LL+LO
H4+V4
H8+V8
V16+H24
Total
0.8
0.6
0.4
0.2
0
PR
WANG
LEE
MCM
HONDA
DIR
Average
Conclusions
 High-level
automatic on-chip interconnect pipelining
 RDR-Pipe: extension of RDR micro-architecture
• Micro-architecture supporting interconnect pipelining
 MCAS-Pipe: enhancement of MCAS synthesis system
• Add in a novel global interconnect sharing algorithm to
effectively reduce the global wiring
 Experimental
results
 Matches or exceeds the RDR-based approach in performance
 Greatly reduces wiring demand
Thank you