Transcript Document

Architecture and Synthesis for Multi-Cycle Communication
SOC Group, VLSICAD Lab
Led by Jason Cong
Yiping Fan, Guoling Han, Xun Yang, Zhiru Zhang
 Motivation
 Our Approach
• What is happening now:
MCAS vs. Conventional Flow
 Regular Distributed Register (RDR) micro-architecture
• Interconnect delays dominate the
MCAS achieves 31% clock period and
24% total latency reduction with 18%
resource overhead and 11% clock cycle
increase on average.
 Highly regular
timing in DSM tech.
 Direct support of multi-cycle on-chip communication
• What is about to happen:
• Single-cycle full chip synchronization
is no longer possible.
 MCAS: Architectural Synthesis for Multi-cycle Communication
 Integrated architectural synthesis (e.g. binding, scheduling) with physical planning
Clock period (ns) comparison
14.00
Conventional Flow
MCAS Flow
12.00
 Target at RDR architecture
C program
10.00
7 clock
8.00
6.00
Register file
…
…
6 clock
2 clock
1 clock
7.52
…
24.9 (mm)
12
Alu1
Mul2
*
Island
Interconnected Component
Graph (ICG)
…
…
Alu1
1,5,10
MUL
FSM
Reg. file
Mul2
3,7,11
….
Local
Computational
Cluster (LCC)
Reg. file
Hi
Reg. file
…
…
Alu2
2,6,9
RDR Placement
Mul1
4,8,12
CDFG
Resource allocation
& Functional unit binding
ICG
Scheduling-driven placement
Locations
Placement-driven
rescheduling & rebinding
Register and port binding
RTL
VHDL
Floorplan
constraints
Multi-cycle path
constraints
MUX
ALU
Cluster with area constraint
Wi
 RDR Architecture
Reg. file
0.00
Datapath & FSM generation
K cycle
Control Data Flow Graph
(CDFG)
2 cycle
10
2.00
Register File
1 cycle
-
LCC
6
* 11
* 8
Alu2
LCC
…
FSM
-
Mul1
FSM
- 5
LCC
1 cycle
…
Global Interconnect
* 4
- 9
22.56
+ 2
* 3
* 7
15.04
FSM
- 1
FSM
3 clock
LCC
K cycles
2 cycles
4 clock
0
FSM
5 clock
FSM
LCC
LCC
4.00
CDFG generation
MCAS (Multi-Cycle Architectural Synthesis)
…
Distribute registers to each “island”
 Chose the island size such that local
computation and communication in each
island can be done in a single cycle:
Dintra-island=Dlogic+Dopt-intDlogic+2Dopt-int(Wi+Hi)T
pr
wang
lee
mcm
honda
dir
chem
u5ml12 matmul
cftmdl
Total latency (ns) comparison
1200.00
1000.00
Conventional Flow
MCAS Flow
800.00
600.00
400.00
200.00
0.00
pr
wang
lee
mcm
honda
dir
chem
u5ml12 matmul
cftmdl
Scheduling-driven placement
Integrate list-scheduling with a SA-based global
placement for minimizing the total latency.
Employ net weighting technique to shorten the
critical global connections.
 Placement-driven rescheduling & rebinding
Integrate force-directed list-scheduling with
simultaneous rescheduling & rebinding to further
minimize the latency.
cft1st
MCAS vs. Synopsys Behavioral
Compiler
MCAS achieves 21% clock period and
29% total latency reduction on average,
without area overhead.
Total latency (ns) comparison
Behavioral
Compiler
MCAS
600
500
400
MCAS System
cft1st
300
200
100
0
pr
wang
mcm
honda
Clock period (ns) comparison
14.00
Behavioral Compiler
MCAS
12.00
10.00
8.00
6.00
4.00
2.00
0.00
pr
wang
mcm
honda