Transcript Document
Architecture and Synthesis for Multi-Cycle Communication SOC Group, VLSICAD Lab Led by Jason Cong Yiping Fan, Guoling Han, Xun Yang, Zhiru Zhang Motivation Our Approach • What is happening now: MCAS vs. Conventional Flow Regular Distributed Register (RDR) micro-architecture • Interconnect delays dominate the MCAS achieves 31% clock period and 24% total latency reduction with 18% resource overhead and 11% clock cycle increase on average. Highly regular timing in DSM tech. Direct support of multi-cycle on-chip communication • What is about to happen: • Single-cycle full chip synchronization is no longer possible. MCAS: Architectural Synthesis for Multi-cycle Communication Integrated architectural synthesis (e.g. binding, scheduling) with physical planning Clock period (ns) comparison 14.00 Conventional Flow MCAS Flow 12.00 Target at RDR architecture C program 10.00 7 clock 8.00 6.00 Register file … … 6 clock 2 clock 1 clock 7.52 … 24.9 (mm) 12 Alu1 Mul2 * Island Interconnected Component Graph (ICG) … … Alu1 1,5,10 MUL FSM Reg. file Mul2 3,7,11 …. Local Computational Cluster (LCC) Reg. file Hi Reg. file … … Alu2 2,6,9 RDR Placement Mul1 4,8,12 CDFG Resource allocation & Functional unit binding ICG Scheduling-driven placement Locations Placement-driven rescheduling & rebinding Register and port binding RTL VHDL Floorplan constraints Multi-cycle path constraints MUX ALU Cluster with area constraint Wi RDR Architecture Reg. file 0.00 Datapath & FSM generation K cycle Control Data Flow Graph (CDFG) 2 cycle 10 2.00 Register File 1 cycle - LCC 6 * 11 * 8 Alu2 LCC … FSM - Mul1 FSM - 5 LCC 1 cycle … Global Interconnect * 4 - 9 22.56 + 2 * 3 * 7 15.04 FSM - 1 FSM 3 clock LCC K cycles 2 cycles 4 clock 0 FSM 5 clock FSM LCC LCC 4.00 CDFG generation MCAS (Multi-Cycle Architectural Synthesis) … Distribute registers to each “island” Chose the island size such that local computation and communication in each island can be done in a single cycle: Dintra-island=Dlogic+Dopt-intDlogic+2Dopt-int(Wi+Hi)T pr wang lee mcm honda dir chem u5ml12 matmul cftmdl Total latency (ns) comparison 1200.00 1000.00 Conventional Flow MCAS Flow 800.00 600.00 400.00 200.00 0.00 pr wang lee mcm honda dir chem u5ml12 matmul cftmdl Scheduling-driven placement Integrate list-scheduling with a SA-based global placement for minimizing the total latency. Employ net weighting technique to shorten the critical global connections. Placement-driven rescheduling & rebinding Integrate force-directed list-scheduling with simultaneous rescheduling & rebinding to further minimize the latency. cft1st MCAS vs. Synopsys Behavioral Compiler MCAS achieves 21% clock period and 29% total latency reduction on average, without area overhead. Total latency (ns) comparison Behavioral Compiler MCAS 600 500 400 MCAS System cft1st 300 200 100 0 pr wang mcm honda Clock period (ns) comparison 14.00 Behavioral Compiler MCAS 12.00 10.00 8.00 6.00 4.00 2.00 0.00 pr wang mcm honda