Transcript Introduction to basic concepts on asynchronous circuit design
Industrial Experiences
Pioneering Asynchronous Commercial Design
Peter A. Beerel Fulcrum Microsystems Calabasas Hills, CA, USA 1
Agenda
Introduction to Fulcrum Description of Integrated Pipelining Fulcrum’s clockless circuit architecture Description of Fulcrum’s Design Flow
Circuit A Specification Design & Verification Design & Verification Synthesis & Floor Planning Physical Design Database Release to Manufacturing Circuit B
Overview of Nexus Fulcrum’s Terabit crossbar Overview of PivotPoint Fulcrum’s first commercial product 2
Company Snapshot
“Clockless” Semiconductor Company Formed out of Caltech (1/00) Technology proven in large-scale designs Located in Calabasas, CA (30 people) Backed by top-tier investors (raised $14M in June)
3
Agenda
Introduction to Fulcrum
Description of Integrated Pipelining
Fulcrum’s clockless circuit architecture Circuit A Specification Design & Verification
Description of Fulcrum’s Design Flow
Design & Verification Synthesis & Floor Planning Physical Design Database Release to Manufacturing Circuit B
Overview of Nexus Fulcrum’s Terabit crossbar Overview of PivotPoint Fulcrum’s first commercial product 4
Fulcrum’s Integrated Pipelining
Robust, power efficient, and high performance
Dual-Rail Domino Logic Dual-Rail Domino Logic
Acknowledge Acknowledge
Dual-Rail Domino Logic Fast delay-insensitive style using domino logic without latches (Developed at Caltech by Fulcrum’s founders)
5
Input Completion Detection
Integrated Pipelining
Leaf Cell A
Dual-Rail Domino Logic
Leaf Cell B
Dual-Rail Domino Logic
Leaf Cell C
Dual-Rail Domino Logic Control Control Control Output Completion Detection Harnessing the power of Domino Logic Addresses delay variability with Completion Sensing Addresses power inefficiency with Async Handshakes Leverages more efficient “N” transistors 6
Hierarchical Design
Multi-level hierarchy of communicating blocks
Reg A Main FSM Adder Reg B Memory
At each level blocks communicate along channels
B N-1 B N-2 B N-3 ASIC Register Bank
leaf cells channels
Subtract/ Divider Reg C Adder/ Mult.
FA N-1 FA N-2 FA N-3 FA 0
7
Leaf Cells
C F LCD RCD D
Definition Smallest block that performs logic and communicates via channels Based on small number of pipeline templates guiding design Forms basic building block for physical design Features Facilitates high throughput and low latency Provides easy timing validation and analog verification ~1,000 digital leaf cell types compose our leaf cell library ~200 additional subtypes for different physical environments (e.g., loads) 8
Template-Based Cell Design
• Each pipeline style (QDI, timed…) has a different blueprint • Library uses a blueprint to implement the lowest level blocks
C C LCD
F
RCD LCD RCD LCD
F
2-input 1-output pipeline stage Blueprint for a QDI N-input M-output pipeline stage C LCD
F
RCD RCD 1-input 2-output pipeline stage
9
Summary of Characteristics
Delay-Insensitive timing model Gates and wires can have arbitrary delays 4 phase 1of4 handshake Uses 4 wires to send 2 bits Plus an acknowledge wire for flow control Returned to neutral between each data transfer Self shielding Precharge domino logic plus async handshake Low latency; high frequency; robust Auto power conservation; zero standby power 10
Agenda
Introduction to Fulcrum Description of Integrated Pipelining Fulcrum’s clockless circuit architecture
Circuit A Specification Design & Verification Design & Verification Description of Fulcrum’s Design Flow Synthesis & Floor Planning Physical Design Database Release to Manufacturing Circuit B
Overview of Nexus Fulcrum’s Terabit crossbar Overview of PivotPoint Fulcrum’s first commercial product 11
Fulcrum Design Flow
Design Specification
Hierarchical design flow Executable specifications Formal decomposition Creates design hierarchy Semi-custom synthesis & layout Hierarchical floor planning Automated transistor sizing Semi-automated physical design Supports synchronous & asynchronous designs Hard macro from place & route
Architecture Design & Verification Micro-architecture Design & Verification Synthesis & Floor Planning Physical Design Database Release to Manufacturing
12
Managing Design Hierarchy
Proprietary Objected Oriented Hardware Language Integrated hierarchical design/verification language Defines cell specification & implementation Specification Java or communicating-sequential-processes (CSP) Implementation: multiple forms Sub-cells Sub-cells defined in terms of specification or implementation Defines integrated test environment for each cell Enables verification at all pairs of levels Efficiency features Supports refinement of cells and channels 13
Physical Design
Layout hierarchy based on design hierarchy Hierarchical floor-planning semi-automated Large scale hand placement before sizing Long distance channels planned carefully Timing closure by construction Placement drives sizing Can insert extra pipelining on long wires late in design Tradeoffs between performance and design time Hand layout where necessary Automated layout where possible Goals Full-custom density and speed within ASIC design time 14
Design Verification: System-Level
Test Bench Device Under Test Test Cases Configuration Manager Traffic Generator & Checker Bus Functional Model Executable Spec Gate-level Verilog Model
Mission Verify that executable spec = written spec + gate level model Use industry-standard tools & methods Cadence NCSIM and efficient Java-Verilog interface Directed random testing Line & functional coverage
Monitor
15
Design Verification: Unit-Level
High level (Java/CSP) Test Engine Copy == Log Low level (CSP/PRS/CDL)
Mitered co-simulation for unit-level verification Check correctness of digital model by comparing it to golden CSP/Java model Features Framework automated and regressed Checks correctness Checks delay insensitivity and/or throughput and latency 16
Analog Verification: Charge Sharing
Charge Sharing Test Generator SPICE Synthesis
SPICE-based charge sharing analysis Test case generation and analysis automated Charge-sharing problems solved in numerous ways Symmetrization Less transistor sharing Delay perturbations 17
Synthesis: Gate Generation / Sizing
Automated generation of transistor netlists Dynamic logic generation Transistor sharing Symmetrization Gate-library matching Transistor sizing Path-based sizing to meet amortized unit-delay model Micro-architecture feedback Identifies where fanout limits performance
CSP Gate Library Logic Synthesis Transistor Sizing CDL Netlist Floor planning Information
18
Fulcrum QDI v. Synchronous Flows
Save clock tree design, analysis, optimization, and verification No timing closure problems Unexpected long-wire bottlenecks easily solved with additional pipeline buffers late in design cycle QDI/DI timing model reduces timing analysis challenges Fulcrum QDI hierarchical design facilitates: Composability, re-use, and early bug detection Hierarchical-floorplanning improves predictability of wires Template-based leaf cell designs simplifies logic design Design reuse reduces criticality of high-level synthesis Decomposition methodology amenable to formal verification 19
Agenda
Introduction to Fulcrum Description of Integrated Pipelining Fulcrum’s clockless circuit architecture Description of Fulcrum’s Design Flow
Circuit A Specification Design & Verification Design & Verification Synthesis & Floor Planning Physical Design Database Release to Manufacturing Circuit B Overview of Nexus
Fulcrum’s Terabit crossbar
Overview of PivotPoint Fulcrum’s first commercial product 20
Globally Asynchronous, Locally Synchronous
SoC designs: many cores with different clock domains Async circuits can interconnect multiple sync cores in an SoC design, eliminating global clock distribution and simplifying clock domain crossing Fulcrum’s “Nexus” is a high speed on-chip interconnect: 16 port, 36 bit asynchronous crossbar Asynchronous cross-chip channels Async-sync clock domain converters Runs at 1.35GHz in 130nm process 21
Nexus System-on-Chip Interconnect
Generic Nexus Example
-
Synchronous IP block
-
Asynchronous IP block
-
Pipelined repeater
-
Clock domain converter
Non-blocking crossbar 16 full-duplex ports Flow control extends through the crossbar Full speed arbitration Arbitrary length “bursts” Bridges clock domains Scales in bit width and ports Process portable 22
Nexus Burst Format
Incoming From Source Outgoing To Target Data 36 bit D N • • • D 3 D 2 D 1 D N • • • D 3 D 2 D 1 Tail Control 1 bit 4 bit 1 0 0 0 To 1 0 0 0 From Source Module Target Module
Arbitrary-length source-routed bursts provide flexibility
23
Sync-to-Async Conversion
S2A
Synchronous Request / Grant FIFO protocol Data transferred if request and grant both high on rising edge of clock Compensates for any skew on asynchronous side Low latency: 1/2 to 3/2 clock cycles at A2S
A2S Synchronous Datapath Request Grant A clock Asynchronous Datapath Asynchronous Datapath A clock Synchronous Datapath Request Grant
Seamlessly Bridges Different Clock Domains
24
Arbitration and Ordering
Unrelated sender/receiver links are independent Bursts sent from multiple input ports to the same output port are serviced fairly by built-in arbitration circuitry Bursts from A to B remain ordered Producer-consumer and global-store-ordering satisfied A sends X to B, A notifies C, C can read X from B A writes X to B, A writes Y to C, if D reads Y from C, it can read X from B Split transactions implement loads Load request and load completion bursts Load completions returned out-of-order
Can tunnel common bus and cache coherance protocols
25
Example: Load/Store Systems
Option 1: Pure Master/Target Ports Masters send Requests to Targets, which may return Completions Each port must either be a Master or a Target so that Completions are never blocked by Requests Devices which need to be both Masters and Targets are given two separate full-duplex ports Could use two separate Nexus crossbars Option 2: Peers Modules which are both Masters and Targets implement an internal buffer to hold Requests so that Completions can bypass them All Masters or Peers restrict number of outstanding Requests to avoid overflowing Request buffers 26
Example: Switch Fabric
Each module maintains input/output queues for traffic to/from each other module Data is sent from an input queue to an output queue over Nexus as a series of short bursts Flow control credits for each output queue are sent backward Eliminates head-of-line blocking Segmentation, buffering, and overspeed optimize performance during congestion Used in PivotPoint, Fulcrum’s first chip product.
27
S1 S2 S3 S4
Nexus Silicon Validation
TSMC 130nm LV Results Block diagram of Nexus Validation Chip Serial IO S5 S6
Proc
Low-K Low-K FSG FSG
V
1.2
1.0
1.2
1.0
GHz
1.35
1.11
1.10
0.87
ns
2.0
2.4
2.5
3.1
pJ/bit
10.4
7.0
11.2
7.6
ALU S7 Crossbar area: 1.75mm^2 Total interconnect area: 4.15mm^2 Peak cross-section bandwidth: 778Gb/s Plot of Nexus crossbar
28
Nexus Summary
Nexus is an asynchronous crossbar interconnect designed to connect up to 16 synchronous modules in a SoC Nexus can be used to implement load/store systems as well as switch fabrics Systems using Nexus can be tested with standard equipment Nexus runs up to 1.35GHz in TSMC 130nm Asynchronous interconnect is now viable for very high performance SoC designs 29
Agenda
Introduction to Fulcrum Description of Integrated Pipelining Fulcrum’s clockless circuit architecture Description of Fulcrum’s Design Flow
Circuit A Specification Design & Verification Design & Verification Synthesis & Floor Planning Physical Design Database Release to Manufacturing Circuit B
Overview of Nexus Fulcrum’s Terabit crossbar
Overview of PivotPoint
Fulcrum’s first commercial product
30
PivotPoint Blade Interconnect
World’s first high-performance clockless chip
Large-scale SoC design
Generic System “Blade”
>32.5M transistors (83% async) 14 separate clock domains
SPI-4 CPU NPU ASIC FPGA CPU NPU ASIC FPGA
Includes key Fulcrum IP Nexus Terabit Crossbar Quad-port 600MHz async SRAM
I/O (Phy/MAC) X8 Backplane Interface
Operates at over 1GHz Delivers 192Gbps of non blocking switching capacity
CPU NPU ASIC FPGA CPU NPU ASIC FPGA
Testable via standard tools JTAG; scan chain Activity-based power scaling 9-month project 31
PivotPoint Leverages Nexus
SPI-4 Route Table SPI-4 16KB Buffer 16KB Buffer SPI-4 Route Table SPI-4 16KB Buffer 16KB Buffer SPI-4 Route Table SPI-4 16KB Buffer 16KB Buffer CPU Interface JTAG Interface Control Bus (Serial Tree) Boundary Scan 16KB Buffer 16KB Buffer SPI-4 Route Table SPI-4 16KB Buffer 16KB Buffer SPI-4 Route Table SPI-4 16KB Buffer 16KB Buffer SPI-4 Route Table SPI-4 3ns latency
A true SoC GALS design
Flexible architecture 6 duplex SPI-4.2 interfaces All paths are independent Optimized for performance Up to 14.4Gbps per interface Up to 32Gbps per Nexus port Full-rate buffer memories Lossless flow control Easily configurable 16-bit CPU interface JTAG support Modest size and power ~2 Watt per active interface 1036 ball package 32
Testing – A Multi-Dimensional Approach
DFT Synchronous scan chains for Synchronous logic Asynchronous scan-chain-like structures for asynchronous logic and sync-async interfaces Standardized JTAG interface for testing Fault-Grading Verilog fault-model for domino logic Industry-standard fault grading tools BIST Use Nexus for observability in Nexus-Based SOCs RAM self test and repair 33
Differentiating Through Technology
Leveraging our clockless technology foundation
Differentiated Product Offering High performance
(latency, capacity)
Power efficient
(linear scaling)
Robust in operation Unique IP Blocks Unmatched performance Extremely robust
(power and temperature)
Easy to integrate
(benign behavior)
Clockless Technology Foundation Silicon proven and customer validated Mature CAD flow
(integrated with commercial tools)
Robust cell library
(thousands of unique cells) 34
Thank You!
Peter A. Beerel, PhD
VP Strategic CAD
pabeerel@
fulcrum
micro.com
818.871.8100
www.
fulcrum
micro.com
26775 Malibu Hills Road Suite 200 Calabasas Hills, CA 91301 “
A group of engineers wants to turn the microprocessor world on its head by doing the unthinkable: tossing out the clock and letting the signals move about unencumbered. For those designers, inspired by research conducted at Caltech,
clocks are for wimps
.
”
Anthony Cataldo , EE Times
35