PPT attached - USC Asynchronous CAD/VLSI Group

Download Report

Transcript PPT attached - USC Asynchronous CAD/VLSI Group

Design Flows and Tools
Peter A. Beerel
University of Southern California
USC Asynchronous CAD/VLSI Group (async.usc.edu)
Part II - Agenda
Design Flows
• Design via decomposition
• Modeling design using System Verilog
Design Automation – The Proteus-A flow
• Legacy RTL
• Added System Verilog CSP front-end
• Asynchronous optimizations
Final Flow Considerations
• Analog Verification
• Design for Test and Debug
Design via Process Decomposition
• Collection of Processes linked by Channels
• Channels pass messages with guaranteed delivery
• Processes synchronize
• Processes can be decomposed into smaller processes
Modeling Asynchronous Design via
SystemVerilogCSP (SVC)
• SystemVerilog interface abstracts channel wires as well as
communication protocol
• Send/Receive
• Blocking tasks (Flow control)
Sender
SVC Interface
Receiver
Abstract
communication
module Sender (interface R);
parameter WIDTH = 8;
logic [WIDTH-1:0] data;
always
begin
//produce data
R.Send(data);
end
endmodule
module Receiver (interface L);
parameter WIDTH = 8;
logic [WIDTH-1:0] data;
always
begin
L.Receive(data);
//consume data
end
endmodule
SVC - Waveform view
//Sender
(DataGen)
always
begin
#Delay;
R.Send(data);
End
//Receiver
always
begin
L.Receive(data);
#FL;
R.Send(data);
#BL;
end
Receiver
pending on
Receive
Sender performs
Send,
Communication
happens
No one is
Sending or
Receiving
Sender
pending on
Send
Receiver
performs
Receive,
Communication
happens
Part II - Agenda
Design Flows
• Design via decomposition
• Modeling design using System Verilog
Design Automation – The Proteus-A flow
• Legacy RTL
• Added System Verilog CSP front-end
• Asynchronous optimizations
Final Flow Considerations
• Analog Verification
• Design for Test and Debug
The Proteus-A Flow – Legacy RTL
Key Features
• Re-uses synchronous EDA tools
• Seamless integration into existing flows
• Back-end design style agnostic
• Up to 2X higher performance
Tool Status
• Commercialized version in production
Proteus/
Sync
Sync
Library
for 2+ years
Library
• Uses proprietary QDI library
• Academic version (Proteus-A) enhanced
significantly at USC
Synth
RTL
Design
Goals
Synthesis
Image
Netlist Netlist
ClockGating
Gating
Clock
Netlist
ClockFree
Constraints
Clock Tree Synthesis
Async
NetlistNetlist
Constraints
Physical Design
Recent Advances
• Power optimization algorithms
Constraints
Final Layout
Flow Demo – Legacy RTL
Physical
Design
Synth.
RTL
Synthesis
Clockfree
Legacy RTL Specification
Final Layout
Synthesized
Image Netlist
Asynchronous
Gate-level
Netlist
Amber23 – Proteus-A Case Study
• Download from http://opencores.com/project,amber
• ARM-compatible 32-bit RISC processor
• 3 stages : FETCH, DECODE and EXECUTE
Cache
Cache
Bus
Bus interface
interface
instruction
Decode
State machine
control
Register bank
Barrel shifter
ALU
Multiplexer
Read data
Zhang, USC Summer Research, 2012
Address, write data
Amber23 – Performance Comparison
• Download from http://opencores.com/project,amber
• ARM-compatible 32-bit RISC processor
• 3 stages : FETCH, DECODE and EXECUTE
Cache
Cache
Bus
Bus interface
interface
instruction
Decode
State machine
control
Register bank
Barrel shifter
ALU
Multiplexer
Read data
Address, write data
Zhang, USC Summer Research, 2012
The Proteus-A Flow – SVCRTL
SystemVerilog
Key New Features
Design
Verilog
SVC2RTL
Goals
• Supports System Verilog CSP front-end
• Enables user-defined conditional communication
Tool Status
• System Verilog version subsequently
developed at USC
• Used in current research at USC and
Technion and 40+ person async class
Constraints
Synthesis
• Saves power at architectural level
• Proprietary version starting from CAST
developed at Fulcrum
Synth. RTL
Image
Netlist Netlist
Proteus/
Sync
Sync
Library
Library
Constraints
ClockGating
Gating
Clock
Netlist
ClockFree
Constraints
Clock Tree Synthesis
Async
NetlistNetlist
Constraints
Physical Design
Final Layout
Key to Low-Power Conditional Communication
D
op
+
0 S
0
R0
Mult
0
MUX
A,B
DEMUX
Add/Sub
+
Conditional communication reduces token flow, saving power
• Traditionally - manually introduced via user-created decomposition
• Recent research - automatically introduced via Operand Isolation
Saifhashemi, PATMOS 2012
SVC2RTL – Enables User-Defined
Conditional Communication
Dummy value
0
0
Not received
Not sent
1
0
1
Part II - Agenda
Design Flows
• Design via decomposition
• Modeling design using System Verilog
Design Automation – The Proteus-A flow
• Legacy RTL
• Added System Verilog CSP front-end
• Asynchronous optimizations
Final Flow Considerations
• Analog Verification
• Design for Test and Debug
Power Optimization Overview
• Conditioning
• Automatically add conditional
communication
• Reconditioning
• Optimize the existing conditionality
Power Saving - The Opportunity
+
Unnecessary calculation
Our Solution - Adding Isolation Cells
• All inputs/outputs are
unconditional
• Operand Isolation
• And-based isolation cells
• Generated by synchronous
RTL synthesizer
• Does not prevent switching
in asynchronous circuits
Isolation cells are not effective in asynchronous circuits
Our Solution - Conditioning
&
+
+
0
0
No Activity
Power Optimization Results
• Case study: 32-bit ALU placed and routed
• Back annotated switching activity using a VCD file
• Results:
• Isolating ADD and SUB are detrimental for rADD and rSUB > 0.2
• 53% power reduction when only isolating MUL (rf=0.25)
• Area cost of isolating MUL is about 4% and no performance penalty
Saifhashemi, Patmos 2012
Power Savings – The Opportunity
Unnecessary activity
0
0
1
0
1
0
Unnecessary activity
Conditional communication is explicit and only at primary IO
The Reconditioning Problem
Definition (The Reconditioning Problem): Rearrange location
of RECEIVE and SEND cells to minimize Power consumption
while preserving functional behavior.
Power Results
Power Comparison: 32 bit
Power Comparison: 32 bit
8000
7000
5000
4000
Original
3000
Greedy0
2000
MILP
1000
0
0.25
0.5
Power
Power
6000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
0.75
Original
Greedy0
MILP
0.25
Operational factor
0.75
Operational factor
RECON1:
Dual-mode arithmetic unit
RECON2:
Conditional multiplier
Power
Power Comparison: 32 bit
4500
4000
3500
3000
2500
2000
1500
1000
500
0
Original
Greedy0
MILP
0.25
Saifhashemi, PhD Thesis, 2012
0.5
0.5
Operational factor
0.75
ALU-OI
ALU after operand isolation
Mode Based Conditional Slack Matching
op
S
R
S
R
MUX
A,B
DEMUX
Add/Sub
Mult
Conditional Slack Matching Advantage – Conditional behavior yields
less stalls and thus not as many pipeline buffers needed
• Previously ignored – conservatively modeled as unconditional
Najibii,2012
Conditional Slack Matching - Results
33% less buffers
on average
Najibii,2012
Design Flow Demo
SystemVerilog
Design
Goals
SVC2RTL
Synth. RTL
Constraints
Synthesis
Image Netlist
Proteus/
Sync
Library
Constraints
ClockFree
Async Netlist
Constraints
Physical Design
Final Layout
Agenda
Design Flows
• Design via decomposition
• Modeling design using System Verilog
Design Automation – The Proteus-A flow
• Legacy RTL
• Added System Verilog CSP front-end
• Asynchronous optimizations
Final Flow Considerations
• Analog Verification
• Design for Test and Debug
Final Flow Considerations
Static Timing Analysis
• Verify timing constraints and performance is a must
• Trick traditional tools into working with asynchronous circuits
Analog Verification
• Domino logic used in QDI flows sensitive to charge sharing
• Asynchronous channels cannot tolerate cross-talk glitches
• Special spiced-based tools developed
Asynchronous Scan
• Asynchronous scan is a must but doable
Design for Silicon Debug
• Chip deadlock is still difficult to debug
Conclusions
The Asynchronous Design Flow/CAD Landscape
• Synchronous design rigidity continues to hamper quality design
• Asynchronous design offers solutions but has many design flow challenges
Design Flow Requirements
• Design flows must easily integrate into synchronous designs
• Circuit quality must compete very well to warrant switching design styles
Our approach
•
Proteus provides a good design framework for automation of both
legacy RTL and SystemVerilog CSP
• Final considerations of analog and timing verification, scan, and
debug should not be over looked
Acknowledgements
http://ee.usc.edu/async2013