SPARK Overview - University of California, San Diego
Download
Report
Transcript SPARK Overview - University of California, San Diego
SPARK: A Parallelizing
High-Level Synthesis Framework
Sumit Gupta
Rajesh Gupta, Nikil Dutt, Alex Nicolau
Center for Embedded Computer Systems
University of California, Irvine and San Diego
http://www.cecs.uci.edu/~spark
Supported by Semiconductor Research Corporation & Intel Inc
System Level Synthesis
System
Level
Model
Task
Analysis
Hardware
Behavioral
Description
High
Level
Synthesis
HW/SW
Partitioning
ASIC
I/O
Software
Behavioral
Description
Software
Compiler
Copyright Sumit Gupta 2003
FPGA
Memory
Processor
Core
2
High Level Synthesis
Transform behavioral descriptions to RTL/gate level
From C to CDFG to Architecture
T
d=e-f
c
Memory
If Node
F
g=h+i
Control
x = a + b;
c = a < b;
if (c) then
d = e – f;
else
g = h + i;
j=d
x g;
Problem
l = e + x;
x=a+b
c=a<b
ALU
# 1 : Poor quality of HLS results beyond
straight-line
behavioral descriptions
Data path
j=dxg
l = e + x controllability of the HLS
Problem # 2 : Poor/No
results
Copyright Sumit Gupta 2003
3
High-level Synthesis
Well-researched area: from early 1980’s
Large number of synthesis optimizations have been proposed
Renewed interest due to new system level design methodologies
Either operation level: algebraic transformations on DSP codes
or logic level: Don’t Care based control optimizations
In contrast, compiler transformations operate at both operation level
(fine-grain) and source level (coarse-grain)
Parallelizing Compiler Transformations
Different optimization objectives and cost models than HLS
Our aim: Develop Synthesis and Parallelizing Compiler
Transformations that are “useful” for HLS
Beyond scheduling results: in Circuit Area and Delay
For large designs with complex control flow (nested
conditionals/loops)
Copyright Sumit Gupta 2003
4
Our Approach: Parallelizing HLS (PHLS)
C Input
Original
CDFG
Source-Level Compiler
Transformations
Scheduling
& Binding
Optimized
CDFG
VHDL
Output
Scheduling Compiler &
Dynamic Transformations
Optimizing Compiler and Parallelizing Compiler transformations
applied at Source-level (Pre-synthesis) and during Scheduling
Source-level code refinement using Pre-synthesis transformations
Code Restructuring by Speculative Code Motions
Operation replication to improve concurrency
Dynamic transformations: exploit new opportunities during
scheduling
Copyright Sumit Gupta 2003
5
SPARK
High Level
Synthesis
Framework
Copyright Sumit Gupta 2003
6
SPARK Parallelizing HLS Framework
C input and Synthesizable RTL VHDL output
Tool-box of Transformations and Heuristics
Script-based application of transformations, passes, and heuristics:
similar to Synopsys Design Compiler
Hierarchical Intermediate Representation (HTGs)
Retains structural information about design (conditional blocks, loops)
Enables efficient and structured application of transformations
Complete HLS tool: Does Resource Binding & Control Synthesis
Enables Graphical Visualization of Design description and
intermediate results (CDFG, DFG, HTG)
Benchmarked on large set of multimedia & image processing designs
SPARK System Release available for download
Each of these can be developed independently of the other
User Manual for running tool and changing synthesis scripts
Tutorial for the synthesis of a portion of a MPEG player
100,000+ lines of C++ code
Copyright Sumit Gupta 2003
7
PHLS Transformations
Organized into Four Groups
Pre-Synthesis Source-to-Source Transformations
Scheduling synthesis & compiler transformations
Speculative Code Motions, Multi-cycling, Operation Chaining,
Loop Shifting (Incremental Loop Pipelining technique)
Dynamic: Transformations applied dynamically during
scheduling
Loop-Invariant Code Motions, Loop Unrolling, CSE
Dynamic CSE & Copy Propagation, Dynamic Branch Balancing
Basic Compiler Transformations
Copy Propagation, Dead Code Elimination, constant propagation
Application of these transformations is
guided by Synthesis Scripts
Copyright Sumit Gupta 2003
8
Experiments
We used SPARK to synthesize designs derived from
several industrial designs
Quantified effects of individual transformations on QOR
Example: MPEG-1, MPEG-2, GIMP Image Processing software
Case Study of Intel Instruction Length Decoder
Pre-synthesis transformations
Speculative Code Motions, Loop Pipeliling
Dynamic Transformations
Scheduling Results
VHDL: Logic Synthesis
Number of States in
Critical Path Length (ns)
FSM
Unit Area
Cycles on Longest Path
9
through Design Copyright Sumit Gupta 2003
Scheduling & Logic Synthesis Results
1.2
MPEG-1 Pred1 Function
MPEG-1 Pred2 Function
1.2
1
0.8
0.6
0.4
1
36%
39%
0.8
42%
0.6
10%
0.2
0
0
Unit Area
8%
0.4
0.2
Longest Path(l Critical Path(c Total Delay (c*l)
cyc)
ns)
36%
Longest Path(l Critical Path(c Total Delay (c*l)
cyc)
ns)
Unit Area
Non-speculative
CMs: Within
Overall: 63-66
% improvement
in Delay
+ Pre-Synthesis Transforms
BBs & Across Hier Blocks
Almost constant Area
+ Speculative Code Motions
+ Dynamic CSE
Copyright Sumit Gupta 2003
10
Example Design: ILD Block from Intel
Case Study: A design derived from the Instruction Length
Decoder of the Intel Pentium® class of processors
Characteristics of Microprocessor functional blocks
Low Latency: Single or Dual cycle implementation
Consist of several small computations
Intermix of control and data logic
Starting with a sequential, multi-cycle specification, we
achieved a fully parallel, single-cycle design
Our toolbox approach enables us to develop a script to
synthesize applications from different domains
Final design looks close to the actual implementation done by
Intel
Copyright Sumit Gupta 2003
11
Key Insights from Project
Coarse-grain and Fine-grain Parallelizing transformations and basic
compiler transformations are essential and key to achieving high
quality of synthesis results
Language-level pre-synthesis optimizations are important due to
the high-level of abstraction at the level of behavioral C
Also important for coarse-grain design space exploration
Although a range of (compiler & synthesis) optimizations exist,
they have to be carefully guided by heuristics and scripts to achieve
desired results
Transformations from compilers and parallelizing compilers do not
directly translate over to synthesis
Need to be radically changed with completely different cost models and
guiding principles
New parallelizing transformations (or transformations that are not useful for
compilers) have to be developed for synthesis
Copyright Sumit Gupta 2003
12
Key Insights from Project
Designers want script based control over transformations, passes –
similar to Synopsys Design Compiler
Optimizations that improve schedule length (cycles) do not
necessarily improve circuit delay (due to longer critical paths, i.e.,
clock period)
Designer Insights can be used to guide transformations – especially coarsegrain code restructuring for design space exploration
For example, loop unrolling and loop pipelining: they increase the number
of operations in the design and hence, resource utilization and in turn, size
of multiplexers and controllers increase
Traditional CDFG and DFG representations used in high-level
synthesis are not sufficient for designs with complex control flow
A Hierarchical intermediate representation (Hierarchical Task Graphs –
HTGs) is required for retaining control and structural information for
efficient coarse-level optimizations
Full set of data dependencies (RAW, WAR, WAW) are required for
correlating output VHDL and C with input C.
Copyright Sumit Gupta 2003
13
Conclusions
Parallelizing code transformations enable a new range of
HLS transformations
Provide the needed improvement in quality of HLS results
Possible to be competitive against manually designed circuits.
Can enable productivity improvements in microelectronic design
Built a C-to-VHDL synthesis system with a range of code
transformations
Platform for applying Coarse and Fine-grain Optimizations
Tool-box approach where transformations and heuristics can be
developed
Enables the designer to find the right synthesis script for
different application domains
Performance improvements of 60-70 % across a number of
designs
We have shown its effectiveness on an Intel design
Copyright Sumit Gupta 2003
14
SPARK Release
Available for download
http://www.cecs.uci.edu/~spark
User Manual
Running the tool
Customizing the synthesis scripts
Tutorial
Synthesis of Portion of the Motion Compensation
algorithm in MPEG-1 player
Copyright Sumit Gupta 2003
15
Thank You
Ongoing Work: Interface Synthesis Co-Design
Targeting a FPGA Platform
C Input
MPEG-1
Pred Block
Execution
Profiling
Manual
HW/SW
Partitioning
Developed novel memory mapping
algorithm to fit memory elements/
application onto FPGA platform
Hardware C
Description
SPARK
High-Level
Synthesis
FPGA Platform
FPGA
Interface
Synthesis
I/O
Software
C
Description
Software
Compiler
Copyright Sumit Gupta 2003
Memory
Processor
Core
17
Future Plans or What is Missing
Need
for ability to specify timing of signals
Interface with logic synthesis tools to enable better
module selection, operator chaining/merging
Time-constrained synthesis
Power Analysis of parallelizing optimizations
More transformations such as loop fusion, range
analysis required
Copyright Sumit Gupta 2003
18
Synthesizable C
ANSI-C front end from Edison Design Group (EDG)
Features of C not supported for synthesis
Features for which support has not been implemented
Pointers
However, Arrays and passing by reference are supported
Recursive Function Calls
Gotos
Multi-dimensional arrays
Structs
Continue, Breaks
Hardware component generated for each function
A called function is instantiated as a hardware component in
calling function
Copyright Sumit Gupta 2003
19
HTG
Graph Visualization
Copyright Sumit Gupta 2003
DFG
20
Resource Utilization Graph
Scheduling
Copyright Sumit Gupta 2003
21
Example of Complex HTG
Example of a real design:
MPEG-1 pred2 function
Just for demonstration; you are
not expected to read the text
Multiple nested loops and
conditionals
Copyright Sumit Gupta 2003
22
Target Applications
Design
# of Ifs
# of
Loops
# Non-Empty
# of
Basic Blocks Operations
MPEG-1
pred1
4
2
17
123
MPEG-1
pred2
11
6
45
287
MPEG-2
dp_frame
18
4
61
260
GIMP
tiler
11
2
35
150
Copyright Sumit Gupta 2003
23
Scheduling & Logic Synthesis Results
1.2
MPEG-2 DpFrame Function
1.2
1
0.8
0.6
GIMP Tiler Function
1
33%
20%
1%
0.6
0.4
0.4
0.2
0.2
0
0
Longest Path(l Critical Path(c Total Delay (c*l)
cyc)
ns)
52%
0.8
Unit Area
41%
14%
Longest Path(l Critical Path(c Total Delay (c*l)
cyc)
ns)
Unit Area
Non-speculative
CMs: Within
Overall: 48-76
% improvement
in Delay
+ Pre-Synthesis Transforms
BBs & Across Hier Blocks
Almost constant Area
+ Speculative Code Motions
+ Dynamic CSE
Copyright Sumit Gupta 2003
24
Case Study: Intel Instruction Length Decoder
Stream of
Instructions
Instruction Buffer
Instruction Length Decoder
First
Insn
Second
Insn
Copyright Sumit Gupta 2003
Third
Instruction
25
ILD Synthesis: Resulting Architecture
Speculate Operations,
Fully Unroll Loop,
Eliminate Loop Index
Variable
Multi-cycle
Sequential
Architecture
Single cycle
Parallel
Architecture
Our toolbox approach enables us to develop a script to
synthesize applications from different domains
Final design looks close to the actual implementation done
by Intel
Copyright Sumit Gupta 2003
26