SPARK Overview - University of California, San Diego

Download Report

Transcript SPARK Overview - University of California, San Diego

SPARK: A Parallelizing
High-Level Synthesis Framework
Sumit Gupta
Rajesh Gupta, Nikil Dutt, Alex Nicolau
Center for Embedded Computer Systems
University of California, Irvine and San Diego
http://www.cecs.uci.edu/~spark
Supported by Semiconductor Research Corporation & Intel Inc
System Level Synthesis
System
Level
Model
Task
Analysis
Hardware
Behavioral
Description
High
Level
Synthesis
HW/SW
Partitioning
ASIC
I/O
Software
Behavioral
Description
Software
Compiler
Copyright Sumit Gupta 2003
FPGA
Memory
Processor
Core
2
High Level Synthesis
Transform behavioral descriptions to RTL/gate level
From C to CDFG to Architecture
T
d=e-f
c
Memory
If Node
F
g=h+i
Control
x = a + b;
c = a < b;
if (c) then
d = e – f;
else
g = h + i;
j=d
x g;
Problem
l = e + x;
x=a+b
c=a<b
ALU
# 1 : Poor quality of HLS results beyond
straight-line
behavioral descriptions
Data path
j=dxg
l = e + x controllability of the HLS
Problem # 2 : Poor/No
results
Copyright Sumit Gupta 2003
3
High-level Synthesis

Well-researched area: from early 1980’s


Large number of synthesis optimizations have been proposed




Renewed interest due to new system level design methodologies
Either operation level: algebraic transformations on DSP codes
or logic level: Don’t Care based control optimizations
In contrast, compiler transformations operate at both operation level
(fine-grain) and source level (coarse-grain)
Parallelizing Compiler Transformations

Different optimization objectives and cost models than HLS
 Our aim: Develop Synthesis and Parallelizing Compiler
Transformations that are “useful” for HLS


Beyond scheduling results: in Circuit Area and Delay
For large designs with complex control flow (nested
conditionals/loops)
Copyright Sumit Gupta 2003
4
Our Approach: Parallelizing HLS (PHLS)
C Input
Original
CDFG
Source-Level Compiler
Transformations

Scheduling
& Binding
Optimized
CDFG
VHDL
Output
Scheduling Compiler &
Dynamic Transformations
Optimizing Compiler and Parallelizing Compiler transformations
applied at Source-level (Pre-synthesis) and during Scheduling
 Source-level code refinement using Pre-synthesis transformations
 Code Restructuring by Speculative Code Motions
 Operation replication to improve concurrency
 Dynamic transformations: exploit new opportunities during
scheduling
Copyright Sumit Gupta 2003
5
SPARK
High Level
Synthesis
Framework
Copyright Sumit Gupta 2003
6
SPARK Parallelizing HLS Framework


C input and Synthesizable RTL VHDL output
Tool-box of Transformations and Heuristics



Script-based application of transformations, passes, and heuristics:
similar to Synopsys Design Compiler
Hierarchical Intermediate Representation (HTGs)






Retains structural information about design (conditional blocks, loops)
Enables efficient and structured application of transformations
Complete HLS tool: Does Resource Binding & Control Synthesis
Enables Graphical Visualization of Design description and
intermediate results (CDFG, DFG, HTG)
Benchmarked on large set of multimedia & image processing designs
SPARK System Release available for download



Each of these can be developed independently of the other
User Manual for running tool and changing synthesis scripts
Tutorial for the synthesis of a portion of a MPEG player
100,000+ lines of C++ code
Copyright Sumit Gupta 2003
7
PHLS Transformations
Organized into Four Groups

Pre-Synthesis Source-to-Source Transformations


Scheduling synthesis & compiler transformations


Speculative Code Motions, Multi-cycling, Operation Chaining,
Loop Shifting (Incremental Loop Pipelining technique)
Dynamic: Transformations applied dynamically during
scheduling


Loop-Invariant Code Motions, Loop Unrolling, CSE
Dynamic CSE & Copy Propagation, Dynamic Branch Balancing
Basic Compiler Transformations

Copy Propagation, Dead Code Elimination, constant propagation
Application of these transformations is
guided by Synthesis Scripts
Copyright Sumit Gupta 2003
8
Experiments

We used SPARK to synthesize designs derived from
several industrial designs



Quantified effects of individual transformations on QOR




Example: MPEG-1, MPEG-2, GIMP Image Processing software
Case Study of Intel Instruction Length Decoder
Pre-synthesis transformations
Speculative Code Motions, Loop Pipeliling
Dynamic Transformations
Scheduling Results

VHDL: Logic Synthesis
Number of States in
 Critical Path Length (ns)
FSM
 Unit Area
 Cycles on Longest Path
9
through Design Copyright Sumit Gupta 2003

Scheduling & Logic Synthesis Results
1.2
MPEG-1 Pred1 Function
MPEG-1 Pred2 Function
1.2
1
0.8
0.6
0.4
1
36%
39%
0.8
42%
0.6
10%
0.2
0
0
Unit Area
8%
0.4
0.2
Longest Path(l Critical Path(c Total Delay (c*l)
cyc)
ns)
36%
Longest Path(l Critical Path(c Total Delay (c*l)
cyc)
ns)
Unit Area
Non-speculative
CMs: Within
Overall: 63-66
% improvement
in Delay
+ Pre-Synthesis Transforms
BBs & Across Hier Blocks
Almost constant Area
+ Speculative Code Motions
+ Dynamic CSE
Copyright Sumit Gupta 2003
10
Example Design: ILD Block from Intel


Case Study: A design derived from the Instruction Length
Decoder of the Intel Pentium® class of processors
Characteristics of Microprocessor functional blocks




Low Latency: Single or Dual cycle implementation
Consist of several small computations
Intermix of control and data logic
Starting with a sequential, multi-cycle specification, we
achieved a fully parallel, single-cycle design


Our toolbox approach enables us to develop a script to
synthesize applications from different domains
Final design looks close to the actual implementation done by
Intel
Copyright Sumit Gupta 2003
11
Key Insights from Project


Coarse-grain and Fine-grain Parallelizing transformations and basic
compiler transformations are essential and key to achieving high
quality of synthesis results
Language-level pre-synthesis optimizations are important due to
the high-level of abstraction at the level of behavioral C



Also important for coarse-grain design space exploration
Although a range of (compiler & synthesis) optimizations exist,
they have to be carefully guided by heuristics and scripts to achieve
desired results
Transformations from compilers and parallelizing compilers do not
directly translate over to synthesis


Need to be radically changed with completely different cost models and
guiding principles
New parallelizing transformations (or transformations that are not useful for
compilers) have to be developed for synthesis
Copyright Sumit Gupta 2003
12
Key Insights from Project

Designers want script based control over transformations, passes –
similar to Synopsys Design Compiler


Optimizations that improve schedule length (cycles) do not
necessarily improve circuit delay (due to longer critical paths, i.e.,
clock period)


Designer Insights can be used to guide transformations – especially coarsegrain code restructuring for design space exploration
For example, loop unrolling and loop pipelining: they increase the number
of operations in the design and hence, resource utilization and in turn, size
of multiplexers and controllers increase
Traditional CDFG and DFG representations used in high-level
synthesis are not sufficient for designs with complex control flow


A Hierarchical intermediate representation (Hierarchical Task Graphs –
HTGs) is required for retaining control and structural information for
efficient coarse-level optimizations
Full set of data dependencies (RAW, WAR, WAW) are required for
correlating output VHDL and C with input C.
Copyright Sumit Gupta 2003
13
Conclusions

Parallelizing code transformations enable a new range of
HLS transformations



Provide the needed improvement in quality of HLS results
 Possible to be competitive against manually designed circuits.
Can enable productivity improvements in microelectronic design
Built a C-to-VHDL synthesis system with a range of code
transformations




Platform for applying Coarse and Fine-grain Optimizations
Tool-box approach where transformations and heuristics can be
developed
 Enables the designer to find the right synthesis script for
different application domains
Performance improvements of 60-70 % across a number of
designs
We have shown its effectiveness on an Intel design
Copyright Sumit Gupta 2003
14
SPARK Release


Available for download
http://www.cecs.uci.edu/~spark
User Manual
Running the tool
 Customizing the synthesis scripts


Tutorial

Synthesis of Portion of the Motion Compensation
algorithm in MPEG-1 player
Copyright Sumit Gupta 2003
15
Thank You
Ongoing Work: Interface Synthesis Co-Design
Targeting a FPGA Platform
C Input
MPEG-1
Pred Block
Execution
Profiling
Manual
HW/SW
Partitioning

Developed novel memory mapping
algorithm to fit memory elements/
application onto FPGA platform
Hardware C
Description
SPARK
High-Level
Synthesis
FPGA Platform
FPGA
Interface
Synthesis
I/O
Software
C
Description
Software
Compiler
Copyright Sumit Gupta 2003
Memory
Processor
Core
17
Future Plans or What is Missing
 Need
for ability to specify timing of signals
 Interface with logic synthesis tools to enable better
module selection, operator chaining/merging
 Time-constrained synthesis
 Power Analysis of parallelizing optimizations
 More transformations such as loop fusion, range
analysis required
Copyright Sumit Gupta 2003
18
Synthesizable C


ANSI-C front end from Edison Design Group (EDG)
Features of C not supported for synthesis




Features for which support has not been implemented




Pointers
 However, Arrays and passing by reference are supported
Recursive Function Calls
Gotos
Multi-dimensional arrays
Structs
Continue, Breaks
Hardware component generated for each function

A called function is instantiated as a hardware component in
calling function
Copyright Sumit Gupta 2003
19
HTG
Graph Visualization
Copyright Sumit Gupta 2003
DFG
20
Resource Utilization Graph
Scheduling
Copyright Sumit Gupta 2003
21

Example of Complex HTG
Example of a real design:
MPEG-1 pred2 function


Just for demonstration; you are
not expected to read the text
Multiple nested loops and
conditionals
Copyright Sumit Gupta 2003
22
Target Applications
Design
# of Ifs
# of
Loops
# Non-Empty
# of
Basic Blocks Operations
MPEG-1
pred1
4
2
17
123
MPEG-1
pred2
11
6
45
287
MPEG-2
dp_frame
18
4
61
260
GIMP
tiler
11
2
35
150
Copyright Sumit Gupta 2003
23
Scheduling & Logic Synthesis Results
1.2
MPEG-2 DpFrame Function
1.2
1
0.8
0.6
GIMP Tiler Function
1
33%
20%
1%
0.6
0.4
0.4
0.2
0.2
0
0
Longest Path(l Critical Path(c Total Delay (c*l)
cyc)
ns)
52%
0.8
Unit Area
41%
14%
Longest Path(l Critical Path(c Total Delay (c*l)
cyc)
ns)
Unit Area
Non-speculative
CMs: Within
Overall: 48-76
% improvement
in Delay
+ Pre-Synthesis Transforms
BBs & Across Hier Blocks
Almost constant Area
+ Speculative Code Motions
+ Dynamic CSE
Copyright Sumit Gupta 2003
24
Case Study: Intel Instruction Length Decoder
Stream of
Instructions
Instruction Buffer
Instruction Length Decoder
First
Insn
Second
Insn
Copyright Sumit Gupta 2003
Third
Instruction
25
ILD Synthesis: Resulting Architecture
Speculate Operations,
Fully Unroll Loop,
Eliminate Loop Index
Variable
Multi-cycle
Sequential
Architecture


Single cycle
Parallel
Architecture
Our toolbox approach enables us to develop a script to
synthesize applications from different domains
Final design looks close to the actual implementation done
by Intel
Copyright Sumit Gupta 2003
26