High level synthesis tools

Download Report

Transcript High level synthesis tools

Some Trends in High-level Synthesis Research Tools

Tanguy Risset

Compsys, Lip, ENS-Lyon http://www.ens-lyon.fr/COMPSYS

Outline

• Context: Why High level synthesis?

• HLS Hard problems • Some solution in existing tools • Some on-going projects 2

Context: Embedded Computing Systems design

• SoC or MPSoC for multimedia application will soon includes:  Network on chip      dozens of initiators (CPU, DMA,…) Mbytes of code Operating systems Shared memory coherency protocols … • SoC Design problems:  Time to market   Design space exploration Software complexity 3

Some envisaged solutions

• Time to market  IP re-use  High level design • Design space exploration  Fast prototyping and performance evaluation, refinement methodology (specification, algorithm, TLM, CABA) • Software complexity  Tools for embedded code generation/embedded OS • High level synthesis is only a small part of the « High level Design » process 4

Definition of High Level Synthesis

• HLS: Generates register-transfer level description from behavioral specification, in an automatic or semi-automatic way. • Input:    A behavioral specification Design constraints Library of available RTL components • Output:   RTL description Performance evaluations 5

Refinement : from algorithm to hardware

algorithm domain

algorithmic exploration

• Matlab • C

System application design

• SoC Intermediate Representation Transaction Level Modeling

abstract architecture SoC platform design virtual prototype

Architecture Description Language

block specification IP block design

6

• AL • TLM

Abstraction levels for HLS

• T-TLM • CABA • RT = Algorithm prior to HW/SW partition = Transaction-Level Model after HW/SW partition models bit-true behavior, register bank, data transfers, system synchronisation no timing needed = Timed TLM (also PVT) TLM + timing annotation refined communication model = Cycle Accurate-Bit Accurate models state at each clock edge = Register Transfer (ASIC flow entry point) synthesisable model 7

Pro’s and Cons

• « Traditional » motivations:  Fast design   Safe design : formal refinement approach « Must be used » to cope with Moore’s law • But!

   Commercial tools are not here A new tool is a big investment Designers have managed without it 8

New motivations ?

• IP-reuse  Slightly change design parameter for re-using IP • New target technologies and languages (FPGA, SystemC, etc.)  Tools can easily re-target the designs • CAD tools companies are investing a lot in « high level like » synthesis tools  Monet, Behavioural compiler, VCC, … • Technological advantage  Traditional RTL design will be de-localized to Asia 9

Outline

• Context: Why High level synthesis?

• HLS Hard problems • Some solution in existing tools • Some on-going projects 10

HLS Hard Problems

• Huge design space  Complex design space exploration  Multi-criteria optimization techniques • Integration into a design environment   Lack of standard interchange format SoC simulation time is a crucial issue • Acceptance by the designers  Find a language common to SoC designers and tools designer • Refinement technical problems  (detailed hereafter) 11

HLS technical problems

• Compilation occurs when the target architecture is precisely known • In HLS, target architecture is only partially specified, Examples:  Data-flow architecture/systolic arrays : pure RTL description  FSM+data path : closer to processor description • HLS technical problems :  Initial specification format / language    Specification refinement : fixed point arithmetic Scheduling/Mapping refinement: resource constraints Technological Mapping refinement 12

Initial specification format

• Restriction on the input language expressivity are necessary • … but designers hate new languages • C-like language (handel-C, silicon-C,hardware-C, etc…) are actually hardware description languages • Main problems:     How to express parallelism/sequentially - Data-flow, CSP-like, process network, event-driven How to express both algorithmic and RTL description How much expressivity - Dynamic control, loops How to introduce constraints/hints 13

Fixed point arithmetic

• Problem: translate a floating point computation to fixed point computation • Most of the tools start with an initial fixed point specification found by extensive simulation.

• Automatic techniques are not handling loops • In the case of signal processing application the signal processing theory can help (transfer function used to compute signal-to-noise ratio).

14

Scheduling/Mapping

• For a « basic bloc », resource constraints scheduling is NP-Hard, but widely studied.

• Computations   Currently, two way to handle loops: - Unroll them - Keep them sequential Other solutions: - Use software pipelining theory - Use the polyhedral model • Memory and communication   Memory mapping is usually strongly guided by the user - Highly active research field (Catthoor, Darte) Communication refinement is also an important issue - Highly dependent on the chosen computation model (Gajski, Kenhuis) 15

Technological mapping refinement

• Fine technological mapping are very target-dependent • Predefined libraries are not precise enough  Delays on wires  Power consumption • VLSI designers « tricks » are difficult to integrate in tools • Sub-Micronics technologies constraints are changing too fast for high level tools  Cross talk  Capacitance 16

Outline

• Context: Why High level synthesis?

• HLS Hard problems • Some solution in existing tools • Some on-going projects 17

Some solution in existing tools

• Digital signal processing circuits:    Gaut: http://lester.univ-ubs.fr:8080 Source: signal processing (one infinite loop) Target: RTL + FSM • FSM+datapath    Ugh: http://www-asim.lip6.fr/recherche/disydent/ Source: restricted C Target: FSM+data path • Regular computation and polyhedral Model    MMAlpha: http://www.irisa.fr/cosi/ALPHA/ Source : functional specification Systolic like architectures 18

GAUT:Génération Automatic d’Unité de Traitement

• Developed first at LASTI (Lannion) and then LESTER (Lorient): free • Generate RTL description from behavioral description for signal processing algorithm • Kernel technology: highly optimized ressource constraint scheduling • Inputs are - a behavioral VHDL description (one process repeated infinitely) - Libraries of operators pre-characterized - Some design constraints • Outputs are - a synthesizable RTL VHDL description (data path, memory, and communication units) - Gantt chart for I/O specification 19

Gaut design flow

Behavioral description VHDL .src

.gc

User constraints: Latency, clock frequency Operators, Alloc,etc.

Compiling -analyzing -loop unrolling graph Synthesis -selection -Scheduling Mapping .vhd

RTL description (data path+control) Operator library .lib

Memory and IO specifications .mem

20

Gaut : VHDL Input code

• Sequential instruction in one single process (no clock, no reset, no sensitivity list)

ENTITY PORT fir IS (xn: IN INTEGER ; yn: OUT INTEGER ); END fir; ARCHITECTURE ...

BEGIN PROCESS behavioral OF VARIABLE VARIABLE VARIABLE H,x: tmp: fir IS vecteur; INTEGER ; i: CONTROL ; BEGIN tmp := xn * H(0); FOR i IN 1 TO N-1 LOOP tmp := tmp + x(i) * H(i); END END LOOP yn <= tmp; FOR i IN ; N-1 DOWNTO 2 LOOP x(i) := x(i-1); END LOOP WAIT FOR ; x(1) := xn; cadence; END PROCESS ; behavioral;

21

Gaut : Input code

• Types  Bit, boolean, std_logic, Integer (single size), Bit_Vector, Std_Logic_Vector  Arrays (to be inlined) • Sequential instructions   Signal and variables assignment Only one level of

if

   For and While loops (to be inlined) Procedure calls (to be inlined) Function calls corresponding to library elements 22

Gaut step1: Source code transformation

• Control dependence elimination  Loop unrolling y ( 0 ) := x ( 0 ) * h ( 0 ) ; for i in 1 to n - 1 loop y ( i ) := y ( i - 1 ) + x ( i ) * h ( i ) ; end loop ;  Procedure inlining y ( 0 ) := x ( 0 ) * h ( 0 ) ; y ( 1 ) := y ( 1 - 1 ) + x ( 1 ) * h ( 1 ); y ( 2 ) := y ( 2 - 1 ) + x ( 2 ) * h ( 2 ) ; y ( 3 ) := y ( 3 - 1 ) + x ( 3 ) * h ( 3 ) ;  Static single assignment b := x + z ; a := b + c ; b := e + f ; y := b; b := x + z ; a := b + c ; b0001 := e + f ; y := b0001; 23

Gaut step1: Source code transformation

• Simple expression generation b := x + z * u ; tmp := z * u ; b := x + tmp ; • Constant propagation • Generation of GC Graph ( Data-Flow Graph Format of Synchronous Programming) 24

GAUT step 2: Scheduling/Mapping

• In addition to throughput and clock cycle, the user can give:  Ressource constraints and mapping constraints    Memory constraints I/O constraints Optimization type • The result is an architecture and a GANTT charts    For computations For I/O For memory 25

26

Gaut step 3: memory and communication synthesis

• Optimizing memory layout and minimizing buses I/O Control Communication unit Datapath Memory unit 27

Gaut: summary

• Advantages  Advanced development status (still research tool)    User guided synthesis Open library Active research team: memory optimization, communication synthesis • Drawbacks    Loop flattening (complexity problem) Predefined timing characteristics Hard to get out of 1D signal processing 28

Ugh: User Guided High Level Synthesis

• Developed at LIP6 (Paris), as part of the Disydent project (Digital System Design Environment): open source • Behavioral level synthesis tool for control dominated • Emphasis on precise timing estimation coprocessor • Kernel technology: ressource constraint scheduling and (GNU-like) compiler construction technology • Inputs are - a C or VHDL behavioral description with KPN communication primitives - a draft data-path - a cycle time constraint TC • Outputs are - a synthesizable RTL VHDL model - a cycle accurate simulation model 29

Coprocessor System Environment

Bus Controller unit R3000 Processor ICacheDCache RAM PI-BUS M/S Interface Coprocessor 30

UGH Structure

Cell Library Depends on the Synthesis tool (Synopsys) Ugh C Draft Data-Path UGH-CGS Coarse grain scheduler VHDL Data-Path Synthesis + Characterization Timing Annotations CK UGH-FGS Fine grain scheduler VHDL FSM/C VHDL Data-Path + FSM Caba simulation Model 31

Input 1 : UGH-C

C Description #include ugh_inChannel32 work2hcfa; ugh_inChannel32 work2hcfb; ugh_outChannel32 hcf2work; uint32 a,b; void hcf(void) { while (a != b) if (a < b) b = b - a; else a = a - b; } int ugh_main() { while (1) { channelRead(work2hcfa,&a); channelRead(work2hcfb,&b); hcf(); channelWrite(hcf2work,&a); } }

 •Library IEEE; •Use ieee.std_logic_arith.a

ll; •entity HCF is •port (CK : in bit; • • DINA : in integer; READA : out bit; • • • ROKA : in bit; DINB : in integer; READA : out bit; • • • • ROKA : in bit; DOUT : out integer; WRITE WOK •end HCF; : out bit; : int bit); 32

Input 2 : Draft Data-path

model Hcf(sofifo hcf2work; sififo work2hcfa, work2hcfa) { DFFl a, b; SUB subst; subst.A = a.Q, b.Q; subst.B = a.Q, b.Q; a.D

b.D

= subst.S, work2hcfa; = subst.S, work2hcfb; hcf2work= subst.S; } D a Q A Subst S D b Q B 33

OUTPUT 1 : Refined Data path

sel_m1 we_ra sel_m4 inf zero dina dinb i0 i1 z M1 RegA d q M2 i0 i1 z d q RegB i0 i1 z M4 a co z Subst s i0 i1 M3 z ck sel_m2 we_rb sel_m3 b op op_subst dout 34

OUTPUT 2 : FSM for control

RESET RESET RESET START READY START ROKA READA ROKA ROKB ROKB READB S1 WHILE IF S2 WOK WRITE WOK 35

Ugh summary

• Advantages   Precise timing information Multi cycle operation   Almost a compiler approach (restricted target architecture) Interfacing (Integrated in a SoC design environment) • Drawbacks  Development status (research tool)    Low level information given by the user Highly dependent on commercial tool (synopsys) Dedicated to control oriented applications 36

MMAlpha

• Developed in Irisa (Rennes): open source • High level synthesis of highly pipelined accelerators • Kernel technology: polyhedral model and systolic design methodology • Emphasis on loop transformations • Input :  functional specification (Alpha langage) • Output :  RTL description of systolic-like architecture (Alpha or VHDL) 37

MMAlpha design flow

For i=1:1:N For j=1:1:N

Alpha VHDL

Uniformization Scheduling RTL derivation

VHDL C C C C FPGA bus host

38

What is polyhedral model?

• Abstract a loop nest by the polyhedron described by the loop indices during execution of the loop • Can be used for any index-based structure : memory (arrays), communications (accesses), etc… • example: convolution (FIR filter)

y

(

i

) 

N

 1 

n

 0

H

(

n

)

x

(

i

n

) for (i=N; i<=M; i++) { y(i)=0; for (n =0; n<=N-1; j++)) { y(i)=y(i)+H(n)x(i-n) }} 39

H(N-1) n

FIR: iteration space

y(N+1) y(N) H(0) 0 0 x(N) x(N+1) i 40

i

,

n N

i

FIR polyhedral representation

(MMAlpha input language)

M

; 0 

n

N

 1 

Y[i,n]

Y[i,n

 1

]

H[n]*x[i-n ]

y(N+1) y(N) n H(N-1) H(0) 0 0 x(N) x(N+1) i 41

MMAlpha polyhedral scheduling

i

,

n N

i

M

; 0 

n

N

 1 

Y[i,n]

Y[i,n

 1

]

H[n]*x[i-n ]

y(N+1) y(N) n H(N-1) H(0) 0 0 x(N) x(N+1) t=4 5 6 i 42

MMAlpha space time transformation

t

,

p p

t

p

N

M

; 0 

p

N

 1 

Y[t,p]

Y[t

 1

,p

 1

]

H[p]*x[t-

2

p]

p y(N) H(N-1) H(0) 0 0 x(N) x(N+1) t=4 5 6 t 43

t

,

p p

t

p

N

MMAlpha mapping

M

; 0 

p

N

 1 

Y[t,p]

Y[t

 1

,p

 1

]

H[p]*x[t-

2

p]

p y(N) y H(N-1) H(0) 0 0 x(N) x(N+1) t=4 5 6 H t i x 0 44

MMAlpha resulting architecture

x( n+D-1) D-1 x( n) d( n) y( n)

+

e( n)

w

0

p=0 w

1

w

2

p=1 w

N -1

x(n-2N+2) w

N-1 

e(n-N+1) y( n) p=N-1

45

MMAlpha current features

• Tool box for designers:  Powerful analyze tools    Pipelining, Change of basis, multi-dimensionnal scheduling, control signal generation.

Code generation (C, VHDL) Hierarchical design methodology • Work in progress:   Ressource constraint scheduling (extention to Z-polyhedra) Multi-dimensionnal scheduling and memory synthesys 46

MMAlpha summary

• Advantages  Design tool integrating loop transformation    Parameterised design (N: size of the filter not fixed until VHDL generation) Formal approach for refinement (functional to operational) A real language that syntactically captures HLS input restriction • Drawbacks   Does not yet handle resource constraints A language (Alpha) and design methodology very different from designer’s habits  Implementation status (research tool) 47

Some Design results

• Ugh compares IDCT with CoWare and Gaut but the results are highly dependent upon design parameters Manual (time optimised) CoWare Ck period (ns) 10.41

21 #cycle execution 118 1 645 Exec time (µs) 1.228

34.545

Area (mm^2) N-A 19.94

Area (#inverter) 242.1

165.6

Gaut 17.5

526 9.2

19 123.5

Ugh 17 1 466 25.922

10.9

70.9

• MMAlpha demonstrates real implementation on FPGA co-processor board (DLMS algorithm) 8 tap DLMS filter MMAlpha Area 2600 slices Clk cycle 35MHz Synthesis time 112 s 48

Outline

• Context: Why High level synthesis?

• HLS Hard problems • Some solution in existing tools • Conclusion and on-going projects 49

HLS conclusion

• HLS tools are not mature enough to produce the famous « C-to-VHDL » magic tool • Most tool designer agree that a highly « user guided » approach is mandatory • CAD tools are still actively developping tools (Mentor: Catapult-C, CoWare: Cocentric….) • Some progress have been made    Domain specific constraints are more clearly identified (control oriented or data flow) Interfacing is studied together with the synthesis Fast simulation is an important issue addressed by HLS tools 50

On-going project: Data-Flow IP interface

• Gaut (Lester) and MMAlpha (Irisa, Lip) are developing a common interface for their IPs (data-flow Ips) I_FIFO 1 IN CTRL I_FIFO 2 input patterns output patterns O_FIFO 1 OUT CTRL O_FIFO 2 51

On-going project: SocLib

• SocLib environment  Public domain systemC simulation models for SoC IP: Cycle-accurate hardware simulation TLM Simulation  VCI interconnection standard  French open academic initiative (should become European through EuroSoc):http://soclib.lip6.fr/ • Typical platform: MIPS MIPS MIPS MIPS Cache VCI Cache VCI Cache VCI Cache VCI

prog.c

MIPS exec Bus / Network on chip (SPIN) GCC-MIPS

prog

RAM prog boot TTY ASIC DMA 52

On-going project: Loop transformation for compilation

• Unified loop nest transformation framework for optimization of compute/data intensive programs (Alchemy Inria project: http://www rocq.inria.fr/~acohen/software.html

).

• WRaP-IT: and Open-64/ORC Interface tool 53

Thanks

• Slides with Help from Lester, LIP6 • Here are some tools I did not talk about: Amical, Cathedral, High 2 , RapidPath, Flash, A/RT, Compaan, Syndex, Phideo, Bach, SPARK, CriticalBlue, Chinook, SCE, CodeSign , Esterel, precisionC, Polis, Atomium , Ptolemy, Handel C , Cyber, Bridge, MCSE, Madeo , SpecC, and many more ….

Any Questions ?

54