Transcript Topic 4

Embedded Computer Architecture

VLIW architectures: Generating VLIW code

TU/e 5kk73 Henk Corporaal

VLIW lectures overview

• Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples – C6 – TM – TTA • Clustering and Reconfigurable components • Code generation – compiler basics – mapping and scheduling – TTA code generation – Design space exploration • Hands-on

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 2

Compiler basics

• Overview – Compiler trajectory / structure / passes – Control Flow Graph (CFG) – Mapping and Scheduling – Basic block list scheduling – Extended scheduling scope – Loop scheduling – Loop transformations • separate lecture

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 3

Compiler basics:

trajectory

Library code Source program

Preprocessor Compiler Assembler Loader/Linker

Object program Error messages

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 4

5/1/2020

Compiler basics:

structure / passes

Source code

Lexical analyzer Parsing

Intermediate code

Code optimization Code generation Register allocation

Sequential code

Scheduling and allocation

token generation check syntax check semantic parse tree generation data flow analysis local optimizations global optimizations code selection peephole optimizations making interference graph graph coloring spill code insertion caller / callee save and restore code

exploiting ILP Object code

Embedded Computer Architecture H. Corporaal, and B. Mesman 5

5/1/2020

Compiler basics:

structure

Simple example: from HLL to (Sequential) Assembly code

position := initial + rate * 60

Lexical analyzer

id := id + id * 60

Syntax analyzer

:= id + id id * 60

Intermediate code generator

temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3

Code optimizer

temp1 := id3 * 60.0

id1 := id2 + temp1

Code generator

movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1

Embedded Computer Architecture H. Corporaal, and B. Mesman 6

Compiler basics:

Control flow graph (CFG)

C input code: CFG:

shows the flow between basic blocks

if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 2 rem r, a, b goto 4 4 …………..

…………..

Program

, is collection of

Functions

, each function is collection of

Basic Blocks

, each BB contains set of

Instructions

, each instruction consists of several

Transports,..

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman

3 rem r, b, a goto 4

7

Compiler basics :

Basic optimizations

• Machine independent optimizations • Machine dependent optimizations

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 8

Compiler basics :

Basic optimizations

• Machine independent optimizations – Common subexpression elimination – Constant folding – Copy propagation – Dead-code elimination – Induction variable elimination – Strength reduction – Algebraic identities • Commutative expressions • Associativity: Tree height reduction – Note: not always allowed(due to limited precision) • For details check any good compiler book !

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 9

Compiler basics :

Basic optimizations

• Machine dependent optimization example – What’s the optimal implementation of a*34 ?

– Use multiplier: mul Tb, Ta, 34 • Pro: No thinking required • Con: May take many cycles –

Alternative

: – SHL Tb, Ta, 1 – SHL Tc, Ta, 5 – ADD Tb, Tb, Tc • Pros: May take fewer cycles • Cons: • Uses more registers • Additional instructions ( I-cache load / code size)

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 10

Compiler basics :

Register allocation

• Register Organization – Conventions needed for parameter passing – and register usage across function calls

r31

Callee saved registers

r21 r20

Caller saved registers Other temporaries

r11 r10

Function Argument and Result transfer

r1 r0

Hard-wired 0

Embedded Computer Architecture H. Corporaal, and B. Mesman 11 5/1/2020

Register allocation using graph coloring

Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program?

Some definitions: • A variable is

defined

assigned to it.

at a point in program when a value is • A variable is

used

at a point in a program when its value is referenced in an expression.

• The

live range

of a variable is the execution range between definitions and uses of a variable.

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 12

Register allocation using graph coloring

5/1/2020

define Program: a := c := b := := b d := := a := c := d use

a Live Ranges b c d

Embedded Computer Architecture H. Corporaal, and B. Mesman 13

Register allocation using graph coloring

Inference Graph

5/1/2020

Coloring: a = red b = green c = blue d = green

Graph needs 3 colors => program needs 3 registers Question

: map coloring requires (at most) 4 colors; what’s the maximum number of colors (= registers) needed for register interference graph coloring?

Embedded Computer Architecture H. Corporaal, and B. Mesman 14

5/1/2020

Register allocation using graph coloring

Spill/ Reload code

Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph Program: Example: Only

two

registers available !!

a Live Ranges b c d

a := c :=

store c

b := := b d := := a

load c

:= c := d

Embedded Computer Architecture H. Corporaal, and B. Mesman 15

Register allocation for a monolithic RF

Scheme of the optimistic register allocator

Spill code Renumber Build Spill costs Simplify Select The

Select phase

selects a color (= machine register) for a variable that

minimizes

the heuristic h:

h = fdep(col, var) + caller_callee(col, var)

5/1/2020

where:

fdep(col, var)

: a measure for the introduction of false dependencies

caller_callee(col, var)

: cost for mapping

var

on a caller or callee saved register

Embedded Computer Architecture H. Corporaal, and B. Mesman 16

• • • • • • • • • • • • • • • • • • • •

Some explanation of reg allocation phases

[

Renumber

:] The first phase finds all live ranges in a procedure and numbers (renames) them uniquely.

[

Build

:] This phase constructs the interference graph.

[Spill Costs:] In preparation for coloring, a spill cost estimate is computed for every live range. The cost is simply the sum of the execution frequencies of the transports that define or use the variable of the live range.

[

Simplify

:] This phase removes nodes with degree < k in an arbitrary order from the graph and

pushes them on a stack

. Whenever it discovers that all remaining nodes have degree >= k, it chooses a spill candidate. This node is also removed from the graph and optimistically pushed on the stack, hoping a color will be available in spite of its high degree.

[

Select

:] Colors are selected for nodes. In turn, each node is

popped from the stack

, reinserted in the interference graph and given a color distinct from its neighbors. Whenever it discovers that it has no color available for some node, it leaves the node uncolored and continues with the next node.

[

Spill Code

:] In the final phase spill code is inserted for the live ranges of all uncolored nodes. • • • • • • • • • • • •

5/1/2020

Some symbolic registers must be mapped on a specific machine register (like stack pointer). These registers get their color in the simplify stage instead of being pushed on the stack.

The other machine registers are divided in caller-saved and callee-saved registers. The allocator computes the caller-saved and callee-saved cost.

The

caller-saved cost

for the symbolic registers is computed when they have live-ranges across a procedure call. The cost per symbolic register is twice the execution frequency of its transport. The

callee-saved cost

of a symbolic register is twice the execution frequency of the procedure to which the transport of the symbolic register belongs. With these two costs in mind the allocator chooses a machine register.

Embedded Computer Architecture H. Corporaal, and B. Mesman 17

5/1/2020

Compiler basics :

Code selection

CISC era

(before 1985) – Code size important – Determine shortest sequence of code • Many options may exist – Pattern matching Example M68029: D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ] 

ADD ([10,A1], D2*16, 20) D1

RISC era

– Performance important – Only few possible code sequences – New implementations of old architectures optimize RISC part of instruction set only; for e.g. i486 / Pentium / M68020

Embedded Computer Architecture H. Corporaal, and B. Mesman 18

Overview

• Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples – C6 – TM – TTA • Clustering • Code generation – Compiler basics – •What is scheduling •Basic Block Scheduling •Extended Basic Block Scheduling •Loop Scheduling

Mapping and Scheduling of Operations

• Design Space Exploration: TTA framework

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 19

Mapping / Scheduling = placing operations in

space

and

time

a b 2 d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; + * d + * z + y e f x r Data Dependence Graph (DDG)

Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020 20

How to map these operations?

a b 2 * + e d f + * z + x r y Architecture constraints: • One Function Unit • All operations single cycle latency cycle 1 2 3 4 5 6

Embedded Computer Architecture H. Corporaal, and B. Mesman

* * + + +

5/1/2020 21

How to map these operations?

a b 2 * + e d f + * z + x r y Architecture constraints: • One Add-sub and one Mul unit • All operations single cycle latency cycle 1 2 3 Mul * * Add-sub + + + 4 5 6

Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020 22

There are many mapping solutions

5/1/2020

x x x x

Pareto graph

(solution space) x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0 Cost Point x is

pareto

 there is no point y for which  i y i

Embedded Computer Architecture H. Corporaal, and B. Mesman 23

Scheduling:

Overview

Transforming a sequential program into a parallel program

: read sequential program read machine description file for each procedure do perform function inlining for each procedure do transform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do

perform instruction scheduling

write out the parallel program

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 24

Basic Block Scheduling

• Basic Block =

piece of code which can only be entered from the top (first instruction) and left at the bottom (final instruction)

• Scheduling a basic block =

Assign resources and a cycle to every operation

• List Scheduling =

Heuristic scheduling approach, scheduling the operation one-by-one

– Time_complexity = O(N), where N is #operations • Optimal scheduling has Time_complexity = O(exp(N) • Question: what is a good scheduling heuristic

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 25

Basic Block Scheduling

• Make a

Data Dependence Graph

(DDG) • Determine minimal length of the DDG (for the given architecture) – minimal number of cycles to schedule the graph (assuming sufficient resources) • Determine: –

ASAP

(As Soon As Possible) cycle = earliest cycle instruction can be scheduled – –

ALAP

(As Late As Possible) cycle = latest cycle instruction can be scheduled – Slack of each operation = ALAP – ASAP

Priority of operations = f (Slack, #decendants, #register impact, …. )

• Place each operation in first cycle with sufficient resources • Notes: –

Basic Block

= a (maximal) piece of consecutive instructions which can only be entered at the first instruction and left at the end – Scheduling order sequential –

Scheduling Priority

determined by used heuristic; e.g. slack + other contributions

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 26

A

Basic Block Scheduling:

determine ASAP and ALAP cycles

B C ADD <1,1> slack ASAP cycle ALAP cycle

we assume all operations are single cycle !

A C <2,2> SUB <3,3> NEG LD <2,3> ADD <1,3> A B ADD <4,4> X LD <2,4> y MUL <1,4> z

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 27

Cycle based list scheduling

5/1/2020

proc

Schedule (

DDG

= (

V,E

))

beginproc

ready

= {

v

|  (

u,v

) 

ready’ = ready sched =

E

}

current_cycle =

0

while

sched

V

do for each if

v

ready’ (select in priority order)

do

 ResourceConfl(

v,current_cycle, sched

)

then

cycle

(

v

)

= current_cycle sched = sched

 {

v

}

endif endfor

current_cycle = current_cycle +

1

ready =

{

v | v

sched

  (

u,v

) 

ready’ =

{

v | v

ready

  (

u,v

) 

E, u

sched

}

E, cycle(u) + delay

(

u,v

)

endwhile endproc

current_cycle

}

Embedded Computer Architecture H. Corporaal, and B. Mesman 28

5/1/2020

Extended Scheduling Scope: look at the CFG

Code:

A; If cond Then B Else C; D; If cond Then E Else F; G;

CFG: Control Flow Graph

B A D E C F Q: Why enlarge the scheduling scope?

G

Embedded Computer Architecture H. Corporaal, and B. Mesman 29

Extended basic block scheduling:

Code Motion

A a) add r3, r4, 4 b) beq . . .

Q: Why moving code?

B c) add r1, r1, r2 C d) sub r3, r3, r2

5/1/2020

D e) mul r1, r1, r3

• Downward code motions?

— a  B, a  C, a  D, c  D, d  D • Upward code motions?

— c  A, d  A, e  B, e  C, e  A

Embedded Computer Architecture H. Corporaal, and B. Mesman 30

Possible Scheduling Scopes

Trace Superblock Decision tree

Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020

Hyperblock/region

31

B A C D E F G

Create and Enlarge Scheduling Scope

B E A D G Trace C F A B D C D’ E F G Superblock G’ E’ tail duplication

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 32

B A C D E F G

Create and Enlarge Scheduling Scope

A tail duplication B C E G D D’ F G’ E’ Decision Tree G’’ F’ B E A D C F G Hyperblock/ region

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 33

B A C D E F G

Comparing scheduling scopes

Multiple exc. paths Side-entries allowed Join points allowed Code motion down joins Must be if-convertible Tail dup. before sched.

Trace Sup.

block

No Yes Yes No No No Yes No No No No Yes

Hyp.

block

Yes No Yes No Yes No

Dec.

Tree

Yes No No No No Yes

Region

Yes No Yes No No No

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 34

Code movement (upwards) within regions:

what to check

?

destination block

Legend:

I I I I I

Copy needed Intermediate block Check for off-liveness Code movement 5/1/2020

add

source block Embedded Computer Architecture H. Corporaal, and B. Mesman 35

5/1/2020

Extended basic block scheduling:

Code Motion

• A

dominates

B  A is always executed before B – Consequently: • A does not dominate B  code motion from B to A requires • B

code duplication post-dominates

A  B is always executed after A – Consequently: • B does not post-dominate A  code motion from B to A is

speculative

A B C D E F

Embedded Computer Architecture H. Corporaal, and B. Mesman

Q1: does C dominate E?

Q2: does C dominate D?

Q3: does F post-dominate D?

Q4: does D post-dominate B?

36

5/1/2020

Scheduling:

Loops

Loop Optimizations: A C B A D A B B C C’ C’’ D D Loop peeling

Embedded Computer Architecture H. Corporaal, and B. Mesman

C C’ C’’ Loop unrolling

37

Scheduling:

Loops

Problems with unrolling:

• Exploits only parallelism within sets of n iterations • Iteration start-up latency • Code expansion

Basic block scheduling Basic block scheduling and unrolling Software pipelining time

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 38

Software pipelining

• Software pipelining a loop is: – Scheduling the loop such that iterations start before preceding iterations have finished Or: – Moving operations across the backedge

Example:

y

 

=

a.x

LD ML ST LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST

5/1/2020

3 cycles/iteration Unroling (3 times) 5/3 cycles/iteration

Embedded Computer Architecture H. Corporaal, and B. Mesman

Software pipelining 1 cycle/iteration

39

5/1/2020

Software pipelining (cont’d)

Basic loop scheduling techniques:

• Modulo scheduling (Rau, Lam) – list scheduling with modulo resource constraints • Kernel recognition techniques – unroll the loop – schedule the iterations – identify a repeating pattern – Examples: • Perfect pipelining (Aiken and Nicolau) • URPR (Su, Ding and Xia) • Petri net pipelining (Allan) This algorithm most used in commercial compilers • Enhanced pipeline scheduling (Ebcioğlu) – fill first cycle of iteration – copy this instruction over the backedge

Embedded Computer Architecture H. Corporaal, and B. Mesman 40

5/1/2020

Software pipelining: Modulo scheduling

Example: Modulo scheduling a loop for (i = 0; i < n; i++) A[i+6] = 3* A[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code (without loop control) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (c) Software pipeline

• Prologue

fills

the SW pipeline with iterations • Epilogue

drains

the SW pipeline

Embedded Computer Architecture H. Corporaal, and B. Mesman

Prologue Kernel Epilogue

41

Software pipelining: determine II, the Initiation Interval

5/1/2020

Cyclic data dependences ld r1, (r2) (0,1) (1,0) mul r3, r1, 3 (0,1) (1,0) sub r4, r3, 1 (0,1) (1,0) st r4, (r5)

Embedded Computer Architecture H. Corporaal, and B. Mesman

For (i=0;.....) A[i+6]= 3*A[i]-1 cycle(v)

(delay, iteration distance) (1,6)

Initiation Interval

cycle(u) + delay(u,v) - II.distance(u,v)

ld_1 ld_2 ld_3 ld_4 -5 ld_5 ld_6 st_1 ld_7

42

Modulo scheduling constraints

MII,

minimum initiation interval, bounded by cyclic dependences and resources:

MII

= max{

ResMinII

,

RecMinII

}

Resources:

ResMinII  max

r

resources

 

used

(

r

)

available

(

r

)  

5/1/2020

Cycles:

cycle

(

v

) 

cycle

(

v

) 

e

 

c

delay

(

e

) 

II

.

distance

(

e

)  Therefore: Or: RecMinII  min

II

N

RecMinII  | 

c

cycles

, 0 

e

 

c

delay

(

e

) 

II

.

distance (

e

)   max

c

cycles

      c e

e

c delay

distance (

e

(

e

) )   

Embedded Computer Architecture H. Corporaal, and B. Mesman 43

Let's go back to: The Role of the Compiler

9 steps required to translate an HLL program:

(see online bookchapter)

5/1/2020

1.

2.

3.

4.

5.

6.

7.

8.

9.

Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports

Embedded Computer Architecture H. Corporaal, and B. Mesman 44

Division of responsibilities between hardware and compiler Application (1) Frontend

Superscalar

(2) Determine Dependencies Determine Dependencies

Dataflow

(3) Binding of Operands Binding of Operands

Multi-threaded

(4) Scheduling Scheduling

Indep. Arch

(5) Binding of Operations Binding of Operations

VLIW

(6) Binding of Transports Binding of Transports

TTA

(7) Execute

5/1/2020

Responsibility of compiler

Embedded Computer Architecture H. Corporaal, and B. Mesman

Responsibility of Hardware

45

Overview

• Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples – C6 – TM – TTA • Clustering • Code generation • Design Space Exploration: TTA framework

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 46

Mapping applications to processors MOVE framework

User intercation feedback Optimizer Architecture parameters feedback x x x x x x Pareto curve (solution space) x x x x x x x x x x x x x x cost Parametric compiler Hardware generator

Move framework

Parallel object code chip

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman

TTA based system

47

TTA (MOVE) organization

Data Memory

Socket

load/store unit load/store unit integer ALU integer ALU float ALU integer RF float RF boolean RF instruct.

unit immediate unit

5/1/2020

Instruction Memory

Embedded Computer Architecture H. Corporaal, and B. Mesman 48

Code generation trajectory for TTAs

Application (C) • Frontend: GCC or SUIF (adapted) Compiler frontend Sequential code Compiler backend Sequential simulation Parallel code

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman

Profiling data Parallel simulation Input/Output Input/Output

49

Exploration: TTA resource reduction

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 50

Exporation: TTA connectivity reduction

0 FU stage constrains cycle time

Number of connections removed

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 51

Can we do better?

How ?

• Code Transformations • SFUs: Special Function Units • Vector processing • Multiple Processors

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman Cost 52

Transforming the specification (1)

+ + + + + + Based on

associativity

of + operation a + (b + c) = (a + b) + c

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 53

Transforming the specification (2)

d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y;

5/1/2020

a b 2 * d * + e + z y f + r x

Embedded Computer Architecture H. Corporaal, and B. Mesman

1 r = 2*b – a; x = z + y; << b r a y + x z

54

Changing the architecture

adding SFUs: special function units

+ + +

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman

+ + + 4-input adder

why is this faster?

55

Changing the architecture

adding SFUs: special function units

In the extreme case put everything into one unit!

Spatial mapping - no control flow

5/1/2020

However: no flexibility / programmability !!

but could use FPGAs

Embedded Computer Architecture H. Corporaal, and B. Mesman 56

SFUs: fine grain patterns

• Why using fine grain SFUs: – Code size reduction – Register file #ports reduction – Could be cheaper and/or faster – Transport reduction – Power reduction (avoid charging non-local wires) –

Supports whole application domain !

• coarse grain would only help certain specific applications Which patterns do need support?

• Detection of recurring operation patterns needed

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 57

SFUs: covering results

Adding only 20 'patterns' of 2 operations dramatically reduces # of operations (with about 40%) !!

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 58

Exploration: resulting architecture

stream input 4 Addercmp FUs 2 Multiplier FUs

5/1/2020

4 RFs 2 Diffadd FUs 9 buses

Architecture for image processing • Several SFUs • Note the reduced connectivity

Embedded Computer Architecture H. Corporaal, and B. Mesman

stream output

59

Conclusions

• Billions of embedded processing systems / year – how to design these systems quickly, cheap, correct, low power,.... ?

– what will their processing platform look like?

• VLIWs are very powerful and flexible – can be easily tuned to application domain • TTAs even more flexible, scalable, and lower power

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 60

Conclusions

• Compilation for ILP architectures is mature – used in commercial compilers • However – Great discrepancy between available and exploitable parallelism • Advanced code scheduling techniques needed to exploit ILP

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 61

Bottom line:

5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 62

Handson-1 (2014)

HOW FAR ARE YOU?

• VLIW processor of Silicon Hive (Intel) • Map your algorithm • Optimize the mapping • Optimize the architecture • Perform DSE (Design Space Exploration) trading off (=> Pareto curves) – Performance, – Energy and – Area (= Cost)

Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020 63