Transcript Topic 4
Embedded Computer Architecture
VLIW architectures: Generating VLIW code
TU/e 5kk73 Henk Corporaal
VLIW lectures overview
• Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples – C6 – TM – TTA • Clustering and Reconfigurable components • Code generation – compiler basics – mapping and scheduling – TTA code generation – Design space exploration • Hands-on
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 2
Compiler basics
• Overview – Compiler trajectory / structure / passes – Control Flow Graph (CFG) – Mapping and Scheduling – Basic block list scheduling – Extended scheduling scope – Loop scheduling – Loop transformations • separate lecture
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 3
Compiler basics:
trajectory
Library code Source program
Preprocessor Compiler Assembler Loader/Linker
Object program Error messages
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 4
5/1/2020
Compiler basics:
structure / passes
Source code
Lexical analyzer Parsing
Intermediate code
Code optimization Code generation Register allocation
Sequential code
Scheduling and allocation
token generation check syntax check semantic parse tree generation data flow analysis local optimizations global optimizations code selection peephole optimizations making interference graph graph coloring spill code insertion caller / callee save and restore code
exploiting ILP Object code
Embedded Computer Architecture H. Corporaal, and B. Mesman 5
5/1/2020
Compiler basics:
structure
Simple example: from HLL to (Sequential) Assembly code
position := initial + rate * 60
Lexical analyzer
id := id + id * 60
Syntax analyzer
:= id + id id * 60
Intermediate code generator
temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3
Code optimizer
temp1 := id3 * 60.0
id1 := id2 + temp1
Code generator
movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1
Embedded Computer Architecture H. Corporaal, and B. Mesman 6
Compiler basics:
Control flow graph (CFG)
C input code: CFG:
shows the flow between basic blocks
if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 2 rem r, a, b goto 4 4 …………..
…………..
Program
, is collection of
Functions
, each function is collection of
Basic Blocks
, each BB contains set of
Instructions
, each instruction consists of several
Transports,..
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman
3 rem r, b, a goto 4
7
Compiler basics :
Basic optimizations
• Machine independent optimizations • Machine dependent optimizations
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 8
Compiler basics :
Basic optimizations
• Machine independent optimizations – Common subexpression elimination – Constant folding – Copy propagation – Dead-code elimination – Induction variable elimination – Strength reduction – Algebraic identities • Commutative expressions • Associativity: Tree height reduction – Note: not always allowed(due to limited precision) • For details check any good compiler book !
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 9
Compiler basics :
Basic optimizations
• Machine dependent optimization example – What’s the optimal implementation of a*34 ?
– Use multiplier: mul Tb, Ta, 34 • Pro: No thinking required • Con: May take many cycles –
Alternative
: – SHL Tb, Ta, 1 – SHL Tc, Ta, 5 – ADD Tb, Tb, Tc • Pros: May take fewer cycles • Cons: • Uses more registers • Additional instructions ( I-cache load / code size)
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 10
Compiler basics :
Register allocation
• Register Organization – Conventions needed for parameter passing – and register usage across function calls
r31
Callee saved registers
r21 r20
Caller saved registers Other temporaries
r11 r10
Function Argument and Result transfer
r1 r0
Hard-wired 0
Embedded Computer Architecture H. Corporaal, and B. Mesman 11 5/1/2020
Register allocation using graph coloring
Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program?
Some definitions: • A variable is
defined
assigned to it.
at a point in program when a value is • A variable is
used
at a point in a program when its value is referenced in an expression.
• The
live range
of a variable is the execution range between definitions and uses of a variable.
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 12
Register allocation using graph coloring
5/1/2020
define Program: a := c := b := := b d := := a := c := d use
a Live Ranges b c d
Embedded Computer Architecture H. Corporaal, and B. Mesman 13
Register allocation using graph coloring
Inference Graph
5/1/2020
Coloring: a = red b = green c = blue d = green
Graph needs 3 colors => program needs 3 registers Question
: map coloring requires (at most) 4 colors; what’s the maximum number of colors (= registers) needed for register interference graph coloring?
Embedded Computer Architecture H. Corporaal, and B. Mesman 14
5/1/2020
Register allocation using graph coloring
Spill/ Reload code
Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph Program: Example: Only
two
registers available !!
a Live Ranges b c d
a := c :=
store c
b := := b d := := a
load c
:= c := d
Embedded Computer Architecture H. Corporaal, and B. Mesman 15
Register allocation for a monolithic RF
Scheme of the optimistic register allocator
Spill code Renumber Build Spill costs Simplify Select The
Select phase
selects a color (= machine register) for a variable that
minimizes
the heuristic h:
h = fdep(col, var) + caller_callee(col, var)
5/1/2020
where:
fdep(col, var)
: a measure for the introduction of false dependencies
caller_callee(col, var)
: cost for mapping
var
on a caller or callee saved register
Embedded Computer Architecture H. Corporaal, and B. Mesman 16
• • • • • • • • • • • • • • • • • • • •
Some explanation of reg allocation phases
[
Renumber
:] The first phase finds all live ranges in a procedure and numbers (renames) them uniquely.
[
Build
:] This phase constructs the interference graph.
[Spill Costs:] In preparation for coloring, a spill cost estimate is computed for every live range. The cost is simply the sum of the execution frequencies of the transports that define or use the variable of the live range.
[
Simplify
:] This phase removes nodes with degree < k in an arbitrary order from the graph and
pushes them on a stack
. Whenever it discovers that all remaining nodes have degree >= k, it chooses a spill candidate. This node is also removed from the graph and optimistically pushed on the stack, hoping a color will be available in spite of its high degree.
[
Select
:] Colors are selected for nodes. In turn, each node is
popped from the stack
, reinserted in the interference graph and given a color distinct from its neighbors. Whenever it discovers that it has no color available for some node, it leaves the node uncolored and continues with the next node.
[
Spill Code
:] In the final phase spill code is inserted for the live ranges of all uncolored nodes. • • • • • • • • • • • •
5/1/2020
Some symbolic registers must be mapped on a specific machine register (like stack pointer). These registers get their color in the simplify stage instead of being pushed on the stack.
The other machine registers are divided in caller-saved and callee-saved registers. The allocator computes the caller-saved and callee-saved cost.
The
caller-saved cost
for the symbolic registers is computed when they have live-ranges across a procedure call. The cost per symbolic register is twice the execution frequency of its transport. The
callee-saved cost
of a symbolic register is twice the execution frequency of the procedure to which the transport of the symbolic register belongs. With these two costs in mind the allocator chooses a machine register.
Embedded Computer Architecture H. Corporaal, and B. Mesman 17
5/1/2020
Compiler basics :
Code selection
•
CISC era
(before 1985) – Code size important – Determine shortest sequence of code • Many options may exist – Pattern matching Example M68029: D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ]
ADD ([10,A1], D2*16, 20) D1
•
RISC era
– Performance important – Only few possible code sequences – New implementations of old architectures optimize RISC part of instruction set only; for e.g. i486 / Pentium / M68020
Embedded Computer Architecture H. Corporaal, and B. Mesman 18
Overview
• Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples – C6 – TM – TTA • Clustering • Code generation – Compiler basics – •What is scheduling •Basic Block Scheduling •Extended Basic Block Scheduling •Loop Scheduling
Mapping and Scheduling of Operations
• Design Space Exploration: TTA framework
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 19
Mapping / Scheduling = placing operations in
space
and
time
a b 2 d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; + * d + * z + y e f x r Data Dependence Graph (DDG)
Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020 20
How to map these operations?
a b 2 * + e d f + * z + x r y Architecture constraints: • One Function Unit • All operations single cycle latency cycle 1 2 3 4 5 6
Embedded Computer Architecture H. Corporaal, and B. Mesman
* * + + +
5/1/2020 21
How to map these operations?
a b 2 * + e d f + * z + x r y Architecture constraints: • One Add-sub and one Mul unit • All operations single cycle latency cycle 1 2 3 Mul * * Add-sub + + + 4 5 6
Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020 22
There are many mapping solutions
5/1/2020
x x x x
Pareto graph
(solution space) x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0 Cost Point x is
pareto
there is no point y for which i y i Embedded Computer Architecture H. Corporaal, and B. Mesman 23 Transforming a sequential program into a parallel program : read sequential program read machine description file for each procedure do perform function inlining for each procedure do transform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do perform instruction scheduling write out the parallel program 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 24 • Basic Block = piece of code which can only be entered from the top (first instruction) and left at the bottom (final instruction) • Scheduling a basic block = Assign resources and a cycle to every operation • List Scheduling = Heuristic scheduling approach, scheduling the operation one-by-one – Time_complexity = O(N), where N is #operations • Optimal scheduling has Time_complexity = O(exp(N) • Question: what is a good scheduling heuristic 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 25 • Make a Data Dependence Graph (DDG) • Determine minimal length of the DDG (for the given architecture) – minimal number of cycles to schedule the graph (assuming sufficient resources) • Determine: – ASAP (As Soon As Possible) cycle = earliest cycle instruction can be scheduled – – ALAP (As Late As Possible) cycle = latest cycle instruction can be scheduled – Slack of each operation = ALAP – ASAP Priority of operations = f (Slack, #decendants, #register impact, …. ) • Place each operation in first cycle with sufficient resources • Notes: – Basic Block = a (maximal) piece of consecutive instructions which can only be entered at the first instruction and left at the end – Scheduling order sequential – Scheduling Priority determined by used heuristic; e.g. slack + other contributions 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 26 A determine ASAP and ALAP cycles B C ADD <1,1> slack ASAP cycle ALAP cycle we assume all operations are single cycle ! A C <2,2> SUB <3,3> NEG LD <2,3> ADD <1,3> A B ADD <4,4> X LD <2,4> y MUL <1,4> z 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 27 5/1/2020 proc Schedule ( DDG = ( V,E )) beginproc ready = { v | ( u,v ) ready’ = ready sched = E } current_cycle = 0 while sched V do for each if v ready’ (select in priority order) do ResourceConfl( v,current_cycle, sched ) then cycle ( v ) = current_cycle sched = sched { v } endif endfor current_cycle = current_cycle + 1 ready = { v | v sched ( u,v ) ready’ = { v | v ready ( u,v ) E, u sched } E, cycle(u) + delay ( u,v ) endwhile endproc current_cycle } Embedded Computer Architecture H. Corporaal, and B. Mesman 28 5/1/2020 Code: A; If cond Then B Else C; D; If cond Then E Else F; G; CFG: Control Flow Graph B A D E C F Q: Why enlarge the scheduling scope? G Embedded Computer Architecture H. Corporaal, and B. Mesman 29 Extended basic block scheduling: A a) add r3, r4, 4 b) beq . . . Q: Why moving code? B c) add r1, r1, r2 C d) sub r3, r3, r2 5/1/2020 D e) mul r1, r1, r3 • Downward code motions? — a B, a C, a D, c D, d D • Upward code motions? — c A, d A, e B, e C, e A Embedded Computer Architecture H. Corporaal, and B. Mesman 30 Trace Superblock Decision tree Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020 Hyperblock/region 31 B A C D E F G B E A D G Trace C F A B D C D’ E F G Superblock G’ E’ tail duplication 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 32 B A C D E F G A tail duplication B C E G D D’ F G’ E’ Decision Tree G’’ F’ B E A D C F G Hyperblock/ region 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 33 B A C D E F G Multiple exc. paths Side-entries allowed Join points allowed Code motion down joins Must be if-convertible Tail dup. before sched. Trace Sup. block No Yes Yes No No No Yes No No No No Yes Hyp. block Yes No Yes No Yes No Dec. Tree Yes No No No No Yes Region Yes No Yes No No No 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 34 destination block Legend: I I I I I Copy needed Intermediate block Check for off-liveness Code movement 5/1/2020 add source block Embedded Computer Architecture H. Corporaal, and B. Mesman 35 5/1/2020 Extended basic block scheduling: Code Motion • A dominates B A is always executed before B – Consequently: • A does not dominate B code motion from B to A requires • B code duplication post-dominates A B is always executed after A – Consequently: • B does not post-dominate A code motion from B to A is speculative A B C D E F Embedded Computer Architecture H. Corporaal, and B. Mesman Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B? 36 5/1/2020 Loop Optimizations: A C B A D A B B C C’ C’’ D D Loop peeling Embedded Computer Architecture H. Corporaal, and B. Mesman C C’ C’’ Loop unrolling 37 Problems with unrolling: • Exploits only parallelism within sets of n iterations • Iteration start-up latency • Code expansion Basic block scheduling Basic block scheduling and unrolling Software pipelining time 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 38 • Software pipelining a loop is: – Scheduling the loop such that iterations start before preceding iterations have finished Or: – Moving operations across the backedge Example: y = a.x LD ML ST LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST 5/1/2020 3 cycles/iteration Unroling (3 times) 5/3 cycles/iteration Embedded Computer Architecture H. Corporaal, and B. Mesman Software pipelining 1 cycle/iteration 39 5/1/2020 Basic loop scheduling techniques: • Modulo scheduling (Rau, Lam) – list scheduling with modulo resource constraints • Kernel recognition techniques – unroll the loop – schedule the iterations – identify a repeating pattern – Examples: • Perfect pipelining (Aiken and Nicolau) • URPR (Su, Ding and Xia) • Petri net pipelining (Allan) This algorithm most used in commercial compilers • Enhanced pipeline scheduling (Ebcioğlu) – fill first cycle of iteration – copy this instruction over the backedge Embedded Computer Architecture H. Corporaal, and B. Mesman 40 5/1/2020 Software pipelining: Modulo scheduling Example: Modulo scheduling a loop for (i = 0; i < n; i++) A[i+6] = 3* A[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code (without loop control) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (c) Software pipeline • Prologue fills the SW pipeline with iterations • Epilogue drains the SW pipeline Embedded Computer Architecture H. Corporaal, and B. Mesman Prologue Kernel Epilogue 41 5/1/2020 Cyclic data dependences ld r1, (r2) (0,1) (1,0) mul r3, r1, 3 (0,1) (1,0) sub r4, r3, 1 (0,1) (1,0) st r4, (r5) Embedded Computer Architecture H. Corporaal, and B. Mesman For (i=0;.....) A[i+6]= 3*A[i]-1 cycle(v) (delay, iteration distance) (1,6) Initiation Interval cycle(u) + delay(u,v) - II.distance(u,v) ld_1 ld_2 ld_3 ld_4 -5 ld_5 ld_6 st_1 ld_7 42 MII, minimum initiation interval, bounded by cyclic dependences and resources: MII = max{ ResMinII , RecMinII } Resources: ResMinII max r resources used ( r ) available ( r ) 5/1/2020 Cycles: cycle ( v ) cycle ( v ) e c delay ( e ) II . distance ( e ) Therefore: Or: RecMinII min II N RecMinII | c cycles , 0 e c delay ( e ) II . distance ( e ) max c cycles c e e c delay distance ( e ( e ) ) Embedded Computer Architecture H. Corporaal, and B. Mesman 43 Let's go back to: The Role of the Compiler 9 steps required to translate an HLL program: (see online bookchapter) 5/1/2020 1. 2. 3. 4. 5. 6. 7. 8. 9. Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports Embedded Computer Architecture H. Corporaal, and B. Mesman 44 Division of responsibilities between hardware and compiler Application (1) Frontend Superscalar (2) Determine Dependencies Determine Dependencies Dataflow (3) Binding of Operands Binding of Operands Multi-threaded (4) Scheduling Scheduling Indep. Arch (5) Binding of Operations Binding of Operations VLIW (6) Binding of Transports Binding of Transports TTA (7) Execute 5/1/2020 Responsibility of compiler Embedded Computer Architecture H. Corporaal, and B. Mesman Responsibility of Hardware 45 • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples – C6 – TM – TTA • Clustering • Code generation • Design Space Exploration: TTA framework 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 46 User intercation feedback Optimizer Architecture parameters feedback x x x x x x Pareto curve (solution space) x x x x x x x x x x x x x x cost Parametric compiler Hardware generator Move framework Parallel object code chip 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman TTA based system 47 Data Memory Socket load/store unit load/store unit integer ALU integer ALU float ALU integer RF float RF boolean RF instruct. unit immediate unit 5/1/2020 Instruction Memory Embedded Computer Architecture H. Corporaal, and B. Mesman 48 Application (C) • Frontend: GCC or SUIF (adapted) Compiler frontend Sequential code Compiler backend Sequential simulation Parallel code 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman Profiling data Parallel simulation Input/Output Input/Output 49 • 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 50 0 FU stage constrains cycle time Number of connections removed 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 51 How ? • Code Transformations • SFUs: Special Function Units • Vector processing • Multiple Processors 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman Cost 52 + + + + + + Based on associativity of + operation a + (b + c) = (a + b) + c 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 53 d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; 5/1/2020 a b 2 * d * + e + z y f + r x Embedded Computer Architecture H. Corporaal, and B. Mesman 1 r = 2*b – a; x = z + y; << b r a y + x z 54 + + + 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman + + + 4-input adder why is this faster? 55 In the extreme case put everything into one unit! Spatial mapping - no control flow 5/1/2020 However: no flexibility / programmability !! but could use FPGAs Embedded Computer Architecture H. Corporaal, and B. Mesman 56 • Why using fine grain SFUs: – Code size reduction – Register file #ports reduction – Could be cheaper and/or faster – Transport reduction – Power reduction (avoid charging non-local wires) – Supports whole application domain ! • coarse grain would only help certain specific applications Which patterns do need support? • Detection of recurring operation patterns needed 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 57 Adding only 20 'patterns' of 2 operations dramatically reduces # of operations (with about 40%) !! 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 58 stream input 4 Addercmp FUs 2 Multiplier FUs 5/1/2020 4 RFs 2 Diffadd FUs 9 buses Architecture for image processing • Several SFUs • Note the reduced connectivity Embedded Computer Architecture H. Corporaal, and B. Mesman stream output 59 • Billions of embedded processing systems / year – how to design these systems quickly, cheap, correct, low power,.... ? – what will their processing platform look like? • VLIWs are very powerful and flexible – can be easily tuned to application domain • TTAs even more flexible, scalable, and lower power 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 60 • Compilation for ILP architectures is mature – used in commercial compilers • However – Great discrepancy between available and exploitable parallelism • Advanced code scheduling techniques needed to exploit ILP 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 61 5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 62 • HOW FAR ARE YOU? • VLIW processor of Silicon Hive (Intel) • Map your algorithm • Optimize the mapping • Optimize the architecture • Perform DSE (Design Space Exploration) trading off (=> Pareto curves) – Performance, – Energy and – Area (= Cost) Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020 63Scheduling:
Overview
Basic Block Scheduling
Basic Block Scheduling
Basic Block Scheduling:
Cycle based list scheduling
Extended Scheduling Scope: look at the CFG
Code Motion
Possible Scheduling Scopes
Create and Enlarge Scheduling Scope
Create and Enlarge Scheduling Scope
Comparing scheduling scopes
Code movement (upwards) within regions:
what to check
?
Scheduling:
Loops
Scheduling:
Loops
Software pipelining
Software pipelining (cont’d)
Software pipelining: determine II, the Initiation Interval
Modulo scheduling constraints
Overview
Mapping applications to processors MOVE framework
TTA (MOVE) organization
Code generation trajectory for TTAs
Exploration: TTA resource reduction
Exporation: TTA connectivity reduction
Can we do better?
Transforming the specification (1)
Transforming the specification (2)
Changing the architecture
adding SFUs: special function units
Changing the architecture
adding SFUs: special function units
SFUs: fine grain patterns
SFUs: covering results
Exploration: resulting architecture
Conclusions
Conclusions
Bottom line:
Handson-1 (2014)