Transcript Topic 4
Embedded Computer Architecture
VLIW architectures: Generating VLIW code
TU/e 5kk73 Henk Corporaal
VLIW lectures overview
• Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples – C6 – TM – TTA • Clustering and Reconfigurable components • Code generation – compiler basics – mapping and scheduling – TTA code generation – Design space exploration • Hands-on
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 2
Compiler basics
• Overview – Compiler trajectory / structure / passes – Control Flow Graph (CFG) – Mapping and Scheduling – Basic block list scheduling – Extended scheduling scope – Loop scheduling – Loop transformations • separate lecture
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 3
Compiler basics:
trajectory
Library code Source program
Preprocessor Compiler Assembler Loader/Linker
Object program Error messages
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 4
5/1/2020
Compiler basics:
structure / passes
Source code
Lexical analyzer Parsing
Intermediate code
Code optimization Code generation Register allocation
Sequential code
Scheduling and allocation
token generation check syntax check semantic parse tree generation data flow analysis local optimizations global optimizations code selection peephole optimizations making interference graph graph coloring spill code insertion caller / callee save and restore code
exploiting ILP Object code
Embedded Computer Architecture H. Corporaal, and B. Mesman 5
5/1/2020
Compiler basics:
structure
Simple example: from HLL to (Sequential) Assembly code
position := initial + rate * 60
Lexical analyzer
id := id + id * 60
Syntax analyzer
:= id + id id * 60
Intermediate code generator
temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3
Code optimizer
temp1 := id3 * 60.0
id1 := id2 + temp1
Code generator
movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1
Embedded Computer Architecture H. Corporaal, and B. Mesman 6
Compiler basics:
Control flow graph (CFG)
C input code: CFG:
shows the flow between basic blocks
if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 2 rem r, a, b goto 4 4 …………..
…………..
Program
, is collection of
Functions
, each function is collection of
Basic Blocks
, each BB contains set of
Instructions
, each instruction consists of several
Transports,..
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman
3 rem r, b, a goto 4
7
Compiler basics :
Basic optimizations
• Machine independent optimizations • Machine dependent optimizations
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 8
Compiler basics :
Basic optimizations
• Machine independent optimizations – Common subexpression elimination – Constant folding – Copy propagation – Dead-code elimination – Induction variable elimination – Strength reduction – Algebraic identities • Commutative expressions • Associativity: Tree height reduction – Note: not always allowed(due to limited precision) • For details check any good compiler book !
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 9
Compiler basics :
Basic optimizations
• Machine dependent optimization example – What’s the optimal implementation of a*34 ?
– Use multiplier: mul Tb, Ta, 34 • Pro: No thinking required • Con: May take many cycles –
Alternative
: – SHL Tb, Ta, 1 – SHL Tc, Ta, 5 – ADD Tb, Tb, Tc • Pros: May take fewer cycles • Cons: • Uses more registers • Additional instructions ( I-cache load / code size)
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 10
Compiler basics :
Register allocation
• Register Organization – Conventions needed for parameter passing – and register usage across function calls
r31
Callee saved registers
r21 r20
Caller saved registers Other temporaries
r11 r10
Function Argument and Result transfer
r1 r0
Hard-wired 0
Embedded Computer Architecture H. Corporaal, and B. Mesman 11 5/1/2020
Register allocation using graph coloring
Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program?
Some definitions: • A variable is
defined
assigned to it.
at a point in program when a value is • A variable is
used
at a point in a program when its value is referenced in an expression.
• The
live range
of a variable is the execution range between definitions and uses of a variable.
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 12
Register allocation using graph coloring
5/1/2020
define Program: a := c := b := := b d := := a := c := d use
a Live Ranges b c d
Embedded Computer Architecture H. Corporaal, and B. Mesman 13
Register allocation using graph coloring
Inference Graph
5/1/2020
Coloring: a = red b = green c = blue d = green
Graph needs 3 colors => program needs 3 registers Question
: map coloring requires (at most) 4 colors; what’s the maximum number of colors (= registers) needed for register interference graph coloring?
Embedded Computer Architecture H. Corporaal, and B. Mesman 14
5/1/2020
Register allocation using graph coloring
Spill/ Reload code
Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph Program: Example: Only
two
registers available !!
a Live Ranges b c d
a := c :=
store c
b := := b d := := a
load c
:= c := d
Embedded Computer Architecture H. Corporaal, and B. Mesman 15
Register allocation for a monolithic RF
Scheme of the optimistic register allocator
Spill code Renumber Build Spill costs Simplify Select The
Select phase
selects a color (= machine register) for a variable that
minimizes
the heuristic h:
h = fdep(col, var) + caller_callee(col, var)
5/1/2020
where:
fdep(col, var)
: a measure for the introduction of false dependencies
caller_callee(col, var)
: cost for mapping
var
on a caller or callee saved register
Embedded Computer Architecture H. Corporaal, and B. Mesman 16
• • • • • • • • • • • • • • • • • • • •
Some explanation of reg allocation phases
[
Renumber
:] The first phase finds all live ranges in a procedure and numbers (renames) them uniquely.
[
Build
:] This phase constructs the interference graph.
[Spill Costs:] In preparation for coloring, a spill cost estimate is computed for every live range. The cost is simply the sum of the execution frequencies of the transports that define or use the variable of the live range.
[
Simplify
:] This phase removes nodes with degree < k in an arbitrary order from the graph and
pushes them on a stack
. Whenever it discovers that all remaining nodes have degree >= k, it chooses a spill candidate. This node is also removed from the graph and optimistically pushed on the stack, hoping a color will be available in spite of its high degree.
[
Select
:] Colors are selected for nodes. In turn, each node is
popped from the stack
, reinserted in the interference graph and given a color distinct from its neighbors. Whenever it discovers that it has no color available for some node, it leaves the node uncolored and continues with the next node.
[
Spill Code
:] In the final phase spill code is inserted for the live ranges of all uncolored nodes. • • • • • • • • • • • •
5/1/2020
Some symbolic registers must be mapped on a specific machine register (like stack pointer). These registers get their color in the simplify stage instead of being pushed on the stack.
The other machine registers are divided in caller-saved and callee-saved registers. The allocator computes the caller-saved and callee-saved cost.
The
caller-saved cost
for the symbolic registers is computed when they have live-ranges across a procedure call. The cost per symbolic register is twice the execution frequency of its transport. The
callee-saved cost
of a symbolic register is twice the execution frequency of the procedure to which the transport of the symbolic register belongs. With these two costs in mind the allocator chooses a machine register.
Embedded Computer Architecture H. Corporaal, and B. Mesman 17
5/1/2020
Compiler basics :
Code selection
•
CISC era
(before 1985) – Code size important – Determine shortest sequence of code • Many options may exist – Pattern matching Example M68029: D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ]
ADD ([10,A1], D2*16, 20) D1
•
RISC era
– Performance important – Only few possible code sequences – New implementations of old architectures optimize RISC part of instruction set only; for e.g. i486 / Pentium / M68020
Embedded Computer Architecture H. Corporaal, and B. Mesman 18
Overview
• Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples – C6 – TM – TTA • Clustering • Code generation – Compiler basics – •What is scheduling •Basic Block Scheduling •Extended Basic Block Scheduling •Loop Scheduling
Mapping and Scheduling of Operations
• Design Space Exploration: TTA framework
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 19
Mapping / Scheduling = placing operations in
space
and
time
a b 2 d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; + * d + * z + y e f x r Data Dependence Graph (DDG)
Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020 20
How to map these operations?
a b 2 * + e d f + * z + x r y Architecture constraints: • One Function Unit • All operations single cycle latency cycle 1 2 3 4 5 6
Embedded Computer Architecture H. Corporaal, and B. Mesman
* * + + +
5/1/2020 21
How to map these operations?
a b 2 * + e d f + * z + x r y Architecture constraints: • One Add-sub and one Mul unit • All operations single cycle latency cycle 1 2 3 Mul * * Add-sub + + + 4 5 6
Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020 22
There are many mapping solutions
5/1/2020
x x x x
Pareto graph
(solution space) x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0 Cost Point x is
pareto
there is no point y for which i y i
Embedded Computer Architecture H. Corporaal, and B. Mesman 23
Scheduling:
Overview
Transforming a sequential program into a parallel program
: read sequential program read machine description file for each procedure do perform function inlining for each procedure do transform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do
perform instruction scheduling
write out the parallel program
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 24
Basic Block Scheduling
• Basic Block =
piece of code which can only be entered from the top (first instruction) and left at the bottom (final instruction)
• Scheduling a basic block =
Assign resources and a cycle to every operation
• List Scheduling =
Heuristic scheduling approach, scheduling the operation one-by-one
– Time_complexity = O(N), where N is #operations • Optimal scheduling has Time_complexity = O(exp(N) • Question: what is a good scheduling heuristic
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 25
Basic Block Scheduling
• Make a
Data Dependence Graph
(DDG) • Determine minimal length of the DDG (for the given architecture) – minimal number of cycles to schedule the graph (assuming sufficient resources) • Determine: –
ASAP
(As Soon As Possible) cycle = earliest cycle instruction can be scheduled – –
ALAP
(As Late As Possible) cycle = latest cycle instruction can be scheduled – Slack of each operation = ALAP – ASAP
Priority of operations = f (Slack, #decendants, #register impact, …. )
• Place each operation in first cycle with sufficient resources • Notes: –
Basic Block
= a (maximal) piece of consecutive instructions which can only be entered at the first instruction and left at the end – Scheduling order sequential –
Scheduling Priority
determined by used heuristic; e.g. slack + other contributions
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 26
A
Basic Block Scheduling:
determine ASAP and ALAP cycles
B C ADD <1,1> slack ASAP cycle ALAP cycle
we assume all operations are single cycle !
A C <2,2> SUB <3,3> NEG LD <2,3> ADD <1,3> A B ADD <4,4> X LD <2,4> y MUL <1,4> z
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 27
Cycle based list scheduling
5/1/2020
proc
Schedule (
DDG
= (
V,E
))
beginproc
ready
= {
v
| (
u,v
)
ready’ = ready sched =
E
}
current_cycle =
0
while
sched
V
do for each if
v
ready’ (select in priority order)
do
ResourceConfl(
v,current_cycle, sched
)
then
cycle
(
v
)
= current_cycle sched = sched
{
v
}
endif endfor
current_cycle = current_cycle +
1
ready =
{
v | v
sched
(
u,v
)
ready’ =
{
v | v
ready
(
u,v
)
E, u
sched
}
E, cycle(u) + delay
(
u,v
)
endwhile endproc
current_cycle
}
Embedded Computer Architecture H. Corporaal, and B. Mesman 28
5/1/2020
Extended Scheduling Scope: look at the CFG
Code:
A; If cond Then B Else C; D; If cond Then E Else F; G;
CFG: Control Flow Graph
B A D E C F Q: Why enlarge the scheduling scope?
G
Embedded Computer Architecture H. Corporaal, and B. Mesman 29
Extended basic block scheduling:
Code Motion
A a) add r3, r4, 4 b) beq . . .
Q: Why moving code?
B c) add r1, r1, r2 C d) sub r3, r3, r2
5/1/2020
D e) mul r1, r1, r3
• Downward code motions?
— a B, a C, a D, c D, d D • Upward code motions?
— c A, d A, e B, e C, e A
Embedded Computer Architecture H. Corporaal, and B. Mesman 30
Possible Scheduling Scopes
Trace Superblock Decision tree
Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020
Hyperblock/region
31
B A C D E F G
Create and Enlarge Scheduling Scope
B E A D G Trace C F A B D C D’ E F G Superblock G’ E’ tail duplication
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 32
B A C D E F G
Create and Enlarge Scheduling Scope
A tail duplication B C E G D D’ F G’ E’ Decision Tree G’’ F’ B E A D C F G Hyperblock/ region
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 33
B A C D E F G
Comparing scheduling scopes
Multiple exc. paths Side-entries allowed Join points allowed Code motion down joins Must be if-convertible Tail dup. before sched.
Trace Sup.
block
No Yes Yes No No No Yes No No No No Yes
Hyp.
block
Yes No Yes No Yes No
Dec.
Tree
Yes No No No No Yes
Region
Yes No Yes No No No
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 34
Code movement (upwards) within regions:
what to check
?
destination block
Legend:
I I I I I
Copy needed Intermediate block Check for off-liveness Code movement 5/1/2020
add
source block Embedded Computer Architecture H. Corporaal, and B. Mesman 35
5/1/2020
Extended basic block scheduling:
Code Motion
• A
dominates
B A is always executed before B – Consequently: • A does not dominate B code motion from B to A requires • B
code duplication post-dominates
A B is always executed after A – Consequently: • B does not post-dominate A code motion from B to A is
speculative
A B C D E F
Embedded Computer Architecture H. Corporaal, and B. Mesman
Q1: does C dominate E?
Q2: does C dominate D?
Q3: does F post-dominate D?
Q4: does D post-dominate B?
36
5/1/2020
Scheduling:
Loops
Loop Optimizations: A C B A D A B B C C’ C’’ D D Loop peeling
Embedded Computer Architecture H. Corporaal, and B. Mesman
C C’ C’’ Loop unrolling
37
Scheduling:
Loops
Problems with unrolling:
• Exploits only parallelism within sets of n iterations • Iteration start-up latency • Code expansion
Basic block scheduling Basic block scheduling and unrolling Software pipelining time
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 38
Software pipelining
• Software pipelining a loop is: – Scheduling the loop such that iterations start before preceding iterations have finished Or: – Moving operations across the backedge
Example:
y
=
a.x
LD ML ST LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST
5/1/2020
3 cycles/iteration Unroling (3 times) 5/3 cycles/iteration
Embedded Computer Architecture H. Corporaal, and B. Mesman
Software pipelining 1 cycle/iteration
39
5/1/2020
Software pipelining (cont’d)
Basic loop scheduling techniques:
• Modulo scheduling (Rau, Lam) – list scheduling with modulo resource constraints • Kernel recognition techniques – unroll the loop – schedule the iterations – identify a repeating pattern – Examples: • Perfect pipelining (Aiken and Nicolau) • URPR (Su, Ding and Xia) • Petri net pipelining (Allan) This algorithm most used in commercial compilers • Enhanced pipeline scheduling (Ebcioğlu) – fill first cycle of iteration – copy this instruction over the backedge
Embedded Computer Architecture H. Corporaal, and B. Mesman 40
5/1/2020
Software pipelining: Modulo scheduling
Example: Modulo scheduling a loop for (i = 0; i < n; i++) A[i+6] = 3* A[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code (without loop control) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (c) Software pipeline
• Prologue
fills
the SW pipeline with iterations • Epilogue
drains
the SW pipeline
Embedded Computer Architecture H. Corporaal, and B. Mesman
Prologue Kernel Epilogue
41
Software pipelining: determine II, the Initiation Interval
5/1/2020
Cyclic data dependences ld r1, (r2) (0,1) (1,0) mul r3, r1, 3 (0,1) (1,0) sub r4, r3, 1 (0,1) (1,0) st r4, (r5)
Embedded Computer Architecture H. Corporaal, and B. Mesman
For (i=0;.....) A[i+6]= 3*A[i]-1 cycle(v)
(delay, iteration distance) (1,6)
Initiation Interval
cycle(u) + delay(u,v) - II.distance(u,v)
ld_1 ld_2 ld_3 ld_4 -5 ld_5 ld_6 st_1 ld_7
42
Modulo scheduling constraints
MII,
minimum initiation interval, bounded by cyclic dependences and resources:
MII
= max{
ResMinII
,
RecMinII
}
Resources:
ResMinII max
r
resources
used
(
r
)
available
(
r
)
5/1/2020
Cycles:
cycle
(
v
)
cycle
(
v
)
e
c
delay
(
e
)
II
.
distance
(
e
) Therefore: Or: RecMinII min
II
N
RecMinII |
c
cycles
, 0
e
c
delay
(
e
)
II
.
distance (
e
) max
c
cycles
c e
e
c delay
distance (
e
(
e
) )
Embedded Computer Architecture H. Corporaal, and B. Mesman 43
Let's go back to: The Role of the Compiler
9 steps required to translate an HLL program:
(see online bookchapter)
5/1/2020
1.
2.
3.
4.
5.
6.
7.
8.
9.
Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports
Embedded Computer Architecture H. Corporaal, and B. Mesman 44
Division of responsibilities between hardware and compiler Application (1) Frontend
Superscalar
(2) Determine Dependencies Determine Dependencies
Dataflow
(3) Binding of Operands Binding of Operands
Multi-threaded
(4) Scheduling Scheduling
Indep. Arch
(5) Binding of Operations Binding of Operations
VLIW
(6) Binding of Transports Binding of Transports
TTA
(7) Execute
5/1/2020
Responsibility of compiler
Embedded Computer Architecture H. Corporaal, and B. Mesman
Responsibility of Hardware
45
Overview
• Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples – C6 – TM – TTA • Clustering • Code generation • Design Space Exploration: TTA framework
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 46
Mapping applications to processors MOVE framework
User intercation feedback Optimizer Architecture parameters feedback x x x x x x Pareto curve (solution space) x x x x x x x x x x x x x x cost Parametric compiler Hardware generator
Move framework
Parallel object code chip
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman
TTA based system
47
TTA (MOVE) organization
Data Memory
Socket
load/store unit load/store unit integer ALU integer ALU float ALU integer RF float RF boolean RF instruct.
unit immediate unit
5/1/2020
Instruction Memory
Embedded Computer Architecture H. Corporaal, and B. Mesman 48
Code generation trajectory for TTAs
Application (C) • Frontend: GCC or SUIF (adapted) Compiler frontend Sequential code Compiler backend Sequential simulation Parallel code
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman
Profiling data Parallel simulation Input/Output Input/Output
49
Exploration: TTA resource reduction
•
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 50
Exporation: TTA connectivity reduction
0 FU stage constrains cycle time
Number of connections removed
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 51
Can we do better?
How ?
• Code Transformations • SFUs: Special Function Units • Vector processing • Multiple Processors
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman Cost 52
Transforming the specification (1)
+ + + + + + Based on
associativity
of + operation a + (b + c) = (a + b) + c
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 53
Transforming the specification (2)
d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y;
5/1/2020
a b 2 * d * + e + z y f + r x
Embedded Computer Architecture H. Corporaal, and B. Mesman
1 r = 2*b – a; x = z + y; << b r a y + x z
54
Changing the architecture
adding SFUs: special function units
+ + +
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman
+ + + 4-input adder
why is this faster?
55
Changing the architecture
adding SFUs: special function units
In the extreme case put everything into one unit!
Spatial mapping - no control flow
5/1/2020
However: no flexibility / programmability !!
but could use FPGAs
Embedded Computer Architecture H. Corporaal, and B. Mesman 56
SFUs: fine grain patterns
• Why using fine grain SFUs: – Code size reduction – Register file #ports reduction – Could be cheaper and/or faster – Transport reduction – Power reduction (avoid charging non-local wires) –
Supports whole application domain !
• coarse grain would only help certain specific applications Which patterns do need support?
• Detection of recurring operation patterns needed
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 57
SFUs: covering results
Adding only 20 'patterns' of 2 operations dramatically reduces # of operations (with about 40%) !!
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 58
Exploration: resulting architecture
stream input 4 Addercmp FUs 2 Multiplier FUs
5/1/2020
4 RFs 2 Diffadd FUs 9 buses
Architecture for image processing • Several SFUs • Note the reduced connectivity
Embedded Computer Architecture H. Corporaal, and B. Mesman
stream output
59
Conclusions
• Billions of embedded processing systems / year – how to design these systems quickly, cheap, correct, low power,.... ?
– what will their processing platform look like?
• VLIWs are very powerful and flexible – can be easily tuned to application domain • TTAs even more flexible, scalable, and lower power
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 60
Conclusions
• Compilation for ILP architectures is mature – used in commercial compilers • However – Great discrepancy between available and exploitable parallelism • Advanced code scheduling techniques needed to exploit ILP
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 61
Bottom line:
5/1/2020 Embedded Computer Architecture H. Corporaal, and B. Mesman 62
Handson-1 (2014)
•
HOW FAR ARE YOU?
• VLIW processor of Silicon Hive (Intel) • Map your algorithm • Optimize the mapping • Optimize the architecture • Perform DSE (Design Space Exploration) trading off (=> Pareto curves) – Performance, – Energy and – Area (= Cost)
Embedded Computer Architecture H. Corporaal, and B. Mesman 5/1/2020 63