Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling

Download Report

Transcript Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling

Compiling for VLIWs and ILP




Profiling
Region formation
Acyclic scheduling
Cyclic scheduling
1
Profiling




Many crucial ILP optimizations require good
profile information
ILP optimizations try to maximize
performance/price by increasing the IPC
Compiler techniques are needed to expose and
enhance ILP
Two types of profiles: point profiles and path
profiles
2
Compiling with Profiling
3
Point Profiles



“Point profiles” collect
statistics about points in call
graphs and control flow graphs
gprof produces call graph
profiles, statistics on how
many times a function was
called, who called it, and
(sometimes) how much time
was spent in that function
Control flow graph profiles
give statistics on nodes (node
profiles) and edges (edge
profiles)
4
Path Profiles
B1

B2
B3
B5
B4
B6


B7
Path 1 {B1, B2, B3, B5, B7} count = 7
Path 2 {B1, B2, B3, B6, B7} count = 9
Path 3 {B1, B2, B4, B6, B7} count = 123

“Path profiles” measure the
execution frequency of a
sequence of basic blocks on a
path a CFG
A “hot path” is a path that is
(very) frequently executed
Types include forward paths
(no backedges), boundedlength paths (start/stop points),
and whole-program paths
(interprocedural)
The choice is a tradeoff
between accuracy and
efficiency to collect the profile
5
Profile Collection



Instrumentation inserts
instructions to record
edge profiling events

Data collected through code
instrumentation is very
detailed, but instrumentation
overhead affects execution
Hardware counters have very
low overhead but information
is not exhaustive
Interrupt-based sampling
examines machine state in
intervals
Collecting path profiles
requires enumerating the set of
paths encountered during
runtime
6
Profile Bookkeeping





Problem: compiler optimization modifies (instrumented)
code in ways that change the use and applicability of
profile information for later compilation stages
Apply profiling right before profile data is needed
Axiom of profile uniformity: “When one copies a chunk of
a program, one should equally divide the profile
frequency of the original chunk among the copies.”
Use this axiom for point profiles as a simple heuristic
Path profiles correlate branches and thus path-based
compiler optimizations preserve these profiles
7
Instruction Scheduling

Region shape
Acyclic
Basic
block
Superblock
Trace
Cyclic


Instruction scheduling is the
most fundamental ILP-oriented
compilation phase
Responsible for identifying
and grouping operations that
can be executed in parallel
Two approaches:

Cyclic schedulers operate

Acyclic schedulers
DAG
on loops to exploit ILP in
(tight) loop nests usually
without control flow
consider loop-free regions
8
Acyclic Scheduling of Basic
Block Region Shapes

add $r13 = $r3, $r0
shl $r13 = $r13, 3
ld.w $r14 = 0[$r4]
sub $r16 = $r6, 3
shr $r15 = $r15, 9


add $r13 = $r3, $r0
sub $r16 = $r6, 3
;; ## end of 1st instr.
shl $r13 = $r13, 3
shr $r15 = $r15, 9
ld.w $r14 = 0[$r4]
;; ## end of 2nd instr.
bundle
bundle
Region is restricted to
single basic block
Local scheduling of
instructions in a single
basic block is simple
ILP is exposed by
bundling operations into
VLIW instructions
(instruction formation or
instruction compaction)
9
Intermezzo: VLIW Encoding
add $r13 = $r3, $r0
sub $r16 = $r6, 3
;; ## end of 1st instr.
shl $r13 = $r13, 3
shr $r15 = $r15, 9
ld.w $r14 = 0[$r4]
;; ## end of 2nd instr.


A VLIW schedule can be
encoded compactly using
horizontal and vertical
nops
Start bits, stop bits, or
instruction templates are
used to compress the
VLIW instructions into
variable-width instruction
bundles
10
Intermezzo: VLIW Execution
Model Subtleties
mov $r1 = 2
;;
mov $r0 = $r1
mov $r1 = 3
;;
ld.w $r0 = 0[$r1]
;;
add $r0 = $r1, $r2
;;
sub $r3 = $r0, $r4
…
# load completed:
add $r3 = $r3, $r0

Horizontal issues within an
instruction:
A read sees the original
value of a register
 A read sees the value
written by the write
 Read and write to same
register is illegal
 Also exception issues


Vertical issues across pipelined
instructions:
EQ model
 LEQ model

EQ model allows $r0 to be reused
between issue of 1st instruction and
its completion when latency expires
11
Acyclic Region Scheduling for
Loops
DO I =
A(I)
ENDDO
DO I =
D(I)
ENDDO
1, N
= C*A(I)
1, N
= A(I)*B(I)
DO I = 1, N
A(I) = C*A(I)
D(I) = A(I)*B(I)
ENDDO

To fulfill the need to
enlarge the region size of
a loop body to expose
more ILP, apply:
Loop fusion
 Loop peeling
 Loop unrolling

DO I = 1, N, 2
A(I) = C*A(I)
D(I) = A(I)*B(I)
A(I+1) = C*A(I+1)
D(I+1) = A(I+1)*B(I+1)
(Assuming 2 divides N)
ENDDO
12
Region Scheduling Across
Basic Blocks

Move operation
from here to there

B3
B6

B4
But operation is now
missing on this path
Region scheduling schedules
operations across basic blocks,
usually on hot paths
Fulfill the need to increase the
region size by merging
operations from block to
expose more ILP
But problem with conditional
flow: how to move operations
from one block to another for
instruction scheduling?
13
Region Scheduling Across
Basic Blocks

Move operation
from here to there
B3

B6
Problem: how to move
operations from one block
to another for instruction
scheduling?
Affected branches need to
be compensated
B4
But operation is now
inserted on this path
14
Trace Scheduling

10
10
B1
B1
70
30
B2
B5
70
30
70
80
B6
B5
70
30
1.
B3
80
20
80
B6
20
80
B4
90
30
B2
B3

10
B4
90
10
2.
Earliest region scheduling
approach has restrictions
A trace consists of a the
operations from a list of basic
blocks B0, B1, …, Bn
Each Bi is a predecessor (falls
through or branches to) the
next Bi+1on the list
For any i and k there is no
path BiBkBi except for
i=0, i.e. the code is cycle free
except that the entire region
can be part of a loop
15
Superblocks

10
10
B1
B1
70
30
B2
70
30
70
B5
30
70
B3’
B3
14
80
1.
30
20
56
6
24
B6
20
10
B4
50.4
5.6
B4’
4.4
2.
3.
20
B4
90
B5
B2
B3
B6

Superblocks are single-entry
multiple-exit traces
Superblock formation uses tail
duplication to to eliminate side
entrances
Each Bi is a predecessor of the
next Bi+1on the list (fall through)
For any i and k there is no path
BiBkBi except for i=0
There are no branches into a
block in the region (no side
entrances), except to B0
39.6
16
Hyperblocks

10
10
B1
B1
70
30
B2

B5
70
B2,B5
30
B3
B3
20
20
B6
80
80
B6
20
20
B4
90
Hyperblocks are single-entry
multiple-exit traces with internal
control flow effectuated via
instruction predication
If-conversion folds flow into
single block using instruction
predication
10
B4
72
8
B4’
2
18
20
17
Intermezzo: Predication
cmpgt $b1 = $r5, 0
;;
br $b1, L1
;;
mpy $r3 = $r1, $r2
;;
L1: stw 0[$r10] = $r3
;;
cmpgt $p1 = $r5, 0
;;
($p1) mpy $r3 = $r1, $r2
;;
stw 0[$r10] = $r3
;;
mpy $r4 = $r1, $r2
;;
cmpgt $b1 = $r5, 0
;;
slct $r3 = $b1, $r4, $r3
;;
stw 0[$r10] = $r3
;;
Original
After full
predication
After partial
prediction




If-conversion translates control
dependences into data
dependences by instruction
predication to conditionally
execute them
Predication requires hardware
support
Full predication adds a
boolean operand to (all or
selected) instructions
Partial predication executes
all instructions, but selects the
final result based on a
condition
18
Treegions

Treegion 1

Treegion 2
Treegions are regions
containing a trees of
blocks such that no block
in a treegion has side
entrances
Any path through a
treegion is a superblock
Treegion 3
19
Region Formation

Region
selection

Region
enlargement

Schedule
construction
The scheduler constructs
schedules for a single region at
a time
Need to select which region to
optimize (within limits of
regions shape), i.e. group
traces of frequently executed
blocks into regions
May need to enlarge regions to
expose enough ILP for
scheduler
20
Region Selection by Trace
Growing



55
A
10
5
5
40
40
B


Trace growing uses the mutual
most likely heuristic:
Suppose A is last block in trace
Add block B to trace if B is
most likely successor of A and
A is B’s most likely
predecessor
Also works to grow backward
Requires edge profiling, but
result can be poor because
edge profiling does not
correlate branch probabilities
21
Region Selection by Path
Profiling

B1
B1

B2
B5
B3
B2
B5
B3
B3’
B6
B6
B4
B4
B4’
Treat trace as a path and
consider its execution
frequency by path profiling
Correlations are preserved
in the region formation
process
path 1: {B1, B2, B3, B4}
path 2: {B1, B2, B3, B6, B4}
path 3: {B1, B5, B3, B4}
path 4: {B1, B5, B3, B6, B4}
count = 44
count = 0
count = 16
count = 12
22
Superblock Enlargement by
Target Expansion

20
80
80
B1
B1
B2
B2
70
B3
B4
90
20
10
B3
B4
B3’
B4’
70
10

Target expansion is useful
when the branch at the
end of a superblock has a
high probability but the
superblock cannot be
enlarged due to a side
entrance
Duplicate sequence of
target blocks to a create
larger superblock
20
23
Superblock Enlargement by
Loop Peeling

10
B1
10
B2
B1
10
B1’

B2
10
0
B2’
10
0
B1”
10
Peel a number of
iterations of a small loop
body to create a larger
superblock that branches
into the loop
Useful when profiled loop
iterations is bounded to a
small constant (two
iterations in the example)
B2”
0
0
24
Superblock Enlargement by
Loop Unrolling

10
B1
10
B2
B1
10
B1’
3.3
B2
90
B2’

Loops with a superblock
body and a backedge with
high probability are called
superblock loops
When a superblock loop
is small we can unroll the
loop
10
B1”
3.3
B2”
30
3.3
25
Exposing ILP After Loop
Unrolling


B1
B2
10
B1’
Split point


Loop unrolling exposes limited
amount of ILP
Cross-iteration dependences on
the loop counter’s updates
prevent parallel execution of
the copies of the loop body
Cannot generally move
instructions across split points
Note: can use speculative
execution to hoist instructions
above split points
26
Exposing ILP with Renaming
and Copy Propagation
27
Schedule Construction

Region
selection
Region
enlargement
Schedule
construction

The schedule constructor
(scheduler) uses compaction
techniques to produce a
schedule for a region after
region formation
The goal is to minimize an
objective cost function while
maintaining program
correctness and obeying
resource limitations:
Increase speed by reducing
completion time
 Reduce code size
 Increase energy efficiency

28
Schedule Construction and
Explicitly Parallel Architectures

add $r13 = $r3, $r0
shl $r13 = $r13, 3
ld.w $r14 = 0[$r4]
sub $r16 = $r6, 3
shr $r15 = $r15, 9
add $r13 = $r3, $r0
sub $r16 = $r6, 3
;; ## end of 1st instr.
shl $r13 = $r13, 3
shr $r15 = $r15, 9
ld.w $r14 = 0[$r4]
;; ## end of 2nd instr.

bundle
bundle
A scheduler for an explicitly
parallel architecture such as
VLIW and EPIC uses the
exposed ILP to statically
schedule instructions in
parallel
Instruction compaction must
obey data dependences (RAW,
WAR, and WAW) and control
dependences to ensure
correctness
29
Schedule Construction and
Instruction Latencies

Takes 2 cycles
to complete
Takes 1 cycle
to complete

mul $r3 = $r3, $r1

add $r13 = $r2, $r3
ld.w $r14 = 0[$r5]
add $r13 = $r13, $r14

ld.w $r15 = 0[$r6]
Takes >3 cycles
(4 cycles ave.)
Instruction latencies must be
taken into account by the
scheduler, but they’re not always
fixed or the same for all ops
A scheduler can assume average
or worst-case instruction latencies
Hide instruction latencies by
ensuring that there is sufficient
height between instruction issue
and when result is needed to avoid
pipeline stalls
Also recall the difference between
the EQ versus the LEQ model
RAW hazards
30
Linear Scheduling Techniques
cycle
mul $r3 = $r3, $r1
add $r13 = $r2, $r3
ld.w $r14 = 0[$r5]
add $r13 = $r13, $r14
ld.w $r15 = 0[$r6]
0
2
0
3
1

Instruction compaction using
linear-time scans over region:


mul $r3 = $r3, $r1
ld.w $r14 = 0[$r5]
;;
ld.w $r15 = 0[$r6]
;;
add $r13 = $r2, $r3
;;
add $r13 = $r13, $r14
;;


As-soon-as-possible (ASAP)
scheduling places ops in the
earliest possible cycle using
top-down scan
As-late-as-possible (ALAP)
scheduling places ops in the
latest possible cycle using
bottom-up scan
Critical-path (CP) scheduling
uses ASAP followed by ALAP
Resource hazard detection is local
At most one
load per inst.
31
List Scheduling

for each root r in the PDG sorted by priority do
enqueue(r)
while DRQ is non-empty do
h = dequeue()
schedule(h)
for each DAG successor s of h do
if all predecessors of s have been scheduled then
enqueue(s)

List scheduling schedules
operations from the global region
based on a data dependence
graph (DDG) or program
dependence graph (PDG) which
both have O(n2) complexity
Repeatedly selects an operation
from a data-ready queue (DRQ),
where an operation is ready when
all if its DDG predecessors have
been scheduled
32
Data Dependence Graph

The data dependence
graph (DDG)
Nodes are operations
 Edges are RAW, WAR,
and WAW dependences

33
Control Flow Dependence
34
Compensation Code

X
Scheduler
interchanges
A with B
A
entry

B
C
exit
Y
Entry and/or exit
must be compensated
Compensation code is
needed when operations
are scheduled across basic
blocks in a region
Compensation code
corrects scheduling
changes by duplicating
code on entries and exits
from a scheduled region
35
No Compensation

X
X
A
B
B
A
C
C
Y
Y
No compensation code is
needed when block B
does not have an entry
and exit
36
Join Compensation

X
A
X
B
Z
B
A
B’
C
C
Y
Y
Z

Join compensation is
applied when block B has
an entry
Duplicate block B
37
Split Compensation

X
X
A
B
B
A
A’
C
W
C
Y
W

Split compensation is
applied when block B has
an exit
Duplicate block A
Y
38
Join-Split Compensation

X
X
Z
A
Z
B
B
B’
A
A’
C

Join-split compensation is
applied when block B has
an entry and an exit
Duplicate block A and B
W
W
C
W
Y
Y
39
Resource Management with
Reservation Tables

Cycle
Integer
ALU
0
busy
1
busy
2
busy
3
FP
ALU
MEM
Branch

busy

busy
busy
busy
A resource reservation table
records which resources are
busy per cycle
Reservation tables allow easy
scheduling of operations by
matching the operation’s
required resources to empty
slots
Construction of reservation
table at a join point in the CFG
is constructed by merging busy
slots from both branches
40
Software
Pipelining
DO i = 0, 6
A
B
C
D
E
F
G
H
ENDDO
prologue
kernel
Assuming that the initiation
interval (II) is 3 cycles
epilogue
41
Software Pipelining Example
> 3 cycles
> 2 cycles
>1 cycle
42
Modulo Scheduling
DDG
MRT
43
Constructing Kernel-Only Code
by Predicate Register Rotation
BRT branches to the top and rotates the predicate registers:
p1 = p0, p2 = p1, p3 = p2, p0 = p3
44
Modulo Variable Expansion (1)
45
Modulo Variable Expansion (2)
46