Chapter 12 - Software Optimisation

Transcript Chapter 12 - Software Optimisation

Chapter 12
Software Optimisation
Software Optimisation Chapter
This chapter consists of three parts:
Part 1: Optimisation Methods.
Part 2: Software Pipelining.
Part 3: Multi-cycle Loop Pipelining.
Chapter 12, Slide 2
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Chapter 12
Software Optimisation
Part 1 - Optimisation Methods
Objectives



Chapter 12, Slide 4
Introduction to optimisation and
optimisation procedure.
Optimisation of C code using the code
generation tools.
Optimisation of assembly code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Introduction

Software optimisation is the process of
manipulating software code to achieve
two main goals:


Faster execution time.
Small code size.
Note: It will be shown that in general there
is a trade off between faster
execution type and smaller code size.
Chapter 12, Slide 5
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Introduction

To implement efficient software, the
programmer must be familiar with:



Chapter 12, Slide 6
Processor architecture.
Programming language (C, assembly or
linear assembly).
The code generation tools (compiler,
assembler and linker).
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Code Optimisation Procedure
Optimise Algorithm
Program in 'C'
and compile without
any optimisation
Make the
necessary
correction(s)
N
Code
Functioning?
Y
Profile Code
No further
optimisation is
required
Y
Result
Satisfactory?
N
Identify Code
Functions to be further
optimised from
Profiling Result
Use intrinsics
Profile Code
No further
optimisation is
required
Y
Result
Satisfactory?
Convert code needing
optimisation to linear
assembly
Code
Functioning?
N
N
Set n=0 (-On)
Make the
necessary
correction(s)
Y
Compile code with
-On option
Result
Satisfactory?
Code
Functioning?
Write code in hand
assembly
Y
No further
optimisation is
required
N
Make the
necessary
correction(s)
N
Y
Profile Code
No further
optimisation is
required
Y
Result
Satisfactory?
N
Pass to next
step of
optimisaion
(N=N+1)
y
N<3?
N
Chapter 12, Slide 7
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Code Optimisation Procedure
C source
file
.c
.if
Parser
.opt
Optimiser
Code
generator
.asm
Optimising Compiler
Chapter 12, Slide 8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Optimising C Compiler Options


Chapter 12, Slide 9
The ‘C6x optimising C compiler uses the
ANSI C source code and can perform
optimisation currently up-to about 80%
compared with a hand-scheduled
assembly.
However, to achieve this level of
optimisation, knowledge of different levels
of optimisation is essential. Optimisation is
performed at different stages and levels.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Assembly Optimisation

To develop an appreciation of how to
optimise code, let us optimise an FIR
filter:
N 1
yn   hk  xn  k 
k 0

For simplicity we write:
N 1
yn   hi  xi 
[1]
i 0
Chapter 12, Slide 10
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Assembly Optimisation

To implement Equation 1, we need to
perform the following steps:
(1)
(2)
(3)
(4)
Load the sample x[i].
Load the coefficients h[i].
Multiply x[i] and h[i].
Add (x[i] * h[i]) to the content of an
accumulator.
(5) Repeat steps 1 to 4 N-1 times.
(6) Store the value in the accumulator to y.
Chapter 12, Slide 11
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Assembly Optimisation

Steps 1 to 6 can be translated into the
following ‘C6x assembly code:
loop
[B0]
[B0]
Chapter 12, Slide 12
MVK
MVK
LDH
LDH
NOP
MPY
NOP
ADD
SUB
B
NOP
.S1
.S1
.D1
.D1
.M1
.L1
.L2
.S1
0,B0
0,A5
*A8++,A2
*A9++,A3
4
A2,A3,A4
A4,A5,A5
B0,1,B0
loop
5
;
;
;
;
;
;
;
;
;
;
;
Initialise the loop counter
Initialise the accumulator
Load the samples x[i]
Load the coefficients h[i]
Add “nop 4” because the LDH has a latency of 5.
Multiply x[i] and h[i]
Multiply has a latency of 2 cycles
Add “x [i]. h[i]” to the accumulator

 loop overhead
 The branch has a latency of 6 cycles
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Assembly Optimisation

In order to optimise the code, we need
to:
(1) Use instructions in parallel.
(2) Remove the NOPs.
(3) Remove the loop overhead (remove SUB
and B: loop unrolling).
(4) Use word access or double-word access
instead of byte or half-word access.
Chapter 12, Slide 13
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Cycle
Step 1 - Using Parallel Instructions
.D1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
.D2
.M1
.M2
.L1
.L2
.S1
.S2
NOP
ldh
Chapter 12, Slide 14
ldh
nop
nop
nop
nop
mpy
nop
add
sub
b
nop
nop
nop
nop
nop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Cycle
Step 1 - Using Parallel Instructions
.D1
.D2
.M1
.M2
.L1
.L2
.S1
.S2
1
ldh
ldh
2
3
4
5
mpy
6
7
add
8
9
sub
b
10
11
12
13
14Note: Not all instructions can be put in parallel since the
result of one unit is used as an input to the following
15
unit.
16
Chapter 12, Slide 15
NOP
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 2 - Removing the NOPs
Cycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
.D1
.D2
ldh
ldh
Chapter 12, Slide 16
.M1
.M2
.L1
.L2
.S1
.S2
NOP
sub
b
nop
nop
mpy
nop
add
loop LDH
LDH
[B0] SUB
[B0] B
NOP
MPY
NOP
ADD
.D1
.D1
.L2
.S1
*A8++,A2
*A9++,A3
B0,1,B0
loop
2
.M1 A2,B3,A4
.L1 A4,A5,A5
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 3 - Loop Unrolling

The SUB and B instructions consume at
least two extra cycles per iteration (this
is known as branch overhead).
||
loop
[B0]
[B0]
LDH
LDH
SUB
B
NOP
MPY
NOP
ADD
.D1
.D1
.L2
.S1
.M1
*A8++,A2
*A9++,A3
B0,1,B0
loop
2
A2,A3,A4
.L1
A4,A5,A5
||
.D1
.D2
LDH
LDH
NOP
MPY
NOP
ADD
;
;
;
||
Chapter 12, Slide 17
LDH
LDH
NOP
MPY
NOP
ADD
LDH
LDH
NOP
MPY
NOP
ADD
.M1X
*A8++,A2
*B9++,B3
4
A2,B3,A4
.L1
A4,A5,A5
.D1
.D2
*A8++,A2
*B9++,B3
4
A2,B3,A4
.M1
.L1
:
:
:
.D1
.D2
;Start of iteration 1
;Use
of
cross
path
;Start of iteration 2
A4,A5,A5
.M1
*A8++,A2
*B9++,B3
4
A2,B3,A4
.L1
A4,A5,A5
; Start of iteration n
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 4 - Word or Double Word Access
The ‘C6711 has two 64-bit data buses
for data memory access and therefore
up to two 64-bit can be loaded into the
registers at any time (see Chapter 2).
 In addition the ‘C6711 devices have
variants of the multiplication
instruction to support different
operation (see Chapter 2).
Note: Store can only be up to 32-bit.

Chapter 12, Slide 18
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 4 - Word or Double Word Access

Using word access, MPY and MPYH the
previous code can be written as:
loop
LDW
LDW
NOP
SUB
B
NOP
MPY
MPYH
NOP
ADD
||
[B0]
[B0]
||

.D1
.D2
.L2
.S1
*A9++,A3 ; 32-bit word is loaded in a single cycle
*B6++,B1
4
.M1
.M2
loop
2
A3,B1,A4
A3,B1,B3
.L1
A4,B3,A5
Note: By loading words and using MPY and
MPYH instructions the execution time has
been halved since in each iteration two 16x16bit multiplications are performed.
Chapter 12, Slide 19
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Optimisation Summary

It has been shown that there are four
complementary methods for code
optimisation:




Using instructions in parallel.
Filling the delay slots with useful code.
Using word or double word load.
Loop unrolling.
These increase performance and reduce code size.
Chapter 12, Slide 20
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Optimisation Summary

It has been shown that there are four
complementary methods for code
optimisation:




Using instructions in parallel.
Filling the delay slots with useful code.
Using word or double word load.
Loop unrolling.
This increases performance but increases code size.
Chapter 12, Slide 21
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Chapter 12
Software Optimisation
Part 2 - Software Pipelining
Objectives





Chapter 12, Slide 23
Why using Software Pipelining, SP?
Understand software pipelining
concepts.
Use software pipelining procedure.
Code the word-wide software pipelined
dot-product routine.
Determine if your pipelined code is
more efficient with or without prolog
and epilog.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Why using Software Pipelining, SP?

SP creates highly optimized loop-code by:




Putting several instructions in parallel.
Filling delay slots with useful code.
Maximizes functional units.
SP is implemented by simply using the tools:


Chapter 12, Slide 24
Compiler options -o2 or -o3.
Assembly Optimizer if .sa file.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Software Pipeline concept
To explain the concept of software pipelining,
we will assume that all instructions execute in
one cycle.
LDH
||
LDH
MPY
ADD
Chapter 12, Slide 25
How many cycles would
it take to perform this
loop 5 times?
(Disregard delay-slots).
______________ cycles
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Software Pipeline Example
LDH
||
LDH
MPY
ADD
How many cycles would
it take to perform this
loop 5 times?
(Disregard delay-slots).
5 x 3 = 15
______________ cycles
Let’s examine hardware
(functional units) usage ...
Chapter 12, Slide 26
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Cycle .D1
1
ldh
.D1
Non-Pipelined Code
.D2
.D2
ldh
2
.M1
ldh
9
Chapter 12, Slide 27
.S1
.S2
ldh
mpy
6
8
.L2
add
5
7
.L1
mpy
3
4
.M2
add
ldh
ldh
mpy
add
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Cycle
1
.D1
ldh
Pipelining Code
.D2
ldh
.M1
.M2
.L1
2
ldh
ldh
mpy
3
ldh
ldh
mpy
add
4
ldh
ldh
mpy
add
5
ldh
ldh
mpy
add
mpy
add
6
7
.L2
.S1
.S2
add
Pipelining these instructions took 1/2 the cycles!
Chapter 12, Slide 28
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Cycle
1
.D1
ldh
Pipelining Code
.D2
ldh
.M1
.M2
.L1
2
ldh
ldh
mpy
3
ldh
ldh
mpy
add
4
ldh
ldh
mpy
add
5
ldh
ldh
mpy
add
mpy
add
6
7
.L2
.S1
.S2
add
Pipelining these instructions takes only 7 cycles!
Chapter 12, Slide 29
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Pipelining Code
Prolog
1
.D1
ldh
.D2
ldh
.M1
Staging for loop.
2
ldh
ldh
mpy
3
ldh
ldh
mpy
add
Single-cycle “loop”
iterated three times.
4
ldh
ldh
mpy
add
5
ldh
ldh
mpy
add
Epilog
6
mpy
add
Completing final
operations.
7
Loop Kernel
Chapter 12, Slide 30
.L1
add
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Pipelined Code
prolog:
; load 1
||
LDH
LDH
||
||
MPY
LDH
LDH
; mpy 1
; load 2
||
||
||
ADD
MPY
LDH
LDH
; add 1
; mpy 2
; load 3
ADD
MPY
LDH
LDH
.
.
; add 2
; mpy 3
; load 4
loop:
||
||
||
Chapter 12, Slide 31
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Software Pipelining Procedure
1.
2.
3.
4.
5.
6.
Chapter 12, Slide 32
Write algorithm in C code & verify.
Write ‘C6x Linear Assembly code.
Create dependency graph.
Allocate registers.
Create scheduling table.
Translate scheduling table to ‘C6x code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Software Pipelining Example (Step 1)
short DotP(short *m, short *n, short count)
{ int i;
short product;
short sum = 0;
for (i=0; i < count; i++)
{
product = m[i] * n[i];
sum += product;
}
return(sum);
}
Chapter 12, Slide 33
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Software Pipelining Procedure
1.
2.
3.
4.
5.
6.
Chapter 12, Slide 34
Write algorithm in C code & verify.
Write ‘C6x Linear Assembly code.
Create dependency graph.
Allocate registers.
Create scheduling table.
Translate scheduling table to ‘C6x code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Write code in Linear Assembly (Step 2)
; for (i=0; i < count; i++)
; prod = m[i] * n[i];
; sum += prod;
loop:
[count]
[count]
ldh
ldh
mpy
add
*p_m++, m
*p_n++, n
m, n, prod
prod, sum, sum
sub
b
count, 1, count
loop
1. No NOP’s required.
2. No parallel instructions required.
3. You don’t have to specify:


Chapter 12, Slide 35
Functional units, or
Registers.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Software Pipelining Procedure
1.
2.
3.
4.
5.
6.
Chapter 12, Slide 36
Write algorithm in C code & verify.
Write ‘C6x Linear Assembly code.
Create a dependency graph (4 steps).
Allocate registers.
Create scheduling table.
Translate scheduling table to ‘C6x code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph Terminology
LDH
a
LDH
.D
Parent Node
b
5
Path
.D
5
NOT
na
Conditional Path
.L
Child Node
Chapter 12, Slide 37
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph Steps
(a) Draw the algorithm nodes and paths.
(b) Write the number of cycles it takes for
each instruction to complete execution.
(c) Assign “required” function units to each
node.
(d) Partition the nodes to A and B sides and
assign sides to all functional units.
Chapter 12, Slide 38
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step a)


In this step each instruction is represented
by a node.
The node is represented by a circle, where:


Outside: write instruction.
Inside: register where result is written.
Nodes are then connected by paths
showing the data flow.
Note: Conditional paths are represented by
dashed lines.

Chapter 12, Slide 39
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step a)
LDH
m
Chapter 12, Slide 40
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step a)
Chapter 12, Slide 41
LDH
LDH
m
n
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step a)
LDH
LDH
m
n
MPY
prod
Chapter 12, Slide 42
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step a)
LDH
LDH
m
n
MPY
prod
ADD
sum
Chapter 12, Slide 43
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step a)
LDH
LDH
m
n
MPY
prod
ADD
sum
Chapter 12, Slide 44
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step a)
LDH
LDH
m
n
MPY
SUB
prod
count
ADD
sum
Chapter 12, Slide 45
B
loop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step b)


In this step the number of cycles it takes
for each instruction to complete execution
is added to the dependency graph.
It is written along the associated data
path.
Chapter 12, Slide 46
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step b)
LDH
LDH
m
n
5
5
MPY
prod
SUB
1
count
1
2
ADD
1
sum
B
loop
6
Chapter 12, Slide 47
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step c)


In this step functional units are assigned to
each node.
It is advantageous to start allocating units
to instructions which require a specific
unit:


Load/Store.
Branch.
We do not need to be concerned with
multiply as this is the only operation that
the .M unit performs.
Note: The side is not allocated at this stage.

Chapter 12, Slide 48
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step c)
.D
LDH
LDH
m
n
5
5
MPY
prod .M
ADD
sum
SUB
1
count
1
2
1
.D
B
loop .S
6
Chapter 12, Slide 49
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step d)




The data path is partitioned into side A
and B at this stage.
To optimise code we need to ensure that a
maximum number of units are used with a
minimum number of cross paths.
To make the partition visible on the
dependency graph a line is used.
The side can then be added to the
functional units associated with each
instruction or node.
Chapter 12, Slide 50
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step d)
A
Side
LDH
.D
B
Side
LDH
m
n
5
5
MPY
.M
prod
ADD
sum
SUB
1
count
1
2
1
.D
B
loop .S
6
Chapter 12, Slide 51
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dependency Graph (Step d)
A
Side
LDH
.D1
B
Side
LDH
m
n
5
5
MPY
.M1x prod
ADD
sum
SUB
1
count .L2
1
2
1 .L1
.D2
B
loop .S2
6
Chapter 12, Slide 52
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Software Pipelining Procedure
1.
2.
3.
4.
5.
6.
Chapter 12, Slide 53
Write algorithm in C code & verify.
Write ‘C6x Linear Assembly code.
Create a dependency graph (4 steps).
Allocate registers.
Create scheduling table.
Translate scheduling table to ‘C6x code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 4 - Allocate Functional Units

.L1
sum
 .M1
prod

.D1
m
Do we have enough functional
units to code this algorithm in
a single-cycle loop?
.S1


x1
.M1x
.L2
count
.D1

.S2

x2
B
Side
LDH
m
n
MPY
.M1x
n
.D2
5
5
.M2
.D2
Chapter 12, Slide 54
A
Side
LDH
SUB
1
prod
count
1
2
loop
ADD
1
.L1
sum
.L2
B
loop
.S2
6
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 4 - Allocate Registers
Content of Register File A
Chapter 12, Slide 55
Reg. A Reg. B
Content of Register File B
A0
B0
count
&a
A1
B1
&b
a
A2
B2
b
prod
A3
B3
sum
A4
B4
...
...
A15
B15
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Software Pipelining Procedure
1.
2.
3.
4.
5.
6.
Chapter 12, Slide 56
Write algorithm in C code & verify.
Write ‘C6x Linear Assembly code.
Create a dependency graph (4 steps).
Allocate registers.
Create scheduling table.
Translate scheduling table to ‘C6x code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Table
PROLOG
1
2
3
4
LOOP
5
6
7
8
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
How do we know the loop ends up in cycle 8?
Chapter 12, Slide 57
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Length of Prolog
LDH
m
Answer:
5

MPY
prod
Count up the length
of longest path, in
this case we have:
5 + 2 + 1 = 8 cycles
2
ADD
1
Chapter 12, Slide 58
sum
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Scheduling Table
PROLOG
1
2
3
4
LOOP
5
6
7
8
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Chapter 12, Slide 59
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Scheduling Table
PROLOG
1
.L1
.L2
.S1
.S2
.M1
.M2
.D1 ldh m
.D2 ldh n
2
*
*
3
4
5
6
7
8
add
B
*
*
*
mpy
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Where do we want to branch?
Chapter 12, Slide 60
LOOP
Branch here
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Scheduling Table
PROLOG
1
2
.L1
.L2
sub
.S1
.S2
.M1
.M2
.D1 ldh m *
.D2 ldh n
*
Chapter 12, Slide 61
LOOP
3
4
5
6
7
*
*
*
*
*
8
add
*
B
*
*
*
mpy
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Software Pipelining Procedure
1.
2.
3.
4.
5.
6.
Chapter 12, Slide 62
Write algorithm in C code & verify.
Write ‘C6x Linear Assembly code.
Create a dependency graph (4 steps).
Allocate registers.
Create scheduling table.
Translate scheduling table to ‘C6x code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Translate Scheduling Table to ‘C6x Code
1
C1
||
1
.L1
.L2
sub
.S1
.S2
.M1
.M2
.D1 ldh m *
.D2 ldh n
*
Chapter 12, Slide 63
2
ldh .D1 *A1++,A2
PROLOG
ldh .D2 *B1++,B2
LOOP
3
4
5
6
*
*
*
*
*
7
add
*
B
*
*
*
mpy
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Translate Scheduling Table to ‘C6x Code
1
2
.L1
.L2
sub
.S1
.S2
.M1
.M2
.D1 ldh m *
.D2 ldh n
*
Chapter 12, Slide 64
C1
ldh .D1
PROLOG
||
ldh .D2
*A1++,A2
*B1++,B2
C2
ldh .D1
*|| * ldh *.D2
|| [B0] sub .L2
*A1++,A2
*
*
*B1++,B2
B0,1,B0
2
3
4
5
6
LOOP
7
add
*
B
*
*
*
mpy
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Translate Scheduling Table to ‘C6x Code
C1
PROLOG
1
2
.L1
.L2
sub
.S1
.S2
.M1
.M2
.D1 ldh m *
.D2 ldh n
*
Chapter 12, Slide 65
3
*
B
*
*
3||
4
ldh .D1
ldh .D2
5
C2
ldh .D1
*
* ldh .D2
*
||
|| [B0] sub .L2
*
*
*
ldh .D1
mpy
C3
||
ldh .D2
|| [B0] sub .L2
* [B0]* B .S2
*
||
*
*
*
*A1++,A2
LOOP
*B1++,B2
6
7
add
*A1++,A2
*
*
*B1++,B2
B0,1,B0
*
*
*A1++,A2
*
*
*B1++,B2
B0,1,B0
*
*
loop
*
*
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Translate Scheduling Table to ‘C6x Code
C1
||
1
2
.L1
.L2
sub
.S1
.S2
.M1
.M2
.D1 ldh m *
.D2 ldh n
*
Chapter 12, Slide 66
3
4
*
*
B
*
*
*
*
*
ldh .D1
ldh .D2
4
5
*
*
C2
ldh .D1
||
ldh .D2
|| [B0] sub .L2
C3
ldh
||
ldh
|| [B0] sub
|| [B0] B
*
.D1
.D2
.L2
.S2
*
mpy
C4
ldh
||
ldh
* [B0] *sub
||
||
* [B0] *B
*A1++,A2
*B1++,B2
6
*A1++,A2
*B1++,B2
B0,1,B0
*
*A1++,A2
*B1++,B2
B0,1,B0
loop
*
*
.D1
.D2
*
.L2
.S2
*
LOOP
7
add
*
*
*
*A1++,A2
*B1++,B2
*
B0,1,B0
loop
*
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Translate Scheduling Table to ‘C6x Code
C1
||
ldh .D1
ldh .D2
*A1++,A2
*B1++,B2
C2
ldh .D1
||
ldh .D2
|| [B0] sub .L2
*A1++,A2
*B1++,B2
B0,1,B0
1
2
3
LOOP
4
.L1
ldh .D1 *A1++,A2
.L2 C3||
sub
*
*
ldh .D2 *B1++,B2
[B0] sub .L2 B0,1,B0
.S1 ||
|| [B0] B
.S2 loop
.S2
B
*
C4
ldh .D1 *A1++,A2
.M1 ||
ldh .D2 *B1++,B2
[B0] sub .L2 B0,1,B0
.M2 ||
|| [B0] B
.S2 loop
.D1 C5ldh m ldh
* .D1 **A1++,A2
*
.D2 ||ldh n ldh
* .D2 **B1++,B2
*
|| [B0] sub .L2
|| [B0] B
.S2
Chapter 12, Slide 67
5
6
7
sub
*
*
8
add
*
B
*
mpy
*
*
*
*
ldh
ldh
*
*
*
*
*
*
B0,1,B0
loop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Translate Scheduling Table to ‘C6x Code
PROLOG
1
2
3
.L1
.L2
sub
*
C6
ldh .D1
.S1
||
ldh .D2
|| [B0] sub B
.L2
.S2
|| [B0] B
.S2
.M1
||
mpy .M1x
.M2
.D1 ldh m *
*
.D2 ldh n
*
*
Chapter 12, Slide 68
LOOP
4
4
6
7
*
*
sub
*
8
add
*
*A1++,A2
*B1++,B2
B0,1,B0
*
*
loop
A2,B2,A3
B
mpy
*
*
*
*
ldh
ldh
*
*
*
*
*
*
*
*
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Translate Scheduling Table to ‘C6x Code
PROLOG
1
2
3
.L1
.L2
sub
*
C7
ldh .D1
.S1
||
ldh .D2
|| [B0] sub B
.L2
.S2
|| [B0] B
.S2
.M1
||
mpy .M1x
.M2
.D1 ldh m *
*
.D2 ldh n
*
*
Chapter 12, Slide 69
LOOP
4
4
6
7
*
*
sub
*
8
add
*
*A1++,A2
*B1++,B2
B0,1,B0
*
*
loop
A2,B2,A3
B
mpy
*
*
*
*
ldh
ldh
*
*
*
*
*
*
*
*
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Translate Scheduling Table to ‘C6x Code
PROLOG
1
2
3
4
LOOP
4
.L1
.L2
sub
*
*
*
* Single-Cycle Loop
.S1
loop:
ldh .D1 *A1++,A2
.S2
*
*
||
ldh B
.D2 *B1++,B2
|| [B0] sub .L2 B0,1,B0
.M1
|| [B0] B
.S2 loop
.M2
||
mpy .M1x A2,B2,A3
m
.D1 ldh ||
* add *.L1 A4,A3,A4
*
*
.D2 ldh n
*
*
*
*
6
7
sub
*
8
add
*
B
mpy
*
*
*
*
ldh
ldh
*
*
*
*
See Chapter 14 for practical examples
Chapter 12, Slide 70
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Translate Scheduling Table to ‘C6x Code

With this method we have only created the prolog
and the loop.

Therefore if the filter has 100 taps, then we need to
repeat the loop 100 times as we need 100 adds.

This means that we are performing 107 loads. These
7 extra loads may lead to some illegal memory
acesses.
1
2
PROLOG
3
4
5
LOOP
6
7
8
add
sub
.L1
.L2
sub sub sub sub sub sub
.S1
.S2
B
B
B
B
B
B
.M1
mpy mpy mpy
.M2
.D1 ldh m ldh m ldh m ldh m ldh m ldh m ldh m ldh m
.D2 ldh n ldh n ldh n ldh n ldh n ldh n ldh n ldh n
Chapter 12, Slide 71
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Solution: The Epilog
We only created the
Prolog and Loop …
What about the Epilog?
The Epilog can be extracted from
your results as described below.
See example in the next slide.
Chapter 12, Slide 72
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dot-Product with Epilog
Prolog
p1:
p2:
||
p3:
||
||
p4:
||
||
p5:
||
||
p6:
||
||
||
p7:
||
||
||
Chapter 12, Slide 73
ldh||ldh
ldh||ldh
[]sub
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
Loop
loop:
||
||
||
||
||
ldh
ldh
mpy
add
[] sub
[] b
Epilog
e1: mpy
|| add
Epilog = Loop - Prolog
And there is no sub or
b in the epilog
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dot-Product with Epilog
Prolog
p1:
p2:
||
p3:
||
||
p4:
||
||
p5:
||
||
p6:
||
||
||
p7:
||
||
||
Chapter 12, Slide 74
ldh||ldh
ldh||ldh
[]sub
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
Loop
loop:
||
||
||
||
||
ldh
ldh
mpy
add
[] sub
[] b
Epilog
e1: mpy
|| add
e2: mpy
|| add
Epilog = Loop - Prolog
And there is no sub or
b in the epilog
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dot-Product with Epilog
Prolog
p1:
p2:
||
p3:
||
||
p4:
||
||
p5:
||
||
p6:
||
||
||
p7:
||
||
||
Chapter 12, Slide 75
ldh||ldh
ldh||ldh
[]sub
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
Loop
loop:
||
||
||
||
||
ldh
ldh
mpy
add
[] sub
[] b
Epilog
e1:
||
e2:
||
mpy
add
mpy
add
e3: mpy
|| add
Epilog = Loop - Prolog
And there is no sub or
b in the epilog
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dot-Product with Epilog
Prolog
p1:
p2:
||
p3:
||
||
p4:
||
||
p5:
||
||
p6:
||
||
||
p7:
||
||
||
Chapter 12, Slide 76
ldh||ldh
ldh||ldh
[]sub
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
Loop
loop:
||
||
||
||
||
ldh
ldh
mpy
add
[] sub
[] b
Epilog
e1:
||
e2:
||
e3:
||
mpy
add
mpy
add
mpy
add
e4: mpy
|| add
Epilog = Loop - Prolog
And there is no sub or
b in the epilog
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dot-Product with Epilog
Prolog
p1:
p2:
||
p3:
||
||
p4:
||
||
p5:
||
||
p6:
||
||
||
p7:
||
||
||
Chapter 12, Slide 77
ldh||ldh
ldh||ldh
[]sub
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
Loop
loop:
||
||
||
||
||
ldh
ldh
mpy
add
[] sub
[] b
Epilog = Loop - Prolog
Epilog
e1:
||
e2:
||
e3:
||
e4:
||
e5:
||
mpy
add
mpy
add
mpy
add
mpy
add
mpy
add
And there is no sub or
b in the epilog
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dot-Product with Epilog
Prolog
p1:
p2:
||
p3:
||
||
p4:
||
||
p5:
||
||
p6:
||
||
||
p7:
||
||
||
Chapter 12, Slide 78
ldh||ldh
ldh||ldh
[]sub
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
Loop
loop:
||
||
||
||
||
ldh
ldh
mpy
add
[] sub
[] b
Epilog = Loop - Prolog
Epilog
e1:
||
e2:
||
e3:
||
e4:
||
e5:
||
mpy
add
mpy
add
mpy
add
mpy
add
mpy
add
e6: add
And there is no sub or
b in the epilog
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Dot-Product with Epilog
Prolog
p1:
p2:
||
p3:
||
||
p4:
||
||
p5:
||
||
p6:
||
||
||
p7:
||
||
||
Chapter 12, Slide 79
ldh||ldh
ldh||ldh
[]sub
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
ldh||ldh
mpy
[]sub
[]b
Loop
loop:
||
||
||
||
||
ldh
ldh
mpy
add
[] sub
[] b
Epilog = Loop - Prolog
And there is no sub or
b in the epilog
Epilog
e1:
||
e2:
||
e3:
||
e4:
||
e5:
||
e6:
mpy
add
mpy
add
mpy
add
mpy
add
mpy
add
add
e7: add
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Scheduling Table: Prolog, Loop and Epilog
Prologue
Loop
Epilogue
Cycle
1
2
3
4
5
6
7
8
.D1
LDH
LDH
LDH
LDH
LDH
LDH
LDH
LDH
.D2
LDH
LDH
LDH
LDH
LDH
LDH
LDH
LDH
9
10
11
12
13
14
15
ADD
ADD
ADD
ADD
ADD
ADD
ADD
MPY
MPY
MPY
MPY
MPY
Unit
.L1
.L2
ADD
SUB
SUB
SUB
SUB
SUB
SUB
SUB
B
B
B
B
B
B
MPY
MPY
MPY
.S1
.S2
.M1
.M2
Chapter 12, Slide 80
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop only!

Can the code be written as a loop only (i.e.
no prolog or epilog)?
Yes!
Chapter 12, Slide 81
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop only!
PROLOG
(i) Remove all instructions
except the branch.
1
2
3
4
LOOP
5
6
7
.L1
8
add
.L2
sub
*
*
*
*
*
*
B
*
*
*
*
*
mpy
*
*
.S1
.S2
.M1
.M2
Chapter 12, Slide 82
.D1
ldh m
*
*
*
*
*
*
*
.D2
ldh n
*
*
*
*
*
*
*
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop only!
PROLOG
(i) Remove all instructions
except the branch.
1
2
LOOP
3
4
5
6
.L1
add
.L2
sub
.S1
.S2
.M1
B
B
B
B
B
B
mpy
.M2
Chapter 12, Slide 83
.D1
ldh m
.D2
ldh n
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop only!
PROLOG
(i) Remove all instructions
except the branch.
(ii) Zero input registers,
accumulator and
product registers.
1
2
LOOP
3
4
5
zero a zero
sum
.L1
.L2
.M1
add
sub
zero
zero b prod
.S1
.S2
6
B
B
B
B
B
B
mpy
.M2
Chapter 12, Slide 84
.D1
ldh m
.D2
ldh n
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop only!
PROLOG
(i) Remove all instructions
except the branch.
(ii) Zero input registers,
accumulator and
product registers.
(iii)Adjust the number of
subtractions.
1
2
LOOP
3
4
5
6
.L1
zero a zero
sum
add
.L2
sub
sub
.S1
zero
zero b prod
.S2
.M1
B
B
B
B
B
B
mpy
.M2
Chapter 12, Slide 85
.D1
ldh m
.D2
ldh n
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop Only - Final Code
Overhead
Loop
Chapter 12, Slide 86
b
loop
b
loop
loop
||
||
b
zero m
zero n
||
||
b
loop
zero prod
zero sum
;product register
;accumulator
||
b
sub
;modify count register
loop
||
||
||
|| []
|| []
ldh
ldh
mpy
add
sub
b
;input register
;input register
loop
loop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Laboratory exercise

Software pipeline using the LDW
version of the Dot-Product routine:
(1)
(2)
(3)
(4)

To Epilogue or Not to Epilog?

Chapter 12, Slide 87
Write linear assembly.
Create dependency graph.
Complete scheduling table.
Transfer table to ‘C6000 code.
Determine if your pipelined code is more
efficient with or without prolog and
epilog.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Lab Solution: Step 1 - Linear Assembly
; for (i=0; i < count; i++)
; prod = m[i] * n[i];
; sum += prod;
*** count becomes 20 ***
loop:
[count]
[count]
ldw
ldw
mpy
mpyh
add
add
sub
b
; Outside of Loop
add
Chapter 12, Slide 88
*p_m++, m
*p_n++, n
m, n, prod
m, n, prodh
prod, sum, sum
prodh, sumh, sumh
count, 1, count
loop
sum, sumh, sum
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 2 - Dependency Graph
A Side
B Side
LDW
LDW
m
n
.D1
.D2
SUB
5
5
count .S2
MPY
MPYH
.M1x prod
1
prodh .M2x
B
2
ADD
1
Chapter 12, Slide 89
.L1 sum
2
loop .S1
ADD
sumh .L2
1
6
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 2 - Functional Units
Chapter 12, Slide 90

.L1

.M1 prod

.D1
m

.S1
loop

x1
.M1x
sum

.L2 sumh

.M2 prodh

.D2

.S2 count

x2
Do we still have enough
functional units to
code this algorithm
in a single-cycle loop?
Yes !
n
.M2x
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 2 - Registers
Register File A
#
#
Register File B
A0
B0
count
A1
B1
A2
B2
A3
B3
return address
&a/ret value
A4
B4
&x
a
A5
B5
x
count/prod
A6
B6
prodh
sum
A7
B7
sumh
Chapter 12, Slide 91
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 3 - Schedule Algorithm
PROLOG
1
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Chapter 12, Slide 92
2
3
4
LOOP
5
6
7
8
add
add
sub1
B1
B2
B3
B4
B5
B6
sub2
sub3
sub4
sub5
sub6
sub7
mpy
mpy2
mpy3
mpyh
mpyh2
mpyh3
ldw m
ldw2
ldw3
ldw4
ldw5
ldw6
ldw7
ldw8
ldw n
ldw2
ldw3
ldw4
ldw5
ldw6
ldw7
ldw8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 4 - ‘C6000 Code

The complete code is available in the
following location:
 \Links\DotP LDW.pdf
Chapter 12, Slide 93
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Why Conditional Subtract?
loop:
[count]
[count]
ldh
ldh
mpy
add
*p_m++, m
*p_n++, n
m, n, prod
prod, sum, sum
sub
b
count, 1, count
loop
Without Cond. Subtract:
With Cond. Subtract:
Loop (count = 1)
loop (count = 0)
loop (count = -1)
loop (count = -2)
loop (count = -3)
loop (count = -4)
Loop (count = 1)
loop (count = 0)
loop (count = 0)
loop (count = 0)
loop (count = 0)
loop (count = 0)
Loop never ends
Chapter 12, Slide 94
(B)
(B)
X
(B)
(B)
(B)
(B)
(B)
(B)
X
(B)
X
(B)
X
(B)
X
(B)
X
Loop ends
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Chapter 12
Software Optimisation
Part 3 - Pipelining Multi-cycle Loops
Objectives




Chapter 12, Slide 96
Software pipeline the weighted vector
sum algorithm.
Describe four iteration interval
constraints.
Calculate minimum iteration interval.
Convert and optimize the dot-product
code to floating point code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
What Requires Multi-Cycle Loops?

Resource Limitations


Live Too Long


Latency required between loop iterations
FIR example and SP floating-point dot product examples are
demonstrated
Functional Unit Latency > 1

Chapter 12, Slide 97
Minimum iteration interval defined by length of time a
Variable is required to exist
Loop Carry Path


Running out of resources
(Functional Units, Registers, Bus Accesses)
Weighted Vector Sum example requires
three .D units
A few ‘C67x instructions require functional units for 2 or 4
cycles rather than one. This defines a minimum iteration
interval.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
What Requires Multi-Cycle Loops?
Four reasons:
1.
2.
3.
4.
Resource Limitations.
Live Too Long.
Loop Carry Path.
Double Precision (FUL > 1).
Use these four constraints to determine the smallest
Iteration Interval (Minimum Iteration Interval or
MII).
Chapter 12, Slide 98
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Resource Limitation: Weighted Vector Sum
Step 1 - C Code
void WVS(short *c, short *b, short *a, short r, short n)
{ int i;
for (i=0; i < n; i++)
{
c[i] = a[i] + (r * b[i]) >> 15;
}
}
Store
.D
Load
.D
Load
.D
a, b:
c:
n:
r:
input arrays
output array
length of arrays
weighting factor
Requires 3 .D units
Chapter 12, Slide 99
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Software Pipelining Procedure
1. Write algorithm in C code
& verify.

& verify.
2.
3.
4.
5.
6.
Write ‘C6x Linear Assembly code.
Code.
Create dependency graph.
Allocate registers.
Create scheduling table.
Translate scheduling table to ‘C6x code.
Chapter 12, Slide 100
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 2 - ‘C6x Linear Code
c[i] = a[i] + (r * b[i]) >> 15;
loop:

LDH
*a++, ai
LDH
*b++, bi
MPY
r, bi, prod
SHR
prod, 15, sum
ADD
ai, sum, ci
STH
ci, *c++
[i]
SUB
i, 1, i
[i]
B
loop
The full code is available here:
\Links\Wvs.sa
Chapter 12, Slide 101
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
LDH
ai
Step 3 - Dependency Graph
.D1
A
Side
bi
B
Side
r
MPY
prod
15
.D2
5
.M2
SHR
5
2
sum
ADD
1
ci
1
1
Chapter 12, Slide 102
LDH
SUB
.S2
i
1
1
.L1
STH
*c++
.L2
B
loop .S1
.D1
6
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 4 -Allocate Functional Units

.L1
ci

This requires 3 .D
units therefore it
cannot fit into a
single cycle loop.

This may fit into a 2
cycle loop if there are
no other constraints.
.M1

.D1
ai, *c

.S1
loop

.L2
i

.M2
prod

.D2
bi

.S2
sum
Chapter 12, Slide 103
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
2 Cycle Loop
loop:
Cycle 1
2 cycles
per
loop iteration
Cycle 2
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Iteration Interval (II): # cycles per loop iteration.
Chapter 12, Slide 104
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Multi-Cycle Loop Iterations
.D1
.D2
.S2
.M1
.M2
.L1
.L2
.S1
Chapter 12, Slide 105
.D1
.D2
.S1
.S2
.M1
.M2
.L1
.L2
loop 1
cycle 1
loop 2
cycle 3
loop 3
cycle 5
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
add
sub
b
add
sub
b
add
sub
b
cycle 2
cycle 4
cycle 6
sth
sth
sth
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Multi-Cycle Loop Iterations
.D1
.D2
.S2
.M1
.M2
.L1
.L2
.S1
Chapter 12, Slide 106
.D1
.D2
.S1
.S2
.M1
.M2
.L1
.L2
loop 1
cycle 1
loop 2
cycle 3
loop 3
cycle 5
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
add
sub
b
add
sub
b
add
sub
b
cycle 2
cycle 4
cycle 6
sth
sth
sth
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Multi-Cycle Loop Iterations
.D1
.D2
.S2
.M1
.M2
.L1
.L2
.S1
Chapter 12, Slide 107
.D1
.D2
.S1
.S2
.M1
.M2
.L1
.L2
loop 1
cycle 1
loop 2
cycle 3
loop 3
cycle 5
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
add
sub
b
add
sub
b
add
sub
b
cycle 2
cycle 4
cycle 6
sth
sth
sth
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Multi-Cycle Loop Iterations
.D1
.D2
.S2
.M1
.M2
.L1
.L2
.S1
Chapter 12, Slide 108
.D1
.D2
.S1
.S2
.M1
.M2
.L1
.L2
loop 1
cycle 1
loop 2
cycle 3
loop 3
cycle 5
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
add
sub
b
add
sub
b
add
sub
b
cycle 2
cycle 4
cycle 6
sth
sth
sth
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Multi-Cycle Loop Iterations
.D1
.D2
.S2
.M1
.M2
.L1
.L2
.S1
Chapter 12, Slide 109
.D1
.D2
.S1
.S2
.M1
.M2
.L1
.L2
loop 1
cycle 1
loop 2
cycle 3
loop 3
cycle 5
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
add
sub
b
add
sub
b
add
sub
b
cycle 2
cycle 4
cycle 6
sth
sth
sth
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Multi-Cycle Loop Iterations
.D1
.D2
.S2
.M1
.M2
.L1
.L2
.S1
Chapter 12, Slide 110
.D1
.D2
.S1
.S2
.M1
.M2
.L1
.L2
loop 1
cycle 1
loop 2
cycle 3
loop 3
cycle 5
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
ldh
ldh
shr
mpy
add
sub
b
add
sub
b
add
sub
b
cycle 2
cycle 4
cycle 6
sth
sth
sth
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
How long is the Prolog?
bi
What is the length of the
longest path?
10
5
ai
How many cycles per loop?
2
prod
2
5
sum
1
ci
10
1
1
Chapter 12, Slide 111
*c++
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart (0)
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
0
2
4
6
8
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
1
3
5
7
9
Chapter 12, Slide 112
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
2
4
6
8
*
*
*
*
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
3
5
7
9
Chapter 12, Slide 113
1
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
2
4
6
8
*
*
*
*
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
3
5
7
9
MPY mi
*
*
Chapter 12, Slide 114
1
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
2
4
6
8
*
*
*
*
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
3
5
7
9
SHR sum
*
*
*
Chapter 12, Slide 115
1
MPY mi
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
2
4
6
8
ADD ci
*
*
*
*
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
3
5
7
9
SHR sum
*
*
*
Chapter 12, Slide 116
1
MPY mi
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
2
4
6
8
ADD ci
*
*
*
*
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
3
5
7
9
SHR sum
*
*
*
STH* c[i]
Chapter 12, Slide 117
1
MPY mi
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
2
4
6
8
ADD ci
*
*
*
*
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
3
5
7
9
SHR sum
*
*
*
*
STH* c[i]
Chapter 12, Slide 118
1
MPY mi
*
LDH ai
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
2
4
6
8
ADD ci
*
*
*
*
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
3
5
7
9
SHR sum
*
*
*
*
STH* c[i]
Chapter 12, Slide 119
1
MPY mi
*
LDH ai
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Conflict Solution
Here are two possibilities ...
Which is better?
Unit\cycle
0
.D1
.D2
LDH bi
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Chapter 12, Slide 120
1
2
LDH ai
*
4
6
8
*
*
*
3
5
7
9
SHR sum
*
*
*
*
STH* c[i]
MPY mi
*
LDH ai
LDH ai
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Conflict Solution
Here are two possibilities ...
Which is better?
Move the LDH to cycle 2.
(so you don’t have to go back and recheck crosspaths)
Unit\cycle
0
.D1
.D2
LDH bi
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Chapter 12, Slide 121
1
2
LDH ai
*
4
6
8
*
*
*
3
5
7
9
SHR sum
*
*
*
*
STH* c[i]
MPY mi
*
LDH ai
LDH ai
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Chapter 12, Slide 122
1
2
4
6
8
ADD ci
LDH ai
*
*
*
*
*
*
*
3
5
7
9
SHR sum
*
*
*
STH c[i]
MPY mi
LDH ai
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Chapter 12, Slide 123
1
2
4
6
8
ADD ci
[i] B
*
*
LDH ai
*
*
*
*
*
*
*
3
5
7
9
SHR sum
*
*
*
STH c[i]
MPY mi
LDH ai
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Chapter 12, Slide 124
1
2
4
6
8
ADD ci
[i] B
*
*
LDH ai
*
*
*
*
*
*
*
3
5
7
9
[i] SUB i
*
*
*
SHR sum
*
*
*
STH c[i]
MPY mi
LDH ai
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Step 5 - Create Scheduling Chart
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Chapter 12, Slide 125
1
2
4
6
8
ADD ci
[i] B
*
*
LDH ai
*
*
*
*
*
*
*
3
5
7
9
[i] SUB i
*
*
*
SHR sum
*
*
*
STH c[i]
MPY mi
LDH ai
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
2 Cycle Loop Kernel
Unit\cycle
0
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
LDH bi
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Chapter 12, Slide 126
1
2
4
6
8
ADD ci
[i] B
*
*
LDH ai
*
*
*
*
*
*
*
3
5
7
9
[i] SUB i
*
*
*
SHR sum
*
*
*
STH c[i]
MPY mi
LDH ai
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
What Requires Multi-Cycle Loops?
Four reasons:
1.
2.
3.
4.
Chapter 12, Slide 127
Resource Limitations.
Live Too Long.
Loop Carry Path.
Double Precision (FUL > 1).
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Live Too Long - Example
LDH
c = (a >> 5) + a
ai
5
.D1
0
LDH ai
1
2
3
4
LDH
LDH
LDH
LDH
5
a0 valid
6
5
SHR
x
.S1
1
ADD
ci
Chapter 12, Slide 128
.L1
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Live Too Long - Example
0
LDH ai
1
2
3
4
LDH
LDH
LDH
LDH
5
a0 valid
6
a1
LDH
ai
5
.D1
5
SHR
x
.S1
1
ADD
ci
Chapter 12, Slide 129
.L1
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Live Too Long - Example
0
LDH ai
1
2
3
4
LDH
LDH
LDH
LDH
5
a0 valid
6
a1
LDH
SHR
x0 valid
ai
5
.D1
5
SHR
x
.S1
1
ADD
ci
Chapter 12, Slide 130
.L1
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Live Too Long - Example
0
LDH ai
1
2
3
4
LDH
LDH
LDH
LDH
5
a0 valid
6
a1
LDH
SHR
x0 valid
ai
5
.D1
ADD
5
SHR
x
Oops, rather than adding
a0 + x0
we got
a1 + x0
.S1
1
ADD
ci
.L1
Let’s look at one solution ...
Chapter 12, Slide 131
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Live Too Long - 2 Cycle Solution
0
LDH ai
2
4
LDH
LDH
6
a0 valid
With a 2 cycle loop,
a0 is valid for
2 cycles.
1
Chapter 12, Slide 132
3
5
a0 valid
7
a1
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Live Too Long - 2 Cycle Solution
0
LDH ai
2
4
LDH
LDH
6
a0 valid
x0 valid
1
Chapter 12, Slide 133
3
5
a0 valid
7
a1
SHR
x0 valid
Notice, a0 and x0 are
both valid for 2 cycles
which is the length of the
Iteration Interval
Adding them ...
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Live Too Long - 2 Cycle Solution
0
LDH ai
2
4
LDH
LDH
6
a0 valid
x0 valid
Works!
But what’s the drawback?
ADD
2 cycle loop is slower.
1
Chapter 12, Slide 134
3
5
a0 valid
7
a1
SHR
x0 valid
Here’s a better solution ...
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Live Too Long - 1 Cycle Solution
0
LDH ai
1
2
3
4
LDH
LDH
LDH
LDH
5
6
a0 valid
a1
MV b
b valid
SHR x0 valid
LDH
ai
5
.D1
ADD
5
b
Using a temporary register
solves this problem without
increasing the
Minimum Iteration Interval
Chapter 12, Slide 135
SHR
MV
x
.S2
1
.S1
1
ADD
ci
.L1
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
What Requires Multi-Cycle Loops?
Four reasons:
1.
2.
3.
4.
Chapter 12, Slide 136
Resource Limitations.
Live Too Long.
Loop Carry Path.
Double Precision (FUL > 1).
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop Carry Path

The loop carry path is a path which
feeds one variable from part of the
algorithm back to another.
e.g. Loop carry path = 3.
p2
1
2
MPY.M2
st_y0
STH.D1
Note: The loop carry path is not the code loop.
Chapter 12, Slide 137
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop Carry Path, e.g. IIR Filter
IIR Filter Example
y0 = a0*x0 + b1*y1
Chapter 12, Slide 138
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
IIR.SA
IIR Filter Example
y0 = a0*x0 + b1*y1
IIR:
Chapter 12, Slide 139
ldh
ldh
ldh
ldh
*a_1,
*x1,
*b_1,
*y0,
A1
A3
B1
B0
mpy
mpy
A1, A3, prod1
B1, B0, prod2
add
sth
prod1, prod2, prod2
prod2, *y0
; y1 is previous y0
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop Carry Path - IIR Example
LDH.D1
x1
LDH.D2
A1
B1
y1
5
IIR Filter Loop
y0 = a1*x1 + b1*y1
p1
MPY.M1
2
Min Iteration Interval
Resource = 2
(need 3 .D units)
Loop Carry Path = 9
p2
MPY.M2
ADD.L1
y0
1
(9 = 5 + 2 + 1 + 1)
therefore, MII = 9
STH.D1 st_y0
1
Can it be minimized?
Chapter 12, Slide 140
Result carries over from
one iteration of the loop
to the next.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop Carry Path - IIR Example (Solution)
LDH.D1
x1
LDH.D2
A1
B1
y1
5
IIR Filter Loop
y0 = a1*x1 + b1*y1
p1
MPY.M1
2
Min Iteration Interval
Resource = 2
MPY.M2
ADD.L1
(need 3 .D units)
New Loop Carry Path = 3
p2
y0
1
1
(3 = 2 + 1)
therefore, MII = 3
STH.D1 st_y0
Since y0 is stored in a CPU register,
it can be used directly by MPY
(after the first loop iteration).
Chapter 12, Slide 141
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Reminder: Fixed-Point Dot-Product Example
.D1
Is there a loop carry
path in this example?
Yes, but it’s only “1”
Min Iteration Interval
Resource = 1
Loop Carry Path = 1
 MII = 1
LDH
LDH
m
n
.D2
5
MPY
prod .M1x
2
ADD
.L1 sum
1
For the fixed-point implementation, the Loop Carry
Path was not taken into account because it is equal to 1.
Chapter 12, Slide 142
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop Carry Path




Chapter 12, Slide 143
IIR Example.
Enhancing the IIR.
Fixed-Point Dot-Product Example.
Floating-Point Dot Product Example.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Loop Carry Path due to FUL > 1
Floating-Point Dot-Product Example
.D1
LDW
LDW
m
n
MPYSP
.D2
5
prod .M1x
Min Iteration Interval
Resource = 1
Loop Carry Path = 4
 MII = 4
4
ADDSP
.L1 sum
4
Chapter 12, Slide 144
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Unrolling the Loop
If the MII must be four cycles long, then
use all of them to calculate four results.
.D1
LDW
LDW
m1
n1
.D2
.D1
LDW
LDW
m2
n2
.D1
LDW
m3
n3
.D2
.D1
LDW
LDW
m4
n4
MPYSP
MPYSP
MPYSP
MPYSP
prod1 .M1x
prod2 .M1x
prod3 .M1x
prod4 .M1x
4
ADDSP
.L1 sum1
Chapter 12, Slide 145
.D2
LDW
4
ADDSP
.L1 sum2
4
ADDSP
.L1 sum3
.D2
4
ADDSP
.L1 sum4
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Staggered Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12

Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6
sum = x3 + x7
sum = x0 + x4 + x8
ADDSP takes 4 cycles or three delay slots to
produce the result.
Chapter 12, Slide 146
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Staggered Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 12, Slide 147
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6
sum = x3 + x7
sum = x0 + x4 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Staggered Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 12, Slide 148
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6
sum = x3 + x7
sum = x0 + x4 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Staggered Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 12, Slide 149
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6
sum = x3 + x7
sum = x0 + x4 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Staggered Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 12, Slide 150
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6
sum = x3 + x7
sum = x0 + x4 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Staggered Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 12, Slide 151
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6
sum = x3 + x7
sum = x0 + x4 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Staggered Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 12, Slide 152
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6
sum = x3 + x7
sum = x0 + x4 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Staggered Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 12, Slide 153
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6
sum = x3 + x7
sum = x0 + x4 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Staggered Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12

Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
NOP
NOP
NOP
NOP
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6
sum = x3 + x7
sum = x0 + x4 + x8
There are effectively four running sums:
sum (i)
sum (i+1)
sum (i+2)
sum (i+3)
Chapter 12, Slide 154
= x(i) + x(i+4) + x(i+8) + …
= x(i+1) + x(i+5) + x(i+9) + …
= x(i+2) + x(i+6) + x(i+10) + …
= x(i+3) + x(i+7) + x(i+11) + …
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Staggered Results)

There are effectively four running sums:
sum (i)
sum (i+1)
sum (i+2)
sum (i+3)

= x(i) + x(i+4) + x(i+8) + …
= x(i+1) + x(i+5) + x(i+9) + …
= x(i+2) + x(i+6) + x(i+10) + …
= x(i+3) + x(i+7) + x(i+11) + …
These need to be combined after the last
addition is complete...
Chapter 12, Slide 155
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Combining Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 12, Slide 156
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
MV
sum, temp
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6
sum = x3 + x7
sum = x0 + x4 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Combining Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 12, Slide 157
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
MV
sum, temp
ADDSP sum, temp, sum1
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6, temp = x1 + x5
sum = x3 + x7
sum = x0 + x4 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Combining Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 12, Slide 158
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
MV
sum, temp
ADDSP sum, temp, sum1
MV
sum, temp
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6, temp = x1 + x5
sum = x3 + x7
sum = x0 + x4 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Combining Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 12, Slide 159
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
MV
sum, temp
ADDSP sum, temp, sum1
MV
sum, temp
ADDSP sum, temp, sum2
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6, temp = x1 + x5
sum = x3 + x7
sum = x0 + x4 + x8, temp = x3 + x7
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Combining Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Chapter 12, Slide 160
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
MV
sum, temp
ADDSP sum, temp, sum1
MV
sum, temp
ADDSP sum, temp, sum2
NOP
NOP
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6, temp = x1 + x5
sum = x3 + x7
sum = x0 + x4 + x8, temp = x3 + x7
sum1 = x1 + x2 + x5 + x6
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Combining Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Chapter 12, Slide 161
Instruction
ADDSP x0, sum, sum
ADDSP x1, sum, sum
ADDSP x2, sum, sum
ADDSP x3, sum, sum
ADDSP x4, sum, sum
ADDSP x5, sum, sum
ADDSP x6, sum, sum
ADDSP x7, sum, sum
ADDSP x8, sum, sum
MV
sum, temp
ADDSP sum, temp, sum1
MV
sum, temp
ADDSP sum, temp sum2
NOP
NOP
NOP
ADDSP sum1, sum2, sum
Result
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6, temp = x1 + x5
sum = x3 + x7
sum = x0 + x4 + x8, temp = x3 + x7
sum1 = x1 + x2 + x5 + x6
sum2 = x0 + x3 + x4 + x7 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
ADDSP Pipeline (Combining Results)
Cycle
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Chapter 12, Slide 162
Instruction
Result
ADDSP
ADDSP
ADDSP
ADDSP
ADDSP
ADDSP
ADDSP
ADDSP
ADDSP
MV
ADDSP
MV
ADDSP
NOP
NOP
NOP
ADDSP
NOP
NOP
NOP
NOP
sum = 0
sum = 0
sum = 0
sum = 0
sum = x0
sum = x1
sum = x2
sum = x3
sum = x0 + x4
sum = x1 + x5
sum = x2 + x6, temp = x1 + x5
sum = x3 + x7
sum = x0 + x4 + x8, temp = x3 + x7
x0, sum, sum
x1, sum, sum
x2, sum, sum
x3, sum, sum
x4, sum, sum
x5, sum, sum
x6, sum, sum
x7, sum, sum
x8, sum, sum
sum, temp
sum, temp, sum1
sum, temp
sum, temp sum2
sum1 = x1 + x2 + x5 + x6
sum1, sum2, sum
sum2 = x0 + x3 + x4 + x7 + x8
sum = x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
What Requires Multi-Cycle Loops?
Four reasons:
1.
2.
3.
4.
Chapter 12, Slide 163
Resource Limitations.
Live Too Long.
Loop Carry Path.
Double Precision (FUL > 1).
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Simple FUL Example
5
MPYDP
3
prod
10 (4.9)
.M1
1
MPYDP
2
3
4
5
MPYDP
6
...
MPYDP ties up the functional unit
for 4 cycles.
Chapter 12, Slide 164
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
A Better Way to Diagram this ...
Since the MPYDP
.M1
1
5
9
13
MPYDP MPYDP MPYDP MPYDP
instruction has a
functional unit
latency (FUL) of
2
6
10
14
3
7
11
15
prod1
prod2
12
16
.M1
“4”, .M1 cannot be
used again until
the fifth cycle.
.M1
Hence, MII  4.
4
8
.M1
Chapter 12, Slide 165
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
What Requires Multi-Cycle Loops?
 1.
 2.
 3.
 4.
Resource Limitations.
Live Too Long.
Loop Carry Path.
Double Precision (FUL > 1).
Lab: Converting your dot-product code to
Single-Precision Floating-Point.
Chapter 12, Slide 166
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004
Chapter 12
Software Optimisation
- End -

Chapter 12 - Software Optimisation

Transcript Chapter 12 - Software Optimisation

Directory