Platform-based Design Exploiting ILP VLIW architectures TU/e 5kk70

Download Report

Transcript Platform-based Design Exploiting ILP VLIW architectures TU/e 5kk70

Platform-based Design
Exploiting ILP
VLIW architectures
TU/e 5kk70
Henk Corporaal
Bart Mesman
What are we talking about?
ILP = Instruction Level Parallelism =
ability to perform multiple operations (or instructions),
from a single instruction stream,
in parallel
VLIW = Very Long Instruction Word architecture
Instruction format example:
operation 1 operation 2 operation 3 operation 4 operation 5
5/1/2020
Platform Design
H. Corporaal and B. Mesman
2
VLIW: Topics Overview
• Enhance performance:
– What options do you have?
• Instruction Level Parallelism
– Limits on ILP
• VLIW
– Examples
• Clustering
• Code generation
• Hands-on
5/1/2020
Platform Design
H. Corporaal and B. Mesman
3
Enhance performance:
4 architecture methods
• (Super)-pipelining
• Powerful instructions
– MD-technique
• multiple data operands per operation
– MO-technique
• multiple operations per instruction
• Multiple instruction issue
5/1/2020
Platform Design
H. Corporaal and B. Mesman
4
Architecture methods
Pipelined Execution of Instructions
IF: Instruction Fetch
INSTRUCTION
CYCLE
1
1
2
3
4
2
IF
3
DC
IF
4
RF
DC
IF
5
EX
RF
DC
IF
6
WB
EX
RF
DC
7
DC: Instruction Decode
8
RF: Register Fetch
WB
EX
RF
EX: Execute instruction
WB
EX
WB
WB: Write Result Register
Simple 5-stage pipeline
Purpose of pipelining:
• Reduce #gate_levels in critical path
• Reduce CPI close to one (instead of a large number for the
multicycle machine)
• More efficient Hardware
Problems
• Hazards: pipeline stalls
• Structural hazards: add more hardware
• Control hazards, branch penalties: use branch prediction
• Data hazards: by passing required
5/1/2020
Platform Design
H. Corporaal and B. Mesman
5
Architecture methods
Pipelined Execution of Instructions
Superpipelining:
• Split one or more of the critical pipeline stages
• Superpipelining degree S:
S(architecture) =  f(Op) * lt (Op)
Op I_set
*
where:
f(op) is frequency of operation op
lt(op) is latency of operation op
5/1/2020
Platform Design
H. Corporaal and B. Mesman
6
Architecture methods
Powerful Instructions (1)
MD-technique
• Multiple data operands per operation
• SIMD: Single Instruction Multiple Data
Vector instruction:
Assembly:
for (i=0, i++, i<64)
c[i] = a[i] + 5*b[i];
set
ldv
mulvi
ldv
addv
stv
or
c = a + 5*b
5/1/2020
Platform Design
H. Corporaal and B. Mesman
vl,64
v1,0(r2)
v2,v1,5
v1,0(r1)
v3,v1,v2
v3,0(r3)
7
Architecture methods
Powerful Instructions (1)
SIMD computing
SIMD Execution Method
time
• Nodes used for independent
operations
• Mesh or hypercube connectivity
• Exploit data locality of e.g.
image processing applications
• Dense encoding (few instruction
bits needed)
node1
node2
node-K
Instruction 1
Instruction 2
Instruction 3
Instruction n
5/1/2020
Platform Design
H. Corporaal and B. Mesman
8
Architecture methods
Powerful Instructions (1)
• Sub-word parallelism
– SIMD on restricted scale:
– Used for Multi-media instructions
• Examples
– MMX, SSX, SUN-VIS, HP MAX-2,
AMD-K7/Athlon 3Dnow, Trimedia II
*
– Example: i=1..4|ai-bi|
5/1/2020
Platform Design
H. Corporaal and B. Mesman
*
*
*
9
Architecture methods
Powerful Instructions (2)
MO-technique: multiple operations per instruction
Two options:
• CISC (Complex Instruction Set Computer)
• VLIW (Very Long Instruction Word)
field
FU 1
instruction
sub r8, r5, 3
FU 2
and r1, r5, 12
FU 3
mul r6, r5, r2
FU 4
ld r3, 0(r5)
FU 5
bnez r5, 13
VLIW instruction example
5/1/2020
Platform Design
H. Corporaal and B. Mesman
10
VLIW architecture: central Register File
Register file
Exec Exec Exec
unit 1 unit 2 unit 3
Issue slot 1
Exec Exec Exec
unit 4 unit 5 unit 6
Issue slot 2
Exec Exec Exec
unit 7 unit 8 unit 9
Issue slot 3
Q: How many ports does the registerfile need for n-issue?
5/1/2020
Platform Design
H. Corporaal and B. Mesman
11
TriMedia TM32A processor
0.18 micron
area : 16.9mm2
200 MHz (typ)
1.4 W
7 mW/MHz
I/O
INTERFACE
I-Cache
TAG
TAG
D-cache
TAG
Platform Design
H. Corporaal and B. Mesman
DSPMUL2
DSPMUL1
IFMUL1
(FLOAT) IFMUL2
(FLOAT)
FALU3
FALU0
FCOMP2
ALU3
ALU0
SHIFTER0
DSPALU0
ALU1
ALU4
SHIFTER1
ALU2
FTOUGH1
DSPALU2
SEQUENCER
/ DECODE
TAG
5/1/2020
(MIPS=
0.9 mW/MHz)
12
Architecture methods: Powerful Instructions (2)
VLIW Characteristics
• Only RISC like operation support
 Short cycle times
• Flexible: Can implement any FU mixture
• Extensible
• Tight inter FU connectivity required
• Large instructions (up to 1000 bits)
• Not binary compatible !!!
• But good compilers exist
5/1/2020
Platform Design
H. Corporaal and B. Mesman
13
Architecture methods
Multiple instruction issue (per cycle)
Who guarantees semantic correctness?
–
•
User: he specifies multiple instruction streams
–
•
Multi-processor: MIMD (Multiple Instruction Multiple Data)
HW: Run-time detection of ready instructions
–
•
Superscalar
Compiler: Compile into dataflow representation
–
5/1/2020
can instructions be executed in parallel
Dataflow processors
Platform Design
H. Corporaal and B. Mesman
14
Multiple instruction issue
Three Approaches
Example code
a := b + 15;
Translation to DDG
(Data Dependence Graph)
c := 3.14 * d;
e := c / f;
&d
3.14
&f
&b
ld
15
+
&a
ld
&e
ld
&c
*
/
st
st
st
5/1/2020
Platform Design
H. Corporaal and B. Mesman
15
Generated Code
Instr. Sequential Code
Dataflow Code
I1
I2
I3
I4
I5
I6
I7
I8
I9
ld(M(&b)
addi 15
st M(&a)
ld M(&d)
muli 3.14
st M(&c)
ld M(&f)
div
st M(&e)
ld
addi
st
ld
muli
st
ld
div
st
r1,M(&b)
r1,r1,15
r1,M(&a)
r1,M(&d)
r1,r1,3.14
r1,M(&c)
r2,M(&f)
r1,r1,r2
r1,M(&e)
-> I2
-> I3
-> I5
-> I6, I8
-> I8
-> I9
Notes:
• An MIMD may execute two streams: (1) I1-I3 (2) I4-I9
– No dependencies between streams; in practice communication and
synchronization required between streams
• A superscalar issues multiple instructions from sequential stream
– Obey dependencies (True and name dependencies)
– Reverse engineering of DDG needed at run-time
• Dataflow code is direct representation of DDG
5/1/2020
Platform Design
H. Corporaal and B. Mesman
16
Result Tokens
Multiple Instruction Issue: Data
flow processor
Token
Matching
Token
Store
Instruction
Generate
Instruction
Store
Reservation
Stations
FU-1
5/1/2020
FU-2
Platform Design
FU-K
H. Corporaal and B. Mesman
17
Instruction Pipeline Overview
CISC
IF
DC
RF
EX
RISC
IF
DC/RF
EX
WB
IF1
IF2
IF3
DC1
DC2
DC3
ISSUE
ISSUE
ISSUE
RF1
RF2
RF3
EX1
EX2
EX3
ROB
ROB
ROB
WB1
WB2
WB3
IFk
DCk
ISSUE
RFk
EXk
ROB
WBk
Superpipelined
VLIW
5/1/2020
IF
IF1
IF2
---
IFs
RF1
RF2
EX1
EX2
WB1
WB2
RFk
EXk
WBk
DC
RF
DC
Platform Design
H. Corporaal and B. Mesman
EX1
DATAFLOW
Superscalar
WB
EX2
---
EX5
WB
RF1
RF2
EX1
EX2
WB1
WB2
RFk
EXk
WBk
18
Four dimensional representation of the
architecture design space <I, O, D, S>
SIMD
100
Data/operation ‘D’
10
Vector
CISC
Superscalar
0.1
MIMD
10
RISC
Dataflow
100
Instructions/cycle ‘I’
Superpipelined
10
VLIW
10
Operations/instruction ‘O’
5/1/2020
Superpipelining
Degree ‘S’
Platform Design
H. Corporaal and B. Mesman
19
Architecture design space
Typical values of K (# of functional units or processor nodes), and
<I, O, D, S> for different architectures
Architecture K
I
O
D
S
Mpar
CISC
RISC
VLIW
Superscalar
Superpipelined
Vector
SIMD
MIMD
Dataflow
0.2
1
1
3
1
0.1
1
32
10
1.2
1
10
1
1
1
1
1
1
1.1
1
1
1
1
64
128
1
1
1
1.2
1.2
1.2
3
5
1.2
1.2
1.2
0.26
1.2
12
3.6
3
32
154
38
12
1
1
10
3
1
7
128
32
10
S(architecture) =  f(Op) * lt (Op)
Op I_set
Mpar = I*O*D*S
5/1/2020
Platform Design
H. Corporaal and B. Mesman
20
Overview
• Enhance performance: architecture methods
• Instruction Level Parallelism
– limits on ILP
• VLIW
– Examples
• Clustering
• Code generation
• Hands-on
5/1/2020
Platform Design
H. Corporaal and B. Mesman
21
General organization of an
ILP architecture
FU-4
Data memory
FU-3
Register file
Instruction
decode unit
Instruction
fetch unit
Instruction memory
FU-2
Bypassing network
FU-1
CPU
FU-5
5/1/2020
Platform Design
H. Corporaal and B. Mesman
22
Motivation for ILP
• Increasing VLSI densities; decreasing feature size
• Increasing performance requirements
• New application areas, like
– multi-media (image, audio, video, 3-D)
– intelligent search and filtering engines
– neural, fuzzy, genetic computing
• More functionality
• Use of existing Code (Compatibility)
• Low Power: P = fCVdd2
5/1/2020
Platform Design
H. Corporaal and B. Mesman
23
Low power through parallelism
• Sequential Processor
–
–
–
–
Switching capacitance C
Frequency f
Voltage V
P = fCV2
• Parallel Processor (two times the number of units)
–
–
–
–
5/1/2020
Switching capacitance 2C
Frequency f/2
Voltage V’ < V
P = f/2 2C V’2 = fCV’2
Platform Design
H. Corporaal and B. Mesman
24
Measuring and exploiting available ILP
• How much ILP is there in applications?
• How to measure parallelism within applications?
– Using existing compiler
– Using trace analysis
• Track all the real data dependencies (RaWs) of instructions from issue
window
– register dependence
– memory dependence
• Check for correct branch prediction
– if prediction correct continue
– if wrong, flush schedule and start in next cycle
5/1/2020
Platform Design
H. Corporaal and B. Mesman
25
Trace analysis
Program
Compiled code
Trace
set
r1,0
set
r2,3
set
r3,&A
st
r1,0(r3)
add
r1,r1,1
r3,r3,4
For i := 0..2
set
r1,0
add
A[i] := i;
set
r2,3
brne r1,r2,Loop
set
r3,&A
st
r1,0(r3)
st
r1,0(r3)
add
r1,r1,1
add
r1,r1,1
add
r3,r3,4
add
r3,r3,4
brne r1,r2,Loop
S := X+3;
Loop:
brne r1,r2,Loop
st
r1,0(r3)
add
add
r1,r1,1
add
r3,r3,4
r1,r5,3
brne r1,r2,Loop
How parallel can this code be executed?
5/1/2020
Platform Design
H. Corporaal and B. Mesman
add
r1,r5,3
26
Trace analysis
Parallel Trace
set
r1,0
set
r2,3
set
r3,&A
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
brne r1,r2,Loop
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
brne r1,r2,Loop
brne r1,r2,Loop
add
r1,r5,3
Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7
5/1/2020
Platform Design
H. Corporaal and B. Mesman
27
Ideal Processor
Assumptions for ideal/perfect processor:
1. Register renaming – infinite number of virtual registers => all
register WAW & WAR hazards avoided
2. Branch and Jump prediction – Perfect => all program
instructions available for execution
3. Memory-address alias analysis – addresses are known. A store
can be moved before a load provided addresses not equal
Also:
–
–
–
–
unlimited number of instructions issued/cycle (unlimited resources), and
unlimited instruction window
perfect caches
1 cycle latency for all instructions (FP *,/)
Programs were compiled using MIPS compiler with maximum
optimization level
5/1/2020
Platform Design
H. Corporaal and B. Mesman
28
Upper Limit to ILP: Ideal Processor
Integer: 18 - 60
FP: 75 - 150
160
150.1
140
118.7
Instruction Issues per cycle
IPC
120
100
75.2
80
62.6
60
54.8
40
17.9
20
0
gcc
espresso
li
fpppp
doducd
tomcatv
Programs
5/1/2020
Platform Design
H. Corporaal and B. Mesman
29
Window Size and Branch Impact
• Change from infinite window to examine 2000
FP: 15 - 45
and issue at most 64 instructions per cycle
61
60
60
58
IPC
Instruction issues per cycle
50
Integer: 6 – 12
48
46
46
45
45 45
41
40
35
30
29
19
20
16
15
13
12
14
10
10
9
6
7
6
6
6
7
4
2
2
2
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
5/1/2020
Platform Design
H.
CorporaalStandard
and B. Mesman
Perfect Tournament
Profile
No prediction
Perfect
SelectiveBHT(512)
predictor
2-bit
Static
None
30
Limiting nr. of Renaming Registers
• Changes: 2000 instr. window, 64 instr. issue, 8K 2-level
predictor (slightly better than tournament predictor)
70
FP: 11 - 45
59
Integer: 5 - 15
60
54
49
IPC
Instruction issues per cycle
50
45
44
40
35
29
30
28
20
20
16
15 15
13
10
11 10 10
12 12 12 11
10
9
5
4
5
11
6
4
15
5
5
5
4
7
5
5
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
5/1/2020
Infinite 256 128 64 32 None
Platform Design 256
H. Corporaal
and B. Mesman
Infinite
128
64 32
31
Memory Address Alias Impact
• Changes: 2000 instr. window, 64 instr. issue, 8K
2-level predictor, 256 renaming registers
49
50
49
45
45
45
FP: 4 - 45
(Fortran,
no heap)
40
IPC
Instruction issues per cycle
35
30
25
Integer: 4 - 9
20
16
16
15
15
12
10
10
5
9
7
7
4
5
5
4
3
3
4
6
5
4
3
4
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Global/stack Perfect perfect
Inspection Inspection
None None
Perfect Global/stack
Perfect
5/1/2020
Platform Design
H. Corporaal and B. Mesman
32
Reducing Window Size
• Assumptions: Perfect disambiguation, 1K Selective predictor, 16
entry
return stack, 64 renaming registers, issue as many as window
60
56
52
IPC
Instruction issues per cycle
50
47
FP: 8 - 45
45
40
35
34
30
22
20
Integer: 6 - 12
15 15
10 10 10
10
9
13
12 12 11 11
10
8
8
6
4
6
3
17 16
14
9
6
4
22
2
14
12
9
8
4
15
9
7
5
4
3
3
6
3
3
0
gcc
expresso
li
fpppp
doducd
tomcatv
Program
Infinite
5/1/2020
256
128
64
32
16
8
Platform
Design 128
H. Corporaal
and B. Mesman
Infinite
256
64
32 16 8 4
4
33
How to Exceed ILP Limits of
This Study?
• WAR and WAW hazards through memory:
eliminated WAW and WAR hazards through register
renaming, but not in memory
• Unnecessary dependences
– compiler did not unroll loops so iteration variable
dependence
• Overcoming the data flow limit: value prediction,
predicting values and speculating on prediction
– Address value prediction and speculation predicts
addresses and speculates by reordering loads and stores.
Could provide better aliasing analysis
5/1/2020
Platform Design
H. Corporaal and B. Mesman
34
Conclusions
• Amount of parallelism is limited
– higher in Multi-Media and Signal Processing appl.
– higher in kernels
• Trace analysis detects all types of parallelism
– task, data and operation types
• Detected parallelism depends on
– quality of compiler
– hardware
– source-code transformations
5/1/2020
Platform Design
H. Corporaal and B. Mesman
35
Overview
• Enhance performance: architecture methods
• Instruction Level Parallelism
• VLIW
– Examples
•
•
•
•
C6
TM
IA-64: Itanium, ....
TTA
• Clustering
• Code generation
• Hands-on
5/1/2020
Platform Design
H. Corporaal and B. Mesman
36
VLIW concept
A VLIW architecture
with 7 FUs
Instruction Memory
Instruction register
Function
Int FU
units
Int FU
Int FU
LD/ST
LD/ST
FP FU
FP FU
Floating Point
Register File
Int Register File
Data Memory
5/1/2020
Platform Design
H. Corporaal and B. Mesman
37
VLIW characteristics
•
•
•
•
Multiple operations per instruction
One instruction per cycle issued (at most)
Compiler is in control
Only RISC like operation support
– Short cycle times
– Easier to compile for
• Flexible: Can implement any FU mixture
• Extensible / Scalable
However:
• tight inter FU connectivity required
• not binary compatible !!
– (new long instruction format)
• low code density
5/1/2020
Platform Design
H. Corporaal and B. Mesman
38
VelociTI
C6x
datapath
5/1/2020
Platform Design
H. Corporaal and B. Mesman
39
VLIW example: TMS320C62
TMS320C62 VelociTI Processor
• 8 operations (of 32-bit) per instruction (256 bit)
• Two clusters
– 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs)
– 2 x 16 registers
– One bus available to write in register file of other cluster
•
•
•
•
•
5/1/2020
Flexible addressing modes (like circular addressing)
Flexible instruction packing
All instruction conditional
Originally: 5 ns, 200 MHz, 0.25 um, 5-layer CMOS
128 KB on-chip RAM
Platform Design
H. Corporaal and B. Mesman
40
VLIW example: Philips TriMedia TM1000
5 constant
5 ALU
2 memory
2 shift
2 DSP-ALU
2 DSP-mul
3 branch
2 FP ALU
2 Int/FP ALU
1 FP compare
1 FP div/sqrt
5/1/2020
Register file (128 regs, 32 bit, 15 ports)
Exec
unit
Exec
unit
Exec
unit
Exec
unit
Exec
unit
Data
cache
(16 kB)
Instruction register (5 issue slots)
PC
Platform Design
Instruction
cache (32kB)
H. Corporaal and B. Mesman
41
Intel EPIC Architecture IA-64
Explicit Parallel Instruction Computer (EPIC)
• IA-64 architecture -> Itanium, first realization 2001
Register model:
• 128 64-bit int x bits, stack, rotating
• 128 82-bit floating point, rotating
• 64 1-bit boolean
• 8 64-bit branch target address
• system control registers
See http://en.wikipedia.org/wiki/Itanium
5/1/2020
Platform Design
H. Corporaal and B. Mesman
42
EPIC Architecture: IA-64
• Instructions grouped in 128-bit bundles
– 3 * 41-bit instruction
– 5 template bits, indicate type and stop location
• Each 41-bit instruction
– starts with 4-bit opcode, and
– ends with 6-bit guard (boolean) register-id
• Supports speculative loads
5/1/2020
Platform Design
H. Corporaal and B. Mesman
43
Itanium
5/1/2020
Platform Design
H. Corporaal and B. Mesman
44
Itanium 2:
McKinley
5/1/2020
Platform Design
H. Corporaal and B. Mesman
45
EPIC Architecture: IA-64
• EPIC allows for more binary compatibility then a
plain VLIW:
– Function unit assignment performed at run-time
– Lock when FU results not available
• See other website for more info on IA-64:
– www.ics.ele.tue.nl/~heco/courses/ACA
– (look at related material)
5/1/2020
Platform Design
H. Corporaal and B. Mesman
46
What are we talking about?
ILP = Instruction Level Parallelism =
ability to perform multiple operations (or instructions),
from a single instruction stream,
in parallel
VLIW = Very Long Instruction Word architecture
Instruction format:
operation 1 operation 2 operation 3 operation 4 operation 5
5/1/2020
Platform Design
H. Corporaal and B. Mesman
47
VLIW evaluation
Strong points of VLIW:
– Scalable (add more FUs)
– Flexible (an FU can be almost anything; e.g. multimedia support)
Weak points:
• With N FUs:
– Bypassing complexity: O(N2)
– Register file complexity: O(N)
– Register file size: O(N2)
• Register file design restricts FU flexibility
Solution: .................................................. ?
5/1/2020
Platform Design
H. Corporaal and B. Mesman
48
VLIW evaluation
FU-4
Data memory
FU-3
Register file
Instruction
decode unit
Instruction
fetch unit
Instruction memory
FU-2
Bypassing network
FU-1
CPU
FU-5
Control problem
O(N2)
O(N)-O(N2)
With N function units
5/1/2020
Platform Design
H. Corporaal and B. Mesman
49
Solution
TTA: Transport Triggered Architecture
+
>
+
*
>
st
5/1/2020
*
st
Platform Design
H. Corporaal and B. Mesman
50
Transport Triggered Architecture
General organization of a TTA
FU-1
CPU
FU-4
FU-5
Data memory
FU-3
Register
file
Bypassing network
Instruction
decode unit
Instruction
fetch unit
Instruction memory
FU-2
5/1/2020
Platform Design
H. Corporaal and B. Mesman
51
TTA structure; datapath details
Data Memory
load/store load/store
unit
unit
integer
ALU
integer
ALU
boolean
RF
instruct.
unit
float
ALU
Socket
integer
RF
float
RF
immediate
unit
Instruction Memory
5/1/2020
Platform Design
H. Corporaal and B. Mesman
52
TTA hardware characteristics
• Modular: building blocks easy to reuse
• Very flexible and scalable
– easy inclusion of Special Function Units (SFUs)
• Very low complexity
–
–
–
–
–
–
5/1/2020
> 50% reduction on # register ports
reduced bypass complexity (no associative matching)
up to 80 % reduction in bypass connectivity
trivial decoding
reduced register pressure
easy register file partitioning (a single port is enough!)
Platform Design
H. Corporaal and B. Mesman
53
TTA software characteristics
add r3, r1, r2
That does not
look like an
improvement !?!
o1
o2
+r
r1  add.o1;
r2 add.o2;
add.r  r3
• More difficult to schedule !
• But: extra scheduling optimizations
5/1/2020
Platform Design
H. Corporaal and B. Mesman
54
Program TTAs
How to do data operations ?
1. Transport of operands to FU
• Operand move (s)
Trigger
Operand
• Trigger move
2. Transport of results from FU
• Result move (s)
Internal stage
Example
Add r3,r1,r2
becomes
r1  Oint
r2  Tadd
………….
Rint  r3
// operand move to integer unit
// trigger move to integer unit
// addition operation in progress
// result move from integer unit
Result
FU Pipeline
How to do Control flow ?
1. Jumps:
2. Branch:
3. Call:
5/1/2020
#jump-address  pc
#displacement  pcd
pc  r; #call-address  pcd
Platform Design
H. Corporaal and B. Mesman
55
Scheduling example
VLIW
load/store
unit
add r1,r2,r2
integer
ALU
integer
ALU
sub r4,r1,95
TTA
r1 -> add.o1,
r2 -> add.o2
add.r -> sub.o1, 95 -> sub.o2
sub.r -> r4
5/1/2020
Platform Design
integer
RF
H. Corporaal and B. Mesman
immediate
unit
56
TTA Instruction format
General MOVE field:
g
i
src
dst
: guard specifier
: immediate specifier
: source
: destination
g
i
src
dst
General MOVE instructions: multiple fields
move 1
move 2
move 3
move 4
How to use immediates?
Small, 6 bits
g
1
imm
dst
Long, 32 bits
g
0
Ir-1
dst
5/1/2020
Platform Design
H. Corporaal and B. Mesman
imm
57
Programming TTAs
How to do conditional execution
Each move is guarded
Example
r1  cmp.o1
r2  cmp.o2
cmp.r g
g:r3 r4
5/1/2020
// operand move to compare unit
// trigger move to compare unit
// put result in boolean register g
// guarded move takes place when r1=r2
Platform Design
H. Corporaal and B. Mesman
58
Register file port pressure for TTAs
Read and write ports required
ILP degree
3.50
3.00
2.50
2.00
1.50
1.00
5
Read ports
5/1/2020
4
3
2
Platform Design
1
1
2
3
4
H. Corporaal and B. Mesman
5
Write ports
59
Summary of TTA Advantages
• Better usage of transport capacity
– Instead of 3 transports per dyadic operation, about 2 are
needed
– # register ports reduced with at least 50%
– Inter FU connectivity reduces with 50-70%
• No full connectivity required
• Both the transport capacity and # register ports become
independent design parameters; this removes one of the
major bottlenecks of VLIWs
• Flexible: Fus can incorporate arbitrary functionality
• Scalable: #FUS, #reg.files, etc. can be changed
• FU splitting results into extra exploitable concurrency
• TTAs are easy to design and can have short cycle times
5/1/2020
Platform Design
H. Corporaal and B. Mesman
60
TTA automatic DSE
User
intercation
Optimizer
x
x
x
feedback
x
x
Architecture
parameters
Parametric compiler
Pareto curve
(solution space)
x
feedback
x
x
x
x
x
x
x
x x
x
x
x
x x
cost
Hardware generator
Move framework
Parallel
object
code
5/1/2020
chip
Platform Design
H. Corporaal and B. Mesman
61
Overview
•
•
•
•
Enhance performance: architecture methods
Instruction Level Parallelism
VLIW
Examples
– C6
– TM
– TTA
• Clustering and Reconfigurable components
• Code generation
• Hands-on
5/1/2020
Platform Design
H. Corporaal and B. Mesman
62
Clustered VLIW
• Clustering = Splitting up the VLIW data path
- same can be done for the instruction path –
loop buffer
loop buffer
loop buffer
FU FU FU
FU FU FU
FU FU FU
register file
register file
register file
Level 2 (shared) Cache
Level 1 Instruction Cache
Level 1 Data Cache
5/1/2020
Platform Design
H. Corporaal and B. Mesman
63
Clustered VLIW
Why clustering?
• Timing: faster clock
• Lower Cost
– silicon area
– T2M
• Lower Energy
What’s the disadvantage?
5/1/2020
Platform Design
H. Corporaal and B. Mesman
64
Fine-Grained reconfigurable:
Xilinx XC4000 FPGA
CLB
Slew
Rate
Control
CLB
Switch
Matrix
D
CLB
Q
Passive
Pull-Up,
Pull-Down
Vcc
Output
Buffer
Pad
Input
Buffer
CLB
Q
Programmable
Interconnect
D
Delay
I/O Blocks (IOBs)
C1 C2 C3 C4
H1 DIN S/R EC
S/R
Control
G4
G3
G2
G1
F4
F3
F2
F1
DIN
G
Func.
Gen.
F'
G'
H
Func.
Gen.
F
Func.
Gen.
D
EC
RD
1
Y
G'
H'
S/R
Control
DIN
F'
G'
D
SD
Q
H'
1
H'
K
SD
Q
H'
F'
EC
RD
X
Configurable
Logic Blocks (CLBs)
5/1/2020
Platform Design
H. Corporaal and B. Mesman
65
Coarse-Grained reconfigurable:
Chameleon CS2000
Highlights:
•32-bit datapath (ALU/Shift)
•16x24 Multiplier
•distributed local memory
•fixed timing
5/1/2020
Platform Design
H. Corporaal and B. Mesman
66
Recent Coarse Grain Reconfigurable
Architectures
• SmartCell 2009
– read http://www.hindawi.com/journals/es/2009/518659.html
•
•
•
•
•
•
Montium (reconfigurable VLIW)
RAPID
NIOS II
RAW
PicoChip
PACT XPP64
• many more ….
5/1/2020
Platform Design
H. Corporaal and B. Mesman
67
Hybrid FPGAs: Virtex II-Pro
GHz
IO:16
Upserial
to 16 transceivers
serial transceivers
Up to
PowerPCs
Memory blocks
PowerPC
ReConfig.
logic
Reconfigurable logic
blocks
Courtesy of Xilinx (Virtex II Pro)
5/1/2020
Platform Design
H. Corporaal and B. Mesman
68
Reconfiguration time
HW or SW reconfigurable?
reset
FPGA
Spatial mapping
loopbuffer
context
Temporal mapping
Subword parallelism
1 cycle
fine
5/1/2020
Data path granularity
Platform Design
H. Corporaal and B. Mesman
VLIW
coarse
69
Granularity Makes Differences
5/1/2020
Fine-Grained
Architecture
Coarse-Grained
Architecture
Clock Speed
Low
High
Configuration
Time
Long
Short
Unit Amount
Large
Small
Flexibility
High
Low
Power
High
Low
Area
Large
Small
Platform Design
H. Corporaal and B. Mesman
70