No Slide Title

Download Report

Transcript No Slide Title

Platform-based Design
Exploiting ILP
VLIW architectures (part a)
TU/e 5kk70
Henk Corporaal
Bart Mesman
What are we talking about?
ILP = Instruction Level Parallelism =
ability to perform multiple operations (or instructions),
from a single instruction stream,
in parallel
VLIW = Very Long Instruction Word architecture
Instruction format:
operation 1 operation 2 operation 3 operation 4 operation 5
4/23/2020
Platform Design
H. Corporaal and B. Mesman
2
VLIW: Topics Overview
• Enhance performance: architecture methods
• Instruction Level Parallelism
– Limits on ILP
• VLIW
– Examples
• Clustering
• Code generation
• Hands-on
4/23/2020
Platform Design
H. Corporaal and B. Mesman
3
Enhance performance:
3 architecture methods
• (Super)-pipelining
• Powerful instructions
– MD-technique
• multiple data operands per operation
– MO-technique
• multiple operations per instruction
• Multiple instruction issue
4/23/2020
Platform Design
H. Corporaal and B. Mesman
4
Architecture methods
Pipelined Execution of Instructions
IF: Instruction Fetch
INSTRUCTION
CYCLE
1
1
2
3
4
2
IF
3
DC
IF
4
RF
DC
IF
5
EX
RF
DC
IF
6
WB
EX
RF
DC
7
DC: Instruction Decode
8
RF: Register Fetch
WB
EX
RF
EX: Execute instruction
WB
EX
WB
WB: Write Result Register
Simple 5-stage pipeline
Purpose of pipelining:
• Reduce #gate_levels in critical path
• Reduce CPI close to one
• More efficient Hardware
Problems
• Hazards: pipeline stalls
• Structural hazards: add more hardware
• Control hazards, branch penalties: use branch prediction
• Data hazards: by passing required
4/23/2020
Platform Design
H. Corporaal and B. Mesman
5
Architecture methods
Pipelined Execution of Instructions
Superpipelining:
• Split one or more of the critical pipeline stages
*
4/23/2020
Platform Design
H. Corporaal and B. Mesman
6
Architecture methods
Powerful Instructions (1)
MD-technique
• Multiple data operands per operation
• SIMD: Single Instruction Multiple Data
Vector instruction:
Assembly:
for (i=0, i++, i<64)
c[i] = a[i] + 5*b[i];
set
ldv
mulvi
ldv
addv
stv
c = a + 5*b
4/23/2020
Platform Design
H. Corporaal and B. Mesman
vl,64
v1,0(r2)
v2,v1,5
v1,0(r1)
v3,v1,v2
v3,0(r3)
7
Architecture methods
Powerful Instructions (1)
SIMD computing
SIMD Execution Method
time
• Nodes used for independent
operations
• Mesh or hypercube connectivity
• Exploit data locality of e.g.
image processing applications
• Dense encoding (few instruction
bits needed)
node1
node2
node-K
Instruction 1
Instruction 2
Instruction 3
Instruction n
4/23/2020
Platform Design
H. Corporaal and B. Mesman
8
Architecture methods
Powerful Instructions (1)
• Sub-word parallelism
– SIMD on restricted scale:
– Used for Multi-media instructions
• Examples
– MMX, SUN-VIS, HP MAX-2, AMDK7/Athlon 3Dnow, Trimedia II
– Example: i=1..4|ai-bi|
*
4/23/2020
Platform Design
H. Corporaal and B. Mesman
*
*
*
9
Architecture methods
Powerful Instructions (2)
MO-technique: multiple operations per instruction
• CISC (Complex Instruction Set Computer)
• VLIW (Very Long Instruction Word)
field
FU 1
instruction
sub r8, r5, 3
FU 2
and r1, r5, 12
FU 3
mul r6, r5, r2
FU 4
ld r3, 0(r5)
FU 5
bnez r5, 13
VLIW instruction example
4/23/2020
Platform Design
H. Corporaal and B. Mesman
10
VLIW architecture: central Register File
Register file
Exec Exec Exec
unit 1 unit 2 unit 3
Issue slot 1
4/23/2020
Exec Exec Exec
unit 4 unit 5 unit 6
Issue slot 2
Platform Design
H. Corporaal and B. Mesman
Exec Exec Exec
unit 7 unit 8 unit 9
Issue slot 3
11
TM1000 DSPCPU
5 constant
5 ALU
2 memory
2 shift
2 DSP-ALU
2 DSP-mul
3 branch
2 FP ALU
2 Int/FP ALU
1 FP compare
1 FP div/sqrt
4/23/2020
Register file (128 regs, 32 bit, 15 ports)
Exec
unit
Exec
unit
Exec
unit
Exec
unit
Exec
unit
Data
cache
(16 kB)
Instruction register (5 issue slots)
PC
Platform Design
Instruction
cache (32kB)
H. Corporaal and B. Mesman
12
TriMedia TM32A processor
0.18 micron
area : 16.9mm2
200 MHz (typ)
1.4 W
7 mW/MHz
I/O
INTERFACE
I-Cache
TAG
TAG
D-cache
TAG
Platform Design
H. Corporaal and B. Mesman
DSPMUL2
DSPMUL1
IFMUL1
(FLOAT) IFMUL2
(FLOAT)
FALU3
FALU0
FCOMP2
ALU3
ALU0
SHIFTER0
DSPALU0
ALU1
ALU4
SHIFTER1
ALU2
FTOUGH1
DSPALU2
SEQUENCER
/ DECODE
TAG
4/23/2020
(MIPS=
0.9 mW/MHz)
13
Architecture methods: Powerful Instructions (2)
VLIW Characteristics
• Only RISC like operation support
 Short cycle times
• Flexible: Can implement any FU mixture
• Extensible
• Tight inter FU connectivity required
• Large instructions (up to 1000 bits)
• Not binary compatible
• But good compilers exist
4/23/2020
Platform Design
H. Corporaal and B. Mesman
14
Architecture methods
Multiple instruction issue (per cycle)
Who guarantees semantic correctness?
– can instructions be executed in parallel
• User specifies multiple instruction streams
– MIMD (Multiple Instruction Multiple Data)
• Run-time detection of ready instructions
– Superscalar
• Compile into dataflow representation
– Dataflow processors
4/23/2020
Platform Design
H. Corporaal and B. Mesman
15
Multiple instruction issue
Three Approaches
Example code
a := b + 15;
Translation to DDG
(Data Dependence Graph)
c := 3.14 * d;
e := c / f;
&d
3.14
&f
&b
ld
15
+
&a
ld
&e
ld
&c
*
/
st
st
st
4/23/2020
Platform Design
H. Corporaal and B. Mesman
16
Generated Code
Instr. Sequential Code
Dataflow Code
I1
I2
I3
I4
I5
I6
I7
I8
I9
ld(M(&b)
addi 15
st M(&a)
ld M(&d)
muli 3.14
st M(&c)
ld M(&f)
div
st M(&e)
ld
addi
st
ld
muli
st
ld
div
st
r1,M(&b)
r1,r1,15
r1,M(&a)
r1,M(&d)
r1,r1,3.14
r1,M(&c)
r2,M(&f)
r1,r1,r2
r1,M(&e)
-> I2
-> I3
-> I5
-> I6, I8
-> I8
-> I9
Notes:
• An MIMD may execute two streams: (1) I1-I3 (2) I4-I9
– No dependencies between streams; in practice communication and
synchronization required between streams
• A superscalar issues multiple instructions from sequential stream
– Obey dependencies (True and name dependencies)
– Reverse engineering of DDG needed at run-time
• Dataflow code is direct representation of DDG
4/23/2020
Platform Design
H. Corporaal and B. Mesman
17
Result Tokens
Multiple Instruction Issue: Data
flow processor
Token
Matching
Token
Store
Instruction
Generate
Instruction
Store
Reservation
Stations
FU-1
4/23/2020
FU-2
Platform Design
FU-K
H. Corporaal and B. Mesman
18
Instruction Pipeline Overview
CISC
IF
DC
RF
EX
RISC
IF
DC/RF
EX
WB
IF1
IF2
IF3
DC1
DC2
DC3
ISSUE
ISSUE
ISSUE
RF1
RF2
RF3
EX1
EX2
EX3
ROB
ROB
ROB
WB1
WB2
WB3
IFk
DCk
ISSUE
RFk
EXk
ROB
WBk
Superpipelined
VLIW
4/23/2020
IF
IF1
IF2
---
IFs
RF1
RF2
EX1
EX2
WB1
WB2
RFk
EXk
WBk
DC
RF
DC
Platform Design
H. Corporaal and B. Mesman
EX1
DATAFLOW
Superscalar
WB
EX2
---
EX5
WB
RF1
RF2
EX1
EX2
WB1
WB2
RFk
EXk
WBk
19
Four dimensional representation of the
architecture design space <I, O, D, S>
SIMD
100
Data/operation ‘D’
10
Vector
CISC
Superscalar
0.1
MIMD
10
RISC
Dataflow
100
Instructions/cycle ‘I’
Superpipelined
10
VLIW
10
Operations/instruction ‘O’
4/23/2020
Superpipelining
Degree ‘S’
Platform Design
H. Corporaal and B. Mesman
20
Architecture design space
Typical values of K (# of functional units or processor nodes), and
<I, O, D, S> for different architectures
Architecture K
I
O
D
S
Mpar
CISC
RISC
VLIW
Superscalar
Superpipelined
Vector
SIMD
MIMD
Dataflow
0.2
1
1
3
1
0.1
1
32
10
1.2
1
10
1
1
1
1
1
1
1.1
1
1
1
1
64
128
1
1
1
1.2
1.2
1.2
3
5
1.2
1.2
1.2
0.26
1.2
12
3.6
3
32
154
38
12
1
1
10
3
1
7
128
32
10
S(architecture) =  f(Op) * lt (Op)
Op I_set
Mpar = I*O*D*S
4/23/2020
Platform Design
H. Corporaal and B. Mesman
21
Overview
• Enhance performance: architecture methods
• Instruction Level Parallelism
– limits on ILP
• VLIW
– Examples
• Clustering
• Code generation
• Hands-on
4/23/2020
Platform Design
H. Corporaal and B. Mesman
22
General organization of an
ILP architecture
FU-4
Data memory
FU-3
Register file
Instruction
decode unit
Instruction
fetch unit
Instruction memory
FU-2
Bypassing network
FU-1
CPU
FU-5
4/23/2020
Platform Design
H. Corporaal and B. Mesman
23
Motivation for ILP
• Increasing VLSI densities; decreasing feature size
• Increasing performance requirements
• New application areas, like
– multi-media (image, audio, video, 3-D)
– intelligent search and filtering engines
– neural, fuzzy, genetic computing
• More functionality
• Use of existing Code (Compatibility)
• Low Power: P = fCVdd2
4/23/2020
Platform Design
H. Corporaal and B. Mesman
24
Low power through parallelism
• Sequential Processor
–
–
–
–
Switching capacitance C
Frequency f
Voltage V
P = fCV2
• Parallel Processor (two times the number of units)
–
–
–
–
4/23/2020
Switching capacitance 2C
Frequency f/2
Voltage V’ < V
P = f/2 2C V’2 = fCV’2
Platform Design
H. Corporaal and B. Mesman
25
Measuring and exploiting available ILP
• How much ILP is there in applications?
• How to measure parallelism within applications?
– Using existing compiler
– Using trace analysis
• Track all the real data dependencies (RaWs) of instructions from issue
window
– register dependence
– memory dependence
• Check for correct branch prediction
– if prediction correct continue
– if wrong, flush schedule and start in next cycle
4/23/2020
Platform Design
H. Corporaal and B. Mesman
26
Trace analysis
Program
Compiled code
Trace
set
r1,0
set
r2,3
set
r3,&A
st
r1,0(r3)
add
r1,r1,1
r3,r3,4
For i := 0..2
set
r1,0
add
A[i] := i;
set
r2,3
brne r1,r2,Loop
set
r3,&A
st
r1,0(r3)
st
r1,0(r3)
add
r1,r1,1
add
r1,r1,1
add
r3,r3,4
add
r3,r3,4
brne r1,r2,Loop
S := X+3;
Loop:
brne r1,r2,Loop
st
r1,0(r3)
add
add
r1,r1,1
add
r3,r3,4
r1,r5,3
brne r1,r2,Loop
How parallel can this code be executed?
4/23/2020
Platform Design
H. Corporaal and B. Mesman
add
r1,r5,3
27
Trace analysis
Parallel Trace
set
r1,0
set
r2,3
set
r3,&A
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
brne r1,r2,Loop
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
brne r1,r2,Loop
brne r1,r2,Loop
add
r1,r5,3
Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7
4/23/2020
Platform Design
H. Corporaal and B. Mesman
28
Ideal Processor
Assumptions for ideal/perfect processor:
1. Register renaming – infinite number of virtual registers => all
register WAW & WAR hazards avoided
2. Branch and Jump prediction – Perfect => all program
instructions available for execution
3. Memory-address alias analysis – addresses are known. A store
can be moved before a load provided addresses not equal
Also:
–
–
–
–
unlimited number of instructions issued/cycle (unlimited resources), and
unlimited instruction window
perfect caches
1 cycle latency for all instructions (FP *,/)
Programs were compiled using MIPS compiler with maximum
optimization level
4/23/2020
Platform Design
H. Corporaal and B. Mesman
29
Upper Limit to ILP: Ideal Processor
Integer: 18 - 60
FP: 75 - 150
160
150.1
140
118.7
Instruction Issues per cycle
IPC
120
100
75.2
80
62.6
60
54.8
40
17.9
20
0
gcc
espresso
li
fpppp
doducd
tomcatv
Programs
4/23/2020
Platform Design
H. Corporaal and B. Mesman
30
Window Size and Branch Impact
• Change from infinite window to examine 2000
FP: 15 - 45
and issue at most 64 instructions per cycle
61
60
60
58
IPC
Instruction issues per cycle
50
Integer: 6 – 12
48
46
46
45
45 45
41
40
35
30
29
19
20
16
15
13
12
14
10
10
9
6
7
6
6
6
7
4
2
2
2
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
4/23/2020
Platform Design
H.
CorporaalStandard
and B. Mesman
Perfect Tournament
Profile
No prediction
Perfect
SelectiveBHT(512)
predictor
2-bit
Static
None
31
Impact of Limited Renaming Registers
• Changes: 2000 instr. window, 64 instr. issue, 8K 2-level
predictor (slightly better than tournament predictor)
70
FP: 11 - 45
59
Integer: 5 - 15
60
54
49
IPC
Instruction issues per cycle
50
45
44
40
35
29
30
28
20
20
16
15 15
13
10
11 10 10
12 12 12 11
10
9
5
4
5
11
6
4
15
5
5
5
4
7
5
5
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
4/23/2020
Infinite 256 128 64 32 None
Platform Design 256
H. Corporaal
and B. Mesman
Infinite
128
64 32
32
Memory Address Alias Impact
• Changes: 2000 instr. window, 64 instr. issue, 8K
2-level predictor, 256 renaming registers
49
50
49
45
45
45
FP: 4 - 45
(Fortran,
no heap)
40
IPC
Instruction issues per cycle
35
30
25
Integer: 4 - 9
20
16
16
15
15
12
10
10
5
9
7
7
4
5
5
4
3
3
4
6
5
4
3
4
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Global/stack Perfect perfect
Inspection Inspection
None None
Perfect Global/stack
Perfect
4/23/2020
Platform Design
H. Corporaal and B. Mesman
33
Window Size Impact
• Assumptions: Perfect disambiguation, 1K Selective predictor, 16
entry
return stack, 64 renaming registers, issue as many as window
60
56
52
IPC
Instruction issues per cycle
50
47
FP: 8 - 45
45
40
35
34
30
22
20
Integer: 6 - 12
15 15
10 10 10
10
9
13
12 12 11 11
10
8
8
6
4
6
3
17 16
14
9
6
4
22
2
14
12
9
8
4
15
9
7
5
4
3
3
6
3
3
0
gcc
expresso
li
fpppp
doducd
tomcatv
Program
Infinite
4/23/2020
256
128
64
32
16
8
Platform
Design 128
H. Corporaal
and B. Mesman
Infinite
256
64
32 16 8 4
4
34
How to Exceed ILP Limits of
This Study?
• WAR and WAW hazards through memory:
eliminated WAW and WAR hazards through register
renaming, but not in memory
• Unnecessary dependences
– compiler did not unroll loops so iteration variable
dependence
• Overcoming the data flow limit: value prediction,
predicting values and speculating on prediction
– Address value prediction and speculation predicts
addresses and speculates by reordering loads and stores.
Could provide better aliasing analysis
4/23/2020
Platform Design
H. Corporaal and B. Mesman
35
Conclusions
• Amount of parallelism is limited
– higher in Multi-Media
– higher in kernels
• Trace analysis detects all types of parallelism
– task, data and operation types
• Detected parallelism depends on
– quality of compiler
– hardware
– source-code transformations
4/23/2020
Platform Design
H. Corporaal and B. Mesman
36
Overview
• Enhance performance: architecture methods
• Instruction Level Parallelism
• VLIW
– Examples
•
•
•
•
C6
TM
IA-64: Itanium, ....
TTA
• Clustering
• Code generation
• Hands-on
4/23/2020
Platform Design
H. Corporaal and B. Mesman
37
VLIW concept
A VLIW architecture
with 7 FUs
Instruction Memory
Instruction register
Function
Int FU
units
Int FU
Int FU
LD/ST
LD/ST
FP FU
FP FU
Floating Point
Register File
Int Register File
Data Memory
4/23/2020
Platform Design
H. Corporaal and B. Mesman
38
VLIW characteristics
•
•
•
•
Multiple operations per instruction
One instruction per cycle issued (at most)
Compiler is in control
Only RISC like operation support
– Short cycle times
– Easier to compile for
• Flexible: Can implement any FU mixture
• Extensible / Scalable
However:
• tight inter FU connectivity required
• not binary compatible !!
– (new long instruction format)
4/23/2020
Platform Design
H. Corporaal and B. Mesman
39
VelociTI
C6x
datapath
4/23/2020
Platform Design
H. Corporaal and B. Mesman
40
VLIW example: TMS320C62
TMS320C62 VelociTI Processor
• 8 operations (of 32-bit) per instruction (256 bit)
• Two clusters
– 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs)
– 2 x 16 registers
– One bus available to write in register file of other cluster
•
•
•
•
•
4/23/2020
Flexible addressing modes (like circular addressing)
Flexible instruction packing
All instruction conditional
5 ns, 200 MHz, 0.25 um, 5-layer CMOS
128 KB on-chip RAM
Platform Design
H. Corporaal and B. Mesman
41
VLIW example: Trimedia TM1000 DSPCPU
5 constant
5 ALU
2 memory
2 shift
2 DSP-ALU
2 DSP-mul
3 branch
2 FP ALU
2 Int/FP ALU
1 FP compare
1 FP div/sqrt
4/23/2020
Register file (128 regs, 32 bit, 15 ports)
Exec
unit
Exec
unit
Exec
unit
Exec
unit
Exec
unit
Data
cache
(16 kB)
Instruction register (5 issue slots)
PC
Platform Design
Instruction
cache (32kB)
H. Corporaal and B. Mesman
42
Intel Architecture IA-64
Explicit Parallel Instruction Computer (EPIC)
• IA-64 architecture -> Itanium, first realization
Register model:
• 128 64-bit int x bits, stack, rotating
• 128 82-bit floating point, rotating
• 64 1-bit boolean
• 8 64-bit branch target address
• system control registers
4/23/2020
Platform Design
H. Corporaal and B. Mesman
43
EPIC Architecture: IA-64
• Instructions grouped in 128-bit bundles
– 3 * 41-bit instruction
– 5 template bits, indicate type and stop location
• Each 41-bit instruction
– starts with 4-bit opcode, and
– ends with 6-bit guard (boolean) register-id
• Supports speculative loads
4/23/2020
Platform Design
H. Corporaal and B. Mesman
44
Itanium
4/23/2020
Platform Design
H. Corporaal and B. Mesman
45
Itanium 2:
McKinley
4/23/2020
Platform Design
H. Corporaal and B. Mesman
46
EPIC Architecture: IA-64
• EPIC allows for more binary compatibility then a
plain VLIW:
– Function unit assignment performed at run-time
– Lock when FU results not available
• See other website for more info on IA-64:
– www.ics.ele.tue.nl/~heco/courses/ACA
– (look at related material)
4/23/2020
Platform Design
H. Corporaal and B. Mesman
47