Course HJ86

Transcript Course HJ86

ASCI Winterschool on
Embedded Systems
March 2004
Renesse
Processor Components
the cornerstones of future platforms
with emphasis on ILP exploitation
Henk Corporaal
Peter Knijnenburg
Future
We foresee that
many characteristics of current high
performance architectures will find their way
into the embedded domain.
ASCI winterschool H.C.-P.K.
2
What are we talking about?
ILP = Instruction Level Parallelism =
ability to perform multiple operations (or instructions),
from a single instruction stream,
in parallel
ASCI winterschool H.C.-P.K.
3
Processor Components
Overview
•
•
•
•
•
•
•
Motivation and Goals
Trends in Computer Architecture
RISC processors
ILP Processors
Transport Triggered Architectures
Configurable components
Summary and Conclusions
ASCI winterschool H.C.-P.K.
4
Motivation for ILP
(and other types of parallelism)
• Increasing VLSI densities; decreasing feature size
• Increasing performance requirements
• New application areas, like
– Multi-media (image, audio, video, 3-D)
– intelligent search and filtering engines
– neural, fuzzy, genetic computing
• More functionality
• Use of existing Code (Compatibility)
• Low Power: P = fCV2
ASCI winterschool H.C.-P.K.
5
Low power through parallelism
• Sequential Processor
–
–
–
–
Switching capacitance C
Frequency f
Voltage V
P = fCV2
• Parallel Processor (two times the number of units)
–
–
–
–
Switching capacitance 2C
Frequency f/2
Voltage V’ < V
P = f/2 2C V’2 = fCV’2
ASCI winterschool H.C.-P.K.
6
ILP Goals
• Making the most powerful single chip processor
• Exploiting parallelism between independent instructions
(or operations) in programs
• Exploit hardware concurrency
– multiple FUs, buses, reg files, bypass paths, etc.
• Code compatibility
– binary: superscalar and super-pipelined
– HLL: VLIW
• Incorporate enhanced functionality (ASIP)
ASCI winterschool H.C.-P.K.
7
Overview
•
•
•
•
•
•
•
Motivation and Goals
Trends in Computer Architecture
RISC processors
ILP Processors
Transport Triggered Architectures
Configurable components
Summary and Conclusions
ASCI winterschool H.C.-P.K.
8
Trends in Computer Architecture
•
•
•
•
•
•
Bridging the semantic gap
Performance increase
VLSI developments
Architecture developments: design space
The role of compiler
Right match
ASCI winterschool H.C.-P.K.
9
Very simple processor
r0
r1
r2
Function
Function
Unit(s)
Unit(s)
MDR
Data
Memory
Register
file
MAR
Processor datapath
Decode logic
Instruction register
ASCI winterschool H.C.-P.K.
10
Bridging the Semantic Gap
Programming domains
• Application domain
• Architecture domain
• Data path domain
Example:
Larchiteccture
Lapplication
A := B + C
LD r1, M(&B)
LD r2,M(&C)
ADD r1,r1,r2
ST r1, M(&A)
SW compilation or
interpretation
HW interpretation
ASCI winterschool H.C.-P.K.
Ldatapath
&B  MAR
MDR  r1
&C  MAR
MDR  r2
r1  ALUinput-1
r2  ALUinput-2
ALUoutput := ALUinput-1
ALUoutput  r1
r1  MDR
&A  MAR
11
Bridging the Semantic Gap:
Different Methods
Application
Architecture
Direct Hardware
interpretation
Application
Compilation
and/or software
interpretation
Direct Execution
Architectures
ASCI winterschool H.C.-P.K.
Compilation
and/or software
interpretation
Architecture
Micro-Code
interpretation
Operations &
Data Transports
Application
Operations &
Data Transports
CISC Architectures
Application
Direct
Compilation
and/or software
interpretation
Architecture
Micro-Code
interpretation
Architecture
Operations &
Data Transports
Operations &
Data Transports
RISC Architectures
Microcoded
Architectures
12
Bridging the Semantic Gap:
What happens to the semantic level ?
Compiler
and/or
interpretation
Semantic Level
CISC
RISC
?
Interpretation
1950
1960
1970
1980
1990
2000
2010
Year
Application Domain
Architecture Domain
Datapath
ASCI winterschool H.C.-P.K.
Domain
13
SPECint and SPECfp ratings
Performance Increase
SPECfp92 data
SPECint92 data
1000
SPECfp92 growth
SPECint92 growth
100
10
1.0
0.1
78
80
82
84
86 88
90
92 94
96
98
00
02
Year
Microprocessor SPEC Ratings
• 50% SPECint improvement / year
• 60% SPECfp improvement / year
ASCI winterschool H.C.-P.K.
14
VLSI Developments
~ 2(year-1956) * 2/3
10e7
10
10e5
1.0
Feature Size
Density
10e3
0.1
70
80
90
Minimum feature size in (um)
Density in transistors/chip
# Transistors (DRAM)
00
Year
Cycle time:
tcycle ~ tgate * #gate_levels + wiring_delay + pad_delay
What happens to these contributions ?
ASCI winterschool H.C.-P.K.
15
Architecture Developments
How to improve performance?
• (Super)-pipelining
• Powerful instructions
– MD-technique
• multiple data operands per operation
– MO-technique
• multiple operations per instruction
• Multiple instruction issue
ASCI winterschool H.C.-P.K.
16
Architecture Developments
Pipelined Execution of Instructions
IF: Instruction Fetch
INSTRUCTION
CYCLE
1
1
2
2
IF
3
4
3
DC
IF
4
RF
DC
IF
5
EX
RF
DC
IF
6
WB
EX
RF
DC
7
DC: Instruction Decode
8
RF: Register Fetch
WB
EX
RF
EX: Execute instruction
WB
EX
WB
WB: Write Result Register
Simple 5-stage pipeline
Purpose:
• Reduce #gate_levels in critical path
• Reduce CPI close to one
• More efficient Hardware
Problems
• Hazards: pipeline stalls
• Structural hazards: add more hardware
• Control hazards, branch penalties: use branch prediction
• Data hazards: by passing required
Superpipelining: Split one or more of the critical pipeline stages
ASCI winterschool H.C.-P.K.
17
Architecture Developments
Powerful Instructions (1)
MD-technique
• Multiple data operands per operation
a=B*c+d
Two Styles
• Vector
• SIMD
Vector Execution Method
FU1 FU2 FU3
SIMD Execution Method
FU-K
node1
node2
node-K
ASCI winterschool H.C.-P.K.
time
Instruction 2
Instruction K
Instruction 3
Instruction 2
Instruction 1
Instr K+1
time
Instruction 1
Instruction 3
Instruction n
18
Architecture Developments
Powerful Instructions (1)
Vector Computing
• FU mix may match the application domain
• Use of interleaved memory
• FUs need to be tightly connected
SIMD computing
•
•
•
•
Nodes used for independent operations
Mesh or hypercube connectivity
Exploit data locality of e.g. image processing applications
SIMD on restricted scale: Multi-media instructions
– MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia, ......
– Example: i=1..4|ai-bi|
ASCI winterschool H.C.-P.K.
19
Architecture Developments
Powerful Instructions (2)
MO-technique: multiple operations per instruction
• CISC (Complex Instruction Set Computer)
• VLIW (Very Long Instruction Word)
field
FU 1
instruction
sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2
FU 2
FU 3
FU 4
ld r3, 0(r5)
FU 5
bnez r5, 13
VLIW instruction example
ASCI winterschool H.C.-P.K.
20
Architecture Developments: Powerful Instructions (2)
VLIW Characteristics
• Only RISC like operation support
 Short cycle times
• Flexible: Can implement any FU mixture
• Extensible
• Tight inter FU connectivity required
• Large instructions
• Not binary compatible
ASCI winterschool H.C.-P.K.
21
Architecture Developments
Multiple instruction issue (per cycle)
Who guarantees semantic correctness?
• User specifies multiple instruction streams
– MIMD (Multiple Instruction Multiple Data)
• Run-time detection of ready instructions
– Superscalar
• Compile into dataflow representation
– Dataflow processors
ASCI winterschool H.C.-P.K.
22
Multiple instruction issue
Three Approaches
Example code
a := b + 15;
Translation to DDG
(Data Dependence Graph)
c := 3.14 * d;
e := c / f;
&d
3.14
&f
&b
ld
15
+
&a
ld
&e
ld
&c
*
/
st
st
st
ASCI winterschool H.C.-P.K.
23
Generated Code
Instr. Sequential Code
Dataflow Code
I1
I2
I3
I4
I5
I6
I7
I8
I9
ld(M(&b)
addi 15
st M(&a)
ld M(&d)
muli 3.14
st M(&c)
ld M(&f)
div
st M(&e)
ld
addi
st
ld
muli
st
ld
div
st
r1,M(&b)
r1,r1,15
r1,M(&a)
r1,M(&d)
r1,r1,3.14
r1,M(&c)
r2,M(&f)
r1,r1,r2
r1,M(&e)
-> I2
-> I3
-> I5
-> I6, I8
-> I8
-> I9
Notes:
• An MIMD may execute two streams: (1) I1-I3 (2) I4-I9
– No dependencies between streams; in practice communication and
synchronization required between streams
• A superscalar issues multiple instructions from sequential
stream
– Obey dependencies (True and name dependencies)
– Reverse engineering of DDG needed at run-time
• Dataflow code is direct representation of DDG
ASCI winterschool H.C.-P.K.
24
Instruction Pipeline Overview
CISC
IF
DC
RF
EX
RISC
IF
DC/RF
EX
WB
IF1
IF2
IF3
DC1
DC2
DC3
ISSUE
ISSUE
ISSUE
RF1
RF2
RF3
EX1
EX2
EX3
ROB
ROB
ROB
WB1
WB2
WB3
IFk
DCk
ISSUE
RFk
EXk
ROB
WBk
Superpipelined
VLIW
ASCI winterschool H.C.-P.K.
IF
IF1
IF2
---
IFs
RF1
RF2
EX1
EX2
WB1
WB2
RFk
EXk
WBk
DC
DC
RF
EX1
DATAFLOW
Superscalar
WB
EX2
---
EX5
WB
RF1
RF2
EX1
EX2
WB1
WB2
RFk
EXk
WBk
25
Four dimensional representation of the
architecture design space <I, O, D, S>
SIMD
100
Data/operation ‘D’
10
Vector
CISC
Superscalar
0.1
MIMD
10
RISC
Dataflow
100
Instructions/cycle ‘I’
Superpipelined
10
VLIW
10
Operations/instruction ‘O’
ASCI winterschool H.C.-P.K.
Superpipelining
Degree ‘S’
26
Architecture design space
Typical values of K (# of functional units or processor nodes), and
<I, O, D, S> for different architectures
Architecture K
I
O
D
S
Mpar
CISC
RISC
VLIW
Superscalar
Superpipelined
Vector
SIMD
MIMD
1
1
10
3
1
7
128
32
0.2
1
1
3
1
0.1
1
32
1.2
1
10
1
1
1
1
1
1.1
1
1
1
1
64
128
1
1
1.2
1.2
1.2
3
5
1.2
1.2
0.26
1.2
12
3.6
3
32
154
38
Dataflow
10
10
1
1
1.2
12
S(architecture) =  f(Op) * lt (Op)
Op I_set
Mpar = I*O*D*S
ASCI winterschool H.C.-P.K.
27
The Role of the Compiler
9 steps required to translate an HLL program
•
•
•
•
•
•
•
•
•
Front-end compilation
Determine dependencies
Graph partitioning: make multiple threads (or tasks)
Bind partitions to compute nodes
Bind operands to locations
Bind operations to time slots: Scheduling
Bind operations to functional units
Bind transports to buses
Execute operations and perform transports
ASCI winterschool H.C.-P.K.
28
Division of responsibilities between hardware and compiler
Application
Frontend
Superscalar
Determine Dependencies
Determine Dependencies
Dataflow
Binding of Operands
Binding of Operands
Multi-threaded
Scheduling
Scheduling
Indep. Arch
Binding of Operations
Binding of Operations
VLIW
Binding of Transports
TTA
Binding of Transports
Execute
Responsibility of compiler
ASCI winterschool H.C.-P.K.
Responsibility of Hardware
29
The Right Match
108
MIMD
107
Transistor per CPU Chip
VLIW Superscalar Dataflow
106
RISC+ MMU + FP 64-bit
CISC 32-bit core
105
RISC 32-bit core
104
8-bit Microprocessor
103
72
80
90
00
Year
ASCI winterschool H.C.-P.K.
30
Overview
•
•
•
•
•
•
•
Motivation and Goals
Trends in Computer Architecture
RISC processors
ILP Processors
Transport Triggered Architectures
Configurable components
Summary and Conclusions
ASCI winterschool H.C.-P.K.
31
INSTRUCTION
RISC basics
Bypass
buses
CYCLE
1
1
2
2
IF
3
DC
IF
4
EX
DC
IF
3
4
Register File
5
WB
EX
DC
IF
6
7
8
Forwarding
WB
EX
DC
WB
EX
WB
Immediate
mux
Op-1
Op-1
Memory Unit
operand regs.
Note: Ifetch path not
shown
ALU
mux
RISC datapath
Function
unit
BP-1
ASCI winterschool H.C.-P.K.
32
Why RISC?
Make the common case fast
• Reduced number of instructions
• Limited addressing modes
– load-store architecture
• Large uniform register set
• Limited number of instruction sizes
(preferably one)
– know directly where the following instruction
starts
• Limited number of instruction formats
Enables pipelining
ASCI winterschool H.C.-P.K.
33
Overview
•
•
•
•
•
•
•
Motivation and Goals
Trends in Computer Architecture
RISC processors
ILP Processors
Transport Triggered Architectures
Configurable components
Summary and Conclusions
ASCI winterschool H.C.-P.K.
34
ILP Processors
• Overview
• General ILP organization
• VLIW concept
– examples like: TriMedia, Mpact, TMS320C6x,
IA-64
• Superscalar concept
– examples like: HP-PA8000, Alpha 21264, MIPS
R10k/R12k, Pentium I-IV, AMD5-7, UltraSparc
– (Ref: IEEE Micro April 1996 (HotChips issue)
• Comparing Superscalar and VLIW
ASCI winterschool H.C.-P.K.
35
General ILP processor organization
Central Processing Unit
Instruction
Decode
Unit
FU-2
Data
Memory
Instruction
Fetch
Unit
Register File
Instruction
Memory
FU-1
FU-K
ASCI winterschool H.C.-P.K.
36
ILP processor characteristics
• Issue multiple operations/instructions per
cycle
• Multiple concurrent Function Units
• Pipelined execution
• Shared register file
• Four Superscalar variants
– In-order/Out-of-order execution
– In-order/Out-of-order completion
ASCI winterschool H.C.-P.K.
37
VLIW concept
A VLIW architecture
with 7 FUs
Int FU
Int FU
Instruction Memory
Int FU
LD/ST
LD/ST
FP FU
FP FU
Floating Point
Register File
Int Register File
Data Memory
ASCI winterschool H.C.-P.K.
38
VLIW example: Trimedia
Trimedia Overview
SDRAM
Memory
Interface
19 Mpix/s
Stereo digital
audio
* 5-issue
* 128 registers
* 27 Fus
* 32-bit
* 8-Way set associative
caches
* dual ported data cache
* gaurded operations
ASCI winterschool H.C.-P.K.
Timers
PCI interface
Video In
Video Out
Audio In
Audio Out
I2C Interface
Serial Interface
VLIW
Processor
32 bit, 33 MHZ
40 Mpix/s
208 chanel
digital audio
32K I$
16K D$
VLD
coprocessor
Huffman decoder
MPEG1,2
39
VLIW example: TMS320C62
TMS320C62 VelociTI Processor
• 8 operations (of 32-bit) per instruction (256 bit)
• Two clusters
– 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs)
– 2 x 16 registers
– One port available to read from register file of other cluster
•
•
•
•
•
Flexible addressing modes (like circular addressing)
Flexible instruction packing
All operations conditional
5 ns, 200 MHz, 0.25 um, 5-layer CMOS
128 KB on-chip RAM
ASCI winterschool H.C.-P.K.
40
VelociTI
C64
datapath
Cluster
ASCI winterschool H.C.-P.K.
41
VLIW example: IA-64
Intel HP 64 bit VLIW like architecture
• 128 bit instruction bundle containing 3 instructions
• 128 Integer + 128 Floating Point registers : 7-bit reg id.
• Guarded instructions
– 64 entry boolean register file heavily rely on if-conversion to
remove branches
• Specify instruction independence
– some extra bits per bundle
• Fully interlocked
– i.e. no delay slots: operations are latency compatible within family
of architectures
• Split loads
– non trapping load + exception check
ASCI winterschool H.C.-P.K.
42
Intel Itanium 2
•
•
•
•
EPIC
0.18um 6ML
8 issue slots
1 GHz
(8000 MIPS)
• 130 W (max)
• 61 MOPS/W
• 128b bundle
(3x41b + 5b)
ASCI winterschool H.C.-P.K.
43
Superscalar: Concept
Instruction
Instruction
Memory
Instruction
Cache
Decoder
Reservation
Stations
Branch
Unit
ALU-1
ALU-2
Logic &
Shift
Load
Unit
Store
Unit
Address
Data
Reorder
Buffer
ASCI winterschool H.C.-P.K.
Register
File
Data
Cache
Data
Data
Memory
44
Intel Pentium 4
•
•
•
•
•
•
•
•
•
Superscalar
0.12um 6ML
1.0 V
3 issue
>3 GHz
58 W
20 stage pipeline
ALUs clocked at 2X
Trace cache
ASCI winterschool H.C.-P.K.
45
Pentium 4
• Trace cache
• Hyper threading
• Add with ½ cycle throughput (1 ½ cycle
latency)
add least signif. 16 bits
add most signif. 16 bits
calculate flags
forwarding carry
cycle cycle cycle
ASCI winterschool H.C.-P.K.
46
P4 vs P II, PIII pipeline
Basic P6 Pipeline
1
2
3
Fetch
Fetch
4
5
6
7
8
Intro at
733MHz
9
.18µ
Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch
10
Exec
Basic Pentium® 4 Processor Pipeline
1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
ASCI winterschool H.C.-P.K.
7
8
Rename
9
10
Que Sch
11
12
13
14 15
Sch Sch Disp Disp
RF
Intro at
16 17
18 19 20
1.4GHz
RF
Ex Flgs Br Ck Drive
.18µ
47
Example with Higher IPC and Faster
Clock!
Code
Sequence:
P6
@1GHz
Pentium® 4
Processor
@1.4GHz
Ld
Add
Add
Ld
Add
Add
10 clocks
10ns
IPC = 0.6
ASCI winterschool H.C.-P.K.
6 clocks
4.3ns
IPC = 1.0
48
Superscalar Issues
• How to fetch multiple instructions in time (across
basic block boundaries) ? Trace Cache
• Handling control hazards: Branch prediction
• Non-blocking memory system: Hit over miss
• Handling dependencies: Renaming
• How to support precise interrupts?: ROB
• How to recover from mis-predicted branch path?
ROB
ASCI winterschool H.C.-P.K.
49
Renaming
Example:
#
Original Code
1
2
3
4
mul
st
add
shl
r1,r2,r3
r1,3(r2)
r1,r5,#4
r2,r1,r3
dependence latency renamed version
RaW
WaW, WaR
RaW, WaR
4
1
1
1
mul
st
add
shl
p1,p2,p3
p1,3(p2)
p4,p5,#4
p6,p4,#3
All four instructions may issue simultaneously
– (If resources are available)
Renaming is implemented using
– Reorder buffer: Pentium II/III, HP PA-8000, PowerPC 604,
SPARC64
– Direct register remapping: MIPS 10k/12k, DEC 21264
ASCI winterschool H.C.-P.K.
50
Renaming
Mapping (after I4)
Logic register
Physical register
r1
r2
r3
r4
r5
p4
p6
p3
p5
Note: Old mapping r1-p1not needed anymore; however p1
still active
When may we reuse physical register p1?
– Old mapping has changed (r1-p4)
– p1 has been committed
ASCI winterschool H.C.-P.K.
51
Branch Prediction
• Branch Prediction techniques, why?
– Speculatively execute beyond branches
– Reduce branch penalties
• Classification
– Static techniques; prediction based on:
• Profiling information
• Static analysis of code: use of heuristics
– Dynamic techniques
• 1-level: Branch prediction buffer with n-bit prediction counters
• 2-level: Branch correlation using branch history
• Hybrid methods (e.g. Alpha 21264)
– Combinations of static and dynamic
ASCI winterschool H.C.-P.K.
52
Static Techniques: Heuristic Based
(Ball and Larus’93)
• Loop Branch Heuristic
– Back-edge will be taken 88% of the time
• Pointer Heuristic
– A comparison of two pointers will fail 60% of the time
• Call Heuristic
– A successor block containing a call and which does not postdominate the block containing the branchwill not be taken 78% of the
time
• Opcode Heuristic
– A test of an integer for ‘  0’, or ’ 0’ or ‘ some constant’ will fail
outcome 84% of the time
• Loop Exit Heuristic
– A branch in a loop in which no successor block is a loop head will not
exit the loop 80% of the time
ASCI winterschool H.C.-P.K.
53
Static Heuristic Based
(Ball and Larus’93)
• Return Heuristic
– A successor block containing a return will not be taken 72% of the
time
• Store Heuristic
– A successor block containing a store instruction and which does not
post-dominate will not be taken 55% of the time
• Loop Header Heuristic
– A successor block which is a loop header or a loop pre-header (I.e.
passes control unconditionally to a loop head which it dominates) and
which does not post-dominate will be taken 75% of the time
• Guard Heuristic
– A successor block in which a register is used before being defined
and which does not post-dominate will be taken 62% of the time if
that register is an operand of the branch
ASCI winterschool H.C.-P.K.
54
Static Heuristic Based Prediction
When multiple predictors apply we use
‘Dempster-Shafer’
evidence combination
Pnew =
Pold * Pheuristic
Pold * Pheuristic + (1- Pold)*(1-Pheuristic)
For example if both ‘Loop Exit’ and ‘Store’ heuristic are applied
Pnew =
ASCI winterschool H.C.-P.K.
0.8*0.45
0.8*0.45 + (1 - 0.8)*(1 - 0.45)
= 0.766
55
Dynamic Techniques:
Branch Prediction Buffer: 1 bit prediction
1-bit
Branch address
2 K entries
(Lower K bits)
prediction bit
Problems
• Aliasing: lower K bits of different branch instructions could be the same
– Soln: Use tags (the buffer becomes a tag); however very expensive
• Loops are predicted wrong twice
– Soln: Use n-bit saturation counter prediction
* taken if counter  2 (n-1)
* not-taken if counter < 2 (n-1)
– A 2 bit saturating counter predicts a loop wrong only once
ASCI winterschool H.C.-P.K.
56
Using n-bit Saturating Counters
Branch address
n-bit
saturating
Up/Down
Counter
a
Prediction
2-bit saturating counter scheme
N
T
10/T
11/T
T
T
N
N
01/N
00/N
N
T
ASCI winterschool H.C.-P.K.
57
Branch Correlation Using Branch History
Two schemes (a, k, m, n)
• PA: Per address history, a > 0
• GA: Global history, a = 0
Pattern History Table
2m-1
n-bit
saturating
Up/Down
Counter
m
1
Branch Address
0
2k-1
0 1
a
Prediction
k
Branch History Table
Table size (usually n = 2): #bits = k * 2a + 2k * 2m *n
Variant: Gshare (Scott McFarling’93): GA which takes logic XOR of PC
address bits and branch history bits
ASCI winterschool H.C.-P.K.
58
Predicting the Target Address
1. Branch Target Buffer (BTB)
2. Branch Folding (Store instruction in BTB)
3. Return Stack
ASCI winterschool H.C.-P.K.
59
Accuracy (taking the best combination of parameters):
Branch Prediction Accuracy (%)
GA(0,11,5,2)
98
PA(10, 6, 4, 2)
97
96
95
Bimodal
94
GAs
93
PAs
92
91
89
64
ASCI winterschool H.C.-P.K.
128 256 1K
2K
4K
8K 16K 32K 64K
Predictor Size (bytes)
60
Comparing Superscalar and VLIW
Characteristic
Superscalar
VLIW
Architecture Type
Multiple issue
Multiple operations
Complexity
High
Low
Binary Code Compat..
Yes
No
Source Code Compat.
Yes
Yes, if good compiler
Scheduling
Dynamic
Static
Scheduling Window
10 instructions
100 - 1000 instructions
Speculation
Dynamic
Static
Branch Prediction
Dynamic
Static
Mem ref disambiguation
Dynamic
Static
Scalability
Medium
High
Functional Flexibility
High
Very High
Application
General Purpose
Special Purpose
ASCI winterschool H.C.-P.K.
61
Overview
•
•
•
•
•
•
•
Motivation and Goals
Trends in Computer Architecture
RISC processors
ILP Processors
Transport Triggered Architectures
Configurable components
Summary and Conclusions
ASCI winterschool H.C.-P.K.
62
Reducing Datapath Complexity: TTA
TTA: Transport Triggered Architecture
Overview
Philosophy
MIRROR THE PROGRAMMING PARADIGM
• Program transports, operations are side effects of
transports
• Compiler is in control of hardware transport capacity
ASCI winterschool H.C.-P.K.
63
Transport Triggered Architecture
General Structure of TTA
Data-transport Buses / Move Buses
FU1
ASCI winterschool H.C.-P.K.
FU1
Sockets
FU1
Integer
Reg
File
FP
Reg
FIle
Boolean
Reg
File
64
Program TTAs
How to do data operations ?
1. Transport of operands to FU
• Operand move (s)
Trigger
Operand
• Trigger move
2. Transport of results from FU
• Result move (s)
Internal stage
Example
Add r3,r1,r2
becomes
r1  Oint
r2  Tadd
………….
Rint  r3
// operand move to integer unit
// trigger move to integer unit
// addition operation in progress
// result move from integer unit
Result
FU Pipeline
How to do Control flow ?
1. Jumps:
2. Branch:
3. Call:
ASCI winterschool H.C.-P.K.
#jump-address  pc
#displacement  pcd
pc  r; #call-address  pcd
65
Program TTAs
Scheduling advantages of Transport Triggered Architectures
1. Software bypassing
Rint  r1
r1  Tadd
 Rint r1; Rint  Tadd
2. Dead writeback removal
Rint  r1; Rint  Tadd
 Rint  Tadd
3. Common operand elimination
#4 Oint; r1  Tadd
#4 Oint; r2  Tadd
 #4  Oint; r1  Tadd
r2  Tadd
4. Decouple operand, trigger and result moves completely
r1  Oint; r2  Tadd
Rint  r3
ASCI winterschool H.C.-P.K.
 r1  Oint
--r2  Tadd
--Rint  r3
66
TTA Advantages
Summary of advantages of TTAs
• Better usage of transport capacity
– Instead of 3 transports per dyadic operation, about 2 are needed
– # register ports reduced with at least 50%
– Inter FU connectivity reduces with 50-70%
• No full connectivity required
• Both the transport capacity and # register ports become
independent design parameters; this removes one of the
major bottlenecks of VLIWs
• Flexible: FUs can incorporate arbitrary functionality
• Scalable: #FUs, #reg.files, etc. can be changed
• TTAs are easy to design and can have short cycle times
ASCI winterschool H.C.-P.K.
67
TTA automatic DSE
User
intercation
Optimizer
x
x
x
feedback
x
x
Architecture
parameters
Parametric compiler
Pareto curve
(solution space)
x
feedback
x
x
x
x
Hardware generator
x
x
x
x x
x
x
x
x x
cost
Move framework
Parallel
object
code
ASCI winterschool H.C.-P.K.
chip
68
Overview
•
•
•
•
•
•
•
Motivation and Goals
Trends in Computer Architecture
RISC processors
ILP Processors
Transport Triggered Architectures
Configurable components
Summary and Conclusions
ASCI winterschool H.C.-P.K.
69
Tensilica Xtensa
•
•
•
•
•
•
•
Configurable RISC
0.13um
0.9V
1 issue slot / 5 stage pipeline
490 MHz typical
39.2 mW (no mem.)
12500 MOPS / W
• Tool support
• Optional vector unit
• Special Function Units
ASCI winterschool H.C.-P.K.
70
Fine-Grained reconfigurable:
Xilinx XC4000 FPGA
CLB
Slew
Rate
Control
CLB
Switch
Matrix
D
CLB
Q
Passive
Pull-Up,
Pull-Down
Vcc
Output
Buffer
Pad
Input
Buffer
CLB
Q
Programmable
Interconnect
D
Delay
I/O Blocks (IOBs)
C1 C2 C3 C4
H1 DIN S/R EC
S/R
Control
G4
G3
G2
G1
F4
F3
F2
F1
DIN
G
Func.
Gen.
F'
G'
H
Func.
Gen.
F
Func.
Gen.
D
EC
RD
1
Y
G'
H'
S/R
Control
DIN
F'
G'
D
SD
Q
H'
1
H'
K
SD
Q
H'
F'
EC
RD
X
Configurable
Logic Blocks (CLBs)
ASCI winterschool H.C.-P.K.
71
Coarse-Grained reconfigurable:
Chameleon CS2000
Highlights:
•32-bit datapath (ALU/Shift)
•16x24 Multiplier
•distributed local memory
•fixed timing
ASCI winterschool H.C.-P.K.
72
Hybrid FPGAs: Virtex II-Pro
GHz
IO:16
Upserial
to 16 transceivers
serial transceivers
Up to
PowerPCs
Memory blocks
PowerPC
ReConfig.
logic
Reconfigurable logic
blocks
Courtesy of Xilinx (Virtex II Pro)
ASCI winterschool H.C.-P.K.
73
Reconfiguration time
HW or SW reconfigurable?
reset
FPGA
Spatial mapping
loopbuffer
context
Temporal mapping
Subword parallelism
1 cycle
fine
ASCI winterschool H.C.-P.K.
Data path granularity
VLIW
coarse
74
Granularity Makes Differences
ASCI winterschool H.C.-P.K.
Fine-Grained
Architecture
Coarse-Grained
Architecture
Clock Speed
Low
High
Configuration
Time
Long
Short
Unit Amount
Large
Small
Flexibility
High
Low
Power
High
Low
Area
Large
Small
75
Overview
•
•
•
•
•
•
•
•
Motivation and Goals
Trends in Computer Architecture
RISC processors
ILP Processors
Transport Triggered Architectures
Configurable components
Multi-threading
Summary and Conclusions
ASCI winterschool H.C.-P.K.
76
Simultaneous Multithreading Characteristics
• An SMT has separate front-ends for the different
threads but shares the back-end between all
threads.
• Each thread has its own
– Re-order buffer
– Branch History Register
• Registers, caches, branch prediction tables,
instruction queues, FUs etc. are shared.
ASCI winterschool H.C.-P.K.
77
Multi-threading in Uniprocessor Architectures
Superscalar
Concurrent
Multithreading
Simultaneous
Multithreading
Clock cycles
Empty Slot
Thread 1
Thread 2
Thread 3
Thread 4
Issue slots
ASCI winterschool H.C.-P.K.
78
Instruction Fetch Policies
• The Instruction Fetch policy decides from which
threads to fetch each cycle.
• Performance and throughput is highly sensitive to
the Instr.Fetch policy.
• “Standard” icount fetches from thread with least
instructions in front-end.
• Performance of a thread depends on policy as
well as workload and becomes highly
unpredictable.
ASCI winterschool H.C.-P.K.
79
Resource Allocation in SMT
• Better to perform dynamic resource
allocation to drive instruction fetch.
• DCRA outperforms icount in many cases.
• Possible to use resource allocation to
guarantee certain percentage of singlethread performance.
• Improves predictability and hence suitability
of SMT for real-time embedded systems.
ASCI winterschool H.C.-P.K.
80
Future Processors Components
• New TriMedia has deep pipeline, L1 and L2
cache, and branch prediction.
• META is a (simple) simultaneous multithreaded
architecture.
• Calistro is a embedded multi-processor platform
for mobile applications.
• Imagine (Stanford): combines operation (VLIW)
and data level parallelism (SIMD).
• TRISP (Texas Austin / IBM) and SCALE (MIT)
processors combine task, operation and data
level parallelism.
ASCI winterschool H.C.-P.K.
81
Summary and Conclusions
ILP architectures have great potential
• Superscalars
– Binary compatible upgrade path
• VLIWs
– Very flexible ASIPs
• TTAs
–
–
–
–
Avoid control and datapath bottlenecks
Completely compiler controlled
Very good cost-performance ratio
Low power
• Multi-threading
– Surpass exploitable ILP in applications
– How to choose threads ?
ASCI winterschool H.C.-P.K.
82