No Slide Title

Transcript No Slide Title

Platform Design
Exploiting Data Level Parallelism (DLP)
SIMD architectures
TU/e 5kk70
Henk Corporaal
Bart Mesman
DS
P
Programmable
CPU
Programmable
DSP
Application specific
instruction set
processor (ASIP)
Application
specific processor
flexibility
efficiency
7/20/2015
Platform Design
H. Corporaal and B. Mesman
2
SIMD Performance
Computational 106
efficiency
[MOPS/W] 105
Application specific cores
104
SIMD
103
102
Programmable processors
101
[Roza]
7/20/2015
100
2
1
Platform Design
0.5
0.25
H. Corporaal and B. Mesman
0.13
0.07
Feature size [um]
3
SIMD: Topics Overview
• Enhance performance: architecture methods
• Data Level Parallelism
– Application area
– Subword parallelism
• Locally connected SIMDs
– Xetal
• Fully connected SIMDs
– Imagine
• Communication in SIMD processors
– RCSIMD
– DCSIMD
7/20/2015
Platform Design
H. Corporaal and B. Mesman
4
Enhance performance:
4 architecture methods
• (Super)-pipelining
• Powerful instructions
– MD-technique
• multiple data operands per operation
– MO-technique
• multiple operations per instruction
• Multiple instruction issue
7/20/2015
Platform Design
H. Corporaal and B. Mesman
5
Characteristics of Media Applications
• Poorly matched to conventional architectures
– Caches
– Instruction-Level Parallelism
– Few arithmetic units
• Well-matched to modern VLSI technology
– Lots (100’s - 1000’s) of ALUs fit on a single chip
Communication/synchronization often bottleneck
7/20/2015
Platform Design
H. Corporaal and B. Mesman
6
Architecture methods
Powerful Instructions (1)
MD-technique
• Multiple data operands per operation
• SIMD: Single Instruction Multiple Data
Vector instruction:
Assembly:
for (i=0, i++, i<64)
c[i] = a[i] + 5*b[i];
set
ldv
mulvi
ldv
addv
stv
c = a + 5*b
7/20/2015
Platform Design
H. Corporaal and B. Mesman
vl,64
v1,0(r2)
v2,v1,5
v1,0(r1)
v3,v1,v2
v3,0(r3)
7
Architecture methods
Powerful Instructions (1)
SIMD computing
SIMD Execution Method
time
• Exploit data locality of
e.g. image processing
applications
• Effect on code size?
• Effect on power
consumption?
node1
node2
node-K
Instruction 1
Instruction 2
Instruction 3
Instruction n
7/20/2015
Platform Design
H. Corporaal and B. Mesman
8
Architecture methods
Powerful Instructions (1)
• Sub-word parallelism
– SIMD on restricted scale:
– Used for Multi-media instructions
– Motivation: use a powerful 64-bit alu as
4 x 16-bit alus
• Examples
– MMX, SUN-VIS, HP MAX-2, AMDK7/Athlon 3Dnow, Trimedia II
– Example: i=1..4|ai-bi|
7/20/2015
Platform Design
H. Corporaal and B. Mesman
*
*
*
*
9
LCSIMD
LC-SIMD (Locally connected; e.g. Xetal, Imap)
 long communication delays: shift operations
Instructions Bus
PE0
7/20/2015
PE1
PE2
Platform Design
PE319
H. Corporaal and B. Mesman
10
FCSIMD
FC-SIMD (Fully Connected; Imagine)
 expensive communication network
Instructions Bus
PE0
PE1
PE2
PE319
Fully Connected Communication Network
7/20/2015
Platform Design
H. Corporaal and B. Mesman
11
LC: Xetal Objectives
 High-degree of system integration
 CMOS imaging + DSP
 low cost camera systems
 Low power consumption
 mobile & remote sensing
 Flexibility
 programmable DSP and control
functions
7/20/2015
Platform Design
H. Corporaal and B. Mesman
12
Xetal Architecture
7/20/2015
Platform Design
H. Corporaal and B. Mesman
13
Parallel Processing (SIMD)
 2 columns /processor
 neighbour communication
 low-speed clock (16 MHz)
 clock gating
 shared address decoding
 minimal memory read
access
 LOW-POWER
7/20/2015
Platform Design
H. Corporaal and B. Mesman
14
Xetal Specs & Performance
7/20/2015
Platform Design
H. Corporaal and B. Mesman
15
Simulation Results(1-input)
7/20/2015
Platform Design
H. Corporaal and B. Mesman
16
Simulation Results(1output)
7/20/2015
Platform Design
H. Corporaal and B. Mesman
17
Simulation Results(2)
7/20/2015
Platform Design
H. Corporaal and B. Mesman
18
Imagine: Representative Applications
• Stereo Depth
Extraction
• Polygon Rendering
• MPEG
Encoding/Decoding
Render
101100
010110
001001
Encode/Decode
Encoded 2D Data
7/20/2015
Platform Design
H. Corporaal and B. Mesman
2D Video Stream
19
Stream Processing
Input Data
Kernel
Stream
Output Data
Image 0
convolve
convolve
SAD
Image 1
convolve
Depth Map
convolve
• Little data reuse (pixels never revisited)
• Highly data parallel (output pixels not dependent on other
output pixels)
• Compute intensive (60 arithmetic ops per memory reference)
7/20/2015
Platform Design
H. Corporaal and B. Mesman
20
Stream Architecture Provides Data
Bandwidth Hierarchy
SIMD/VLIW
Control
SDRAM
ALU Cluster
SDRAM
SDRAM
Stream
Register File
ALU Cluster
ALU Cluster
ALU Cluster
ALU Cluster
ALU Cluster
ALU Cluster
SDRAM
Peak BW:
7/20/2015
2GB/s
ALU Cluster
32GB/s
Platform Design
544GB/s
H. Corporaal and B. Mesman
21
SDRAM
SDRAM
SDRAM
SDRAM
Stream
Register File
Application Data
Bandwidth Usage
2GB/s
ALU Cluster
ALU Cluster
ALU Cluster
32GB/s
544GB/s
Memory BW
Global RF BW
Local RF BW
Depth Extractor
0.80 GB/s
18.45 GB/s
210.85 GB/s
MPEG Encoder
0.47 GB/s
2.46 GB/s
121.05 GB/s
Polygon Rendering
0.78 GB/s
4.06 GB/s
102.46 GB/s
QR Decomposition
0.46 GB/s
3.67 GB/s
234.57 GB/s
7/20/2015
Platform Design
H. Corporaal and B. Mesman
22
Stream Register
File: Details
Arbiter
Single-ported
128KB SRAM
32W/cycle
(1024 x 32W)
7/20/2015
Platform Design
Stream buffers
SRF:
H. Corporaal and B. Mesman
To/From
Arithmetic
Clusters
23
Local Register File
+
+
+
*
*
/
To SRF
CU
Intercluster Network
Arithmetic Cluster: Details
Cross Point
From SRF
• Units support floating-point / 32-bit / dual 16-bit / quad 8-bit
instructions
– 4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC
– 17-cycle FDIV (pipelined for 1 FDIV every 7 cycles)
7/20/2015
Platform Design
H. Corporaal and B. Mesman
24
The Imagine Stream
Processor
SDRAM
SDRAM
SDRAM
SDRAM
Network
Interface
ALU Cluster 7
ALU Cluster 6
ALU Cluster 5
ALU Cluster 4
ALU Cluster 3
ALU Cluster 2
ALU Cluster 1
ALU Cluster 0
Stream Register File:
32kW SRAM
Microcontroller:
2K VLIW Instrs
Host
Processor
Stream
Controller
Network
Streaming Memory System
Imagine Stream Processor
7/20/2015
Platform Design
H. Corporaal and B. Mesman
25
Imagine Floorplan
• 22 million
transistors
Stream
Controller
SRF
Control
Network
Interface
Micro-Controller
ALU Cluster 0
ALU Cluster 3
7.8mm
ALU Cluster 2
Streambuffers
SRF
– 0.15 mm Ldrawn
– 0.13 mm Leff
– CMOS process
ALU Cluster 1
Streambuffers
• TI GS30KA:
Memory System
• 500 MHz
ALU Cluster 4
ALU Cluster 5
ALU Cluster 6
ALU Cluster 7
7.6mm
7/20/2015
Platform Design
H. Corporaal and B. Mesman
26
Imagine Programming Environment
StereoDepthExtraction(…)
{
// Load Input Images
...
// Run Kernels
convolve7x7 (RawImage,ConvImage);
convolve3x3 (ConvImage,Conv2Image);
...
// Store Output
Convolve7x7(…)
Compile-time
StreamC
KernelC
C++ compiler
kernel
scheduler
stream
scheduler
microcode
Host
Run-time
7/20/2015
Platform Design
Imagine
}
H. Corporaal and B. Mesman
{
...
while(!In.empty()) {
...
p0 = k0 * in10;
p12 = k21 * in32;
p34 = k43 * in54;
p56 = k65 * in76;
sum = (p0 + p12)
+ (p34 +
p56);
...
}
}
27
Applications
• Algorithms need Dynamic communication:
– lens distortion
– bucket processing
– Mirroring,…
7/20/2015
Platform Design
H. Corporaal and B. Mesman
28
Imap
7/20/2015
Platform Design
H. Corporaal and B. Mesman
29
Imap
7/20/2015
Platform Design
H. Corporaal and B. Mesman
30
DC-SIMD Architecture
PE_1
PE_2
PE_3
PE_4
R1
PE_5
PE_6
PE_7
R4
R2
R7
R5
R3
PE_6  PE_3
data
Bus_1
Bus_2
R6
V dst-add
Bus_0
src-add
PE_4  PE_2
Message format
7/20/2015
Platform Design
H. Corporaal and B. Mesman
31
DC-SIMD Architecture
PE_1
PE_2
PE_3
PE_4
R1
PE_5
PE_6
PE_7
R4
R2
R7
R5
R3
R6
Bus_0
Bus_1
Bus_2
Larger distance: PE_7  PE_1
7/20/2015
Platform Design
H. Corporaal and B. Mesman
32
DC-SIMD Architecture
PE_1
PE_2
PE_3
PE_4
R1
PE_5
PE_6
PE_7
R4
R2
R7
R5
R3
R6
PE_7  PE_5
Bus_0
Bus_1
Bus_2
Priority
PE_6  PE_2
7/20/2015
Platform Design
H. Corporaal and B. Mesman
33
DC-SIMD: arbitration
PE
Read:
V des-add data
write: give priority to further PES
PEn
PEid
PEn+1
xor
Read data
src-add
PEn+2
Next reg.
ab
v
00
n+2
01
n+1
10
n
11
V des-add data
n+2 : 2.v
Select (ab)
a=v’.2’
Buffer instruction:
b=a’.v’+a.1’
7/20/2015
src-add
n+1 : (2+v).1
n
Platform Design
: (1+2+v).0
H. Corporaal and B. Mesman
34
Conclusions
• SIMD nicely matches
– Image applications: data-level parallelism
– VLSI efficiency: copy-paste of simple elements
• So
– Very efficient architecture for image processing
– Low power! Also by trading off clock vs performance
– High memory bandwidth with a single memory port
• But
–
–
–
–
Programmer is burdened with vector thinking and code rewriting
Compilers are not good at recognizing opportunities for vector executions
How to provide the data: vector memories and register files
Need for a “control” processor for control code and if-then-else
• Communication is a problem:
– Unable to perform indirect PE addressing-> DC-SIMD
7/20/2015
Platform Design
H. Corporaal and B. Mesman
35

No Slide Title

Transcript No Slide Title

Directory