No Slide Title

Download Report

Transcript No Slide Title

Platform Design
Exploiting DLP
SIMD architectures
TU/e 5kk70
Henk Corporaal
Bart Mesman
DS
P
Programmable
CPU
Programmable
DSP
Application specific
instruction set
processor (ASIP)
Application
specific processor
flexibility
efficiency
7/21/2015
Platform Design
H. Corporaal and B. Mesman
2
SIMD Performance
Computational 106
efficiency
[MOPS/W] 105
Application specific cores
104
SIMD
103
102
Programmable processors
101
[Roza]
7/21/2015
100
2
1
Platform Design
0.5
0.25
H. Corporaal and B. Mesman
0.13
0.07
Feature size [um]
3
SIMD: Topics Overview
• Enhance performance: architecture methods
• Data Level Parallelism
– Application area
– Subword parallelism
• Locally connected SIMDs
– Xetal
• Fully connected SIMDs
– Imagine
• Communication in SIMD processors
– RCSIMD
– DCSIMD
7/21/2015
Platform Design
H. Corporaal and B. Mesman
4
Enhance performance:
3 architecture methods
• (Super)-pipelining
• Powerful instructions
– MD-technique
• multiple data operands per operation
– MO-technique
• multiple operations per instruction
• Multiple instruction issue
7/21/2015
Platform Design
H. Corporaal and B. Mesman
5
Characteristics of Media Applications
• Poorly matched to conventional architectures
– Caches
– Instruction-Level Parallelism
– Few arithmetic units
• Well-matched to modern VLSI technology
– Lots (100’s - 1000’s) of ALUs fit on a single chip
Communication bandwidth is the scarce resource
7/21/2015
Platform Design
H. Corporaal and B. Mesman
6
Architecture methods
Powerful Instructions (1)
MD-technique
• Multiple data operands per operation
• SIMD: Single Instruction Multiple Data
Vector instruction:
Assembly:
for (i=0, i++, i<64)
c[i] = a[i] + 5*b[i];
set
ldv
mulvi
ldv
addv
stv
c = a + 5*b
7/21/2015
Platform Design
H. Corporaal and B. Mesman
vl,64
v1,0(r2)
v2,v1,5
v1,0(r1)
v3,v1,v2
v3,0(r3)
7
Architecture methods
Powerful Instructions (1)
SIMD computing
SIMD Execution Method
time
• Exploit data locality of
e.g. image processing
applications
• Effect on code size?
• Effect on power
consumption?
node1
node2
node-K
Instruction 1
Instruction 2
Instruction 3
Instruction n
7/21/2015
Platform Design
H. Corporaal and B. Mesman
8
Architecture methods
Powerful Instructions (1)
• Sub-word parallelism
– SIMD on restricted scale:
– Used for Multi-media instructions
– Motivation: use a powerful 64-bit alu as
4 x 16-bit alus
• Examples
– MMX, SUN-VIS, HP MAX-2, AMDK7/Athlon 3Dnow, Trimedia II
– Example: i=1..4|ai-bi|
7/21/2015
Platform Design
H. Corporaal and B. Mesman
*
*
*
*
9
LCSIMD
LC-SIMD (Locally connected; e.g. Xetal, Imap)
 long communication delays: shift operations
Instructions Bus
PE0
7/21/2015
PE1
PE2
Platform Design
PE319
H. Corporaal and B. Mesman
10
FCSIMD
FC-SIMD (Fully Connected; Imagine)
 expensive communication network
Instructions Bus
PE0
PE1
PE2
PE319
Fully Connected Communication Network
7/21/2015
Platform Design
H. Corporaal and B. Mesman
11
LC: Xetal Objectives
 High-degree of system integration
 CMOS imaging + DSP
 low cost camera systems
 Low power consumption
 mobile & remote sensing
 Flexibility
 programmable DSP and control
functions
7/21/2015
Platform Design
H. Corporaal and B. Mesman
12
Xetal Architecture
7/21/2015
Platform Design
H. Corporaal and B. Mesman
13
Parallel Processing (SIMD)
 2 columns /processor
 neighbour communication
 low-speed clock (16 MHz)
 clock gating
 shared address decoding
 minimal memory read
access
 LOW-POWER
7/21/2015
Platform Design
H. Corporaal and B. Mesman
14
Parallel Processing (Contd.)
7/21/2015
Platform Design
H. Corporaal and B. Mesman
15
Xetal Specs & Performance
7/21/2015
Platform Design
H. Corporaal and B. Mesman
16
Simulation Results(1-input)
7/21/2015
Platform Design
H. Corporaal and B. Mesman
17
Simulation Results(1output)
7/21/2015
Platform Design
H. Corporaal and B. Mesman
18
Simulation Results(2)
7/21/2015
Platform Design
H. Corporaal and B. Mesman
19
Imagine: Representative Applications
• Stereo Depth
Extraction
• Polygon Rendering
• MPEG
Encoding/Decoding
Render
101100
010110
001001
Encode/Decode
Encoded 2D Data
7/21/2015
Platform Design
H. Corporaal and B. Mesman
2D Video Stream
20
Stream Processing
Input Data
Kernel
Stream
Output Data
Image 0
convolve
convolve
SAD
Image 1
convolve
Depth Map
convolve
• Little data reuse (pixels never revisited)
• Highly data parallel (output pixels not dependent on other
output pixels)
• Compute intensive (60 arithmetic ops per memory reference)
7/21/2015
Platform Design
H. Corporaal and B. Mesman
21
Stream Architecture Provides Data
Bandwidth Hierarchy
SIMD/VLIW
Control
SDRAM
ALU Cluster
SDRAM
SDRAM
Stream
Register File
ALU Cluster
ALU Cluster
ALU Cluster
ALU Cluster
ALU Cluster
ALU Cluster
SDRAM
Peak BW:
7/21/2015
2GB/s
ALU Cluster
32GB/s
Platform Design
544GB/s
H. Corporaal and B. Mesman
22
SDRAM
SDRAM
SDRAM
SDRAM
Stream
Register File
Application Data
Bandwidth Usage
2GB/s
ALU Cluster
ALU Cluster
ALU Cluster
32GB/s
544GB/s
Memory BW
Global RF BW
Local RF BW
Depth Extractor
0.80 GB/s
18.45 GB/s
210.85 GB/s
MPEG Encoder
0.47 GB/s
2.46 GB/s
121.05 GB/s
Polygon Rendering
0.78 GB/s
4.06 GB/s
102.46 GB/s
QR Decomposition
0.46 GB/s
3.67 GB/s
234.57 GB/s
7/21/2015
Platform Design
H. Corporaal and B. Mesman
23
Stream Register
File: Details
Arbiter
Single-ported
128KB SRAM
32W/cycle
(1024 x 32W)
7/21/2015
Platform Design
Stream buffers
SRF:
H. Corporaal and B. Mesman
To/From
Arithmetic
Clusters
24
Local Register File
+
+
+
*
*
/
To SRF
CU
Intercluster Network
Arithmetic Cluster: Details
Cross Point
From SRF
• Units support floating-point / 32-bit / dual 16-bit / quad 8-bit
instructions
– 4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC
– 17-cycle FDIV (pipelined for 1 FDIV every 7 cycles)
7/21/2015
Platform Design
H. Corporaal and B. Mesman
25
The Imagine Stream
Processor
SDRAM
SDRAM
SDRAM
SDRAM
Network
Interface
ALU Cluster 7
ALU Cluster 6
ALU Cluster 5
ALU Cluster 4
ALU Cluster 3
ALU Cluster 2
ALU Cluster 1
ALU Cluster 0
Stream Register File:
32kW SRAM
Microcontroller:
2K VLIW Instrs
Host
Processor
Stream
Controller
Network
Streaming Memory System
Imagine Stream Processor
7/21/2015
Platform Design
H. Corporaal and B. Mesman
26
Imagine Floorplan
• 22 million
transistors
Stream
Controller
SRF
Control
Network
Interface
Micro-Controller
ALU Cluster 0
ALU Cluster 3
7.8mm
ALU Cluster 2
Streambuffers
SRF
– 0.15 mm Ldrawn
– 0.13 mm Leff
– CMOS process
ALU Cluster 1
Streambuffers
• TI GS30KA:
Memory System
• 500 MHz
ALU Cluster 4
ALU Cluster 5
ALU Cluster 6
ALU Cluster 7
7.6mm
7/21/2015
Platform Design
H. Corporaal and B. Mesman
27
Imagine Programming Environment
StereoDepthExtraction(…)
{
// Load Input Images
...
// Run Kernels
convolve7x7 (RawImage,ConvImage);
convolve3x3 (ConvImage,Conv2Image);
...
// Store Output
Convolve7x7(…)
Compile-time
StreamC
KernelC
C++ compiler
kernel
scheduler
stream
scheduler
microcode
Host
Run-time
7/21/2015
Platform Design
Imagine
}
H. Corporaal and B. Mesman
{
...
while(!In.empty()) {
...
p0 = k0 * in10;
p12 = k21 * in32;
p34 = k43 * in54;
p56 = k65 * in76;
sum = (p0 + p12)
+ (p34 +
p56);
...
}
}
28
Applications
• Algorithms need Dynamic communication:
– lens distortion
– bucket processing
– Mirroring,…
7/21/2015
Platform Design
H. Corporaal and B. Mesman
29
Imap
7/21/2015
Platform Design
H. Corporaal and B. Mesman
30
Imap
7/21/2015
Platform Design
H. Corporaal and B. Mesman
31
D-SIMD Architecture
PE_1
PE_2
PE_3
PE_4
R1
PE_5
PE_6
PE_7
R4
R2
R7
R5
R3
PE_6  PE_3
data
Bus_1
Bus_2
R6
V dst-add
Bus_0
src-add
PE_4  PE_2
Message format
7/21/2015
Platform Design
H. Corporaal and B. Mesman
32
D-SIMD Architecture
PE_1
PE_2
PE_3
PE_4
R1
PE_5
PE_6
PE_7
R4
R2
R7
R5
R3
R6
Bus_0
Bus_1
Bus_2
Larger distance: PE_7  PE_1
7/21/2015
Platform Design
H. Corporaal and B. Mesman
33
D-SIMD Architecture
PE_1
PE_2
PE_3
PE_4
R1
PE_5
PE_6
PE_7
R4
R2
R7
R5
R3
R6
PE_7  PE_5
Bus_0
Bus_1
Bus_2
Priority
PE_6  PE_2
7/21/2015
Platform Design
H. Corporaal and B. Mesman
34
DCSIMD: arbitration
PE
Read:
V des-add data
write: give priority to further PES
PEn
PEid
PEn+1
xor
Read data
src-add
PEn+2
Next reg.
ab
v
00
n+2
01
n+1
10
n
11
V des-add data
n+2 : 2.v
Select (ab)
a=v’.2’
Buffer instruction:
b=a’.v’+a.1’
7/21/2015
src-add
n+1 : (2+v).1
n
Platform Design
: (1+2+v).0
H. Corporaal and B. Mesman
35
Conclusions
• SIMD nicely matches
– Image applications: data-level parallelism
– VLSI efficiency: copy-paste of simple elements
• So
– Very efficient architecture for image processing
– Low power! Also by trading off clock vs peroformance
• But
– Programmer is burdened with vector thinking
– Vectorizing compilers are not good at recognizing opportunities for vector
executions
– Need for a “control” processor for control code and if-then-else
• Communication is a problem:
– Dimensioned for peak BW requirements -> RCSIMD
– Unable to perform indirect PE addressing-> DCSIMD
7/21/2015
Platform Design
H. Corporaal and B. Mesman
36
Assignment
• Select and implement suitable unique application (algorithm) on
Imagine architecture.
• Download tools from: http://www.ics.ele.tue.nl/~hfatemi/5kk10/.
• For using tools, you need to install visual-C on your computer.
• First read beginner_guide.pdf to get familiar with tools.
• Remarks:
– All executable files are ready (not necessary to compile tools)
– Instead of adding “iscd_preproc = C:\Program Files\Microsoft Visual
Studio\VC98\Bin\CL.EXE” to your environment variables use this one:
• Add “C:\Program Files\Microsoft Visual Studio\VC98\Bin” to your path.
• Add “iscd_preproc = CL.EXE” to your environment variables.
• Possible exam questions:
http://www.ics.ele.tue.nl/~heco/courses/pam/Imagine%20assignment.html
• More questions: contact Hamed Fatemi PT 9.19
7/21/2015
Platform Design
H. Corporaal and B. Mesman
37