ClearSpeed's CSX600

Download Report

Transcript ClearSpeed's CSX600

Programming a Heterogeneous
Data Parallel Coprocessor using Cn
Ray McConnell, CTO
Company Confidential © ClearSpeed 2006
www.clearspeed.com
1
Extremely Cool - Extremely Fast
For the world's most compute-intensive
applications, ClearSpeed provides low
power, high performance parallel
processing solutions.
© ClearSpeed 2006 | Company Confidential
2
CSX Technology
Company Confidential © ClearSpeed 2006
www.clearspeed.com
3
What are ClearSpeed’s products?
• New, acceleration coprocessor, the CSX600
–
–
–
–
Assists serial CPU running compute-intensive math libraries
Can be integrated next to the main CPU on the motherboard
…or installed on add-in cards, e.g. PCI-X, PCI Express
…or embedded, e.g. aerospace, auto, medical, defence
• Significantly accelerate libraries and applications
– Libraries: Level 3 BLAS, LAPACK, FFTW
– ISV apps: MATLAB, LS-DYNA, ABAQUS, AMBER, etc.
– In-house codes: Using the SDK to port kernels
• ClearSpeed’s Advance™ board: aimed at the server market
–
–
–
–
Dual CSX600 coprocessors
R∞ ≈ 50 GFLOPS for 64-bit matrix multiply (DGEMM) calls
133 MHz PCI-X
Low power; less than 25 Watts
© ClearSpeed 2006 | Company Confidential
4
CSX600 coprocessor layout
• Array of 96 Processor Elements
• 250 MHz
• IBM 0.13µm FSG process, 8-layer
metal (copper)
• 47% logic, 53% memory
– More logic than most processors!
– About 50% of the logic is FPUs
– Hence around one quarter of the
chip is floating point hardware
• 15 mm x 15 mm die size
• 128 million transistors
• Approx. 10 Watts
© ClearSpeed 2006 | Company Confidential
5
CSX600 processor core
• Multi-Threaded Array Processing
– Programmed in high-level languages
– Hardware multi-threading for latency
tolerance
– Asynchronous, overlapped I/O
– Run-time extensible instruction set
– Bi-endian (compatible with host CPU)
• Array of 96 Processor Elements (PEs)
– Each is a Very Long Instruction Word
(VLIW) core, not just an ALU
– Flexible data parallel processing
– Built-in PE fault tolerance, resiliency
• High performance, low power
dissipation
© ClearSpeed 2006 | Company Confidential
6
CSX600 Processing Elements
Each PE is a VLIW core:
• Multiple execution units
• 4-stage floating point adder
32/64-bit
• 4-stage floating point multiplier
IEEE 754
• Divide/square root unit
• Fixed-point MAC 16x16  32+64
• Integer ALU with shifter
• Load/store
High-bandwidth, 5-port register file (3r, 2w)
Closely coupled 6 KB SRAM for data
High bandwidth per PE DMA (PIO)
Per PE address generators
• Complete pointer model, including parallel
pointer chasing and vectors of addresses
}
•
•
•
•
© ClearSpeed 2006 | Company Confidential
7
Advance™ Dual CSX600 PCI-X accelerator board
–
–
–
–
–
–
–
50 DGEMM GFLOPS sustained
0.4 M 1K complex single precision FFTs/s (20 GFLOPS)
~200 Gbytes/s aggregate B/W to on-chip memories
6.4 Gbytes/s aggregate B/W to local ECC DDR2-DRAM
1 Gbyte of local DRAM (512 Mbytes per CSX600)
~1 Gbyte/s to/from board via PCI-X @133 MHz
< 25 watts for entire card (8” single-slot PCI-X)
© ClearSpeed 2006 | Company Confidential
8
Advance™ Dual CSX600 PCI-X accelerator board
1.6GBytes/sec
1GBytes/sec
3.2GBytes/sec
512M-2GBytes
3.2GBytes/sec
© ClearSpeed 2006 | Company Confidential
9
Which applications can be accelerated?
Any applications with significant data parallelism:
• Fine-grained – vector operations
• Medium-grained – unrolled independent loops
• Coarse-grained – multiple simultaneous data channels/sets
Example applications and libraries include:
•
•
•
•
•
•
•
•
Linear algebra – BLAS, LAPACK
Bio-informatics – AMBER, GROMACS, GAUSSIAN, CPMD
Computational finance – Monte Carlo, genetic algorithms
Signal processing – FFT (1D, 2D, 3D), FIR, Wavelet
Simulation – FEA, N-body, CFD
Image processing – filtering, image recognition, DCTs
Oil & Gas – Kirchhoff Time/Wave Migration
Intelligent systems – artificial neural networks
© ClearSpeed 2006 | Company Confidential
10
ClearSpeed applications strategy
= strong dependence
= weak dependence
Oil & Gas
CAE
Seismic
Processing
Computational
Chemistry
Gaussian
GAMESS
MATLAB
Density Function
Theory (DFT)
Mathematica
CPMD
Top500
LINPACK
Standard Math
Libraries
AMBER
BLAS
LAPACK
FFT
OpenMD
Matrix arithmetic
Linear algebra
Fast Fourier
Transforms
Molecular Dynamics
© ClearSpeed 2006 | Company Confidential
11
ClearSpeed applications strategy
• Provide transparent acceleration of widely used
standard libraries
– Initially target BLAS, LAPACK, FFTW
• Compatible with Intel MKL, AMD ACML, …
– Works just like OpenGL via shared libraries and
dynamically-linked libraries (DLLs)
– Plug-and-play acceleration under Linux and Windows
• Port key widely-used applications
– Choose open source where possible – dissemination
– Have ported GROMACS, now porting AMBER
• Create “template” example applications
• Encourage the creation and adoption of standard
libraries
– OpenMD, OpenFFT
• Work with customers to port proprietary codes
© ClearSpeed 2006 | Company Confidential
12
Application acceleration structure
© ClearSpeed 2006 | Company Confidential
13
BLAS/LAPACK/FFTW uses
• Software known to use BLAS, LAPACK, FFTW…
– MATLAB, Mathematica, Maple, Octave, …
– LINPACK, HPCC
– IMSL, BCSLIB-EXT, SuperLU, NAG
• FEA, CFD, Finance codes
– ABAQUS, ANSYS, MSC (Nastran, Marc, ADAMS), …
– LS-DYNA parallel implicit (uses BCSLIB-EXT)
– CPMD, Molpro, NWChem, GAMESS, Gaussian, …
– Some silicon design (EDA) tools
– Numerous Oil & Gas in-house codes
– Many, many more!
• ClearSpeed has a profiler for analysing an
application’s use of standard libraries (ClearTrace)
© ClearSpeed 2006 | Company Confidential
14
High Performance LINPACK (HPL)
Consider a LINPACK run of 10,000 unknowns, which makes many matrix
multiply (DGEMM) calls, starting at size ≈ 25x109 FMACs, and reducing in
size each time.
High Performance Linpack
System memory
DGEMM call
Main CPU(s)
BLAS system library
First DGEMM
takes e.g.10s at 5
GFLOPS, the
next DGEMM
takes 9.5s etc.
DGEMM return
© ClearSpeed 2006 | Company Confidential
15
Speeding up HPL via accelerated BLAS
Consider exactly the same system as before, but with a ClearSpeed accelerator
board installed. The ClearSpeed BLAS library intercepts calls to the system BLAS
libraries and offloads them for acceleration.
High Performance Linpack
System memory
DGEMM call
Main CPU(s)
BLAS system library
ClearSpeed accelerator board
First DGEMM
takes e.g.1s at 50
GFLOPS, the
next DGEMM
takes 0.95s etc.
DGEMM return
© ClearSpeed 2006 | Company Confidential
16
CSX600 Level 3 BLAS performance
60
Matrix Multiply (DGEMM)
50.0
50
GFLOPS
40
30
20
14.4
10
9.4
6.5
5.2
14.0
12.2
5.9
el
(3
Ita
.4
ni
G
um
H
z)
2
(1
N
.6
EC
C
G
le
Hz
SX
ar
)
Sp
-8
ee
(2
d
G
Hz
Ad
)
va
nc
e
bo
ar
d
z)
D
28
tiu
m
n
el
Pe
n
er
o
In
t
O
pt
In
t
5
95
0
(2
.6
.9
G
(1
5
ER
D
AM
G
H
Hz
)
H
2G
(2
.
PO
W
97
0
IB
M
PC
er
Po
w
IB
M
IB
M
BG
/L
(7
00
M
H
z)
z)
0
Source: vendor websites
© ClearSpeed 2006 | Company Confidential
17
CSX600 Level 3 BLAS power efficiency
2500
MFLOPS Per Watt
Matrix Multiply (DGEMM)
2000
2000
1500
1000
500
431
120
82
99
129
108
60
0
/L
0M
0
(7
)
Hz
0
97
H
G
2
.
(2
z)
ER
5
H
G
9
.
(1
z)
5
28
6G
.
(2
)
Hz
0
95
H
G
4
.
(3
z)
um
i
n
D
C
W
an
ro
m
t
rP
O
M
e
I
u
t
e
i
P
l
IB
p
w
nt
te
M
O
o
e
n
I
P
IB
D
lP
M
M
e
t
A
IB
In
BG
2
H
G
6
.
(1
z)
C
E
N
-8
X
S
H
G
(2
ed
e
Sp
r
ea
Cl
z)
ce
n
va
d
A
d
ar
o
b
Source: vendor websites
© ClearSpeed 2006 | Company Confidential
18
DGEMM performance from hardware
50
45
40
35
GFLOPS
30
25
20
15
10
5
0
0
384
768
1152
1536
1920
2304
2688
3072
3456
3840
4224
4608
4992
5376
5760
6144
Matrix Size
© ClearSpeed 2006 | Company Confidential
19
CSX600 LINPACK performance on 4-core machine
© ClearSpeed 2006 | Company Confidential
20
ClearSpeed applications strategy
ClearSpeed Plays Here Today
PCI-e, Next Generations
ClearSpeed Plays Here.
NWChem
Fluid Dynamics
Ocean Models
Petro Reservoir
Auto NVH
Auto Crash
Weather
Seismic
GAMESS
Core
Limited
Bandwidth
Limited
Stream
DAXPY
DDOT
~1 Byte per
Flop
SparseMV
SPECfp2000
© ClearSpeed 2006 | Company Confidential
Linpack
DGEMM
21
MATLAB acceleration
Plug-and-play MATLAB
acceleration
• Original time on 3.2 GHz x86:
– 8.1 seconds
• Time with ClearSpeed FFT
acceleration:
– 1.6 seconds
• Time with ClearSpeed
convolution acceleration:
– 1.2 seconds
• 6X acceleration!
© ClearSpeed 2006 | Company Confidential
22
Software
Company Confidential © ClearSpeed 2006
www.clearspeed.com
23
Software development environment
Software Development Kit (SDK)
• Cn compiler (ANSI-C based commercial compiler),
assembler, libraries, ddd/gdb-based debugger, newlibbased C-rtl etc.
• Extensive documentation and training
• CSX600 dual-processor development boards
• Microcode Development Kit (MDK)
• Microcode compiler, debugger, and standard ISET.
• Available for Linux, Windows
© ClearSpeed 2006 | Company Confidential
24
Gdb/ddd debugger
Port of standard gdb
enables most GUIs to
“just work” with the
CSX600:
• Hardware supports single
step, breakpoint etc
• gdb port is multi-everything
(thread, processor and
board)
• Visualize all the state in the
PEs
• Hardware performance
counters also exposed via
gdb
© ClearSpeed 2006 | Company Confidential
25
Thread profiler
• The CSX600 is 8-way threaded:
– Typically 1 compute thread and 1 or more I/O threads
• The hardware supports tracing in real-time:
– Thread switches
– I/O operation start/finish
© ClearSpeed 2006 | Company Confidential
26
CSX600 from a programmer’s perspective
• Mono execution unit and poly execution unit
– Instructions can be executed in either
domain
– mono variables are scalar (single value)
– poly variables are vector (multiple values)
• 2 domains, 2 types of memory:
– mono memory (e.g. card memory)
– poly memory (embedded in poly execution
unit)
© ClearSpeed 2006 | Company Confidential
27
Cn: Extending C for SIMD Array Programming
•
New Keywords
– mono and poly storage qualifiers
• mono is a serial (single) variable
• poly is a parallel (vector) variable
•
Contrast the two types:
– mono:
• One copy exists on mono execution unit
• Visible to all processing elements in poly execution unit
• mono assumed unless poly specified
– poly:
•
•
•
•
One per processing element in the poly execution unit
Visible to a single processing element
Data can be shared via “swazzle”
Not visible to mono execution unit
© ClearSpeed 2006 | Company Confidential
28
Cn - Variables
• poly variables akin to an array of mono
variables:
• Consider:
int ma, mb, mc;
mc = ma + mb;
poly int pa, pb, pc;
pc = pa + pb;
• Variables pa,pb,pc exist on all PEs
– Default configuration: 96 PEs
© ClearSpeed 2006 | Company Confidential
29
Cn - Broadcast
int ma;
poly int pb, pc;
pc = ma + pb;
• mono variable ma is broadcast to all poly
execution units
© ClearSpeed 2006 | Company Confidential
30
Cn - Pointers
• mono and poly can be used with pointers
mono
poly
mono
poly
int
int
int
int
*
*
*
*
mono
mono
poly
poly
mPmi
mPpi
pPmi
pPpi
mono ptr to mono int
mono ptr to poly int
poly ptr to mono int
poly ptr to poly int
• Most commonly used type:
– mono ptr to poly type
poly <type> * mono <varname>
© ClearSpeed 2006 | Company Confidential
31
Cn mono to poly pointers
• mono ptr to poly int
poly int * mono pPmi
int
Poly memory
int
int *
Poly memory
Mono memory
int
Poly memory
Note: Points to same location in each PE
© ClearSpeed 2006 | Company Confidential
32
Cn – Poly to mono pointers
• De-reference of poly pointer to mono not permitted
mono int * poly pPmi;
poly int Pi;
Pi = *pPmi; // Not permitted
• Instead, available through a Cn library function call
mono int * poly pPmi;
poly int Pi;
memcpym2p(&Pi, pPmi, sizeof(int)); // OK
© ClearSpeed 2006 | Company Confidential
33
Cn - Conditionals
• if statements in Cn depend on multiplicity of the condition
• A poly execution unit (SIMD) can NOT skip poly
conditional code
– Single Instruction stream for all PEs
– All PEs execute instructions in lockstep
– All code must be issued, but not necessarily executed
• Example:
if (a == b) /* true on some PEs, false on some */
/* Always issued, may be ignored */
else
/* Always issued, may be ignored */
© ClearSpeed 2006 | Company Confidential
34
Porting code
void daxpy(double *c, double *a, double alpha, uint N) {
uint i;
for (i=0; i<N; i++)
c[i] = c[i] + a[i]*alpha;
}
void daxpy(double *c, double *a, double alpha, uint N) {
uint i;
poly double cp, ap;
for (i=0; i<N; i+=num_pes) {
memcpym2p(&cp, &c[i+pe_num], sizeof(double));
memcpym2p(&ap, &a[i+pe_num], sizeof(double));
cp = cp + ap*alpha;
memcpyp2m(&c[i+pe_num], &cp, sizeof(double))
}
© ClearSpeed 2006 | Company Confidential
35
Example: Cn radix-2 FFT
void cn_fft(poly float *xy,poly float *w, short n) {
poly short n1,n2,ie,ia,i,j,k,l;
poly float xt,yt,c,s;
n2 = n;
ie = 1;
for (k=n; k > 1; k = (k >> 1) ) {
n1 = n2;
n2 = n2>>1;
ia = 0;
for (j=0; j < n2; j++) {
c = w[2*ia];
s = w[2*ia+1];
ia = ia + ie;
for (i=j; i < n; i += n1) {
l
= i + n2;
xt
= xy[2*l]
- xy[2*i];
xy[2*i]
= xy[2*i]
+ xy[2*l];
yt
= xy[2*l+1] - xy[2*i+1];
xy[2*i+1] = xy[2*i+1] + xy[2*l+1];
xy[2*l]
= (c*xt + s*yt);
xy[2*l+1] = (c*yt - s*xt);
}
}
= ie<<1;
© ie
ClearSpeed
2006 | Company Confidential
}
36
ClearSpeed Advance™ and CSX600 summary
ClearSpeed’s Advance™ board delivers new
levels of floating-point and integer performance,
performance per watt, and ease of use:
• For acceleration of standard libraries and
applications, such as Level 3 BLAS, LAPACK,
FFTW, MATLAB, Mathematica, ANSYS, ABAQUS,
GAMESS, Gaussian, AMBER …
• 50 GFLOPS sustained from a ClearSpeed
Advance™ board
• Callable from C/C++, Fortran, etc.
• ~25 watts per single-slot board
• Multiple Advance™ boards for even higher
performance
© ClearSpeed 2006 | Company Confidential
37