Presentation

Download Report

Transcript Presentation

Carnegie Mellon
High Performance
Linear Transform
Program Generation
for the Cell BE
Vas Chellappa
Franz Franchetti
Markus Püschel
Electrical & Computer Engineering
Carnegie Mellon University
Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc.
Carnegie Mellon
Cell Broadband Engine
Cell BE Chip

EIB
SPE
LS
LS
SPE
SPE
LS
LS
SPE
SPE
LS
LS
SPE
SPE
LS
LS
SPE
Main Mem


Multicore cpu (8 SPEs+1 PPE)
SPEs: SIMD cores designed for
numerical computing
256KB “local store” per SPE
(scratchpad-like)

Programmer-driven DMA

204 Gflop/s peak
How do we harness the Cell’s impressive peak performance?
2
Carnegie Mellon
DFT on the Cell BE
Spiral generated
(this paper)
350x
FFTC
FFTW
Numerical Recipes
Platform-tuned code is 350x faster. But hard to write!
3
Carnegie Mellon
Overview

Background, Spiral Overview

Generating DFTs for the Cell

Performance Results

Concluding Remarks
Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong,
Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo:
SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005
4
Carnegie Mellon
“Fitting” Dataflow to Hardware
Stage 1
Core 0
Stage 2
Stage 3
Core 1
Stage 5
Stage 4
Stage 3
Stage 4
Stage 2
Stage 1
IterativeParallel
Algorithm
(programming
ease)
Recursive
algorithm
(memory
hierarchy)
execution
(multicore)
To “fit” DFT to architecture:

Various traversals

Various factorizations
How to map dataflow to architecture automatically?
5
Carnegie Mellon
“Fitting” Dataflow to Platform (contd.)
4
3
2
1
1
5
2
3
4
Core 0
Core 1
Intuition: rewrite formulas to obtain suitable dataflow
6
Carnegie Mellon
Program Generation in Spiral
Transform
user specified
Optimization at all
abstraction levels
Fast algorithm
in SPL
many choices
parallelization
vectorization
∑-SPL
loop
optimizations
C Code
Iteration
of thisfolding
process
constant
to search
for the fastest
scheduling
…… not all …
But that’s
7
Carnegie Mellon
Common Abstraction: SPL
SPL: Tensor-product representation
Eg.: Cooley-Tukey fast Fourier transform (FFT):
1 1 1 1  1
1 j 1  j   


1 1 1 1 1

 
1

j

1
j

 
 1
1 
 1
1

  1
1   
  

1  
    1 1
1    1 1
 1    

  j   


1
  1
   
1  

1 1  
  
 1  
1  

  1
Tensor products in SPL represent loop structures
8
Carnegie Mellon
Overview

Background, Spiral Overview

Generating DFTs for the Cell

Performance Results

Concluding Remarks
9
Carnegie Mellon
Mapping DFTs to the Cell
Objective: High-performance
transform library for Cell BE
DFT
Cell BE Chip
EIB
SPE
LS
LS
SPE
SPE
LS
LS
SPE
SPE
LS
LS
SPE
SPE
LS
LS
SPE
Main Mem
Cell’s architectural paradigms:
Vectorize DFT for
vector length 
Parallelize DFT across
p SPEs, and use a
DMA packet size of 
Optimize DFT for
throughput
(s DFTs required)
Vectorization
Parallelization
Multibuffering
Tags guide formula rewriting
10
Carnegie Mellon
SPL to Parallel Code
 Natural parallel construct in SPL:
A
A
A
A
x
Processor 0
Processor 1
Processor 2
Processor 3
y
Independent, load-balanced, communication-free operation
 Parallelizing other constructs in SPL:
x
y
Permutations require message exchange (on-chip DMA comm.)
Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA
11
Carnegie Mellon
SPL to Streaming Code

Streaming: Overlapping computation with communication
 On-chip (SPE ↔ SPE) and off-chip (SPE ↔ Main memory)

Idea: tensor loops become multi-buffered loops
i'th iteration
(Trickier for other
SPL constructs)

Write
Ai-1
Compute Ai
Read
Ai+1
A
A
A
x
y
Useful for:
 Throughput-optimized code
 Large, out-of-chip sizes
Idea: rewrite algorithm at SPL level to achieve largest DMA packets
12
Carnegie Mellon
Generating Cell Code
Transform
user specified
Rewriting
Fast algorithm
in SPL
tag guided
Streamed from
memory for
throughput
Load balanced
across p SPEs
SIMD kernel
optimized for
memory hierarchy
All-to-all
communication
(on-chip)
Loop
operations
in ∑-SPL
Cell-specific optimized C code (intrinsics, DMA etc.)
13
Carnegie Mellon
Generated Code Sample
/* Complex-to-complex DFT size 64 on 2 SPEs */
dft_c2c_64(float *X, float *Y, int spuid)
{ // Block 1 (IxA)L
for(i:=0; i<=7; i++) // Right most gather
{ DMA_GATHER(gath_func(X,i), gath_func(T1,i), 4) } // uses spu_mfcdma()
spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait on gather
// compute vectorized DFT kernel of size m
vectorized
for(i:=0; i<=7; i++) // Scatter at interface
DMA
{ DMA_SCATTER(scat_func(T1,i), scat_func(T2,i), 4) }
all_to_all_synchronization_barrier(); // uses mailbox msgs
parallelized
// Block 2 (AxI)
/* Gather is a no operation since the scatter above
accounted for it */
// compute vectorized DFT kernel of size n
for(i:=0; i<=7; i++) // Left most scatter
{ DMA_SCATTER(scat_func(T1,i), scat_func(Y,i), 4) }
all_to_all_synchronization_barrier();
}
DFT 216: 4,000+ lines of code!
14
Carnegie Mellon
Problem Space: Options
Parallelization
Base (Vectorized)
DFT
Vectorization
assumed
SPE DFT SPE
SPE
Single DFT
parallelized across
multiple SPEs
Main Memory Operations
DFT
Latency
optimized
(default)
SPE
Multiple
independent DFTs
on multiple SPEs
DFT
DFT
Throughput,
multibuffered
SPE
SPE
SPE
DFT
DFT
SPE
SPE
DFT
DFT
SPE
SPE
(Only for
small DFTs)
SPE DFT SPE
SPE DFT SPE
SPE
SPE
Multiple parallelized
independent DFTs
SPE
SPE
15
Carnegie Mellon
Problem Space: Combinations
Latency-optimized usage scenarios
Throughput-optimized usage scenarios
SPE DFT SPE
SPE DFT SPE
Single DFT from
main memory
SPE
SPE
DFT
Parallel,
multibuffered DFT
Independent DFTs
multibuffered in
parallel
SPE
SPE
DFT
DFT
DFT
DFT
SPE
SPE
DFT
DFT
DFT
DFT
SPE
SPE
Devise rewrite rules for tags. Nestings describe all scenarios
16
Carnegie Mellon
Overview

Background, Spiral Overview

Generating DFTs for the Cell

Performance Results

Concluding Remarks
17
Carnegie Mellon
SPE
DFT SPE
SPE
SPE
8-SPEs
4-SPEs
2-SPEs
1-SPE
18
Carnegie Mellon
SPE
SPE
DFT
SPE
Spiral: 8-SPEs
SPE
FFTW
FFTC
Spiral: 1-SPE
4.5x faster than FFTW, 1.63x faster than FFTC
19
Carnegie Mellon
More Performance Results
Mercury

Single-SPE DFT code

Split/interleaved complex formats

Non-2-power sizes

Double precision (PowerXCell 8i)
Spiral
Chow
IBM SDK
20
Carnegie Mellon
Other Linear Transforms

Discrete Sine, Cosine
transforms, DFT with real
inputs (single-SPE)

2-D DFTs

Out-of-core sizes

Limited to 2D DFTs on 1-SPE
(for now)
More performance results:
Srinivas Chellappa, Franz Franchetti , and Markus Püschel: Computer Generation of fast Fourier Transforms for the
Cell Broadband Engine Proceedings of International Conference on Supercomputing (ICS) 2009
21
Carnegie Mellon
Overview

Background, Spiral Overview

Generating DFTs for the Cell

Performance Results

Concluding Remarks
22
Carnegie Mellon
Conclusion

Automatic generation of transform libraries
 High performance
 Variety of scenarios, formats

High performance on Cell requires:
 Vectorization multi-core parallelization, streaming, DMA code
 Future processors likely to have similar paradigms, tradeoffs

Spiral approach:
 Common abstraction of transform, algorithm, architecture (SPL)
 Rewrite rules to go from transform to architecture architecture
space
algorithm
space
23