Transcript Slide 1
Parallel Applications for Multi-core Processors
Ana L u cia Vârbănescu T U Delft / Vrije Universiteit Amsterdam with acknowledgements to: The Multicore Solutions Group @ IBM TJ Wantson, NY, USA Alexander van Amesfoort @ TUD Rob van Nieuwpoort @ VU/ASTRON
Outline
► One introduction ► Cell/B.E. case-studies Sweep3D, Marvel, CellSort Radioastronomy An Empirical Performance Checklist ► Alternatives GP-MC, GPUs ► Views on parallel applications … and multiple conclusions
2
/79.95
One introduction
3
/79.95
The history: STI Cell/B.E.
► Sony: main processor for PS3 ► Toshiba: signal processing and video streaming ►
IBM: high performance computing 4
/79.95
The architecture
► ► ► 1 x PPE 64-bit PowerPC L1: 32 KB I$+32 KB D$ L2: 512 KB 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector registers Hybrid memory model: PPE: Rd/Wr SPEs: Async DMA
5
/79.95
The Programming
► Thread-based model, with push/pull data flow Thread scheduling by user Memory transfers are explicit ► Five layers of parallelism to be exploited: Task parallelism (MPMD) Data parallelism (SPMD) Data streaming parallelism (DMA double buffering) Vector parallelism (SIMD – up to 16-ways) Pipeline parallelism (dual-pipelined SPEs)
6
/79.95
Sweep3D application
► Part of the ASCI benchmark ► Solves a three-dimensional particle transport problem ► It is a 3D wavefront computation
IPDPS 2007:
Fabrizio Petrini, Gordon Fossum, Juan Fernández, Ana Lucia Varbanescu, Michael Kistler, Michael Perrone
: Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine
7
/79.95
SUBROUTINE sweep()
Sweep3D computation
DO k=1,kt/mk ! Octant loop ! Angle pipelining loop ! K-plane loop RECV W/E ! Receive W/E I-inflows RECV N/S ! Receive N/S J-inflows ! JK-diagonals with MMI pipelining DO jkm=1,jt+mk-1+mmi-1 ! I-lines on this diagonal DO il=1,ndiag ! Solve Sn equation IF .NOT. do_fixups DO i=1,it ENDDO ! Solve Sn equation with fixups ELSE DO i=1,it ENDDO ENDIF ENDDO ! I-lines on this diagonal ENDDO ! JK-diagonals with MMI SEND W/E ! Send W/E I-outflows SEND N/S ! Send N/S J-outflows ENDDO ENDDO ! K-plane pipelining loop
8
/79.95
! Angle pipelining loop ENDDO ! Octant loop
Application parallelization
► ► ► ► ► Process Level Parallelism inherits wavefront parallelism implemented in MPI Thread-level parallelism Assign “chunks” of I-lines to SPEs Data streaming parallelism Thread use double buffering , for both RD and WR Vector parallelism SIMD-ize the loops • E.g., 2-ways for double precision, 4-ways for single precision Pipeline parallelism SPE dual-pipeline => multiple logical threads of vectorization
9
/79.95
Experiments
► Run on SDK2.0, Q20 (prototype) blade 2 Cell processors, 16 SPEs available 3.2GHz, 1GB RAM
10
/79.95
Optimization techniques
11
/79.95
Performance comparison
12
/79.95
Sweep3D lessons:
► Essential SPE-level optimizations: Low-level parallelization • Communication • SIMD-ization • Dual-pipelines Address alignment DMA grouping ► Aggressive low-level optimizations = Algorithm tuning!!
13
/79.95
Generic CellSort
► Based on bitonic merge/sort works on 2 K array elements ► Sorts 8-byte patterns from an input string, ► Keeps track of the original position
14
/79.95
Data “compression”
►
Memory limitations:
SPE LS=256KB => 128KB data (16K-Keys) + 64KB indexes KEY INDEX X X is replaced by KEYS INDEXES Avoid branches (sorting is about if’s … ) SIMD-ization with (2Keys x 8B) per 16B vector
15
/79.95
Re-implementing the if’s
► if (A>B) Can be replaced with 6 SIMD instructions for comparing
inline int
sKeyCompareGT16(SORT_TYPE A, SORT_TYPE B) { VECTORFORMS temp1, temp2, temp3, temp4; temp1.vui =
spu_cmpeq
( A.vect.vui, B.vect.vui ); temp2.vui =
spu_cmpgt
( A.vect.vui, B.vect.vui); temp3.vui =
spu_slqwbyte
( temp2.vui, 4); temp4.vui =
spu_and
(temp3.vui, temp1.vui); temp4.vui =
spu_or
(spu_or(temp4.vui, temp2.vui), temp1.vui); }
return
(
spu_extract
(
spu_gather
(temp4.vui),0) >= 8);
16
/79.95
The good results
► ► ► input data: 256KB string running on: One PPE on Cell blade, The PPEon a PS3 PPE+16xSPEs on the same a Cell blade 16 SPEs => speed-up ~46
Sorting speed-up using 16SPEs for 256KB data
50.00
40.00
30.00
20.00
10.00
0.00
0.70
PPE @ Q20 1.00
PPE @ PS3 45.88
17
16 SPEs /79.95
The bad results
► Non-standard key types A lot of effort for implementing basic operations efficiently ► SPE-to-SPE communication wastes memory A larger local SPE sort was more efficient ► The limitation of 2k elements is killing performance Another basic algorithm may be required ► Cache-troubles PPE cache is “polluted” by SPEs accesses Flushing is not trivial
18
/79.95
Lessons from CellSort
► Some algorithms do not fit the Cell/B.E.
It pays off to look for different solutions at the higher level (i.e., different algorithm) Hard to know in advance ► SPE-to-SPE communication may be expensive Not only time-wise, but memory-wise too!
► SPE memory is *very* limited Double buffering wastes memory too!
► Cell does show cache-effects
19
/79.95
Multimedia Analysis & Retrieval
MARVEL: ► Machine tagging, searching and filtering of images & video ► Novel Approach: Semantic models by analyzing visual, audio & speech modalities Automatic classification of scenes, objects, events, people, sites, etc.
► http://www.research.ibm.com/marvel
20
/79.95
MARVEL case-study
► Multimedia content retrieval and analysis Extracts the values for 4 features of interest: ColorHistogram, ColorCorrelogram, Texture, EdgeHistogram Compares the image features with the model features and generates an overall confidence score
21
/79.95
MarCell = MARVEL on Cell
► Identified 5 kernels to port on the SPEs: • • • • ColorHistogram (CHExtract) Texture (TXExtract)
EH
EdgeHistogram (EHExtract)
CD
22
/79.95
MarCell – Porting
1 2 3 4 Detect & isolate kernels to be ported Replace kernels with C++ stubs Implement the data transfers and move kernels on SPEs Iteratively optimize SPE code
ICPP 2007:
A.L. Varbanescu, H.J. Sips, K.A. Ross, Q. Liu, A. Natsev, J.R. Smith, L.-K. Liu,
An Effective Strategy for Porting C++ Applications on Cell.
23
/79.95
Experiments
► Run on a PlayStation3 1Cell processor, 6 SPEs available 3.2GHz, 256MB RAM ► Double-checked with a Cell blade Q20 2 Cell processors, 16 SPEs available 3.2GHz, 1GB RAM ► SDK2.1
24
/79.95
MarCell – kernels speed-up
Kernel AppStart CHExtract CCExtract TXExtract EHExtract CDetect SPE[ms] 7.17
0.82
5.87
2.01
2.48
0.41
Speed-up vs. PPE 0.95
Speed-up vs.Desktop
0.67
Speed-up vs. Laptop 0.83
Overall contribution 8 % 52.22
21.00
30.17
8 % 55.44
15.56
91.05
7.15
21.26
7.08
18.79
3.75
22.45
8.04
30.85
4.88
54 % 6 % 28 % 2 %
25
/79.95
Task parallelism – setup
26
/79.95
Task parallelism – on Cell blade
27
/79.95
Data parallelism – setup
► All SPEs execute the same kernel => SPMD ► Requires SPE reconfiguration: Thread re-creation Overlays ► Kernels scale, overall application doesn’t !!
28
/79.95
Combined parallelism – setup
► Different kernels span over multiple SPEs ► Load balancing ► CC and TX ideal candidates ► But we verify all possible solutions
29
/79.95
Combined parallelism - Cell blade [1/2]
Execution times for all possible scenarios using 16 SPEs
0.0235
0.0215
0.0195
0.0175
0.0155
0.0135
0.0115
0.0095
0.0075
4 9 10 11 12 13 14 15 16
CCPE 2008:
A.L. Varbanescu, H.J. Sips, K.A. Ross, Q. Liu, A. Natsev, J.R. Smith, L.-K. Liu,
Evaluating Application Mapping Scenarios on the Cell/B.E.
30
/79.95
Combined parallelism - Cell blade [2/2]
0.013
0.012
0.011
Best performance per number of used SPEs
0.01
0.009
Tmin = 8.54 ms 0.008
1-1-1-1 1-2-1-1 1-3-1-1 2-3-1-1 2-3-2-1 1-4-2-2 2-4-2-2 2-4-3-2 3-5-2-2 4-4-3-2 3-4-3-4 2-6-3-4 3-5-3-5
SPEs/task (CH-CC-TX-EH) 31
/79.95
MarCell lessons:
► Mapping and scheduling: High-level parallelization • • Essential for “seeing” the influence of kernel optimizations Platform-oriented MPI-inheritance may not be good enough Context switches are expensive Static scheduling can be replaced with dynamic (PPE based) scheduling
32
/79.95
Radioastronomy
► ► ► Very large radiotelescopes LOFAR, ASKAP, SKA, etc. Radioastronomy features Very large data sets Off-line (files) and On-line processing (streaming) Simple computation kernels Time constraints • Due to streaming • Due to storage capability Radioastronomy data processing is ongoing research Multi-core processors are a challenging solution
33
/79.95
Getting the sky image
► The signal path from the antenna to the sky image We focus on imaging
34
/79.95
Data imaging
► ► Two phases for building a sky image Imaging: gets measured visibilities and creates dirty image Deconvolution “cleans” the dirty image into a sky model . The more iterations, the better the model But more iterations = more measured visibilities
35
/79.95
Gridding/Degridding
(u,v)-tracks sampled data (visibilities) V(b(t i )) gridding Gridded data (all baselines) degridding V(b( t i )) = data read at time t i on baseline b D j (b(ti)) contributes to a certain region in the final grid. ► Both gridding and degridding are performed by convolution
36
/79.95
The code
forall
(j =0..Nfreq;i=0..Nsamples
−
1) // for all samples //the kernel position in C
compute
cindex=C_Offset((u,v,w)[i],freq[j]); //the grid region to fill
compute
gindex=G_Offset((u,v,w)[i],freq[j]); //for all points in the chosen region
for
(x=0;x
<
M;x++) // sweep the convolution kernel
if
(gridding) G[gindex+x]+=C[cindex+x] V[i,j];
if
(degridding) V’[i,j]+=G[gindex+x] C[cindex+x]; ► All operations are performed with complex numbers !
37
/79.95
The computation
HDD Read (u,v,w)(t,b) V(t,b,f) Compute C_ind,G_ind Memory Read SC[k], SG[k] k = 1.. m x m Compute SG[k]+D x SC[k] Write SG[k] to G ► ► ► ► Samples x baselines x frequency_channels Computation/iteration: M * (4ADD + 4MUL) = 8 * M Memory transfers/iteration: RD: 2* M * 8B ; WR: M * 8B Arithmetic intensity [FLOPs/byte]: 1/3 => memory intensive app!
Two consecutive data points “hit” different regions in C/G => dynamic!
38
/79.95
The data
► ► Memory footprint: C: 4MB ~ 100MB V: 3.5GB for 990 baselines x 1 sample/s x 16 fr.channels
G: 4MB For each data point: Convolution kernel: from 15 x 15 up to 129 x 129
39
/79.95
Data distribution
► “Round-robin” 12 11 10 9 8 7 6 5 4 3 2 1 ► “Chunks” 1 2 3 4 5 6 7 8 9 10 11 12 ► Queues 1 4 5 2 7 11 8 9 3 12 6
40
/79.95
Parallelization
HDD Read (u,v,w)(t,b) V(t,b,f) Compute C_ind,G_ind Memory
DMA
Rd SC[k], SG[k] k = 1.. m x m Compute SG[k]+D x SC[k]
DMA
Wr SG[k] to localG Add localG to finalG Samples x baselines x frequency_channels ► A master-worker model “Scheduling” decisions on the PPE SPEs concerned only with computation
41
/79.95
Optimizations
► Exploit data locality PPE: fill the queues in a “smart” way SPEs: avoid unnecessary DMA ► Tune queue sizes ► Increase queue filling speed 2 or 4 threads on the PPE ► Sort queues By g_ind and/or c_ind
42
/79.95
Experiments set-up
► Collection of 990 baselines 1 baseline Multiple baselines ► Run gridding and degridding for: 5 different support sizes Different core/thread configurations ► Report: Execution time / operation (i.e., per gridding and per degridding): Texec/op = Texec/(NSamples x NFreqChans x KernelSize x #Cores)
43
/79.95
Results – overall evolution
44
/79.95
Lessons from Gridding
► SPE kernels have to be as regular as possible Dynamic scheduling works on the PPE side ► Investigate data-dependent optimizations To spare memory accesses ► Arithmetic intensity is a very important metric Aggressively optimizing the computation part only pays off when communication-to-computation is small!! ► I/O can limit the Cell/B.E. performance
45
/79.95
Performance checklist [1/2]
► Low-level No dynamics on the SPEs Memory alignment Cache behavior vs. SPE data contention Double-buffering Balance computation optimization with the communication Expect impact on the algorithmic level
46
/79.95
Performance checklist [2/2]
► High-level Task-level parallelization • Symmetrical/asymmetrical Static mapping if possible; dynamic only on the PPE Address data locality also on the PPE Moderate impact on algorithm ► Data-dependent optimizations Enhance data locality
47
/79.95
Outline
► One introduction ► Cell/B.E. case-studies Sweep3D, Marvel, CellSort Radioastronomy An Empirical Performance Checklist ► Alternatives GP-MC, GPUs ► Views on parallel applications … and multiple conclusions
48
/79.95
Other platforms
► General Purpose MC Easier to program (SMP machines) Homogeneous Complex, traditional cores, multi-threaded ► GPU’s Hierarchical cores Harder to program (more parallelism) Complex memory architecture Less predictable
49
/79.95
A Comparison
► Different strategies are required for each platform Core-specific optimization are the most important for GPP Dynamic job/data allocation are essential for Cell/B.E.
Memory management for high data parallelism is critical for GPU
50
/79.95
Efficiency (case-study)
► ► We have tried to the most “natural” programming model for each platform The parallelization effort GPP: 4 days • A Master-Worker model may improve performance here as well Cell/B.E.: 3-4 months • Very good performance, complex solution GPU: 1 month (still in progress) ?
?
51
/79.95
Outline
► One introduction ► Cell/B.E. case-studies Sweep3D, Marvel, CellSort Radioastronomy An Empirical Performance Checklist ► Alternatives GP-MC, GPUs ► Views on parallel applications … and multiple conclusions
52
/79.95
A view from Berkeley
53
/79.95
A view from Holland
54
/79.95
Overall …
► Cell/B.E. is NOT hard to program unless … High performance or high productivity are required ► Still in the case-studies phase Everything can run on the Cell … but how and when ?! ► Optimizations Low-level: may be delegated to a compiler (difficult) High-level: must be user-assisted ► Programming models Offer partial solutions, but none seems complete Various approaches, with limited loss in efficiency
55
/79.95
… but …
► Cell/B.E. is NOT the only option Choosing a multi-core platform is *highly* application dependent Efficiency is essential, more so than performance ► In-core optimizations pay off for *all* platforms Are roughly predictable too ► Higher level optimizations make the difference Data management and distribution Task scheduling Isolation and proper implementation of dynamic behavior (e.g., scheduling)
56
/79.95
Take-home messages [1/2]
► It’s not multi-core processors that are difficult to program, but more that applications are difficult to parallelize.
► There is no silver bullet for all applications to perform great on the Cell/B.E., but there are common practices for getting there.
57
/79.95
Take-home messages [2/2]
► The application design, implementation, and optimization principles for the Cell/B.E. hold for most multi-core platforms. ► Applications must have a massive influence on next generation multi-core processor design
58
/79.95
Thank you!
► (Any more) Questions ?
A.L.Varbanescu@t u delft.nl
http://www.pds.ewi.t
u delft.nl/~varbanescu 59
/79.95