Parallel considerations of VELO PatPixelTracking Daniel Hugo Cámpora Pérez LHCb Online team Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi.
Download ReportTranscript Parallel considerations of VELO PatPixelTracking Daniel Hugo Cámpora Pérez LHCb Online team Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi.
1 Parallel considerations of VELO PatPixelTracking Daniel Hugo Cámpora Pérez LHCb Online team 2 Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi framework Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 3 Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi framework Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 4 Fast Pixel problem description Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 5 Fast Pixel problem description Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 6 Fast Pixel problem description • • • • 48 sensors with 12 chips each Each chip has 256x256 pixels Clustered 2x2 by readout board Right and left sensors at different z with overlap Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 7 Fast Pixel problem description • The algorithm searches for hits starting from the last pixel lattice. Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 8 Fast Pixel problem description • The algorithm searches for hits starting from the last pixel lattice. • Per hit, it searches for compatible hits (on a given radius) in the next pixel lattice. • Finding at least three compatible hits forms a track. Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 9 Fast Pixel problem description However, the current approach is very sequential (albeit efficient!). • Hits must not be already used. • Continue instructions, break the loop and make it fast. Porting the same algorithm to other programming models as is makes for a proof of concept (produced physics are the same). Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 10 Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi framework Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 11 Current test setup Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 12 Current test setup We are interested in the search bit! Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 13 Current test setup • Input is 200 Monte-Carlo generated events. • Implementations produce exactly the same output as Brunel, unless stated otherwise. • Current setup runs TBB with a variable number of threads specified by task_scheduler_init init(i); • 1000 experiments are run per configuration. Results shown are the mean of those, standard deviation is checked as well. Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 14 Comparing apples to… • Lab13 ▫ Intel Xeon CPU E5-2650 (2 CPUs) ▫ 20M Cache, 2.00 GHz (2.80 GHz TB) ▫ 8 cores, 16 HW threads • Intel MIC (Pre-Production Intel® Xeon Phi™ coprocessors) ▫ 1.1 GHz ▫ 61 cores, 244 HW threads • GPU ▫ NVIDIA GeForce 680GTX ▫ 1GHz ▫ 1536 CUDA cores (96 SIMD cores) Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 15 Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 16 Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 17 Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 18 Precision • Is there a real need of double point operations? How about single precision instead… Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 19 Precision Mean 1 (correct / produced tracks): 100% Mean 2 (correct / total number of tracks): 99.9964% • We miss one track in 28.000. • No incorrect tracks are generated. Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 20 Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 21 Ma(g)ny-cores like Single Precision! Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 22 Current implementation • Setup is decoupled from the Gaudi framework. • Produced physics are the same. • Parallelism is setup as thread per event. • GPU acts as simple SIMD (“speedup” of 0.3x !) ▫ divergent branches and warps are not good friends Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 23 Event-wise parallelism? Using a similar idea to the baseline algorithm, we can exploit the inherent parallel nature of the problem. • Average #hits per sensor: 22.6 • Average multiplicity (hit x hit): 771.15 • Average multiplicity (hit x hit x hit): 1544.7 Early stage parallel algorithm produces 85% of the correct tracks. Different results doesn’t necessarily mean wrong! Physics demonstration! Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 24 Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi framework Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 25 What’s missing? The current setup is cool and dandy for comparing results, but not for testing the real setup! Daniel Hugo Cámpora Pérez 26-10-2012 26 Integration with Gaudi Current HLT doesn’t consider having coprocessors to help in the execution of any step. Framework is sequential! Per event execution on a coprocessor is not realistic. Memory copies will kill us! • Each event is approximately 50kB. • Processing one single event is trivial. We have to hide the latency! Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 27 Pipelining! Eg. #event chunk = 200 Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 28 Integration with Gaudi Gaudihive Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 29 Integration with Gaudi Gaudihive Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012 30 In conclusion • The analysis on the sequential algorithm is complete. • A speedup of 10.70x has been obtained by properly configuring TBB. • MIC underperforms because of lack of use of VPUs, more tweaking is necessary. • Using floats rather than doubles is beneficial for many-core architectures, and results are the same. • A parallel version of the PatPixel would show a more realistic architecture comparison, and should be better performant. • The current framework with a good pipeline could enable the use of a coprocessor in a production environment. Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking 21-11-2012