Parallel considerations of VELO PatPixelTracking Daniel Hugo Cámpora Pérez LHCb Online team Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi.

Download Report

Transcript Parallel considerations of VELO PatPixelTracking Daniel Hugo Cámpora Pérez LHCb Online team Outline • PatPixel problem description • Test setup, some results • Integration with Gaudi.

1
Parallel considerations of
VELO PatPixelTracking
Daniel Hugo Cámpora Pérez
LHCb Online team
2
Outline
• PatPixel problem description
• Test setup, some results
• Integration with Gaudi framework
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
3
Outline
• PatPixel problem description
• Test setup, some results
• Integration with Gaudi framework
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
4
Fast Pixel problem description
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
5
Fast Pixel problem description
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
6
Fast Pixel problem description
•
•
•
•
48 sensors with 12 chips each
Each chip has 256x256 pixels
Clustered 2x2 by readout board
Right and left sensors at different z with overlap
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
7
Fast Pixel problem description
• The algorithm searches for hits starting from the last pixel lattice.
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
8
Fast Pixel problem description
• The algorithm searches for hits starting from the last pixel lattice.
• Per hit, it searches for compatible hits (on a given radius) in the next pixel
lattice.
• Finding at least three compatible hits forms a track.
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
9
Fast Pixel problem description
However, the current approach is very sequential (albeit efficient!).
• Hits must not be already used.
• Continue instructions, break the loop and make it fast.
Porting the same algorithm to other programming models as is makes for a
proof of concept (produced physics are the same).
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
10
Outline
• PatPixel problem description
• Test setup, some results
• Integration with Gaudi framework
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
11
Current test setup
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
12
Current test setup
We are interested in the search bit!
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
13
Current test setup
• Input is 200 Monte-Carlo generated events.
• Implementations produce exactly the same output as Brunel, unless
stated otherwise.
• Current setup runs TBB with a variable number of threads specified
by task_scheduler_init init(i);
• 1000 experiments are run per configuration. Results shown are the
mean of those, standard deviation is checked as well.
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
14
Comparing apples to…
• Lab13
▫ Intel Xeon CPU E5-2650 (2 CPUs)
▫ 20M Cache, 2.00 GHz (2.80 GHz TB)
▫ 8 cores, 16 HW threads
• Intel MIC (Pre-Production Intel® Xeon Phi™ coprocessors)
▫ 1.1 GHz
▫ 61 cores, 244 HW threads
• GPU
▫ NVIDIA GeForce 680GTX
▫ 1GHz
▫ 1536 CUDA cores (96 SIMD cores)
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
15
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
16
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
17
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
18
Precision
• Is there a real need of double point operations?
How about single precision instead…
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
19
Precision
Mean 1 (correct / produced tracks): 100%
Mean 2 (correct / total number of tracks): 99.9964%
• We miss one track in 28.000.
• No incorrect tracks are generated.
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
20
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
21
Ma(g)ny-cores like Single Precision!
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
22
Current implementation
• Setup is decoupled from the Gaudi framework.
• Produced physics are the same.
• Parallelism is setup as thread per event.
• GPU acts as simple SIMD (“speedup” of 0.3x !)
▫ divergent branches and warps are not good friends
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
23
Event-wise parallelism?
Using a similar idea to the baseline algorithm, we can exploit the inherent parallel
nature of the problem.
• Average #hits per sensor: 22.6
• Average multiplicity (hit x hit): 771.15
• Average multiplicity (hit x hit x hit): 1544.7
Early stage parallel algorithm produces 85% of the correct tracks.
Different results doesn’t necessarily mean wrong!  Physics demonstration!
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
24
Outline
• PatPixel problem description
• Test setup, some results
• Integration with Gaudi framework
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
25
What’s missing?
The current setup is cool and dandy for comparing results, but not for testing
the real setup!
Daniel Hugo Cámpora Pérez
26-10-2012
26
Integration with Gaudi
Current HLT doesn’t consider having coprocessors to help in the execution of any
step. Framework is sequential!
Per event execution on a coprocessor is not realistic. Memory copies will kill us!
• Each event is approximately 50kB.
• Processing one single event is trivial.
We have to hide the latency!
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
27
Pipelining!
Eg. #event chunk = 200
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
28
Integration with Gaudi
Gaudihive
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
29
Integration with Gaudi
Gaudihive
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012
30
In conclusion
• The analysis on the sequential algorithm is complete.
• A speedup of 10.70x has been obtained by properly configuring TBB.
• MIC underperforms because of lack of use of VPUs, more tweaking is
necessary.
• Using floats rather than doubles is beneficial for many-core architectures, and
results are the same.
• A parallel version of the PatPixel would show a more realistic architecture
comparison, and should be better performant.
• The current framework with a good pipeline could enable the use of a
coprocessor in a production environment.
Daniel Hugo Cámpora Pérez - Parallel Considerations of VELO PatPixelTracking
21-11-2012