Transcript Slide 1
Image Reconstruction on Multicore Processors
Graduate Students Eric Fontaine and Viraj Paropkari Faculty Members: Ada Gavrilovska and Hsien-Hsin S. Lee
Agenda
• Background • FDK algorithm – Overview – Parallelization Method – Current Results • Katsevich Algorithm – Overview – Parallelization Method – Current Results • Future Plans 2
Background
• Use 3-D CT scan to identify tumors and other defects inside the body.
• Two common methods – MRI • Complex math and physics • Main function ─ Simple IFFT – Filtered back-projection • Two common filtered back-projection algorithms – FDK • Approximation, fast • Use projections taken on a circular path surrounding the object • More accurate on the plane containing the circle – Katsevich • More accurate, but also more compute-intensive • Use projections taken on a helical path surrounding the object • It can reconstruct long objects, unlike the original FDK.
• Both contain large data parallelism 3
FDK Algorithm Overview
• Cone beam image reconstruction with source on a helix for a flat detector • Reconstruction for 3-D volume • Initialize the helix source parameters • Compute/load cone beam data • Length correction weighting • 1-D horizontal filtering • Linear Pre-interpolation • Back projection • Compare Results with standard phantom 4
Parallelization Strategy
• Based on FDK algorithm for general scanning paths like helix.
* – Each thread is assigned a subset of the total number of projections, and performs length correction weighting, filtering and back projections of its assigned projections. – After all threads are done, there is an implicit barrier necessary for synchronization. Then each thread is assigned a subset of the total volume to reconstruct.
– We use OpenMP • Reconstruct subsets of the total volume in parallel (to fit into individual cache) • Piece the image together at the end (reduced inter-core communication)
Length correction weighting, filtering, back-projection Assign Projections barrier Reconstructed Image Length correction weighting, filtering, back-projection
*Ge Wang, Tein-Hsiang Lin, Ping-chin Cheng, and Douglas M. Shinozaki. A general cone-beam reconstruction algorithm. IEEE Trans. On Medical Imaging, 12(3):486-496, September 1993 5
Single and Dual-Thread Performance
Performance (Seconds) Speedup of dual-thread OpenMP code 2 Single Thread Dual Thread 16^3 13 31 32^3 17 33 64^3 49 50 128^3 311 184 256^3 2416 1263 1.5
1 0.5
0 16^3 Speedup 0.419
32^3 0.515
Slowdown 64^3 0.98
128^3 1.69
256^3 1.912
6
FDK Analysis for Memory Behavior
Statistics of Single Thread 45 40 35 30 25 20 15 10 5 0 L1 Miss% L2 Miss% DTLB Miss% 16^3 23.65
0.64
6.67
32^3 38.62
41.74
9.95
64^3 128^3 256^3 36.93
23.75
12.93
34.08
18.6
12.5
21.5
14.39
14.79
Statistics of Two Threads 100 80 60 40 20 0 16^3 L1 Miss% L2 Miss% 81.8
1.64
DTLB Miss% 10.27
32^3 90.55
1.76
12.05
64^3 91.45
1.78
11.67
128^3 256^3 91.58
1.87
12.86
96.58
6.13
11.71
7
Katsevich Algorithm Overview
• Reconstructs a 3-D cylindrical volume exactly from 2-D projections.
[1] – The inputs are projections (b) taken from a helical path surrounding the volume of interest (a). • Implemented the Noo method appropriately (c).
the total volume.
[2] PI-interval containing that voxel.
: – These projections are differentiated and weighted – These undergo a 1-D Hilbert transform along the κ-lines.
• First undergo remapping to κ-line coordinates (d).
• Perform 1-D convolution w/ filter kernel (e).
• Return to projection coordinates by remapping (f).
– To reconstruct the 3-D volume (g), each voxel’s coordinates is back projected the source projections • The cumulative sum is taken for all projections belonging to the • Used similar parallelization strategy to FDK – Each thread processes a subset of the projections.
– After synchronization, each thread reconstructs a subset of
(a) (b) (c) (d) (e) (f) (g)
[1] Alexander Katsevich, " Theoretically exact FBP-type inversion algorithm for spiral CT ", Society for Industrial and Applied Mathematics Journal on Applied Mathematics, 62:2012-2026, 2002. [2] F. Noo, J. Pack, and D. Heuscher, “Exact helical reconstruction using native cone-beam geometries,”
Physics in Medicine and Biology
, vol. 48, pp. 3787 –3818, 2003. 8
Results
Reconstruction Time (single-precision)
1600 1400 1200 1000 800 600 400 200 0 0.5
0.2
64 6.1
3.1
90.5
45.8
128 256
Width of Reconstructed Volume
1454.6
740.1
512 1 Thread 2 Threads
Speedup using 2 Threads (single-precision)
1.993
1.940
1.974
1.965
2 1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1 64 128 256
Width of Reconstructed Volume
512 • Using Intel Core2 Duo @ 2.66 GHz.
• Close to 2x speedup 2 1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
Reconstruction Time (double precision)
2500 2000 1500 1000 500 0 2133.0
1087.0
0.6
0.3
64 8.8
4.4
136.0
68.9
128 256
Width of Reconstructed Volume
512 1 thread 2 threads
Speedup using 2 Threads (double precision)
1.984
1.980
1.974
1.962
64 128 256
Width of Reconstructed Volume
512 9
Image Quality
512^3 Reconstruction 512 Projections per Turn, 512x64 size projections 512^3 original Phantom
10
Benchmark
Comparison to Published Results in [3] (single-threaded, double precision)
7000 6000 5000 4000 3000 2000 1000 0 6013 my results uiowa results 2133 9 226 136 975 128 256
Width of Reconstructed Volume
512 • Compared against the published timing results in [3] , which used 64-bit AMD Opteron processors.
• Unable to determine exact parameters used by author of [3] , so the comparison may be questionable.
[3] Deng, J., Yu, H., Ni, J., He, T., Zhao, S., Wang, L., and Wang, G. 2006. A Parallel Implementation of the Katsevich Algorithm for 3-D CT Image Reconstruction.
J. Supercomput.
38, 1 (Oct. 2006), 35-47.
11
Optimizations Used
Time Breakdown for Different Stages (initial version)
Differentiation and Filtering Determine PI intervals Perform Backprojection Other • Majority of time spent during backprojection and determining the PI intervals.
– PI-intervals are constant for a particular helix.
• PI-intervals are precomputed and saved to a file.
– Only necessary to precompute PI-intervals for one horizontal slice.
• PI-intervals for different horizontal slices can be determined by rotation.
– Easy ~25% speedup 12
Optimizations Used
• Next focused on backprojection inner loop.
– Removed trival lookup tables to save cache space.
• ~10% speedup.
– Used sin, cos lookup tables • ~15% speedup.
– Moved if statements for smoothing the ends of the PI interval outside the loop.
• Duplicated inner loop code.
• ~10% speedup.
– Removed if statements required for bounds testing the backprojected coordinates.
• Needed to add extra row and column slack to projection data.
• ~3% speedup.
13
Future work • Explore memory layout to reduce cache misses and page faults.
• Implement the same algorithms on Cell processor for competitive analysis.
14