Transcript Slide 1

Image Reconstruction on Multicore Processors

Graduate Students Eric Fontaine and Viraj Paropkari Faculty Members: Ada Gavrilovska and Hsien-Hsin S. Lee

Agenda

• Background • FDK algorithm – Overview – Parallelization Method – Current Results • Katsevich Algorithm – Overview – Parallelization Method – Current Results • Future Plans 2

Background

• Use 3-D CT scan to identify tumors and other defects inside the body.

• Two common methods – MRI • Complex math and physics • Main function ─ Simple IFFT – Filtered back-projection • Two common filtered back-projection algorithms – FDK • Approximation, fast • Use projections taken on a circular path surrounding the object • More accurate on the plane containing the circle – Katsevich • More accurate, but also more compute-intensive • Use projections taken on a helical path surrounding the object • It can reconstruct long objects, unlike the original FDK.

• Both contain large data parallelism 3

FDK Algorithm Overview

• Cone beam image reconstruction with source on a helix for a flat detector • Reconstruction for 3-D volume • Initialize the helix source parameters • Compute/load cone beam data • Length correction weighting • 1-D horizontal filtering • Linear Pre-interpolation • Back projection • Compare Results with standard phantom 4

Parallelization Strategy

• Based on FDK algorithm for general scanning paths like helix.

* – Each thread is assigned a subset of the total number of projections, and performs length correction weighting, filtering and back projections of its assigned projections. – After all threads are done, there is an implicit barrier necessary for synchronization. Then each thread is assigned a subset of the total volume to reconstruct.

– We use OpenMP • Reconstruct subsets of the total volume in parallel (to fit into individual cache) • Piece the image together at the end (reduced inter-core communication)

Length correction weighting, filtering, back-projection Assign Projections barrier Reconstructed Image Length correction weighting, filtering, back-projection

*Ge Wang, Tein-Hsiang Lin, Ping-chin Cheng, and Douglas M. Shinozaki. A general cone-beam reconstruction algorithm. IEEE Trans. On Medical Imaging, 12(3):486-496, September 1993 5

Single and Dual-Thread Performance

Performance (Seconds) Speedup of dual-thread OpenMP code 2 Single Thread Dual Thread 16^3 13 31 32^3 17 33 64^3 49 50 128^3 311 184 256^3 2416 1263 1.5

1 0.5

0 16^3 Speedup 0.419

32^3 0.515

Slowdown 64^3 0.98

128^3 1.69

256^3 1.912

6

FDK Analysis for Memory Behavior

Statistics of Single Thread 45 40 35 30 25 20 15 10 5 0 L1 Miss% L2 Miss% DTLB Miss% 16^3 23.65

0.64

6.67

32^3 38.62

41.74

9.95

64^3 128^3 256^3 36.93

23.75

12.93

34.08

18.6

12.5

21.5

14.39

14.79

Statistics of Two Threads 100 80 60 40 20 0 16^3 L1 Miss% L2 Miss% 81.8

1.64

DTLB Miss% 10.27

32^3 90.55

1.76

12.05

64^3 91.45

1.78

11.67

128^3 256^3 91.58

1.87

12.86

96.58

6.13

11.71

7

Katsevich Algorithm Overview

• Reconstructs a 3-D cylindrical volume exactly from 2-D projections.

[1] – The inputs are projections (b) taken from a helical path surrounding the volume of interest (a). • Implemented the Noo method appropriately (c).

the total volume.

[2] PI-interval containing that voxel.

: – These projections are differentiated and weighted – These undergo a 1-D Hilbert transform along the κ-lines.

• First undergo remapping to κ-line coordinates (d).

• Perform 1-D convolution w/ filter kernel (e).

• Return to projection coordinates by remapping (f).

– To reconstruct the 3-D volume (g), each voxel’s coordinates is back projected the source projections • The cumulative sum is taken for all projections belonging to the • Used similar parallelization strategy to FDK – Each thread processes a subset of the projections.

– After synchronization, each thread reconstructs a subset of

(a) (b) (c) (d) (e) (f) (g)

[1] Alexander Katsevich, " Theoretically exact FBP-type inversion algorithm for spiral CT ", Society for Industrial and Applied Mathematics Journal on Applied Mathematics, 62:2012-2026, 2002. [2] F. Noo, J. Pack, and D. Heuscher, “Exact helical reconstruction using native cone-beam geometries,”

Physics in Medicine and Biology

, vol. 48, pp. 3787 –3818, 2003. 8

Results

Reconstruction Time (single-precision)

1600 1400 1200 1000 800 600 400 200 0 0.5

0.2

64 6.1

3.1

90.5

45.8

128 256

Width of Reconstructed Volume

1454.6

740.1

512 1 Thread 2 Threads

Speedup using 2 Threads (single-precision)

1.993

1.940

1.974

1.965

2 1.9

1.8

1.7

1.6

1.5

1.4

1.3

1.2

1.1

1 64 128 256

Width of Reconstructed Volume

512 • Using Intel Core2 Duo @ 2.66 GHz.

• Close to 2x speedup 2 1.9

1.8

1.7

1.6

1.5

1.4

1.3

1.2

1.1

1

Reconstruction Time (double precision)

2500 2000 1500 1000 500 0 2133.0

1087.0

0.6

0.3

64 8.8

4.4

136.0

68.9

128 256

Width of Reconstructed Volume

512 1 thread 2 threads

Speedup using 2 Threads (double precision)

1.984

1.980

1.974

1.962

64 128 256

Width of Reconstructed Volume

512 9

Image Quality

512^3 Reconstruction 512 Projections per Turn, 512x64 size projections 512^3 original Phantom

10

Benchmark

Comparison to Published Results in [3] (single-threaded, double precision)

7000 6000 5000 4000 3000 2000 1000 0 6013 my results uiowa results 2133 9 226 136 975 128 256

Width of Reconstructed Volume

512 • Compared against the published timing results in [3] , which used 64-bit AMD Opteron processors.

• Unable to determine exact parameters used by author of [3] , so the comparison may be questionable.

[3] Deng, J., Yu, H., Ni, J., He, T., Zhao, S., Wang, L., and Wang, G. 2006. A Parallel Implementation of the Katsevich Algorithm for 3-D CT Image Reconstruction.

J. Supercomput.

38, 1 (Oct. 2006), 35-47.

11

Optimizations Used

Time Breakdown for Different Stages (initial version)

Differentiation and Filtering Determine PI intervals Perform Backprojection Other • Majority of time spent during backprojection and determining the PI intervals.

– PI-intervals are constant for a particular helix.

• PI-intervals are precomputed and saved to a file.

– Only necessary to precompute PI-intervals for one horizontal slice.

• PI-intervals for different horizontal slices can be determined by rotation.

– Easy ~25% speedup 12

Optimizations Used

• Next focused on backprojection inner loop.

– Removed trival lookup tables to save cache space.

• ~10% speedup.

– Used sin, cos lookup tables • ~15% speedup.

– Moved if statements for smoothing the ends of the PI interval outside the loop.

• Duplicated inner loop code.

• ~10% speedup.

– Removed if statements required for bounds testing the backprojected coordinates.

• Needed to add extra row and column slack to projection data.

• ~3% speedup.

13

Future work • Explore memory layout to reduce cache misses and page faults.

• Implement the same algorithms on Cell processor for competitive analysis.

14