Calculation of Line Integrals Through Voxel Volumes Using CUDA

Download Report

Transcript Calculation of Line Integrals Through Voxel Volumes Using CUDA

Implementation of Voxel Volume
Projection Operators Using CUDA
Applications to Iterative Cone-Beam
CT reconstruction
Three questions to keep in mind
during the presentation
1. Name three differences between the GPU and
CPU contributing to the observed performance
improvements.
2. Scattered accumulation (a[ix] = a[ix] + b)
can result in reduced performance and
unexpected results in CUDA.
Why? Please state two possible problems!
3. Suppose that each thread in a CUDA program is
responsible for writing to one unique output
element. Give one possible solution for handling
”odd sized” output data.
What are voxel volume projectors?
Detector
p
f
X-ray source
•
•
•
•
Object: f
Projection data: p (line integrals through f)
Forward projection: P, mapping f to p
Backprojection: PT, mapping p to f
The tomography problem
• Given p, find an image/volume f, such that the
difference between Pf and p is small in some
sense, e.g. z(f)=||Pf-p||2 attains a minimum.
• Steepest descent approach:
– Gradient: z(f)=2PT(Pf-p)
– Update step: fk+1=fk-2aPT(Pfk-p)
• We use this method only for illustration of
why P and PT are needed. In practice, faster
and better regularized methods are needed.
Why implement P and PT on the GPU?
• Reasonable sizes of f and p are around 2GB, implying a
size of 2GBx2GB for P and PT.
• Although these matrices are sparse, the number of
nonzero elements is approximately 2000GB, i.e.,
matrix-vector products involving these matrices are
computationally demanding.
• The GPU offers many solutions that help speeding up
the computations:
–
–
–
–
Massive parallelism
Hardware linear interpolation
Very high memory bandwidth
Texture caches ”optimized for 2D locality”
The Joseph forward projector [1]
• Illustration in the 2D case:
pi= L (x0+x1+x2+x3)
x2
x0
x3
x1
L
• Generalization to 3D is straightforward. Instead of
linear intepolation, bilinear interpolation is used.
[1] Joseph, P. M., An improved algorithm for reprojecting rays through pixel images,
IEEE Transactions on Medical Imaging, 1982, MI-1, 192-196
Implementation sketch for P
• We choose to let one thread correspond to
one ray.
• For each ray/thread:
1. Determine the ray origin and direction.
2. Determine the ray and voxel volume intersection.
3. Step along the ray and accumulate values from
the voxel volume, using the 3D-texture cache.
4. Multiply with L and store to output element.
Handling of ”odd” detector sizes
• The x-ray detector is divided into 2D blocks
corresponding to CUDA thread blocks, using a static
block size (16x2).
• To handle the case with detector dimensions not
divisible with the block size, conditional statements
were used:
// Return if outside detector
if (rowIx>=Nrows) return;
if (chIx>=Nch) return;
• Although this reduces efficiency due to divergent
programs, the reduction is small for detectors with
reasonably large number of channels and rows.
Implementation details 1
Calc. of source and detector positions
// Calculate focus center and displacement vectors
float alpha = theta + D_ALPHA(chIx);
float3 fc = make_float3(Rf*cos(alpha), Rf*sin(alpha), z0+D_Z(chIx));
float3 fw = make_float3(-sin(alpha), cos(alpha), 0.f);
float3 fh = make_float3(-cos(alpha)*sin(aAngle),
-sin(alpha)*sin(aAngle), cos(aAngle));
// Calculate detector center and displacement vectors
float3 dc = fc + (Rf+Rd) * make_float3(-cos(theta), -sin(theta), D_SLOPE(rowIx));
float3 dw = make_float3(sin(theta), -cos(theta), 0.f);
float3 dh = make_float3(0.f, 0.f, 1.f);
// Calculate focus position in texture index coordinates
float3 f = ((fc + fOffset.x*fw + fOffset.y*fh) - origin) / spacing;
// Calculate detector position in texture index coordinates
float3 d = ((dc + dOffset.x*dw + dOffset.y*dh) - origin) / spacing;
// Create ray struct
Ray ray;
ray.o = f;
ray.d = normalize(d - f);
Code based on NVIDIA CUDA SDK ”volumeRender”
Implementation details 2
Accumulation of contributions
// Accumulate contributions along ray
float dValue = 0;
float t = tnear;
for(int i=0; i<dimensions.x; i++)
{
// Calculate current position
float3 pos = ray.o + ray.d*t + 0.5f; // texture origin: (0.5f, 0.5f, 0.5f)
// read from 3D texture
dValue += tex3D(devT_reconVolume, pos.x, pos.y, pos.z);
// increase t
t += tstep;
if (t >= tfar) break;
}
// Update detector data value
dValue *= length((ray.d*spacing))*tstep;
projectionData[rowIx*Nch+chIx] += dValue;
Code based on NVIDIA CUDA SDK ”volumeRender”
Experiments – forward projection (P)
• Input data dimensions: 672x24x3000 floats
• Output data dimensions: 512x512x257 floats
• Only very small differences in accuracy, without
practical implications, occur between CPU and GPU
implementations.
• Calculation times
– CPU: 2500 s
– GPU: 47 s
• Speed up factor: approximately 50x
• For a larger collection of problems, speedups between
20x and 50x have been observed.
Efficient implementation of the exact
adjoint operator PT is much trickier, why?
• 2D illustration of the procedure:
pi
• Same interpolation coefficients as for P, but with
scattered accumulation instead of reading.
• No hardware interpolation. No 2D/3D textures.
• New parallelization setup is needed. One ray/one
thread leads to corrupted results.
One slow but exact implementation
for the exact transpose of P
• Let one thread represent only one accumulation from a ray,
i.e.,
– calculation of position in voxel volume.
– calculation and multiplication with interpolation coefficients.
– accumulation to the 4 voxels closest to the active ray/voxel
volume plane intersection.
• Very short, and very many rays threads.
• One thread block represents a number of rays, separated
so that conflicting updates do not occur:
Block 0
Block 1
Block 2 ...
Experiments – backprojection (PT)
• Input data dimensions: 512x512x257 floats
• Output data dimensions: 672x24x3000 floats
• Calculation times
– CPU: 2700 s
– GPU: 700 s
• Speed up factor: approximately 4x
• For a larger collection of problems, speedups
between 4x and 8x have been observed.
Approximate implementation of the
adjoint operator PT
• Bilinear interpolation on the detector.
• 2D-illustration:
• Parallelization is now accomplished by letting one pixel
correspond to one thread.
• This method is approximate since the interpolation
coefficients generally are different from those of PT.
Performance comparison
Approximate versus exact PT operator
• Calculation times
– CPU PT : 2700s
– GPU PT (exact): 700s
– GPU PT (approximate): 60s
• Axial images of Head phantom reconstructions (25HU-75HU):
Exact
Approximate
Conclusions and future research
• For operations such as P and PT, speedups of 4x to 50x can be
obtained.
• The amount of speedup is highly dependent on
– The possibility to efficiently read memory (coalesced or by the use of
texture caches / constant memory).
– The possibility to use hardware interpolation
– Complexity of the kernel. Using too many registers slows down the
program. Lookup tables in constant memory can help reducing the
amount of registers.
• Although scattered write is supported by CUDA, for these
problemes, it did not seem good to use it from a performance point
of view. Note that this prohibits using cudaArray textures.
• Remark: Even if the approximate PT operator give a bad result in the
example given here, there are other situations where the exact
operator is superior. It is therefore of interest to find better
approximations.