Medical Image Registration

Download Report

Transcript Medical Image Registration

Medical Image Registration
A Quick Win
Richard Ansorge
CUDA Image Registration
29 Oct 2008
Richard Ansorge
The problem
• CT, MRI, PET and Ultrasound produce 3D
volume images
• Typically 256 x 256 x 256 = 16,777,216 image
voxels.
• Combining modalities (inter modality) gives extra
information.
• Repeated imaging over time same modality, e.g.
MRI, (intra modality) equally important.
• Have to spatially register the images.
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Example – brain lesion
CT
CUDA Image Registration
MRI
29 Oct 2008
PET
Richard Ansorge
PET-MR Fusion
The PET
image shows
metabolic
activity.
This
complements
the MR
structural
information
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Registration Algorithm
Im A
Im A
Transform
Im B to
match Im A
Compute
Cost
Function
Im B′
Im B
NB Cost function
calculation dominates
for 3D images and is
inherently parallel
CUDA Image Registration
29 Oct 2008
Update
transform
parameters
No
good
fit?
Yes
Done
Richard Ansorge
Transformations
General affine transform has
12 parameters:
жa
ц
зз 11 a12 a13 a14 ч
ч
ззa 21 a 22 a 23 a 24 ч
ч
ч
зз
ч
a
a
a
a
ч
зз 31 32 33 34 ч
ч
ззи 0
ч
0
0
1ч
ш
Polynomial transformations can be useful for e.g. pincushion type distortions:
x ў= a11x + a12y + a13z + a14 + b1x 2 + b2xy + b3y 2 + b4z 2 + b5xz + b6yz
y ў= L
z ў= K
Local, non-linear transformations, e.g using cubic BSplines,
increasingly popular, very computationally demanding.
CUDA Image Registration
29 Oct 2008
Richard Ansorge
We tried this before
6 Parameter Rigid Registration - done 8 years ago
1200
64
56
1000
Time/secs
800
40
600
32
24
400
Speedup Factor
48
16
200
8
0
0
0
4
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Number of Processors
SR2201
CUDA Image Registration
29 Oct 2008
PC 333MHz
Speedup
perfect scaling
Richard Ansorge
Now - Desktop PC - Windows XP
Needs 400 W power supply
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Free Software: CUDA & Visual C++ Express
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Visual C++ SDK in action
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Visual C++ SDK in action
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Architecture
CUDA Image Registration
29 Oct 2008
Richard Ansorge
9600 GT Device Query
Current
GTX 280
has 240
cores!
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Matrix Multiply from SDK
NB using 4-byte floats
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Matrix Multiply (from SDK)
GPU v CPU for NxN Matrix Multipy
400
350
GPU Speedup
300
250
200
150
100
50
0
0
1024
2048
3072
4096
5120
6144
N
average speedup
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Matrix Multiply (from SDK)
GPU v CPU for NxN Matrix Multipy
800.0
700.0
GPU Speedup
600.0
500.0
400.0
300.0
200.0
100.0
0.0
0
1024
2048
3072
4096
5120
6144
N
speedup
CUDA Image Registration
29 Oct 2008
average speedup
Richard Ansorge
Matrix Multiply (from SDK)
800
40
700
35
600
30
500
25
400
20
300
15
200
10
100
5
0
0
0
1024
2048
3072
4096
5120
speed / mads/ns or mads/ 100ns
GPU Speedup
GPU v CPU for NxN Matrix Multipy
6144
N
speedup
CUDA Image Registration
29 Oct 2008
CPU mads/100 ns
GPU mads/ns
Richard Ansorge
Image Registration
CUDA Code
CUDA Image Registration
29 Oct 2008
Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture
__constant__ float c_aff[16];
// 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b
// and Source Image s. These buffers are in device memory
__global__ void d_costfun(int nx,int ny,int nz,float *b,float *s)
{
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
float x = (float)ix;
float y = (float)iy;
float z = 0.0f;
// start with slice zero
float4 v = make_float4(x,y,z,1.0f);
float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
float tx = dot(r0,v); // Matrix Multiply using dot products
float ty = dot(r1,v);
float tz = dot(r2,v);
float source = 0.0f;
float target = 0.0f;
float cost = 0.0f;
uint is = iy*nx+ix;
uint istep = nx*ny;
for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];
target = tex3D(tex1, tx, ty, tz);
is += istep;
v.z += 1.0f;
tx = dot(r0,v);
ty = dot(r1,v);
tz = dot(r2,v);
cost += fabs(source-target); // other costfuns here as required
}
b[iy*nx+ix]=cost; // store thread sum for host
}
CUDA Image Registration
29 Oct 2008
Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture
texture<float,
3, cudaReadModeElementType>
tex1;
__constant__ float c_aff[16];
// 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b
// and Source Image s. These buffers are in device memory
__global__ void d_costfun(int nx,int ny,int nz,float *b,float *s)
{
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
float x = (float)ix;
float y = (float)iy;
float z = 0.0f;
// start with slice zero
float4 v = make_float4(x,y,z,1.0f);
float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
float tx = dot(r0,v); // Matrix Multiply using dot products
float ty = dot(r1,v);
float tz = dot(r2,v);
float source = 0.0f;
float target = 0.0f;
float cost = 0.0f;
uint is = iy*nx+ix;
uint istep = nx*ny;
for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];
target = tex3D(tex1, tx, ty, tz);
is += istep;
v.z += 1.0f;
tx = dot(r0,v);
ty = dot(r1,v);
tz = dot(r2,v);
cost += fabs(source-target); // other costfuns here as required
}
b[iy*nx+ix]=cost; // store thread sum for host
}
__constant__ float c_aff[16];
tex1: moving image, stored as 3D texture
c_aff: affine transformation matrix, stored as constants
CUDA Image Registration
29 Oct 2008
Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture
__constant__ float c_aff[16];
// 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b
// device function
declaration
// and Source Image s. These buffers are in device memory
__global__
void d_costfun(int nx,int
ny,int nz,float
*b,float
*s)
__global__ void
d_costfun(int
nx,int
ny,int
nz,float
*b,float *s)
{
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
float x = (float)ix;
float y = (float)iy;
float z = 0.0f;
// start with slice zero
float4 v = make_float4(x,y,z,1.0f);
float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
float tx = dot(r0,v); // Matrix Multiply using dot products
float ty = dot(r1,v);
float tz = dot(r2,v);
float source = 0.0f;
float target = 0.0f;
float cost = 0.0f;
uint is = iy*nx+ix;
uint istep = nx*ny;
for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];
target = tex3D(tex1, tx, ty, tz);
is += istep;
v.z += 1.0f;
tx = dot(r0,v);
ty = dot(r1,v);
tz = dot(r2,v);
cost += fabs(source-target); // other costfuns here as required
}
b[iy*nx+ix]=cost; // store thread sum for host
nx, ny & nz: image dimensions (assumed same of both)
b: output array for partial sums
s: reference image (mislabelled in code)
}
CUDA Image Registration
29 Oct 2008
Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture
__constant__ float c_aff[16];
// 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b
// and Source Image s. These buffers are in device memory
__global__ void d_costfun(int nx,int ny,int nz,float *b,float *s)
{
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
float x = (float)ix;
float y = (float)iy;
float z = 0.0f;
// start with slice zero
float4 v = make_float4(x,y,z,1.0f);
float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
float tx = dot(r0,v); // Matrix Multiply using dot products
float ty = dot(r1,v);
float tz = dot(r2,v);
float source = 0.0f;
float target = 0.0f;
float cost = 0.0f;
uint is = iy*nx+ix;
uint istep = nx*ny;
for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];
target = tex3D(tex1, tx, ty, tz);
is += istep;
v.z += 1.0f;
tx = dot(r0,v);
ty = dot(r1,v);
tz = dot(r2,v);
cost += fabs(source-target); // other costfuns here as required
}
b[iy*nx+ix]=cost; // store thread sum for host
}
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
float x = (float)ix;
float y = (float)iy;
float z = 0.0f;
// start with slice zero
Which thread am I? (similar to MPI) however one thread for each xy pixel, 240x256=61440 threads (CF ~128 nodes for MPI)
CUDA Image Registration
29 Oct 2008
Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture
__constant__ float c_aff[16];
// 4x4 Affine transform
float4 v = make_float4(x,y,z,1.0f);
Function arguments are image dimensions
and pointers
to output2],c_aff[
buffer b
float4 r0 = ////make_float4(c_aff[
0],c_aff[
1],c_aff[
3]);
and Source Image s. These buffers are in device memory
__global__ void d_costfun(int nx,int4],c_aff[
ny,int nz,float *b,float
*s)
float4 r1 = make_float4(c_aff[
5],c_aff[
6],c_aff[ 7]);
{
float4 r2 = make_float4(c_aff[
int ix = blockIdx.x*blockDim.x8],c_aff[
+ threadIdx.x; //9],c_aff[10],c_aff[11]);
Thread ID matches
int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]);
// 0,0,0,1?
float x = (float)ix;
float y = (float)iy;
float tx = dot(r0,v);
// Matrix //Multiply
dot products
float z = 0.0f;
start with sliceusing
zero
float4 v = make_float4(x,y,z,1.0f);
float ty = dot(r1,v);
float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float tz = dot(r2,v);
float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float source = float4
0.0f;
r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
float tx = dot(r0,v); // Matrix Multiply using dot products
float target = 0.0f;
float ty = dot(r1,v);
float tz = dot(r2,v);
float cost = 0.0f;
//= accumulates
cost function contributions
float source
0.0f;
float target//
= 0.0f;
v.z=0.0f;
z of first slice is zero (redundant as done above)
float cost = 0.0f;
uint is = iy*nx+ix;
uint is = iy*nx+ix;
// this is index of my voxel in first z-slice
uint istep = nx*ny;
uint istep = nx*ny;
// stride
index
voxel
for(int iz=0;iz<nz;iz++)
{ //to
process
all z's same
in same thread
here in subsequent slices
source = s[is];
target = tex3D(tex1, tx, ty, tz);
is += istep;
v.z += 1.0f;
tx = dot(r0,v);
ty = dot(r1,v);
tz = dot(r2,v);
cost += fabs(source-target); // other costfuns here as required
Initialisations and first matrix multiply.
“v” is 4-vector current voxel x,y,z address
“tx,ty,tz” hold corresponding transformed position
}
b[iy*nx+ix]=cost; // store thread sum for host
}
CUDA Image Registration
29 Oct 2008
Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture
__constant__ float c_aff[16];
// 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b
// and Source Image s. These buffers are in device memory
__global__ void d_costfun(int nx,int ny,int nz,float *b,float *s)
{
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
float x = (float)ix;
float y = (float)iy;
float z = 0.0f;
// start with slice zero
float4 v = make_float4(x,y,z,1.0f);
float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
float tx = dot(r0,v); // Matrix Multiply using dot products
float ty = dot(r1,v);
float tz = dot(r2,v);
float source = 0.0f;
float target = 0.0f;
float cost = 0.0f;
uint is = iy*nx+ix;
uint istep = nx*ny;
for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];
target = tex3D(tex1, tx, ty, tz);
is += istep;
v.z += 1.0f;
tx = dot(r0,v);
ty = dot(r1,v);
tz = dot(r2,v);
cost += fabs(source-target); // other costfuns here as required
}
b[iy*nx+ix]=cost; // store thread sum for host
}
for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];
target = tex3D(tex1, tx, ty, tz); // NB very FAST trilinear interpolation!!
is += istep;
v.z += 1.0f; // step to next z slice
tx = dot(r0,v);
ty = dot(r1,v);
Z
Y
tz = dot(r2,v);
cost += fabs(source-target); // other costfuns here as required
X
}
b[iy*nx+ix]=cost; // store thread sum for host
Loop sums contributions for all z values at fixed x,y position. Each tread
updates a different element of 2D results array b.
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Host Code Initialization Fragment
...
blockSize.x = blockSize.y = 16; // multiples of 16 a VERY good idea
gridSize.x = (w2+15) / blockSize.x;
gridSize.y = (h2+15) / blockSize.y;
// allocate working buffers, image is W2 x H2 x D2
cudaMalloc((void**)&dbuff,w2*h2*sizeof(float));
// passed as “b” to kernel
bufflen = w2*h2;
Array1D<float> shbuff = Array1D<float>(bufflen);
shbuff.Zero();
hbuff = shbuff.v;
cudaMalloc((void**)&dnewbuff,w2*h2*d2*sizeof(float)); //passed as “s” to kernel
cudaMemcpy(dnewbuff,vol2,w2*h2*d2*sizeof(float),cudaMemcpyHostToDevice);
e = make_float3((float)w2/2.0f,(float)h2/2.0f,(float)d2/2.0f); // fixed rotation origin
o = make_float3(0.0f);
// translations
r = make_float3(0.0f);
// rotations
s = make_float3(1.0f,1.0f,1.0f); // scale factors
t = make_float3(0.0f);
// tans of shears
...
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Calling the Kernel
double nr_costfun(Array1D<double> &a)
{
static Array2D<float> affine = Array2D<float>(4,4); // a holds current transformation
double sum = 0.0;
make_affine_from_a(nr_fit,affine,a); // convert to 4x4 matrix of floats
cudaMemcpyToSymbol(c_aff,affine.v[0],4*4*sizeof(float));
// load constant mem
d_costfun<<<gridSize, blockSize>>>(w2,h2,d2,dbuff,dnewbuff); // run kernel
CUT_CHECK_ERROR("kernel failed");
// OK?
cudaThreadSynchronize();
// make sure all done
// copy partial sums from device to host
cudaMemcpy(hbuff,dbuff,bufflen*sizeof(float),cudaMemcpyDeviceToHost);
for(int iy=0;iy<h2;iy++) for(int ix=0;ix<w2;ix++) sum += hbuff[iy*w2+ix]; // final sum
calls++;
if(verbose>1){
printf("call %d costfun %12.0f, a:",calls,sum);
for(int i=0;i<a.sizex();i++)printf(" %f",a.v[i]);
printf("\n");
}
return sum;
}
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Example Run (240x256x176 images)
C: >airwc
airwc v2.5 Usage: AirWc <target> <source> <result> opts(12rtdgsf)
C:>airwc sb1 sb2 junk 1f
NIFTI Header on File sb1.nii
converting short to float 0 0.000000
NIFTI Header on File sb2.nii
converting short to float 0 0.000000
Using device 0: GeForce 9600 GT
Initial correlation 0.734281
using cost function 1 (abs-difference)
using cost function 1 (abs-difference)
Amoeba time: 4297, calls 802, cost:127946102
Cuda Total time 4297, Total calls 802
File dofmat.mat written
Nifti file junk.nii written, bswop=0
Full Time 6187
CUDA Image Registration
29 Oct 2008
timer 0 1890 ms
timer 1 0 ms
timer 2 3849 ms
timer 3 448 ms
timer 4 0 ms
Total 6.187 secs
Final Transformation:
0.944702 -0.184565 0.017164 40.637428
0.301902 0.866726 -0.003767 -38.923237
-0.028792 -0.100618 0.990019 18.120852
0.000000 0.000000 0.000000 1.000000
Final rots and shifts
6.096217 -0.156668 -19.187197
-0.012378 0.072203 0.122495
scales and shears
0.952886 0.912211 0.995497
0.150428 -0.101673 0.009023
Richard Ansorge
Desktop 3D Registration
Registration
with
FLIRT 4.1
8.5 Minutes
Registration
with
CUDA
6 Seconds
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Comments
• This is actually already very useful. Almost
interactive (add visualisation)
• Further speedups possible
– Faster card
– Smarter optimiser
– Overlap IO and Kernel execution
– Tweek CUDA code
• Extend to non-linear local registration
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Intel Larabee?
Porting from
CUDA to
Larabee
should be
easy
Figure 1: Schematic of the Larabee many-core architecture: The
number of CPU cores and the number and type of co-processors and
I/O blocks are implementation-dependent, as are the positions of the
CPU and non-CPU blocks on the chip.
CUDA Image Registration
29 Oct 2008
Richard Ansorge
Thank you
CUDA Image Registration
29 Oct 2008
Richard Ansorge