Transcript Reverse Time Migration on GMAC
22 nd NVIDIA GTC of September, 2010
Reverse Time Migration
on
GMAC
Javier Cabezas
Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José María Cela Nacho Navarro
BSC
Repsol/BSC UPC/UIUC NVIDIA Repsol UPC/BSC UPC/BSC
Outline
• • • • •
Introduction
Reverse Time Migration on CUDA GMAC at a glance Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 2
Reverse Time Migration on CUDA
└
RTM
• • • RTM generates an image of the subsurface layers Uses traces recorded by sensors in the field RTM’s algorithm 1.
2.
Propagation of a modeled wave (forward in time) Propagation of the recorded traces (backward in time) 3.
• Correlation of the forward and backward wavefields Last forward wavefield with the first backward wavefield • FDTD are preferred to FFT • 2 nd -order finite differencing in time • High-order finite differencing in space NVIDIA GPU Technology Conference – 22 nd of September, 2010 3
Introduction
└
Barcelona Supercomputing Center (BSC)
• • BSC and Repsol: Kaleidoscope project • Develop better algorithms/techniques for seismic imaging • We focused on Reverse Time Migration (RTM), as it is the most popular seismic imaging technique for depth exploration Due to the high computational power required, the project started a quest for the most suitable hardware • PowerPC: scalability issues • • • Cell: good performance (in production @ Repsol) , difficult programmability FPGA: potentially best performance , programmability nightmare GPUs: 5x speedup vs Cell (GTX280) , what about programmability?
NVIDIA GPU Technology Conference – 22 nd of September, 2010 4
Outline
• • • • • Introduction
Reverse Time Migration on CUDA
→
General approach
• • • Disk I/O Domain decomposition Overlapping computation and communication GMAC at a glance Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 5
Reverse Time Migration on CUDA
└
General approach
• We focus on the host-side part of the implementation 1.
Avoid memory transfers between host and GPU memories • Implement on the GPU as many computations as possible 2.
Hide latency of memory transfers • Overlap memory transfers and kernel execution 3.
Take advantage of the PCIe full-duplex capabilities (Fermi) • Overlap deviceToHost and hostToDevice memory transfers NVIDIA GPU Technology Conference – 22 nd of September, 2010 6
Reverse Time Migration on CUDA
└
General approach Forward
3D-Stencil Absorbing Boundary Conditions Source insertion
Backward
3D-Stencil Absorbing Boundary Conditions Traces insertion Compression
Write to disk Read from disk
Decompression Correlation NVIDIA GPU Technology Conference – 22 nd of September, 2010 7
Reverse Time Migration on CUDA
└
General approach
• Data structures used in the RTM algorithm • Read/Write structures • 3D volume for the wavefield (can be larger than 1000x1000x1000 points) • State of the wavefiled in previous time-steps to compute finite differences in time • Some extra points in each direction at the boundaries (halos) • Read-Only structures • 3D volume of the same size as the wavefield • Geophones’ recorded traces: time-steps x #geophones NVIDIA GPU Technology Conference – 22 nd of September, 2010 8
Reverse Time Migration on CUDA
└
General approach
• Data flow-graph (forward)
3D-Stencil ABC Source
Wavefields Constant read only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22 nd of September, 2010 9
Compress
Reverse Time Migration on CUDA
└
General approach
• Simplified data flow-graph (forward)
RTM Kernel
Wave-fields Constant read only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22 nd of September, 2010 10
Compress
Reverse Time Migration on CUDA
└
General approach
• Control flow-graph (forward) • RTM Kernel Computation • Compress and transfer to disk • deviceToHost + Disk I/O • • Performed every N steps Can run in parallel with the next compute steps Runs on the GPU Runs on the CPU NVIDIA GPU Technology Conference – 22 nd of September, 2010 Start i = 0
RTM Kernel
i%N == 0 no yes
Compress toHost
11 i++ yes i < steps no End
Disk I/O
Outline
• • • • • Introduction • •
Reverse Time Migration on CUDA
• General approach →
Disk I/O
Domain decomposition Overlapping computation and communication GMAC at a glance Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 12
Reverse Time Migration on CUDA
└
Disk I/O
• GPU → Disk transfers are very time-consuming
K 1 K 2 K 3 K 4 C toHost Disk I/O K 5
• time Transferring to disk can be overlapped with the next (compute only) steps
K 1 K 2
Runs on the GPU Runs on the CPU
K 3
NVIDIA GPU Technology Conference – 22 nd of September, 2010
K 4
time 13
C K 5 toHost K 6 Disk I/O K 7 K 8
Reverse Time Migration on CUDA
└
Disk I/O
• Single transfer: wait for all the data to be in host memory
deviceToHost Disk I/O
• time Multiple transfers: overlap deviceToHost transfers with disk I/O • Double buffering
toH toH toH toH Disk I/O Disk I/O Disk I/O
time 14
Disk I/O
NVIDIA GPU Technology Conference – 22 nd of September, 2010
Reverse Time Migration on CUDA
└
Disk I/O
• CUDA-RT limitations • GPU memory accessible by the owner host thread only → deviceToHost transfers must be performed by the compute thread
GPU
GPU address space
Compute thread
CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 15
I/O thread
Reverse Time Migration on CUDA
└
Disk I/O
• CUDA-RT Implementation (single transfer) • CUDA streams must be used not to block GPU execution → Intermediate page-locked buffer must be used: for real-size problems the system can run out of memory!
GPU
GPU address space CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 16
Reverse Time Migration on CUDA
└
Disk I/O
• CUDA-RT Implementation (multiple transfers) • Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU → Lots of synchronization code in the compute thread
GPU
GPU address space CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 17
Outline
• • • • • Introduction
Reverse Time Migration on CUDA
• General approach • • Disk I/O →
Domain decomposition
Overlapping computation and communication GMAC at a glance Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 18
Reverse Time Migration on CUDA
└
Domain decomposition
• • But… wait, real-size problems require > 16GB of data!
Volumes are split into tiles (along the Z-axis) • 3D-Stencil introduces data dependencies
D2 D4
x y z NVIDIA GPU Technology Conference – 22 nd of September, 2010
D1
19
D3
Reverse Time Migration on CUDA
└
Domain decomposition
• Multi-node may be required to overcome memory capacity limitations • Shared memory for intra-node communication • MPI for inter-node communication Node 1 Node 2
GPU1 GPU2 GPU3 GPU4 GPU1 GPU2 GPU3 GPU4
Host Memory NVIDIA GPU Technology Conference – 22 nd of September, 2010
MPI
20 Host Memory
Reverse Time Migration on CUDA
└
Domain decomposition
• Data flow-graph (multi-domain)
RTM Kernel Compress Compress RTM Kernel
Wave-fields (domain 1) Wave-fields (domain 2) Constant read only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22 nd of September, 2010 21
Reverse Time Migration on CUDA
└
Domain decomposition
• Control flow-graph (multi-domain) • Boundary exchange every time-step • Inter-domain communication blocks execution of the next steps!
Runs on the GPU Runs on the CPU NVIDIA GPU Technology Conference – 22 nd of September, 2010 Start i = 0
Kernel Exchange
sync s%N == 0 no yes
Compress toHost
22 i++ yes i < steps no End
Disk I/O
Reverse Time Migration on CUDA
└
Domain decomposition
• Boundary exchange every time-step is needed
K 1 X K 2 X K 3 X K 4 X C K 5 toHost X K 6 X K 7 Disk I/O
time NVIDIA GPU Technology Conference – 22 nd of September, 2010 23
Reverse Time Migration on CUDA
└
Domain decomposition
• Single-transfer exchange • “Easy” to program, needs large page-locked buffers
deviceToHost deviceToHost deviceToHost hostToDevice hostToDevice hostToDevice
• time Multiple-transfer exchange to maximize PCI-Express utilization • “Complex” to program, needs smaller page-locked buffers
toH toH toH toH toH toH toH toH toH toH toH toH toD toD toD toD toD toD toD toD toD toD toD toD
time 24 NVIDIA GPU Technology Conference – 22 nd of September, 2010
Reverse Time Migration on CUDA
└
Domain decomposition
• CUDA-RT limitations • Each host thread can only access to the memory objects it allocates GPUs’ address spaces
GPU 1 GPU 2 GPU 3 GPU 4
CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 25
Reverse Time Migration on CUDA
└
Domain decomposition
• CUDA-RT implementation (single-transfer exchange) • Streams and page-locked memory buffers must be used • Page-locked memory buffers can be too big
GPU 1 GPU 2 GPU 3 GPU 4
GPUs’ address spaces CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 26
└
Domain decomposition
• • • CUDA-RT implementation (multiple-transfer exchange) • Uses small page-locked buffers • More synchronization code Too complex to be represented using Powerpoint!
Very difficult to implement in real code!
NVIDIA GPU Technology Conference – 22 nd of September, 2010 27
Outline
• • • • • Introduction
Reverse Time Migration on CUDA
• General approach • • Disk I/O Domain decomposition →
Overlapping computation and communication
GMAC at a glance Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 28
Reverse Time Migration on CUDA
└
Overlapping computation and communication
• Problem: boundary exchange blocks the execution of the following time-step
K 1 X K 2 X K 3 X K 4 X C K 5 toHost X K 6 X K 7 Disk I/O
time NVIDIA GPU Technology Conference – 22 nd of September, 2010 29
Reverse Time Migration on CUDA
└
Overlapping computation and communication
• Solution: with a 2-stage execution plan we can effectively overlap the boundary exchange between domains
k 1 K 1 k 2 K 2 k 3 K 3 k 4 K 4 C k 5 K 5 k 6 K 6 k 7 K 7 k 8 K 8 C k 9 K 9 X X X X X X X X X toHost toHost
Disk I/O Disk I/O
Disk I/O
time NVIDIA GPU Technology Conference – 22 nd of September, 2010 30
Reverse Time Migration on CUDA
└
Overlapping computation and communication
• Approach: two-stage execution • Stage 1: compute the wavefield points to be exchanged x y z
GPU1
NVIDIA GPU Technology Conference – 22 nd of September, 2010 31
GPU2
Reverse Time Migration on CUDA
└
Overlapping computation and communication
• Approach: two-stage execution • Stage 2: Compute the remaining points while exchanging the boundaries x y z
GPU1
NVIDIA GPU Technology Conference – 22 nd of September, 2010 32
GPU2
Reverse Time Migration on CUDA
└
Overlapping computation and communication
• But two-stage execution requires more abstractions and code complexity • An additional stream per domain • We already have 1 to launch kernels, 1 to overlap transfers to disk, 1 to exchange boundaries →
At this point the code is a complete mess!
• Requires 4 streams per domain, many page-locked buffers, lots of inter-thread synchronization • Poor readability and maintainability • Easy to introduce bugs NVIDIA GPU Technology Conference – 22 nd of September, 2010 33
Outline
• • • Introduction Reverse Time Migration on CUDA • •
GMAC at a glance
→
Features
• Code examples Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 34
GMAC at a glance
└
Introduction
• • Library that enhances the host programming model of CUDA Freely available at http://code.google.com/p/adsm/ • Developed by BSC and UIUC • • NCSA license (BSD-like) Works in Linux and MacOS X (Windows version coming soon) •
Presented in detail tomorrow at 9 am @ San Jose Ballroom
NVIDIA GPU Technology Conference – 22 nd of September, 2010 35
GMAC at a glance
└
Features
• Unified virtual address space for all the memories in the system • Single allocation for shared objects • Special API calls: gmacMalloc, gmacFree • GPU memory allocated by a host thread is visible to all host threads →
Brings POSIX thread semantics back to developers
Shared Data CPU GPU CPU Data NVIDIA GPU Technology Conference – 22 nd of September, 2010 Memory 36
GMAC at a glance
└
Features
• Parallelism exposed via regular POSIX threads • Replaces the explicit use of CUDA streams • OpenMP support • GMAC uses streams and page-locked buffers internally • Concurrent kernel execution and memory transfers for free
GPU
NVIDIA GPU Technology Conference – 22 nd of September, 2010 37
GMAC at a glance
└
Features
• • • Optimized bulk memory operations via library interposition • File I/O • Standard I/O functions: fwrite , fread • Automatic overlap of Disk I/O and hostToDevice and deviceToHost transfers Optimized GPU to GPU transfers via regular memcpy Enhanced versions of the MPI send/receive calls NVIDIA GPU Technology Conference – 22 nd of September, 2010 38
Outline
• • • Introduction Reverse Time Migration on CUDA • •
GMAC at a glance
• Features →
Code examples
Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 39
GMAC at a glance
└
Examples
• Single allocation (and pointer) for shared objects
CUDA-RT
void compute(FILE *file, int size) { 1 float *
foo
, *
dev_foo
; 2
foo
=
malloc
(size); 3 fread(
foo
, size, 1, file); 4
cudaMalloc
(&
dev_foo
, size); 5
cudaMemcpy
(
dev_foo , foo ,
size, ToDevice); 6 kernel<<
dev_foo
, size); 7 cudaThreadSynchronize(); 8
cudaMemcpy
(
foo , dev_foo ,
size, ToHost); 9 cpuComputation( foo ); 10
cudaFree
(
dev_foo
)
;
11
free
(
foo
); }
GMAC
void compute(FILE *file, int size) { 1 float *
foo
; 2
foo
=
gmacMalloc
(size); 3 fread(
foo
, size, 1, file); 4 5 6 kernel<<
foo
, size); 7 gmacThreadSynchronize(); 9 cpuComputation(
foo
); 10
gmacFree
(
foo
); 11 } NVIDIA GPU Technology Conference – 22 nd of September, 2010 40
GMAC at a glance
└
Examples
• Optimized support for bulk memory operations
CUDA-RT
void compute(FILE *file, int size) { 1 float *foo, *dev_foo; 2 foo = malloc(size); 3
fread(foo, size, 1, file);
4 cudaMalloc(&dev_foo, size); 5
cudaMemcpy(dev_foo, foo, size, ToDevice);
6 kernel<<
GMAC
void compute(FILE *file, int size) { 1 float *foo; 2 foo = gmacMalloc(size); 3
fread(foo, size, 1, file);
4 5 6 kernel<<
Outline
• • • Introduction GMAC at a glance
Reverse Time Migration on GMAC
→
Disk I/O
• • • Domain decomposition Overlapping computation and communication Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 42
Reverse Time Migration on GMAC
└
Disk I/O
• CUDA-RT Implementation (multiple transfers) • Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU → Lots of synchronization code in the compute thread
GPU
GPU address space CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 43
Reverse Time Migration on GMAC
└
Disk I/O (GMAC)
• GMAC implementation • deviceToHost transfers performed by the I/O thread • • deviceToHost and Disk I/O transfers overlap for free Small page-locked buffers are used
GPU
Global address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 44
Outline
• • • Introduction GMAC at a glance • •
Reverse Time Migration on GMAC
• Disk I/O →
Domain decomposition
Overlapping computation and communication Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 45
Reverse Time Migration on GMAC
└
Domain decomposition (CUDA-RT)
• CUDA-RT implementation (single-transfer exchange) • Streams and page-locked memory buffers must be used • Page-locked memory buffers can be too big
GPU 1 GPU 2 GPU 3 GPU 4
GPUs’ address spaces CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 46
Reverse Time Migration on GMAC
└
Domain decomposition (GMAC)
• GMAC implementation (multiple-transfer exchange) • Exchange of boundaries performed using a simple memcpy !
GPU 1 GPU 2 GPU 3 GPU 4
Unified global address space • Full PCIe utilization: internally GMAC performs several transfers and double buffering NVIDIA GPU Technology Conference – 22 nd of September, 2010 47
Outline
• • • Introduction GMAC at a glance
Reverse Time Migration on GMAC
• Disk I/O • • Domain decomposition →
Overlapping computation and communication
Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 48
Reverse Time Migration on GMAC
└
Overlapping computation and communication
• No streams, no page-locked buffers, similar performance: ± 2% readVelocity(
velociy
); cudaMalloc(&d_input, W_SIZE); cudaMalloc(&d_output, W_SIZE); cudaHostAlloc(&
i_halos
, H_SIZE); cudaHostAlloc(&
disk_buffer
, W_SIZE); cudaStreamCreate(&
s1
); cudaStreamCreate(&
s2
);
cudaMemcpy
(
d_velocity
,
velocity
, W_SIZE) for all time steps do launch_stage1(d_output, d_input,
s1
); launch_stage2(d_output, d_input,
s2
); cudaMemcpyAsync(
i_halos
, d_output,
s1
);
cudaStreamSynchronize(s1);
barrier(); cudaMemcpyAsync(d_output,
i_halos
,
s1
); cudaThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); transfer_to_host(
disk_buffer
); barrier_write_to_disk(); }
// ... Update pointers
end for
CUDA-RT
fread(velocity); gmacMalloc(&input, W_SIZE); gmacMalloc(&output, W_SIZE); for all time steps do launch_stage1( output, input ); gmacThreadSynchronize(); launch_stage2( output, input ); memcpy(neighbor, output); gmacThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); barrier_write_to_disk(); }
// ... Update pointers
end for
GMAC
NVIDIA GPU Technology Conference – 22 nd of September, 2010 49
Outline
• • • Introduction GMAC at a glance
Reverse Time Migration on GMAC
• Disk I/O • • Domain decomposition Inter-domain communication →
Development cycle and debugging
• Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 50
Reverse Time Migration on GMAC
└
Development cycle and debugging
• CUDA-RT • • Start from a simple, correct sequential code Implement kernels one at a time and check correctness • • Two allocations per data structure Keep data consistency by hand ( cudaMemcpy ) • To introduce modifications to any kernel • • Two allocations per data structure Keep data consistency by hand ( cudaMemcpy ) 3D-Stencil Absorbing Boundary Conditions Source insertion Compression NVIDIA GPU Technology Conference – 22 nd of September, 2010 51
Reverse Time Migration on GMAC
└
Development cycle and debugging
• GMAC • Allocate objects with • Single pointer • gmacMalloc Use pointer both in the host and GPU kernel implementations • No copies 3D-Stencil Absorbing Boundary Conditions Source insertion Compression NVIDIA GPU Technology Conference – 22 nd of September, 2010 52
Outline
• • • • • Introduction Reverse Time Migration on CUDA GMAC at a glance Reverse Time Migration on GMAC
Conclusions
NVIDIA GPU Technology Conference – 22 nd of September, 2010 53
Conclusions
• Heterogeneous systems based on GPUs are currently the most appropriate to implement RTM • CUDA has programmability issues • CUDA provides a good language to expose data parallelism in the code to be run on the GPU • The host-side interface provided by the CUDA-RT makes difficult to implement even some basic optimizations GMAC eases the development of applications for GPU-based systems with no performance penalty 6-month part-time single programmer: full RTM version (5x speedup over the previous Cell implementation) NVIDIA GPU Technology Conference – 22 nd of September, 2010 54
Acknowledgements
• • • • Barcelona Supercomputing Center Repsol Universitat Politècnica de Catalunya University of Illinois at Urbana-Champaign NVIDIA GPU Technology Conference – 22 nd of September, 2010 55
Thank you!
Questions?
NVIDIA GPU Technology Conference – 22 nd of September, 2010 56