Reverse Time Migration on GMAC

Download Report

Transcript Reverse Time Migration on GMAC

22 nd NVIDIA GTC of September, 2010

Reverse Time Migration

on

GMAC

Javier Cabezas

Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José María Cela Nacho Navarro

BSC

Repsol/BSC UPC/UIUC NVIDIA Repsol UPC/BSC UPC/BSC

Outline

• • • • •

Introduction

Reverse Time Migration on CUDA GMAC at a glance Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 2

Reverse Time Migration on CUDA

RTM

• • • RTM generates an image of the subsurface layers Uses traces recorded by sensors in the field RTM’s algorithm 1.

2.

Propagation of a modeled wave (forward in time) Propagation of the recorded traces (backward in time) 3.

• Correlation of the forward and backward wavefields Last forward wavefield with the first backward wavefield • FDTD are preferred to FFT • 2 nd -order finite differencing in time • High-order finite differencing in space NVIDIA GPU Technology Conference – 22 nd of September, 2010 3

Introduction

Barcelona Supercomputing Center (BSC)

• • BSC and Repsol: Kaleidoscope project • Develop better algorithms/techniques for seismic imaging • We focused on Reverse Time Migration (RTM), as it is the most popular seismic imaging technique for depth exploration Due to the high computational power required, the project started a quest for the most suitable hardware • PowerPC: scalability issues • • • Cell: good performance (in production @ Repsol) , difficult programmability FPGA: potentially best performance , programmability nightmare GPUs: 5x speedup vs Cell (GTX280) , what about programmability?

NVIDIA GPU Technology Conference – 22 nd of September, 2010 4

Outline

• • • • • Introduction

Reverse Time Migration on CUDA

General approach

• • • Disk I/O Domain decomposition Overlapping computation and communication GMAC at a glance Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 5

Reverse Time Migration on CUDA

General approach

• We focus on the host-side part of the implementation 1.

Avoid memory transfers between host and GPU memories • Implement on the GPU as many computations as possible 2.

Hide latency of memory transfers • Overlap memory transfers and kernel execution 3.

Take advantage of the PCIe full-duplex capabilities (Fermi) • Overlap deviceToHost and hostToDevice memory transfers NVIDIA GPU Technology Conference – 22 nd of September, 2010 6

Reverse Time Migration on CUDA

General approach Forward

3D-Stencil Absorbing Boundary Conditions Source insertion

Backward

3D-Stencil Absorbing Boundary Conditions Traces insertion Compression

Write to disk Read from disk

Decompression Correlation NVIDIA GPU Technology Conference – 22 nd of September, 2010 7

Reverse Time Migration on CUDA

General approach

• Data structures used in the RTM algorithm • Read/Write structures • 3D volume for the wavefield (can be larger than 1000x1000x1000 points) • State of the wavefiled in previous time-steps to compute finite differences in time • Some extra points in each direction at the boundaries (halos) • Read-Only structures • 3D volume of the same size as the wavefield • Geophones’ recorded traces: time-steps x #geophones NVIDIA GPU Technology Conference – 22 nd of September, 2010 8

Reverse Time Migration on CUDA

General approach

• Data flow-graph (forward)

3D-Stencil ABC Source

Wavefields Constant read only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22 nd of September, 2010 9

Compress

Reverse Time Migration on CUDA

General approach

• Simplified data flow-graph (forward)

RTM Kernel

Wave-fields Constant read only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22 nd of September, 2010 10

Compress

Reverse Time Migration on CUDA

General approach

• Control flow-graph (forward) • RTM Kernel Computation • Compress and transfer to disk • deviceToHost + Disk I/O • • Performed every N steps Can run in parallel with the next compute steps Runs on the GPU Runs on the CPU NVIDIA GPU Technology Conference – 22 nd of September, 2010 Start i = 0

RTM Kernel

i%N == 0 no yes

Compress toHost

11 i++ yes i < steps no End

Disk I/O

Outline

• • • • • Introduction • •

Reverse Time Migration on CUDA

• General approach →

Disk I/O

Domain decomposition Overlapping computation and communication GMAC at a glance Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 12

Reverse Time Migration on CUDA

Disk I/O

• GPU → Disk transfers are very time-consuming

K 1 K 2 K 3 K 4 C toHost Disk I/O K 5

• time Transferring to disk can be overlapped with the next (compute only) steps

K 1 K 2

Runs on the GPU Runs on the CPU

K 3

NVIDIA GPU Technology Conference – 22 nd of September, 2010

K 4

time 13

C K 5 toHost K 6 Disk I/O K 7 K 8

Reverse Time Migration on CUDA

Disk I/O

Single transfer: wait for all the data to be in host memory

deviceToHost Disk I/O

• time Multiple transfers: overlap deviceToHost transfers with disk I/O • Double buffering

toH toH toH toH Disk I/O Disk I/O Disk I/O

time 14

Disk I/O

NVIDIA GPU Technology Conference – 22 nd of September, 2010

Reverse Time Migration on CUDA

Disk I/O

• CUDA-RT limitations • GPU memory accessible by the owner host thread only → deviceToHost transfers must be performed by the compute thread

GPU

GPU address space

Compute thread

CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 15

I/O thread

Reverse Time Migration on CUDA

Disk I/O

• CUDA-RT Implementation (single transfer) • CUDA streams must be used not to block GPU execution → Intermediate page-locked buffer must be used: for real-size problems the system can run out of memory!

GPU

GPU address space CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 16

Reverse Time Migration on CUDA

Disk I/O

• CUDA-RT Implementation (multiple transfers) • Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU → Lots of synchronization code in the compute thread

GPU

GPU address space CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 17

Outline

• • • • • Introduction

Reverse Time Migration on CUDA

• General approach • • Disk I/O →

Domain decomposition

Overlapping computation and communication GMAC at a glance Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 18

Reverse Time Migration on CUDA

Domain decomposition

• • But… wait, real-size problems require > 16GB of data!

Volumes are split into tiles (along the Z-axis) • 3D-Stencil introduces data dependencies

D2 D4

x y z NVIDIA GPU Technology Conference – 22 nd of September, 2010

D1

19

D3

Reverse Time Migration on CUDA

Domain decomposition

• Multi-node may be required to overcome memory capacity limitations • Shared memory for intra-node communication • MPI for inter-node communication Node 1 Node 2

GPU1 GPU2 GPU3 GPU4 GPU1 GPU2 GPU3 GPU4

Host Memory NVIDIA GPU Technology Conference – 22 nd of September, 2010

MPI

20 Host Memory

Reverse Time Migration on CUDA

Domain decomposition

• Data flow-graph (multi-domain)

RTM Kernel Compress Compress RTM Kernel

Wave-fields (domain 1) Wave-fields (domain 2) Constant read only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22 nd of September, 2010 21

Reverse Time Migration on CUDA

Domain decomposition

• Control flow-graph (multi-domain) • Boundary exchange every time-step • Inter-domain communication blocks execution of the next steps!

Runs on the GPU Runs on the CPU NVIDIA GPU Technology Conference – 22 nd of September, 2010 Start i = 0

Kernel Exchange

sync s%N == 0 no yes

Compress toHost

22 i++ yes i < steps no End

Disk I/O

Reverse Time Migration on CUDA

Domain decomposition

• Boundary exchange every time-step is needed

K 1 X K 2 X K 3 X K 4 X C K 5 toHost X K 6 X K 7 Disk I/O

time NVIDIA GPU Technology Conference – 22 nd of September, 2010 23

Reverse Time Migration on CUDA

Domain decomposition

• Single-transfer exchange • “Easy” to program, needs large page-locked buffers

deviceToHost deviceToHost deviceToHost hostToDevice hostToDevice hostToDevice

• time Multiple-transfer exchange to maximize PCI-Express utilization • “Complex” to program, needs smaller page-locked buffers

toH toH toH toH toH toH toH toH toH toH toH toH toD toD toD toD toD toD toD toD toD toD toD toD

time 24 NVIDIA GPU Technology Conference – 22 nd of September, 2010

Reverse Time Migration on CUDA

Domain decomposition

• CUDA-RT limitations • Each host thread can only access to the memory objects it allocates GPUs’ address spaces

GPU 1 GPU 2 GPU 3 GPU 4

CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 25

Reverse Time Migration on CUDA

Domain decomposition

• CUDA-RT implementation (single-transfer exchange) • Streams and page-locked memory buffers must be used • Page-locked memory buffers can be too big

GPU 1 GPU 2 GPU 3 GPU 4

GPUs’ address spaces CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 26

Domain decomposition

• • • CUDA-RT implementation (multiple-transfer exchange) • Uses small page-locked buffers • More synchronization code Too complex to be represented using Powerpoint!

Very difficult to implement in real code!

NVIDIA GPU Technology Conference – 22 nd of September, 2010 27

Outline

• • • • • Introduction

Reverse Time Migration on CUDA

• General approach • • Disk I/O Domain decomposition →

Overlapping computation and communication

GMAC at a glance Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 28

Reverse Time Migration on CUDA

Overlapping computation and communication

• Problem: boundary exchange blocks the execution of the following time-step

K 1 X K 2 X K 3 X K 4 X C K 5 toHost X K 6 X K 7 Disk I/O

time NVIDIA GPU Technology Conference – 22 nd of September, 2010 29

Reverse Time Migration on CUDA

Overlapping computation and communication

• Solution: with a 2-stage execution plan we can effectively overlap the boundary exchange between domains

k 1 K 1 k 2 K 2 k 3 K 3 k 4 K 4 C k 5 K 5 k 6 K 6 k 7 K 7 k 8 K 8 C k 9 K 9 X X X X X X X X X toHost toHost

Disk I/O Disk I/O

Disk I/O

time NVIDIA GPU Technology Conference – 22 nd of September, 2010 30

Reverse Time Migration on CUDA

Overlapping computation and communication

• Approach: two-stage execution • Stage 1: compute the wavefield points to be exchanged x y z

GPU1

NVIDIA GPU Technology Conference – 22 nd of September, 2010 31

GPU2

Reverse Time Migration on CUDA

Overlapping computation and communication

• Approach: two-stage execution • Stage 2: Compute the remaining points while exchanging the boundaries x y z

GPU1

NVIDIA GPU Technology Conference – 22 nd of September, 2010 32

GPU2

Reverse Time Migration on CUDA

Overlapping computation and communication

• But two-stage execution requires more abstractions and code complexity • An additional stream per domain • We already have 1 to launch kernels, 1 to overlap transfers to disk, 1 to exchange boundaries →

At this point the code is a complete mess!

• Requires 4 streams per domain, many page-locked buffers, lots of inter-thread synchronization • Poor readability and maintainability • Easy to introduce bugs NVIDIA GPU Technology Conference – 22 nd of September, 2010 33

Outline

• • • Introduction Reverse Time Migration on CUDA • •

GMAC at a glance

Features

• Code examples Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 34

GMAC at a glance

Introduction

• • Library that enhances the host programming model of CUDA Freely available at http://code.google.com/p/adsm/ • Developed by BSC and UIUC • • NCSA license (BSD-like) Works in Linux and MacOS X (Windows version coming soon) •

Presented in detail tomorrow at 9 am @ San Jose Ballroom

NVIDIA GPU Technology Conference – 22 nd of September, 2010 35

GMAC at a glance

Features

• Unified virtual address space for all the memories in the system • Single allocation for shared objects • Special API calls: gmacMalloc, gmacFree • GPU memory allocated by a host thread is visible to all host threads →

Brings POSIX thread semantics back to developers

Shared Data CPU GPU CPU Data NVIDIA GPU Technology Conference – 22 nd of September, 2010 Memory 36

GMAC at a glance

Features

• Parallelism exposed via regular POSIX threads • Replaces the explicit use of CUDA streams • OpenMP support • GMAC uses streams and page-locked buffers internally • Concurrent kernel execution and memory transfers for free

GPU

NVIDIA GPU Technology Conference – 22 nd of September, 2010 37

GMAC at a glance

Features

• • • Optimized bulk memory operations via library interposition • File I/O • Standard I/O functions: fwrite , fread • Automatic overlap of Disk I/O and hostToDevice and deviceToHost transfers Optimized GPU to GPU transfers via regular memcpy Enhanced versions of the MPI send/receive calls NVIDIA GPU Technology Conference – 22 nd of September, 2010 38

Outline

• • • Introduction Reverse Time Migration on CUDA • •

GMAC at a glance

• Features →

Code examples

Reverse Time Migration on GMAC Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 39

GMAC at a glance

Examples

• Single allocation (and pointer) for shared objects

CUDA-RT

void compute(FILE *file, int size) { 1 float *

foo

, *

dev_foo

; 2

foo

=

malloc

(size); 3 fread(

foo

, size, 1, file); 4

cudaMalloc

(&

dev_foo

, size); 5

cudaMemcpy

(

dev_foo , foo ,

size, ToDevice); 6 kernel<<>>(

dev_foo

, size); 7 cudaThreadSynchronize(); 8

cudaMemcpy

(

foo , dev_foo ,

size, ToHost); 9 cpuComputation( foo ); 10

cudaFree

(

dev_foo

)

;

11

free

(

foo

); }

GMAC

void compute(FILE *file, int size) { 1 float *

foo

; 2

foo

=

gmacMalloc

(size); 3 fread(

foo

, size, 1, file); 4 5 6 kernel<<>>( 8

foo

, size); 7 gmacThreadSynchronize(); 9 cpuComputation(

foo

); 10

gmacFree

(

foo

); 11 } NVIDIA GPU Technology Conference – 22 nd of September, 2010 40

GMAC at a glance

Examples

• Optimized support for bulk memory operations

CUDA-RT

void compute(FILE *file, int size) { 1 float *foo, *dev_foo; 2 foo = malloc(size); 3

fread(foo, size, 1, file);

4 cudaMalloc(&dev_foo, size); 5

cudaMemcpy(dev_foo, foo, size, ToDevice);

6 kernel<<>>(dev_foo, size); 7 cudaThreadSynchronize(); 8 cudaMemcpy(foo, dev_foo, size, ToHost); 9 cpuComputation(foo); 10 cudaFree(dev_foo); 11 free(foo); }

GMAC

void compute(FILE *file, int size) { 1 float *foo; 2 foo = gmacMalloc(size); 3

fread(foo, size, 1, file);

4 5 6 kernel<<>>(foo, size); 7 gmacThreadSynchronize(); 8 9 cpuComputation(foo); 10 gmacFree(foo); 11 } NVIDIA GPU Technology Conference – 22 nd of September, 2010 41

Outline

• • • Introduction GMAC at a glance

Reverse Time Migration on GMAC

Disk I/O

• • • Domain decomposition Overlapping computation and communication Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 42

Reverse Time Migration on GMAC

Disk I/O

• CUDA-RT Implementation (multiple transfers) • Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU → Lots of synchronization code in the compute thread

GPU

GPU address space CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 43

Reverse Time Migration on GMAC

Disk I/O (GMAC)

• GMAC implementation • deviceToHost transfers performed by the I/O thread • • deviceToHost and Disk I/O transfers overlap for free Small page-locked buffers are used

GPU

Global address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 44

Outline

• • • Introduction GMAC at a glance • •

Reverse Time Migration on GMAC

• Disk I/O →

Domain decomposition

Overlapping computation and communication Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 45

Reverse Time Migration on GMAC

Domain decomposition (CUDA-RT)

• CUDA-RT implementation (single-transfer exchange) • Streams and page-locked memory buffers must be used • Page-locked memory buffers can be too big

GPU 1 GPU 2 GPU 3 GPU 4

GPUs’ address spaces CPU address space NVIDIA GPU Technology Conference – 22 nd of September, 2010 46

Reverse Time Migration on GMAC

Domain decomposition (GMAC)

• GMAC implementation (multiple-transfer exchange) • Exchange of boundaries performed using a simple memcpy !

GPU 1 GPU 2 GPU 3 GPU 4

Unified global address space • Full PCIe utilization: internally GMAC performs several transfers and double buffering NVIDIA GPU Technology Conference – 22 nd of September, 2010 47

Outline

• • • Introduction GMAC at a glance

Reverse Time Migration on GMAC

• Disk I/O • • Domain decomposition →

Overlapping computation and communication

Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 48

Reverse Time Migration on GMAC

Overlapping computation and communication

• No streams, no page-locked buffers, similar performance: ± 2% readVelocity(

velociy

); cudaMalloc(&d_input, W_SIZE); cudaMalloc(&d_output, W_SIZE); cudaHostAlloc(&

i_halos

, H_SIZE); cudaHostAlloc(&

disk_buffer

, W_SIZE); cudaStreamCreate(&

s1

); cudaStreamCreate(&

s2

);

cudaMemcpy

(

d_velocity

,

velocity

, W_SIZE) for all time steps do launch_stage1(d_output, d_input,

s1

); launch_stage2(d_output, d_input,

s2

); cudaMemcpyAsync(

i_halos

, d_output,

s1

);

cudaStreamSynchronize(s1);

barrier(); cudaMemcpyAsync(d_output,

i_halos

,

s1

); cudaThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); transfer_to_host(

disk_buffer

); barrier_write_to_disk(); }

// ... Update pointers

end for

CUDA-RT

fread(velocity); gmacMalloc(&input, W_SIZE); gmacMalloc(&output, W_SIZE); for all time steps do launch_stage1( output, input ); gmacThreadSynchronize(); launch_stage2( output, input ); memcpy(neighbor, output); gmacThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); barrier_write_to_disk(); }

// ... Update pointers

end for

GMAC

NVIDIA GPU Technology Conference – 22 nd of September, 2010 49

Outline

• • • Introduction GMAC at a glance

Reverse Time Migration on GMAC

• Disk I/O • • Domain decomposition Inter-domain communication →

Development cycle and debugging

• Conclusions NVIDIA GPU Technology Conference – 22 nd of September, 2010 50

Reverse Time Migration on GMAC

Development cycle and debugging

• CUDA-RT • • Start from a simple, correct sequential code Implement kernels one at a time and check correctness • • Two allocations per data structure Keep data consistency by hand ( cudaMemcpy ) • To introduce modifications to any kernel • • Two allocations per data structure Keep data consistency by hand ( cudaMemcpy ) 3D-Stencil Absorbing Boundary Conditions Source insertion Compression NVIDIA GPU Technology Conference – 22 nd of September, 2010 51

Reverse Time Migration on GMAC

Development cycle and debugging

• GMAC • Allocate objects with • Single pointer • gmacMalloc Use pointer both in the host and GPU kernel implementations • No copies 3D-Stencil Absorbing Boundary Conditions Source insertion Compression NVIDIA GPU Technology Conference – 22 nd of September, 2010 52

Outline

• • • • • Introduction Reverse Time Migration on CUDA GMAC at a glance Reverse Time Migration on GMAC

Conclusions

NVIDIA GPU Technology Conference – 22 nd of September, 2010 53

Conclusions

• Heterogeneous systems based on GPUs are currently the most appropriate to implement RTM • CUDA has programmability issues • CUDA provides a good language to expose data parallelism in the code to be run on the GPU • The host-side interface provided by the CUDA-RT makes difficult to implement even some basic optimizations  GMAC eases the development of applications for GPU-based systems with no performance penalty  6-month part-time single programmer: full RTM version (5x speedup over the previous Cell implementation) NVIDIA GPU Technology Conference – 22 nd of September, 2010 54

Acknowledgements

• • • • Barcelona Supercomputing Center Repsol Universitat Politècnica de Catalunya University of Illinois at Urbana-Champaign NVIDIA GPU Technology Conference – 22 nd of September, 2010 55

Thank you!

Questions?

NVIDIA GPU Technology Conference – 22 nd of September, 2010 56