Embarrassingly Parallel RTM

Transcript Embarrassingly Parallel RTM

RTM at Petascale and Beyond
Michael Perrone
IBM Master Inventor
Computational Sciences Center, IBM Research
© 2011 IBM Corporation
RTM (Reverse Time Migration) Seismic Imaging on BGQ
• RTM is a widely-used imaging technique
for oil and gas exploration, particularly
under subsalts
• Over $5 trillion of subsalt oil is believed to
exist in the Gulf of Mexico
• Imaging subsalt regions of the Earth is
extremely challenging
• Industry anticipates exascale need by 2020
IBM Research
Bottom Line: Seismic Imaging
 We can make RTM 10 to 100 times faster
 How?
► Abandon embarrassingly parallel RTM
► Use domain-partitioned, multisource RTM
 System requirements
► High communication BW
► Low communication latency
Can be extended
equally well to FWI
► Lots of memory
3
[email protected]
© 2011 IBM Corporation
IBM Research
Take Home Messages
 Embarrassingly parallel is not always the best approach
 It is crucial to know where bottlenecks exist
 Algorithmic changes can dramatically improve
performance
4
[email protected]
© 2011 IBM Corporation
IBM Research
Compute performance on new hardware
Old hardware
New hardware 1
New hardware 2
Run Time
 Kernel performance improvement
5
[email protected]
© 2011 IBM Corporation
IBM Research
Compute performance on new hardware
Old hardware
New hardware 1
New hardware 2
Disk IO
Run Time
 Need to track end-to-end performance
6
[email protected]
© 2011 IBM Corporation
IBM Research
Bottlenecks: Memory IO
 GPU: 0.1 B/F
►100 GB/s
►1 TF/s
 BG/P: 1.0 B/F
BG/Q L2: 1.5 B/F
►13.6 GB/s
► > 300 GB/s
►13.6 GF/s
► 204.8 GF/s
 BG/Q: 0.2 B/F
►43 GB/s
►204.8 GF/s
7
[email protected]
© 2011 IBM Corporation
IBM Research
GPU’s for Seismic Imaging?
 x86/GPU [old results, 2x now]
► 17B Stencils / Second
► nVidia / INRIA collaboration
– Velocity model:
560x560x905
– Iterations: 22760
 BlueGene/P
► 40B Stencils / Second
► Comparable model
size/complexity
► Partial optimization
– MPI not overlapped
– Kernel optimization ongoing
Abdelkhalek, R., Calandra, H., Coulaud, O., Roman, J.,
Latu, G. 2009. Fast Seismic Modeling and Reverse
Time Migration on a GPU Cluster. In International
Conference on High Performance Computing &
Simulation, 2009. HPCS'09.
► BlueGene/Q will be even faster
8
[email protected]
© 2011 IBM Corporation
IBM Research
Reverse Time Migration (RTM)
Receiver Data:
Source Data:
R
(
x
,
y
,
z
,
t
)

S
(
x
,
y
,
z
,
t
)

Ship
~1 km
~5 km
1 Shot
9
[email protected]
© 2011 IBM Corporation
IBM Research
RTM - Reverse Time Migration
 Use 3D wave equation to model sound in Earth


 


 


222 1
2



P
(
x
,
y
,
z
,
t
)

S
(
x
,
y
,
z
,
t
)


x
y
z
t
S
2
(
x
,
y
,
z
)
 v



222 1
2



P
(
x
,
y
,
z
,
t
)

R
(
x
,
y
,
z
,
t
)


x
y
z
t
R
2
(
x
,
y
,
z
)
 v


 Forward (Source):
Reverse (Receiver):
P
(
x
,
y
,
z
,
t
)
P
(
x
,
y
,
z
,
t
)
S
R


 Imaging Condition
10
[email protected]

I
(
x
,
y
,
z
)

P
(
x
,
y
,
z
,
t
)
P
(
x
,
y
,
z
,
t
)

S
R
t
© 2011 IBM Corporation
IBM Research
Implementing the Wave Equation


 Finite difference in time:


 


222 1
2



P
(
x
,
y
,
z
,
t
)

S
(
x
,
y
,
z
,
t
)


x
y
z
t
S
2
(
x
,
y
,
z
)
 v

P
(
x
,
y
,
z
,
t
)

P
(
x
,
y
,
z
,
t

1
)

2
P
(
x
,
y
,
z
,
t
)

P
(
x
,
y
,
z
,
t

1
)
2
t




 Finite difference in space:

2
P
(
x
,
y
,
z
,
t
)

g
(
n
)
P
(
x

n
,
y
,
z
,
t
)

x
x
n
P
(
x
,
y
,
z
,
t
)

g
(
n
)
P
(
x
,
y

n
,
z
,
t
)

y
2
y
n
2

P
(
x
,
y
,
z
,
t
)

g
(
n
)
P
(
x
,
y
,
z

n
,
t
)

z
z
n

 Absorbing boundary conditions, interpolation, compression, etc.

11
[email protected]
© 2011 IBM Corporation
IBM Research
Image
RTM Algorithm (for each shot)
 Load data
► Velocity model v(x,y,z)
t=N
F(N)
R(N)
I(N)
t=2N
F(2N)
R(2N)
I(2N)
t=3N
F(3N)
R(3N)
I(3N)
.
.
.
.
.
.
t=kN
F(kN)
► Source & Receiver data
 Forward propagation
► Calculate P(x,y,z,t)
► Every N timesteps
– Compress P(x,y,x,t)
– Write P(x,y,x,t) to disk/memory
 Backward propagation
► Calculate P(x,y,z,t)
► Every N timesteps
– Read P(x,y,x,t) from disk/memory
– Decompress P(x,y,x,t)
– Calculate partial sum of I(x,y,z)
.
.
.
R(kN)
I(kN)
 Merge I(x,y,z) with global image
12
[email protected]
© 2011 IBM Corporation
IBM Research
Embarrassingly Parallel RTM
Data Archive (Disk)
Process shots
in parallel, one
per slave node
Model
Slave
Node
Disk
Slave
Node
Disk
Master
Node
...
Slave
Node
Disk
Scratch disk bottleneck
Subset of model for each shot (~100k+ shots)
13
[email protected]
© 2011 IBM Corporation
IBM Research
Domain-Partitioned Multisource RTM
Data Archive (Disk)
Process all
data at once
with domain
decomposition
Model
Slave
Node
Disk
Slave
Node
Disk
Master
Node
...
Slave
Node
Shots merged and model partitioned
14
[email protected]
Disk
Small partitions mean
forward wave can be
stored locally: No disks
© 2011 IBM Corporation
IBM Research
Multisource RTM
Full Velocity
Model
Receiver
data
Velocity
Subset
Source
 Linear superposition principal
 2

1
2
2
2
t Pi (x, y,z,t)  Si (x, y,z,t)
x  y  z  2
v (x, y,z) 

 So N sources can be merged

 2

N
1

2
2
2  N
t i Pi (x, y,z,t) i Si (x, y,z,t)
x  y  z  2

v (x, y,z) 

 Finite receiver array acts as nonlinear filter on data
Rmeasured (x,y,z,t)  MRfull (x, y,z,t)

Accelerate by
factor of N
 Nonlinearity leads to “crosstalk” noise which needs to be minimized
15
[email protected]
© 2011 IBM Corporation
IBM Research
3D RTM Scaling (Partial optimization)
 512x512x512 & 1024x1024x1024 models
 Scaling improves for larger models
16
[email protected]
© 2011 IBM Corporation
IBM Research
GPU Scaling is Comparatively Poor
Tsubame supercomputer
Japan
 GPU’s achieve only
10% of peak
performance (100x
increase for 1000
nodes
Okamoto, T., Takenaka, H., Nakamura, T. and
Aoki, T. 2010. Accelerating large-scale
simulation of seismic wave propagation by
multi-GPUs and three-dimensional domain
decomposition. In Earth Planets Space,
November, 2010.
17
[email protected]
© 2011 IBM Corporation
IBM Research
Physical survey size mapped to BG/Q L2 cache
 Isotropic RTM with minimum V = 1.5 km/s
 10 points per wavelength (5 would reduce number below by 8x)
 Mapping entire survey volume – not a subset (enables multisource)
1000
(512)^3
512
km^3m^3
100
(4096)^3
4096
km^3m^3
10
(16384)^3
16384
km^3m^3
1
0
10
20
30
40
50
60
70
80
Max Imaging Frequency
18
[email protected]
© 2011 IBM Corporation
IBM Research
Snapshot Data Easily Fits in Memory (No disk required)
10000
9000
500^3
8000
 4x more
capacity
for BGQ
600^3
7000
700^3
800^3
6000
900^3
1000^3
5000
1100^3
4000
1200^3
1300^3
3000
1400^3
1500^3
2000
1600^3
1000
0
128
256
512
1024
2048
3072
4096
5120
6144
7168
8192
9216 10240
 # of uncompressed snapshots that can be stored
for various model sizes and number of nodes.
19
© 2011 IBM Corporation
IBM Research
Comparison
 Embarrassingly parallel RTM
► Coarse-grain communication
► Coarse-grain synchronization
► Disk IO Bottleneck
 Partitioned RTM
► Fine-grain communication
► Fine-grain synchronization
► No scratch disk
20
[email protected]
Low latency
High bandwidth:
Blue Gene
© 2011 IBM Corporation
IBM Research
Conclusion: RTM can be dramatically accelerated
 Algorithmic:
► Adopt partitioned, multisource RTM
► Abandon embarrassingly parallel implementations
 Hardware:
► Increase communication bandwidth
► Decrease communication latency
► Reduce node nondeterminism
 Advantages
► Can process larger models - scales well
► Avoids scratch disk IO bottleneck
► Improves RAS & MTBF: No disk means no moving parts
 Disadvantages
► Must handle shot “crosstalk” noise
– Methods exist - research continuing…
21
[email protected]
© 2011 IBM Corporation

Embarrassingly Parallel RTM

Transcript Embarrassingly Parallel RTM

Directory