Embarrassingly Parallel RTM
Download
Report
Transcript Embarrassingly Parallel RTM
RTM at Petascale and Beyond
Michael Perrone
IBM Master Inventor
Computational Sciences Center, IBM Research
© 2011 IBM Corporation
RTM (Reverse Time Migration) Seismic Imaging on BGQ
• RTM is a widely-used imaging technique
for oil and gas exploration, particularly
under subsalts
• Over $5 trillion of subsalt oil is believed to
exist in the Gulf of Mexico
• Imaging subsalt regions of the Earth is
extremely challenging
• Industry anticipates exascale need by 2020
IBM Research
Bottom Line: Seismic Imaging
We can make RTM 10 to 100 times faster
How?
► Abandon embarrassingly parallel RTM
► Use domain-partitioned, multisource RTM
System requirements
► High communication BW
► Low communication latency
Can be extended
equally well to FWI
► Lots of memory
3
[email protected]
© 2011 IBM Corporation
IBM Research
Take Home Messages
Embarrassingly parallel is not always the best approach
It is crucial to know where bottlenecks exist
Algorithmic changes can dramatically improve
performance
4
[email protected]
© 2011 IBM Corporation
IBM Research
Compute performance on new hardware
Old hardware
New hardware 1
New hardware 2
Run Time
Kernel performance improvement
5
[email protected]
© 2011 IBM Corporation
IBM Research
Compute performance on new hardware
Old hardware
New hardware 1
New hardware 2
Disk IO
Run Time
Need to track end-to-end performance
6
[email protected]
© 2011 IBM Corporation
IBM Research
Bottlenecks: Memory IO
GPU: 0.1 B/F
►100 GB/s
►1 TF/s
BG/P: 1.0 B/F
BG/Q L2: 1.5 B/F
►13.6 GB/s
► > 300 GB/s
►13.6 GF/s
► 204.8 GF/s
BG/Q: 0.2 B/F
►43 GB/s
►204.8 GF/s
7
[email protected]
© 2011 IBM Corporation
IBM Research
GPU’s for Seismic Imaging?
x86/GPU [old results, 2x now]
► 17B Stencils / Second
► nVidia / INRIA collaboration
– Velocity model:
560x560x905
– Iterations: 22760
BlueGene/P
► 40B Stencils / Second
► Comparable model
size/complexity
► Partial optimization
– MPI not overlapped
– Kernel optimization ongoing
Abdelkhalek, R., Calandra, H., Coulaud, O., Roman, J.,
Latu, G. 2009. Fast Seismic Modeling and Reverse
Time Migration on a GPU Cluster. In International
Conference on High Performance Computing &
Simulation, 2009. HPCS'09.
► BlueGene/Q will be even faster
8
[email protected]
© 2011 IBM Corporation
IBM Research
Reverse Time Migration (RTM)
Receiver Data:
Source Data:
R
(
x
,
y
,
z
,
t
)
S
(
x
,
y
,
z
,
t
)
Ship
~1 km
~5 km
1 Shot
9
[email protected]
© 2011 IBM Corporation
IBM Research
RTM - Reverse Time Migration
Use 3D wave equation to model sound in Earth
222 1
2
P
(
x
,
y
,
z
,
t
)
S
(
x
,
y
,
z
,
t
)
x
y
z
t
S
2
(
x
,
y
,
z
)
v
222 1
2
P
(
x
,
y
,
z
,
t
)
R
(
x
,
y
,
z
,
t
)
x
y
z
t
R
2
(
x
,
y
,
z
)
v
Forward (Source):
Reverse (Receiver):
P
(
x
,
y
,
z
,
t
)
P
(
x
,
y
,
z
,
t
)
S
R
Imaging Condition
10
[email protected]
I
(
x
,
y
,
z
)
P
(
x
,
y
,
z
,
t
)
P
(
x
,
y
,
z
,
t
)
S
R
t
© 2011 IBM Corporation
IBM Research
Implementing the Wave Equation
Finite difference in time:
222 1
2
P
(
x
,
y
,
z
,
t
)
S
(
x
,
y
,
z
,
t
)
x
y
z
t
S
2
(
x
,
y
,
z
)
v
P
(
x
,
y
,
z
,
t
)
P
(
x
,
y
,
z
,
t
1
)
2
P
(
x
,
y
,
z
,
t
)
P
(
x
,
y
,
z
,
t
1
)
2
t
Finite difference in space:
2
P
(
x
,
y
,
z
,
t
)
g
(
n
)
P
(
x
n
,
y
,
z
,
t
)
x
x
n
P
(
x
,
y
,
z
,
t
)
g
(
n
)
P
(
x
,
y
n
,
z
,
t
)
y
2
y
n
2
P
(
x
,
y
,
z
,
t
)
g
(
n
)
P
(
x
,
y
,
z
n
,
t
)
z
z
n
Absorbing boundary conditions, interpolation, compression, etc.
11
[email protected]
© 2011 IBM Corporation
IBM Research
Image
RTM Algorithm (for each shot)
Load data
► Velocity model v(x,y,z)
t=N
F(N)
R(N)
I(N)
t=2N
F(2N)
R(2N)
I(2N)
t=3N
F(3N)
R(3N)
I(3N)
.
.
.
.
.
.
t=kN
F(kN)
► Source & Receiver data
Forward propagation
► Calculate P(x,y,z,t)
► Every N timesteps
– Compress P(x,y,x,t)
– Write P(x,y,x,t) to disk/memory
Backward propagation
► Calculate P(x,y,z,t)
► Every N timesteps
– Read P(x,y,x,t) from disk/memory
– Decompress P(x,y,x,t)
– Calculate partial sum of I(x,y,z)
.
.
.
R(kN)
I(kN)
Merge I(x,y,z) with global image
12
[email protected]
© 2011 IBM Corporation
IBM Research
Embarrassingly Parallel RTM
Data Archive (Disk)
Process shots
in parallel, one
per slave node
Model
Slave
Node
Disk
Slave
Node
Disk
Master
Node
...
Slave
Node
Disk
Scratch disk bottleneck
Subset of model for each shot (~100k+ shots)
13
[email protected]
© 2011 IBM Corporation
IBM Research
Domain-Partitioned Multisource RTM
Data Archive (Disk)
Process all
data at once
with domain
decomposition
Model
Slave
Node
Disk
Slave
Node
Disk
Master
Node
...
Slave
Node
Shots merged and model partitioned
14
[email protected]
Disk
Small partitions mean
forward wave can be
stored locally: No disks
© 2011 IBM Corporation
IBM Research
Multisource RTM
Full Velocity
Model
Receiver
data
Velocity
Subset
Source
Linear superposition principal
2
1
2
2
2
t Pi (x, y,z,t) Si (x, y,z,t)
x y z 2
v (x, y,z)
So N sources can be merged
2
N
1
2
2
2 N
t i Pi (x, y,z,t) i Si (x, y,z,t)
x y z 2
v (x, y,z)
Finite receiver array acts as nonlinear filter on data
Rmeasured (x,y,z,t) MRfull (x, y,z,t)
Accelerate by
factor of N
Nonlinearity leads to “crosstalk” noise which needs to be minimized
15
[email protected]
© 2011 IBM Corporation
IBM Research
3D RTM Scaling (Partial optimization)
512x512x512 & 1024x1024x1024 models
Scaling improves for larger models
16
[email protected]
© 2011 IBM Corporation
IBM Research
GPU Scaling is Comparatively Poor
Tsubame supercomputer
Japan
GPU’s achieve only
10% of peak
performance (100x
increase for 1000
nodes
Okamoto, T., Takenaka, H., Nakamura, T. and
Aoki, T. 2010. Accelerating large-scale
simulation of seismic wave propagation by
multi-GPUs and three-dimensional domain
decomposition. In Earth Planets Space,
November, 2010.
17
[email protected]
© 2011 IBM Corporation
IBM Research
Physical survey size mapped to BG/Q L2 cache
Isotropic RTM with minimum V = 1.5 km/s
10 points per wavelength (5 would reduce number below by 8x)
Mapping entire survey volume – not a subset (enables multisource)
1000
(512)^3
512
km^3m^3
100
(4096)^3
4096
km^3m^3
10
(16384)^3
16384
km^3m^3
1
0
10
20
30
40
50
60
70
80
Max Imaging Frequency
18
[email protected]
© 2011 IBM Corporation
IBM Research
Snapshot Data Easily Fits in Memory (No disk required)
10000
9000
500^3
8000
4x more
capacity
for BGQ
600^3
7000
700^3
800^3
6000
900^3
1000^3
5000
1100^3
4000
1200^3
1300^3
3000
1400^3
1500^3
2000
1600^3
1000
0
128
256
512
1024
2048
3072
4096
5120
6144
7168
8192
9216 10240
# of uncompressed snapshots that can be stored
for various model sizes and number of nodes.
19
© 2011 IBM Corporation
IBM Research
Comparison
Embarrassingly parallel RTM
► Coarse-grain communication
► Coarse-grain synchronization
► Disk IO Bottleneck
Partitioned RTM
► Fine-grain communication
► Fine-grain synchronization
► No scratch disk
20
[email protected]
Low latency
High bandwidth:
Blue Gene
© 2011 IBM Corporation
IBM Research
Conclusion: RTM can be dramatically accelerated
Algorithmic:
► Adopt partitioned, multisource RTM
► Abandon embarrassingly parallel implementations
Hardware:
► Increase communication bandwidth
► Decrease communication latency
► Reduce node nondeterminism
Advantages
► Can process larger models - scales well
► Avoids scratch disk IO bottleneck
► Improves RAS & MTBF: No disk means no moving parts
Disadvantages
► Must handle shot “crosstalk” noise
– Methods exist - research continuing…
21
[email protected]
© 2011 IBM Corporation