Transcript pptx

SYNCHRONIZATION USING
REMOTE-SCOPE PROMOTION
MARC S. ORR†§, SHUAI CHE§, AYSE YILMAZER§,
BRADFORD M. BECKMANN§, MARK D. HILL†§, DAVID A. WOOD†§
†UW-MADISON, §AMD RESEARCH
ASPLOS, MARCH 16, 2015
EXECUTIVE SUMMARY
Heterogeneous chips,
like GPUs, have
hierarchical memories
All Global Synchronization
Best of Both?
Scoped Synchronization
Work Stealing
(7% Speedup)
(18% Speedup)
NEW: Remote-Scope Promotion
(25% Speedup)
2 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
OUTLINE
Background: Synchronization + Scopes
Synchronization using Remote-Scope Promotion
Results/Conclusion
3 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
BACKGROUND: SYNCHRONIZATION + SCOPES
Parallel Synchronization semantics
‒acquire: pull latest data (to me)
‒release: push latest data (to others)
Scopes bound synchronization:
‒Smaller scope  less synchronization overhead
scope
abbrev.
description
work-item
wi
Like a CPU thread
wavefront
wv
work-items executing in lockstep on SIMD
work-group
wg
wavefronts executing on the same CU
component
cmp
work-groups executing on the same GPU
system
sys
All work-items/threads in the process
4 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
ACQUIRE/RELEASE ANIMATION
void incX_component() {
void incX_workgroup() {
while (!CAS_acq_cmp(&L, 0, 1));
X = X + 1;
while (!CAS_acq_wg(&L, 0, 1));
X = X + 1;
st_rel_cmp(&L, 0);
st_rel_wg(&L, 0);
}
}
CU0
CU1
L1 Cache
Xwg= scope0
3
1 L = 10
4
L1 Cache
wg
X =scope1
2
L2
component
X = 2scope L = 01
5 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
SCOPED SYNCHRONIZATION’S STRENGTHS
Static local sharing
wg_scope0
data 0
wg_scope1
data 1
component scope
Dynamic global sharing
wg scope0
wg scope1
global data store
On current hardware, wg scope can yield >20% speedup
over cmp scope
6 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
SCOPED SYNCHRONIZATION’S LIMITATIONS
Dynamic local sharing: some threads access shared data
less frequently than others in an ad-hoc manner
Example: work stealing
wg scope0
wgqueue
scope1
1
queue
stale0 scope
component
7 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
OUTLINE
Background: Synchronization + Scopes
Synchronization using Remote-Scope Promotion
Results/Conclusion
8 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
REMOTE-SCOPE PROMOTION
Insight: wg1 needs to trigger the promotion of scope 0
Contribution: hardware support for scope promotion &
ISA instructions that utilize it
promote
wg_scope0
queue 0
wg_scope
queue 1
1
queue
stale0 scope
component
9 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
PROMOTION SEMANTIC
Prior memory models: HRF-direct, HRF-indirect
‒Invariant: acquire/release pair must occur at the same scope
work-item 0 (in wg 0)
work-item 1 (in wg 1)
st(V,2)
promotion
st_rel_cmp(L,
st_rel_wg(L, 0)
0)
OK
synchronizes-with RACE!
relationship
Three new memory orders:
cas_rm_acq_cmp(&L,
cas_acq_cmp(&L,
cas_acq_wg(&L, 0,
0,0,
1)
1)1)
ld(R1, V)
remoteAcquire
Promote the scope of last release to the
scope of this acquire, then perform acquire
remoteRelease
Promote the scope of next acquire to the
scope of this release, then perform release
remoteAcquire+Release
combine remote acquire & remote release
10 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
IMPLEMENTATION

remote_acq_cmp(L)
1.
2.
Promote the scope of the
last release on L
Perform an acquire
operation on L
remote_rel_cmp(L)
1.
2.
Perform a release
operation on L
Promote the scope of the
next acquire on L
CU0
CU1
CU2
L1 Cache
V=3 L=0
L1 Cache
promote
V=2
L1 Cache
L2
V=2
FLUSH

L=1
11 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
IMPLEMENTATION DETAILS
Hardware Support
‒Sending/receiving sub-operations between CUs
‒Cache line locking to resolve races
Guarantee “coherence order” for read-modify-writes
‒Hardware support to stall new synchronization operations at
target scope
Paper formalizes scope promotion
‒Shows that scope promotion is compatible with coherence order
12 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
OUTLINE
Background: Synchronization + Scopes
Synchronization using Remote-Scope Promotion
Results/Conclusion
13 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
METHODOLOGY
Prototyped remote scoped synchronization in gem5
‒Extended with internal GPU model
Refactored 3 Pannotia workloads to retrieve graph nodes
from task queues
‒SSSP, Color, PageRank (each run with 3-4 inputs)
14 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
RESULTS
Speedup
baseline
scope-only
steal-only
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
rem-sync
1.25x
1.18x
1.07x
scenario
baseline
scope-only
steal-only
rem-sync
Scope of sync.?
global
local
global
local
15 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
Work stealing?
no
no
Yes
Yes
CONCLUSION
All Global Synchronization
Best of Both!
Scoped Synchronization
Work Stealing
(7% Speedup)
(18% Speedup)
NEW: Remote-Scope Promotion
(25% Speedup)
16 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
Questions?
Backup
ΜBENCHMARK RESULTS
wg-scope vs. cmp-scope on AMD A10-7850K
1.6
Small tasks benefit
from scopes
1.4
speedup
1.2
1
All LD
0.8
75% LD
0.6
50% LD
0.4
25% LD
0.2
All ST
0
4
8
16
32
64
128
256
512
# of memory operations between acquire and release
Scopes matter!
19 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015
1024
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or
otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to
time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO
EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be
trademarks of their respective owners.
20 | SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, 2015