Transcript pptx
Whither Acoherent
Shared Memory?
Mark D. Hill
UW-Madison Computer Sciences
Workshop on Negative Outcomes,
Post-mortems, and Experiences (NOPE)
December 2015
But NOPE Can Be Fun Too
3
Acoherent
Shared Memory
Derek R. Hower
Ph.D. Defense, July 16, 2012
www.cs.wisc.edu/multifacet/theses/derek_hower_phd.pdf
www.cs.wisc.edu/multifacet/theses/derek_hower_phd_talk.pptx
Executive Summary & Outline
Acoherent Shared Memory [2012]
Coherence is complex & inefficient
Switch to CVS-like checkout/checkin model
Same performance; less energy for CPUs
Whither Acoherent Shared Memory?
CPUs coherence “settled”
GPU/accelerators not ready
Timing wrong; hard to publish out-there ideas
But seeded Heterogeneous Race Free
5
The Big Picture
L1
P
P
CI
CI
CO
L1
?
CO
Simple
L2
abstraction
Simple
abstraction
GPU
P
Coherent
View
P
Acoherent
View
- Complex implementation
- Hides caches (bad?!)
- High overhead
- Simple implementation
- Abstracts caches
- Low overhead
6
The Problem With Coherence
Wrong abstraction
Optimized for fine-grained, share-everything
• Programs aren’t!
Makes SW isolation hard
Hypothesis: SW will want control over data placement
Impedes HW specialization
Does your multicore ASIC need a coherence controller?
Coherent GPUs?
Efficiency problems
Directories take space/broadcasts take energy
• e.g. 14% of cache are dedicated to directory on 4-core die1
1
Stackhouse et al., ISSCC 2008
7
Rethinking Coherence: Goals
Maintain programmer sanity
Keep shared memory
Minimal compatibility change
Expose hardware capabilities
Let SW guide memory management -> semantics
Simple hardware
Lower cost of entry for accelerators
Solution: Acoherent Shared Memory
8
ASM Model Basics
Replace black box with simple hierarchy
Still flat, linear address space
SW gets private storage
CI
CO
CO
P
P
9
CI
Manage with CVS-like checkout/checkin
Checkout/Checkin
Granularity?
Checkout/Checkin are
not synchronization primitives
- Closer to a FENCE
10
P
P
CI
CO
CO
Checkin:
Publish local updates globally
CI
Checkout:
Pull data into private storage
Segments
Compromise: Memory Segments
– Linear partition of address space
– CO/CI segments at a time
Observation: Programs are already segmented
Can re-use layout
Stack
Heap
Typical CO/CI granularity
in existing C code
Data
BSS
Code
11
Segment Types
Not all memory wants/needs acoherence
Segment types give different “views”
Communicate semantic information to HW
Stack
Private
Private
Available Types
Private
Coherent-RW
Heap
Acoherent
Shared
Acoherent
Device
Data
BSS
Code
Coherent-RO
Coherent
RO
Shared,
Read-Only
12
ASM-CMP Overview
Based on MIPS
+ special insns, e.g., checkout, checkin
Uses segments, no paging
Skipping the Details
• Maintains flat address space
Coherence protocol -> Acoherence Engine
DMA for caches
• Selectively move data
14
Acoherence Engine
Three main responsibilities:
Checkout:
Lazy Flash Invalidate
• Invalidate all segment data
Checkin:
Track write set
Decoupled
Metastate
Cache
• Write back all dirty segment data
Order:
Timestamp based
• Detect CI-CO pairs
FSM like coherence, but few races, no directory
15
Energy
1.2
Energy Normalized to MOESI
1
0.8
0.6
0.4
0.2
0
Less Energy
(Same Performance)
17
e_l1d
e_l1i
e_l2
e_link
e_switch
e_tlb
Executive Summary & Outline
Acoherent Shared Memory [2012]
Coherence is complex & inefficient
Switch to CVS-like checkout/checkin model
Same performance; less energy for CPUs
Whither Acoherent Shared Memory?
CPUs coherence “settled”
GPU/accelerators not ready
Timing wrong; hard to publish out-there ideas
But seeded Heterogeneous Race Free
21
2012 Thesis Conclusions
Going forward:
HW designs must find efficiency
SW will want to see caches/control placement
ASM: viable alternative to coherent shared memory
Semantic cooperation between HW/SW
ASM-CMP: build components w/o coherence engine
Make custom integration easier
Practically:
Will the next x86 core use ASM? No
Will a heterogeneous accelerator? Maybe
24
View from 2015 by Hower & Hill
Did Coherence need to be revisited?
For CPUs, perhaps “no”
Solutions complex, but this complexity is “sunk cost”
What about coherence to
/accelerators?
Acoherent Shared Memory might be a good match
Hower did not have the needed infrastructure for this
Crude GPU models would have been trashed.
Our timing was wrong
Regrettably hard to publish imperfect visions
Can effect next career steps
25
Hower’s Previous Work in 2012
Rerun: ISCA 2008 and CACM 2009
Race recorder for deterministic replay
vs. state of the art:
• SAME logging performance, > 10x state reduction
Calvin: HPCA 2011
Coherence for deterministic execution
• i.e., zero-log-size deterministic replay
Selective determinism to match program requirements
Hobbes: WoDet 2011
Strong acoherence in SW runtime
26
HETEROGENEOUS-RACE-FREE
MEMORY MODELS
DEREK R. HOWER, BLAKE A. HECHTMAN,
BRADFORD M. BECKMANN, BENEDICT R. GASTER,
MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD
ASPLOS 3/4/2014
HETEROGENEOUS SOFTWARE
HIERARCHICAL W/ SCOPES
OpenCL Software Hierarchy
Grid
‒Sub-group (CUDA Warp)
‒Workgroup (thread block)
‒NDRange (grid)
‒System (system)
m
Di
en
on
si
Z
Dimension Y
on
si
en
Sub-group
(Hardware-specific size)
Z
28 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
m
Di
Why? See Hardware
Dimension X
Work-item
Scoped Synchronization
‒Sync w.r.t. subset of threads
‒OpenCL: flag.store(1,…,
memory_scope_work_group)
‒CUDA: __threadfence{_block}
Dimension Y
Work-group
Dimension X
OpenCL Execution Hierarchy
HETEROGENEOUS HAREWARE
HIERARCHICAL W/ SCOPES
L2
L1
L1
Write buffers:
WI1
WI2
WI3
WI4
E.g. GPU memory system: Write combining caches
Scopes have different costs:
‒Sync w/ work-group: flush write buffer
‒Sync w/ NDrange: flush write buffer + L1 cache flush/invalidate
Programming with scoped synchronization?
29 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
HETEROGENEOUS-RACE-FREE MEMORY MODELS
History
‒1979: Sequential Consistency (SC): like multitasking uniprocessor
‒1990: SC for DRF: SC for programs that are data-race-free
‒2005: Java uses SC for DRF (+ more)
‒2008: C++ uses SC for DRF (+ more)
Q: Heterogeneous memory model in < 3 decades?
2014: SC for Heterogeneous-Race-Free: SC for programs
‒With “enough” synchronization (DRF)
Already questioned
‒Of “enough” scope (HRF)
at MICRO’15
‒Variants for current & future SW/HW
2014: Heterogeneous System Architecture (HSA) ADOPTS!
30 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Executive Summary & Outline
Acoherent Shared Memory [2012]
Coherence is complex & inefficient
Switch to CVS-like checkout/checkin model
Same performance; less energy for CPUs
Whither Acoherent Shared Memory?
CPUs coherence “settled”
GPU/accelerators not ready
Timing wrong; hard to publish out-there ideas
But seeded Heterogeneous Race Free
34