Transcript pptx

Whither Acoherent
Shared Memory?
Mark D. Hill
UW-Madison Computer Sciences
Workshop on Negative Outcomes,
Post-mortems, and Experiences (NOPE)
December 2015
But NOPE Can Be Fun Too
3
Acoherent
Shared Memory
Derek R. Hower
Ph.D. Defense, July 16, 2012
www.cs.wisc.edu/multifacet/theses/derek_hower_phd.pdf
www.cs.wisc.edu/multifacet/theses/derek_hower_phd_talk.pptx
Executive Summary & Outline
 Acoherent Shared Memory [2012]
 Coherence is complex & inefficient
 Switch to CVS-like checkout/checkin model
 Same performance; less energy for CPUs
 Whither Acoherent Shared Memory?
 CPUs coherence “settled”
 GPU/accelerators not ready
 Timing wrong; hard to publish out-there ideas
 But seeded Heterogeneous Race Free
5
The Big Picture
L1
P
P
CI
CI
CO
L1
?
CO
Simple
L2
abstraction
Simple
abstraction
GPU
P
Coherent
View
P
Acoherent
View
- Complex implementation
- Hides caches (bad?!)
- High overhead
- Simple implementation
- Abstracts caches
- Low overhead
6
The Problem With Coherence
 Wrong abstraction
 Optimized for fine-grained, share-everything
• Programs aren’t!
 Makes SW isolation hard
 Hypothesis: SW will want control over data placement
 Impedes HW specialization
 Does your multicore ASIC need a coherence controller?
 Coherent GPUs?
 Efficiency problems
 Directories take space/broadcasts take energy
• e.g. 14% of cache are dedicated to directory on 4-core die1
1
Stackhouse et al., ISSCC 2008
7
Rethinking Coherence: Goals
 Maintain programmer sanity
 Keep shared memory
 Minimal compatibility change
 Expose hardware capabilities
 Let SW guide memory management -> semantics
 Simple hardware
 Lower cost of entry for accelerators
 Solution: Acoherent Shared Memory
8
ASM Model Basics
 Replace black box with simple hierarchy
 Still flat, linear address space
 SW gets private storage
CI
CO
CO
P
P
9
CI
 Manage with CVS-like checkout/checkin
Checkout/Checkin
Granularity?
Checkout/Checkin are
not synchronization primitives
- Closer to a FENCE
10
P
P
CI
CO
CO
Checkin:
Publish local updates globally
CI
Checkout:
Pull data into private storage
Segments
 Compromise: Memory Segments
– Linear partition of address space
– CO/CI segments at a time
 Observation: Programs are already segmented
 Can re-use layout
Stack
Heap
Typical CO/CI granularity
in existing C code
Data
BSS
Code
11
Segment Types
 Not all memory wants/needs acoherence
 Segment types give different “views”
 Communicate semantic information to HW
Stack
Private
Private
Available Types
Private
Coherent-RW
Heap
Acoherent
Shared
Acoherent
Device
Data
BSS
Code
Coherent-RO
Coherent
RO
Shared,
Read-Only
12
ASM-CMP Overview
 Based on MIPS
 + special insns, e.g., checkout, checkin
 Uses segments, no paging
Skipping the Details
• Maintains flat address space
 Coherence protocol -> Acoherence Engine
 DMA for caches
• Selectively move data
14
Acoherence Engine
 Three main responsibilities:
 Checkout:
Lazy Flash Invalidate
• Invalidate all segment data
 Checkin:
Track write set
Decoupled
Metastate
Cache
• Write back all dirty segment data
 Order:
Timestamp based
• Detect CI-CO pairs
 FSM like coherence, but few races, no directory
15
Energy
1.2
Energy Normalized to MOESI
1
0.8
0.6
0.4
0.2
0
Less Energy
(Same Performance)
17
e_l1d
e_l1i
e_l2
e_link
e_switch
e_tlb
Executive Summary & Outline
 Acoherent Shared Memory [2012]
 Coherence is complex & inefficient
 Switch to CVS-like checkout/checkin model
 Same performance; less energy for CPUs
 Whither Acoherent Shared Memory?
 CPUs coherence “settled”
 GPU/accelerators not ready
 Timing wrong; hard to publish out-there ideas
 But seeded Heterogeneous Race Free
21
2012 Thesis Conclusions
 Going forward:
 HW designs must find efficiency
 SW will want to see caches/control placement
 ASM: viable alternative to coherent shared memory
 Semantic cooperation between HW/SW
 ASM-CMP: build components w/o coherence engine
 Make custom integration easier
 Practically:
 Will the next x86 core use ASM? No
 Will a heterogeneous accelerator? Maybe
24
View from 2015 by Hower & Hill
 Did Coherence need to be revisited?
 For CPUs, perhaps “no”
 Solutions complex, but this complexity is “sunk cost”
 What about coherence to
/accelerators?
 Acoherent Shared Memory might be a good match
 Hower did not have the needed infrastructure for this
 Crude GPU models would have been trashed.
 Our timing was wrong
 Regrettably hard to publish imperfect visions
 Can effect next career steps
25
Hower’s Previous Work in 2012
 Rerun: ISCA 2008 and CACM 2009
 Race recorder for deterministic replay
 vs. state of the art:
• SAME logging performance, > 10x state reduction
 Calvin: HPCA 2011
 Coherence for deterministic execution
• i.e., zero-log-size deterministic replay
 Selective determinism to match program requirements
 Hobbes: WoDet 2011
 Strong acoherence in SW runtime
26
HETEROGENEOUS-RACE-FREE
MEMORY MODELS
DEREK R. HOWER, BLAKE A. HECHTMAN,
BRADFORD M. BECKMANN, BENEDICT R. GASTER,
MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD
ASPLOS 3/4/2014
HETEROGENEOUS SOFTWARE
HIERARCHICAL W/ SCOPES
OpenCL Software Hierarchy
Grid
‒Sub-group (CUDA Warp)
‒Workgroup (thread block)
‒NDRange (grid)
‒System (system)
m
Di
en
on
si
Z
Dimension Y
on
si
en
Sub-group
(Hardware-specific size)
Z
28 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
m
Di
Why? See Hardware
Dimension X
Work-item
Scoped Synchronization
‒Sync w.r.t. subset of threads
‒OpenCL: flag.store(1,…,
memory_scope_work_group)
‒CUDA: __threadfence{_block}
Dimension Y
Work-group
Dimension X
OpenCL Execution Hierarchy
HETEROGENEOUS HAREWARE
HIERARCHICAL W/ SCOPES
L2
L1
L1
Write buffers:
WI1
WI2
WI3
WI4
E.g. GPU memory system: Write combining caches
Scopes have different costs:
‒Sync w/ work-group: flush write buffer
‒Sync w/ NDrange: flush write buffer + L1 cache flush/invalidate
 Programming with scoped synchronization?
29 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
HETEROGENEOUS-RACE-FREE MEMORY MODELS
History
‒1979: Sequential Consistency (SC): like multitasking uniprocessor
‒1990: SC for DRF: SC for programs that are data-race-free
‒2005: Java uses SC for DRF (+ more)
‒2008: C++ uses SC for DRF (+ more)
Q: Heterogeneous memory model in < 3 decades?
2014: SC for Heterogeneous-Race-Free: SC for programs
‒With “enough” synchronization (DRF)
Already questioned
‒Of “enough” scope (HRF)
at MICRO’15 
‒Variants for current & future SW/HW
2014: Heterogeneous System Architecture (HSA) ADOPTS!
30 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Executive Summary & Outline
 Acoherent Shared Memory [2012]
 Coherence is complex & inefficient
 Switch to CVS-like checkout/checkin model
 Same performance; less energy for CPUs
 Whither Acoherent Shared Memory?
 CPUs coherence “settled”
 GPU/accelerators not ready
 Timing wrong; hard to publish out-there ideas
 But seeded Heterogeneous Race Free
34