Transcript ppt

Theory: Asleep at the
Switch to Many-Core
Phillip B. Gibbons
Intel Research Pittsburgh
Workshop on Theory and Many-Core
May 29, 2009
Slides are © Phillip B. Gibbons
Two Decades after the peak of Theory’s
interest in parallel computing…
The Age of Many-Core is finally underway
• Fueled by Moore’s Law: 2X cores per chip
every 18 months
All aboard the
parallelism train!
• (Almost) The only
way to faster apps
3
Theory: Asleep at the Switch to Many-Core
All Aboard the Parallelism Train?
Switch to Many-Core…Many Challenges
• Interest waned long ago
• Yet problems were NOT solved
Research needed in all aspects of Many-Core
• Computer Architecture
YES!
• Programming Languages & Compilers
YES!
• Operating & Runtime Systems
YES!
• Theory
Who has answered the call?
4
Theory: Asleep at the Switch to Many-Core
Theory: Asleep at the Switch
“Engineer driving derailed Staten Island train may
have fallen asleep at the switch.” (12/26/08)
Theory needs to wake-up & regain a
leadership role in parallel computing
5
Theory: Asleep at the Switch to Many-Core
Theory’s Strengths
• Conceptual Models
– Abstract models of computation
• New Algorithmic Paradigms
– New algorithms, new protocols
• Provable Correctness
– Safety, liveness, security, privacy,…
• Provable Performance Guarantees
– Approximation, probabilistic, new metrics
• Inherent Power/Limitations
– Of primitives, features,…
…among others
6
Theory: Asleep at the Switch to Many-Core
Five Areas in Which Theory Can (Should)
Have an Important Impact
• Parallel Thinking
• Memory Hierarchy
• Asymmetry/Heterogeniety
• Concurrency Primitives
• Power
7
Theory: Asleep at the Switch to Many-Core
Montparnasse 1895
Impact Area: Parallel Thinking
Key: Good Model of Parallel
Computation
• Express Parallelism
• Good parallel
programmer’s model
• Good for teaching,
teaching “how to think”
• Can be engineered to
good performance
8
Theory: Asleep at the Switch to Many-Core
Impact Area: Memory Hierarchy
• Deep cache/storage hierarchy
• Need conceptual model
• Need smart thread schedulers
9
Theory: Asleep at the Switch to Many-Core
Impact Area: Asymmetry/Heterogeniety
• Fat/Thin cores
• SIMD extensions
• Multiple coherence
domains
• Mixed-mode
parallelism
• Virtual Machines
• ...
10
Theory: Asleep at the Switch to Many-Core
Impact Area: Concurrency Primitives
• Parallel prefix
• Hash map [Herlihy08]
• Map reduce [Karloff09]
• Transactional memory
• Memory block
transactions [Blelloch08]
• Graphics primitives [Ha08]
• Make the case Many-Core should (not) support
• Improve the algorithm
• Recommend new primitives (prescriptive)
11
Theory: Asleep at the Switch to Many-Core
Impact Area: Power
Many-cores provide features for reducing power
• Voltage scaling [Albers07]
• Dynamically run on fewer cores, fewer banks
Fertile area for
Theory help
12
Theory: Asleep at the Switch to Many-Core
Deep Dive: Memory Hierarchy
• Deep cache/storage hierarchy
• Need conceptual model
• Need smart thread schedulers
13
Theory: Asleep at the Switch to Many-Core
Good Performance Requires
Effective Use of the Memory Hierarchy
CPU
Performance:
• Running/response time
L1
• Throughput
• Power
L2 Cache
Main Memory
Magnetic Disks
Two new trends: Pervasive Multicore & Pervasive Flash
bring new challenges and opportunities
14
Theory: Asleep at the Switch to Many-Core
New Trend 1: Pervasive Multicore
Challenges
Opportunity
• Rethink apps &
systems to
take advantage
of more CPUs
on chip
CPU
CPU
CPU
• Cores compete
for hierarchy
L1
L1
L1
• Hard to reason
about parallel
performance
Shared
L2 Cache
L2 Cache
Main Memory
Magnetic Disks
• Hundred cores
coming soon
• Cache hierarchy
design in flux
• Hierarchies differ
across platforms
Makes Effective Use of Hierarchy Much Harder
15
Theory: Asleep at the Switch to Many-Core
New Trend 2: Pervasive Flash
Opportunity
CPU
CPU
CPU
L1
L1
L1
Shared L2 Cache
Flash
Devices
Main Memory
Magnetic Disks
• Rethink apps &
systems to
take advantage
Challenges
• Performance
quirks of Flash
• Technology
in flux, e.g.,
Flash Translation
Layer (FTL)
New Type of Storage in the Hierarchy
16
Theory: Asleep at the Switch to Many-Core
How Hierarchy is Treated Today
Algorithm Designers & Application/System Developers
often tend towards one of two extremes
Ignorant
API view: Memory + I/O;
Parallelism often ignored
[Performance iffy]
(Pain)-Fully
Aware
Hand-tuned to platform
[Effort high, Not portable,
Limited sharing scenarios]
Or they focus on one or a few aspects,
but without a comprehensive view of the whole
17
Theory: Asleep at the Switch to Many-Core
Hierarchy-Savvy Parallel Algorithm Design
(Hi-Spade) project
…seeks to enable:
A hierarchy-savvy approach to algorithm design
& systems for emerging parallel hierarchies
• Hide what can be hid
• Expose what must be exposed
for good performance
• Robust: many platforms,
“Hierarchy-Savvy”
many resource sharing scenarios
• Sweet-spot between ignorant
and (pain)fully aware
18
Theory: Asleep at the Switch to Many-Core
http://www.pittsburgh.intel-research.net/projects/hi-spade/
Hi-Spade Research Scope
A hierarchy-savvy approach to algorithm design
& systems for emerging parallel hierarchies
Research agenda includes
• Theory:
conceptual models, algorithms,
analytical guarantees
• Systems:
runtime support, performance
tools, architectural features
• Applications: databases, operating systems,
application kernels
19
Theory: Asleep at the Switch to Many-Core
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
Cache Hierarchies: Sequential
External Memory (EM) Algorithms
Main Memory (size M)
[See Vitter’s
ACM Surveys
article]
Block size B
External Memory
External Memory Model
20
Theory: Asleep at the Switch to Many-Core
+ Simple model
+ Minimize I/Os
– Only 2 levels
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
Cache Hierarchies: Sequential
Alternative:
Cache-Oblivious Algorithms [Frigo99]
Main Memory (size M)
Block size B
External Memory
Cache-Oblivious Model
+ Key Goal
+ simple model
Key Goal: Good performance
for any M & B
Guaranteed good cache performance
at all levels of hierarchy
– Single CPU only
21
Twist on EM Model:
M & B unknown to Algorithm
Theory: Asleep at the Switch to Many-Core
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
Cache Hierarchies: Parallel
Explicit Multi-level Hierarchy:
Multi-BSP Model [Valiant08]
Goal:
Approach simplicity of cache-oblivious model
Hierarchy-Savvy sweet spot
22
Theory: Asleep at the Switch to Many-Core
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
Challenge:
– Theory of cache-oblivious algorithms falls apart once
introduce parallelism:
Good performance for any M & B on 2 levels DOES NOT
imply good performance at all levels of hierarchy
Key reason: Caches not fully shared
CPU1
CPU2
CPU3
L1
L1
L1
B
Shared
L2 Cache
L2 Cache
23
What’s good for CPU1 is
often bad for CPU2 & CPU3
e.g., all want to write B
at ≈ the same time
– Parallel cache-obliviousness too strict a goal
Theory: Asleep at the Switch to Many-Core
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
Key new dimension:
Scheduling of parallel threads
Has LARGE impact on cache performance
Recall our problem scenario:
CPU1
CPU2
CPU3
L1
L1
L1
B
Shared
L2 Cache
L2 Cache
24
Theory: Asleep at the Switch to Many-Core
all CPUs want to write B
at ≈ the same time
Can mitigate (but not solve)
if can schedule the writes
to be far apart in time
Existing Parallel Cache Models
1
2
p
CPU
CPU
CPU
Parallel
Shared-Cache
Model:
shared cache
(size = C)
block
transfer
(size = B)
main memory
Parallel
Private-Cache
Model:
1
2
p
CPU
CPU
CPU
private cache
(size = C)
private cache
(size = C)
private cache
(size = C)
block
transfer
(size = B)
block
transfer
(size = B)
main memory
25
Theory: Asleep at the Switch to Many-Core
Slide from
Rezaul Chowdhury
block
transfer
(size = B)
Competing Demands of Private and Shared Caches
p
1
CPU
2
CPU
CPU
private cache
private cache
private cache
shared cache
main memory
26

Shared cache: cores work on the same set of cache blocks

Private cache: cores work on disjoint sets of cache blocks

Experimental results have shown that on CMP architectures

work-stealing, i.e., the state-of-art scheduler for private-cache
model, can suffer from excessive shared-cache misses

parallel depth first, i.e., the best scheduler for shared-cache
model, can incur excessive private-cache misses
Theory: Asleep at the Switch to Many-Core
Slide from
Rezaul Chowdhury
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
Private vs. Shared Caches
• Parallel all-shared hierarchy:
+ Provably good cache performance for cache-oblivious algs
• 3-level multi-core model: insights on private vs. shared
+ Designed new scheduler with provably good cache performance
for class of divide-and-conquer algorithms [Blelloch08]
CPU1
CPU2
CPU3
L1
L1
L1
Shared
L2 Cache
L2 Cache
27
Theory: Asleep at the Switch to Many-Core
– Results require exposing
working set size for
each recursive subproblem
Hi-Spade: Hierarchy-savvy Parallel Algorithm Design
Parallel Tree of Caches
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Approach: [Blelloch09]
Design low-depth cache-oblivious algorithm
Thrm: for each level i, only O(Mi P D/ Bi ) misses
more than the sequential schedule
Low depth D
28
Theory: Asleep at the Switch to Many-Core
Good miss bound
Five Areas in Which Theory Can (Should)
Have an Important Impact
• Parallel Thinking
• Memory Hierarchy
• Asymmetry/Heterogeniety
• Concurrency Primitives
• Power
29
Theory: Asleep at the Switch to Many-Core