The Case for Hardware Transactional Memory
Download
Report
Transcript The Case for Hardware Transactional Memory
The Stanford
Pervasive Parallelism Lab
A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan,
J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis,
K. Olukotun, M. Rosenblum, S. Thrun
Pervasive Parallelism Laboratory
Stanford University
The Looming Crisis
Software developers will soon face systems with
> 1 TFLOP of compute power
20+ of cores, 100+ hardware threads
Heterogeneous cores (CPU+GPUs), app-specific accelerators
Deep memory hierarchies
Challenge: harness these devices productively
Improve performance, power, reliability and security
The parallelism gap
Yawning divide between the capabilities of today’s programming
environments, the requirements of emerging applications, and
the challenges of future parallel architectures
The Stanford Pervasive
Parallelism Laboratory
Goal: the parallel computing platform for 2012
Make parallel programming practical for the masses
Algorithms, programming models, runtimes, and
architectures for scalable parallelism (10,000s of threads)
Parallel computing a core component of CS education
PPL is a combination of
Leading Stanford researchers across multiple domains
Leading companies in computer systems and software
Applications, languages, software systems, architecture
Sun, AMD, Nvidia, IBM, Intel, HP
An exciting vision for pervasive parallelism
Open laboratory; all result in the open-source
The PPL Team
Applications
Programming & software systems
Ron Fedkiw, Vladlen Koltun, Sebastian Thrun
Alex Aiken, Pat Hanrahan, Mendel Rosenblum
Architecture
Bill Dally, John Hennessy, Mark Horowitz,
Christos Kozyrakis, Kunle Olukotun (Director)
The PPL Team
Research expertise
Applications: graphics, physics simulation, visualization, AI,
robotics, …
Software systems: virtual machines, GPGPU, stream
programming, transactional programming, speculative
parallelization, optimizing compilers, bug detection, security,…
Architecture: multi-core & multithreading, scalable sharedmemory, transactional memory hardware, interconnect
networks, low-power processors, stream processors, vector
processors, …
Commercial success
MIPS & SGI, Rambus, VMware, Niagara processors, Renderman,
Stream Processors Inc, Avici, Tableau, …
Guiding Principles
Top down research
Scalability
In hardware resources (10, 000s threads)
Developer productivity (ease of use)
HW provides flexible primitives
App & developer needs drive system
High-level info flows to low-level system
Software synthesizes complete solutions
Build real, full system prototypes
The PPL Vision
Virtual
Worlds
Rendering
DSL
Autonomous
Vehicle
Physics
DSL
Scripting
DSL
Financial
Services
Probabilistic
DSL
Analytics
DSL
Parallel Object Language
Common Parallel Runtime
Explicit / Static
Implicit / Dynamic
Hardware Architecture
OOO Cores
Scalable
Interconnects
Partitionable
Hierarchies
SIMD Cores
Scalable
Coherence
Threaded Cores
Isolation &
Atomicity
Pervasive
Monitoring
The PPL Vision
Virtual
Worlds
Rendering
DSL
Autonomous
Vehicle
Physics
DSL
Scripting
DSL
Financial
Services
Probabilistic
DSL
Analytics
DSL
Parallel Object Language
Common Parallel Runtime
Explicit / Static
Implicit / Dynamic
Hardware Architecture
OOO Cores
Scalable
Interconnects
Partitionable
Hierarchies
SIMD Cores
Scalable
Coherence
Threaded Cores
Isolation &
Atomicity
Pervasive
Monitoring
Demanding Applications
Seismic modeling
Media-X
Geophysics
Existing Stanford
research center
PPL
NIH NCBC
AI/ML
Vision
Web, Mining
Streaming DB
Mobile
HCI
Leverage domain expertise at Stanford
DOE ASC
Graphics
Games
Existing Stanford CS
research groups
Environmental
Science
CS research groups & national centers for scientific computing
From consumer apps to neuroinformatics
Virtual Worlds Application
Next-gen web platform
Challenges
Immersive collaboration
Social gaming
Millions of players in vast landscape
Client-side game engine
Server-side world simulation
AI, physics, large-scale rendering
Dynamic content, huge datasets
More at http://vw.stanford.edu/
Autonomous Vehicle
Application
Cars that drive autonomously in traffic
Challenges
Save lives & money
Improve highway throughput
Improve productivity
Client-side sensing, perception, planning, &
control
Server-side data merging, pre-processing, &
post-processing, traffic control, model
generation
Real-time, huge datasets
More at http://www.stanfordracing.org
The PPL Vision
Virtual
Worlds
Rendering
DSL
Autonomous
Vehicle
Physics
DSL
Scripting
DSL
Financial
Services
Probabilistic
DSL
Analytics
DSL
Parallel Object Language
Common Parallel Runtime
Explicit / Static
Implicit / Dynamic
Hardware Architecture
OOO Cores
Scalable
Interconnects
Partitionable
Hierarchies
SIMD Cores
Scalable
Coherence
Threaded Cores
Isolation &
Atomicity
Pervasive
Monitoring
Domain Specific Languages (DSL)
Leverage success of DSL across application domains
SQL (data manipulation), Matlab (scientific), Ruby/Rails (web),…
DSLs
High-level data types & ops tailored to domain
E.g., relations, triangles, matrices, …
Express high-level intent without specific implementation artifacts
higher productivity for developers
Programmer isolated from details of specific system
DSLs scalable parallelism for the system
Declarative description of parallelism & locality patterns
E.g., ops on relation elements, sub-array being processed, …
Portable and scalable specification of parallelism
Automatically adjust data structures, mapping, and scheduling as
systems scale up
DSL Research & Challenges
Goal: create the tools for DSL development
Initial DSL targets
Rendering, physics simulation, analytics, probabilistic
computations
Challenges
DSL implementation embed in base PL
DSL-specific optimizations telescoping compilers
Start with Scala (OO, type-safe, functional, extensible)
Use Scala as a scripting DSL that ties multiple DSLs
Use domain knowledge to optimize & annotate code
Feedback to programmers ?
…
The PPL Vision
Virtual
Worlds
Rendering
DSL
Autonomous
Vehicle
Physics
DSL
Scripting
DSL
Financial
Services
Probabilistic
DSL
Analytics
DSL
Parallel Object Language
Common Parallel Runtime
Explicit / Static
Implicit / Dynamic
Hardware Architecture
OOO Cores
Scalable
Interconnects
Partitionable
Hierarchies
SIMD Cores
Scalable
Coherence
Threaded Cores
Isolation &
Atomicity
Pervasive
Monitoring
Common Parallel Runtime (CPR)
Goals
Provide common, portable, abstract target for all DSLs
Manages parallelism & locality
Write once, run everywhere model
Achieve efficient execution (performance, power, …)
Handles specifics of HW system
Approach
Compile DSLs to common IR
Base language + low-level constructs & pragmas
Per-object capabilities
Forall, async/join, atomic, barrier, …
Read-only or write-only, output data, private, relaxed coherence, …
Combine static compilation + dynamic management
Explicit management of regular tasks & predictable patterns
Implicit management of irregular parallelism
CPR Research & Challenges
Integrating & balancing opposing approaches
Task-level & data-level parallelism
Static & dynamic concurrency management
Explicit & implicit memory management
Utilize high-level information from DSLs
The key to overcoming difficult challenges
Adapt to changes in application behavior, OS decisions,
runtime constraints
Manage heterogeneous HW resources
Utilize novel HW primitives
To reduce overhead of communication, synchronization, …
To understand runtime behavior on specific HW & adapt to it
The PPL Vision
Virtual
Worlds
Rendering
DSL
Autonomous
Vehicle
Physics
DSL
Scripting
DSL
Financial
Services
Probabilistic
DSL
Analytics
DSL
Parallel Object Language
Common Parallel Runtime
Explicit / Static
Implicit / Dynamic
Hardware Architecture
OOO Cores
Scalable
Interconnects
Partitionable
Hierarchies
SIMD Cores
Scalable
Coherence
Threaded Cores
Isolation &
Atomicity
Pervasive
Monitoring
Hardware Architecture @
2012
The many-core chip
100s of cores
OOO, threaded, & SIMD
Hierarchy of shared memories
Scalable, on-chip network
TC
TC
TC
TC L1
L1
L1
L1
OOO
SIMD
L1
L1
TC
TC
TC
TC L1
L1
L1
L1
L2 Memory
The system
Few many-core chips
Per-chip DRAM channels
Global address space
I/O
The data-center
Cluster of systems
L1
L1
DRAM
L3 Memory
CTL
L2 Memory
L1
L1
OOO
SIMD
TC
SIMD
L2 Memory
L2 Memory
L1
OOO
L1
L1
L1
OOO
SIMD
TC
Architecture Challenges
Heterogeneity
Support for parallelism & locality management
Synchronization, communication, …
Explicit Vs. implicit locality management
Runtime monitoring
Scalability
Balance of resources, granularity of parallelism
On-chip/off-chip bandwidth & latency
Scalability of key abstractions (e.g., coherence)
Beyond performance
Power, fault-tolerance, QoS, security, virtualization
Architecture Research
Revisit architecture & micro-architecture for parallelism
Software synthesizes primitives into execution systems
Define semantics & implementation of key primitives
Communication, atomicity, isolation, partitioning, coherence,
consistency, checkpoint
Fine-grain & bulk support
Streaming system: partitioning + bulk communication
Thread-level spec: isolation + fine-grain communication
Transactional memory: atomicity + isolation + consistency
Security: partitioning + isolation
Fault tolerance: isolation + checkpoint + bulk communication
Challenges: interactions, scalability, cost, virtualization
Architecture Research
Software-managed HW primitives
Exploit high-level knowledge from DSLs & CPR
E.g., scale coherence using coarse-grain techniques
Support for programmability & management
Coarse-grain in time: force coherence only when needed
Coarse-grain in space: object-based, selective coherence
Fine-grain monitoring, HW-assisted invariants
Build upon primitives for concurrency
Efficient interface to CPR
Scalable on-chip & off-chip interconnects
High-radix network
Adaptive routing
Research Methodology
Conventional approaches are still useful
Develop app & SW system on existing platforms
Simulate novel HW mechanisms
Need some method that bridges HW & SW research
Makes new HW features available for SW research
Does not compromise HW speed, SW features, or scale
Allows for full-system prototypes
Multi-core, accelerators, clusters, …
Needed for research, convincing for industry, exciting for students
Approach: commodity chips + FPGAs in memory system
Commodity chips: fast system with rich SW environment
FPGAs: prototyping platform for new HW features
Scale through cluster arrangement
FARM: Flexible Architecture
Research Machine
Memory
Memory
Core 0
Core 1
Core 0
Core 1
Core 2
Core 3
Core 2
Core 3
Core 0
Core 1
Core 0
Core 1
Core 2
Core 3
Core 2
Core 3
Memory
Memory
FARM: Flexible Architecture
Research Machine
Memory
Memory
Core 0
Core 1
Core 0
Core 1
Core 2
Core 3
Core 2
Core 3
Core 0
Core 1
Core 2
Core 3
FPGA
SRAM
Memory
Memory
I
O
FARM: Flexible Architecture
Research Machine
Memory
Memory
Core 0
Core 1
Core 0
Core 1
Core 2
Core 3
Core 2
Core 3
GPU/Stream
FPGA
SRAM
Memory
Memory
I
O
FARM: Flexible Architecture
Research Machine
Memory
Memory
(scalable)
Core 0
Core 1
Core 0
Core 1
Core 2
Core 3
Core 2
Core 3
Core 0
Core 1
Core 2
Core 3
FPGA
SRAM
Memory
Memory
I
O
Infiniband
Or PCIe
Interconnect
Example FARM Uses
Software research
SW development for heterogeneous systems
Scheduling system for large-scale parallelism
Code generation & resource management
Thread state management & adaptive control
Hardware research
Scalable streaming & transactional HW
Scalable shared memory
FPGA extends protocols throughout cluster
FPGA provides coarse-grain tracking
Hybrid memory systems
Custom processors & accelerators
HW support for monitoring, scheduling, isolation,
virtualization, …
Conclusions
PPL: a full system vision for pervasive parallelism
Applications, programming models, software systems, and
hardware architecture
Key initial ideas
Domain-specific languages
Combine implicit & explicit management
Flexible HW features