The Case for Hardware Transactional Memory

Transcript The Case for Hardware Transactional Memory

The Stanford
Pervasive Parallelism Lab
A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan,
J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis,
K. Olukotun, M. Rosenblum, S. Thrun
Pervasive Parallelism Laboratory
Stanford University
The Looming Crisis


Software developers will soon face systems with

> 1 TFLOP of compute power

20+ of cores, 100+ hardware threads

Heterogeneous cores (CPU+GPUs), app-specific accelerators

Deep memory hierarchies
Challenge: harness these devices productively


Improve performance, power, reliability and security
The parallelism gap

Yawning divide between the capabilities of today’s programming
environments, the requirements of emerging applications, and
the challenges of future parallel architectures
The Stanford Pervasive
Parallelism Laboratory

Goal: the parallel computing platform for 2012




Make parallel programming practical for the masses
Algorithms, programming models, runtimes, and
architectures for scalable parallelism (10,000s of threads)
Parallel computing a core component of CS education
PPL is a combination of

Leading Stanford researchers across multiple domains


Leading companies in computer systems and software



Applications, languages, software systems, architecture
Sun, AMD, Nvidia, IBM, Intel, HP
An exciting vision for pervasive parallelism
Open laboratory; all result in the open-source
The PPL Team

Applications


Programming & software systems


Ron Fedkiw, Vladlen Koltun, Sebastian Thrun
Alex Aiken, Pat Hanrahan, Mendel Rosenblum
Architecture

Bill Dally, John Hennessy, Mark Horowitz,
Christos Kozyrakis, Kunle Olukotun (Director)
The PPL Team

Research expertise




Applications: graphics, physics simulation, visualization, AI,
robotics, …
Software systems: virtual machines, GPGPU, stream
programming, transactional programming, speculative
parallelization, optimizing compilers, bug detection, security,…
Architecture: multi-core & multithreading, scalable sharedmemory, transactional memory hardware, interconnect
networks, low-power processors, stream processors, vector
processors, …
Commercial success

MIPS & SGI, Rambus, VMware, Niagara processors, Renderman,
Stream Processors Inc, Avici, Tableau, …
Guiding Principles

Top down research



Scalability



In hardware resources (10, 000s threads)
Developer productivity (ease of use)
HW provides flexible primitives


App & developer needs drive system
High-level info flows to low-level system
Software synthesizes complete solutions
Build real, full system prototypes
The PPL Vision
Virtual
Worlds
Rendering
DSL
Autonomous
Vehicle
Physics
DSL
Scripting
DSL
Financial
Services
Probabilistic
DSL
Analytics
DSL
Parallel Object Language
Common Parallel Runtime
Explicit / Static
Implicit / Dynamic
Hardware Architecture
OOO Cores
Scalable
Interconnects
Partitionable
Hierarchies
SIMD Cores
Scalable
Coherence
Threaded Cores
Isolation &
Atomicity
Pervasive
Monitoring
The PPL Vision
Virtual
Worlds
Rendering
DSL
Autonomous
Vehicle
Physics
DSL
Scripting
DSL
Financial
Services
Probabilistic
DSL
Analytics
DSL
Parallel Object Language
Common Parallel Runtime
Explicit / Static
Implicit / Dynamic
Hardware Architecture
OOO Cores
Scalable
Interconnects
Partitionable
Hierarchies
SIMD Cores
Scalable
Coherence
Threaded Cores
Isolation &
Atomicity
Pervasive
Monitoring
Demanding Applications
Seismic modeling
Media-X
Geophysics
Existing Stanford
research center
PPL
NIH NCBC
AI/ML
Vision
Web, Mining
Streaming DB
Mobile
HCI
Leverage domain expertise at Stanford


DOE ASC
Graphics
Games
Existing Stanford CS
research groups

Environmental
Science
CS research groups & national centers for scientific computing
From consumer apps to neuroinformatics
Virtual Worlds Application

Next-gen web platform




Challenges





Immersive collaboration
Social gaming
Millions of players in vast landscape
Client-side game engine
Server-side world simulation
AI, physics, large-scale rendering
Dynamic content, huge datasets
More at http://vw.stanford.edu/
Autonomous Vehicle
Application

Cars that drive autonomously in traffic




Challenges




Save lives & money
Improve highway throughput
Improve productivity
Client-side sensing, perception, planning, &
control
Server-side data merging, pre-processing, &
post-processing, traffic control, model
generation
Real-time, huge datasets
More at http://www.stanfordracing.org
The PPL Vision
Virtual
Worlds
Rendering
DSL
Autonomous
Vehicle
Physics
DSL
Scripting
DSL
Financial
Services
Probabilistic
DSL
Analytics
DSL
Parallel Object Language
Common Parallel Runtime
Explicit / Static
Implicit / Dynamic
Hardware Architecture
OOO Cores
Scalable
Interconnects
Partitionable
Hierarchies
SIMD Cores
Scalable
Coherence
Threaded Cores
Isolation &
Atomicity
Pervasive
Monitoring
Domain Specific Languages (DSL)

Leverage success of DSL across application domains


SQL (data manipulation), Matlab (scientific), Ruby/Rails (web),…
DSLs 

High-level data types & ops tailored to domain


E.g., relations, triangles, matrices, …
Express high-level intent without specific implementation artifacts


higher productivity for developers
Programmer isolated from details of specific system
DSLs  scalable parallelism for the system

Declarative description of parallelism & locality patterns


E.g., ops on relation elements, sub-array being processed, …
Portable and scalable specification of parallelism

Automatically adjust data structures, mapping, and scheduling as
systems scale up
DSL Research & Challenges

Goal: create the tools for DSL development

Initial DSL targets


Rendering, physics simulation, analytics, probabilistic
computations
Challenges

DSL implementation  embed in base PL



DSL-specific optimizations  telescoping compilers



Start with Scala (OO, type-safe, functional, extensible)
Use Scala as a scripting DSL that ties multiple DSLs
Use domain knowledge to optimize & annotate code
Feedback to programmers  ?
…
The PPL Vision
Virtual
Worlds
Rendering
DSL
Autonomous
Vehicle
Physics
DSL
Scripting
DSL
Financial
Services
Probabilistic
DSL
Analytics
DSL
Parallel Object Language
Common Parallel Runtime
Explicit / Static
Implicit / Dynamic
Hardware Architecture
OOO Cores
Scalable
Interconnects
Partitionable
Hierarchies
SIMD Cores
Scalable
Coherence
Threaded Cores
Isolation &
Atomicity
Pervasive
Monitoring
Common Parallel Runtime (CPR)

Goals

Provide common, portable, abstract target for all DSLs


Manages parallelism & locality



Write once, run everywhere model
Achieve efficient execution (performance, power, …)
Handles specifics of HW system
Approach

Compile DSLs to common IR

Base language + low-level constructs & pragmas


Per-object capabilities


Forall, async/join, atomic, barrier, …
Read-only or write-only, output data, private, relaxed coherence, …
Combine static compilation + dynamic management


Explicit management of regular tasks & predictable patterns
Implicit management of irregular parallelism
CPR Research & Challenges

Integrating & balancing opposing approaches




Task-level & data-level parallelism
Static & dynamic concurrency management
Explicit & implicit memory management
Utilize high-level information from DSLs

The key to overcoming difficult challenges

Adapt to changes in application behavior, OS decisions,
runtime constraints

Manage heterogeneous HW resources

Utilize novel HW primitives


To reduce overhead of communication, synchronization, …
To understand runtime behavior on specific HW & adapt to it
The PPL Vision
Virtual
Worlds
Rendering
DSL
Autonomous
Vehicle
Physics
DSL
Scripting
DSL
Financial
Services
Probabilistic
DSL
Analytics
DSL
Parallel Object Language
Common Parallel Runtime
Explicit / Static
Implicit / Dynamic
Hardware Architecture
OOO Cores
Scalable
Interconnects
Partitionable
Hierarchies
SIMD Cores
Scalable
Coherence
Threaded Cores
Isolation &
Atomicity
Pervasive
Monitoring
Hardware Architecture @
2012

The many-core chip

100s of cores



OOO, threaded, & SIMD
Hierarchy of shared memories
Scalable, on-chip network
TC
TC
TC
TC L1
L1
L1
L1
OOO
SIMD
L1
L1
TC
TC
TC
TC L1
L1
L1
L1
L2 Memory

The system



Few many-core chips
Per-chip DRAM channels
Global address space
I/O
The data-center

Cluster of systems
L1
L1
DRAM
L3 Memory
CTL
L2 Memory
L1
L1
OOO
SIMD
TC

SIMD
L2 Memory
L2 Memory
L1
OOO
L1
L1
L1
OOO
SIMD
TC
Architecture Challenges

Heterogeneity


Support for parallelism & locality management




Synchronization, communication, …
Explicit Vs. implicit locality management
Runtime monitoring
Scalability



Balance of resources, granularity of parallelism
On-chip/off-chip bandwidth & latency
Scalability of key abstractions (e.g., coherence)
Beyond performance

Power, fault-tolerance, QoS, security, virtualization
Architecture Research

Revisit architecture & micro-architecture for parallelism




Software synthesizes primitives into execution systems






Define semantics & implementation of key primitives
Communication, atomicity, isolation, partitioning, coherence,
consistency, checkpoint
Fine-grain & bulk support
Streaming system: partitioning + bulk communication
Thread-level spec: isolation + fine-grain communication
Transactional memory: atomicity + isolation + consistency
Security: partitioning + isolation
Fault tolerance: isolation + checkpoint + bulk communication
Challenges: interactions, scalability, cost, virtualization
Architecture Research

Software-managed HW primitives


Exploit high-level knowledge from DSLs & CPR
E.g., scale coherence using coarse-grain techniques



Support for programmability & management




Coarse-grain in time: force coherence only when needed
Coarse-grain in space: object-based, selective coherence
Fine-grain monitoring, HW-assisted invariants
Build upon primitives for concurrency
Efficient interface to CPR
Scalable on-chip & off-chip interconnects


High-radix network
Adaptive routing
Research Methodology

Conventional approaches are still useful

Develop app & SW system on existing platforms



Simulate novel HW mechanisms
Need some method that bridges HW & SW research

Makes new HW features available for SW research

Does not compromise HW speed, SW features, or scale

Allows for full-system prototypes


Multi-core, accelerators, clusters, …
Needed for research, convincing for industry, exciting for students
Approach: commodity chips + FPGAs in memory system

Commodity chips: fast system with rich SW environment

FPGAs: prototyping platform for new HW features

Scale through cluster arrangement
FARM: Flexible Architecture
Research Machine
Memory
Memory
Core 0
Core 1
Core 0
Core 1
Core 2
Core 3
Core 2
Core 3
Core 0
Core 1
Core 0
Core 1
Core 2
Core 3
Core 2
Core 3
Memory
Memory
FARM: Flexible Architecture
Research Machine
Memory
Memory
Core 0
Core 1
Core 0
Core 1
Core 2
Core 3
Core 2
Core 3
Core 0
Core 1
Core 2
Core 3
FPGA
SRAM
Memory
Memory
I
O
FARM: Flexible Architecture
Research Machine
Memory
Memory
Core 0
Core 1
Core 0
Core 1
Core 2
Core 3
Core 2
Core 3
GPU/Stream
FPGA
SRAM
Memory
Memory
I
O
FARM: Flexible Architecture
Research Machine
Memory
Memory
(scalable)
Core 0
Core 1
Core 0
Core 1
Core 2
Core 3
Core 2
Core 3
Core 0
Core 1
Core 2
Core 3
FPGA
SRAM
Memory
Memory
I
O
Infiniband
Or PCIe
Interconnect
Example FARM Uses

Software research

SW development for heterogeneous systems


Scheduling system for large-scale parallelism


Code generation & resource management
Thread state management & adaptive control
Hardware research

Scalable streaming & transactional HW


Scalable shared memory




FPGA extends protocols throughout cluster
FPGA provides coarse-grain tracking
Hybrid memory systems
Custom processors & accelerators
HW support for monitoring, scheduling, isolation,
virtualization, …
Conclusions

PPL: a full system vision for pervasive parallelism


Applications, programming models, software systems, and
hardware architecture
Key initial ideas

Domain-specific languages

Combine implicit & explicit management

Flexible HW features

The Case for Hardware Transactional Memory

Transcript The Case for Hardware Transactional Memory

Directory