Transcript slides

Efficient HPC
Data Motion via
Scratchpad
Memory
Kayla Seager, Ananta Tiwari, Michael
Laurenzano, Joshua Peraza, Pietro Cicotti,
Laura Carrington
PMaC
Performance Modeling and Characterization
Question 1
Do HPC workloads benefit from
software managed Scratchpads?
YES!
If, so how will we manage it?
PMaC
Performance Modeling and Characterization
Outline
 Motivation
 Scratchpad Background
 Simulation Framework and Methodology
 Initial Study
 Current Direction
PMaC
Performance Modeling and Characterization
Outline
 Motivation
 Scratchpad Background
 Simulation Framework and Methodology
 Initial Study
 Current Direction
PMaC
Performance Modeling and Characterization
Problem: HPC Powerwall
 Can't scale old systems
– Powerwall already reached by petaflop systems
– Must redesign for power savings
 Efficiency must increase by 2x
Source: Exascale Report (Kogge, 2008)
PMaC
Performance Modeling and Characterization
How to get Energy Savings
1. Redesign Hardware
– Simpler hardware
– Transfer complexity to software
2. Minimize expensive data movement
– Memory slower
– More cores=more contention
– HPC codes have large working set sizes
PMaC
Performance Modeling and Characterization
Outline
 Motivation
 Scratchpad Background
 Simulation Framework and Methodology
 Initial Study
 Current Direction
PMaC
Performance Modeling and Characterization
Tagging
Array
Memory
Array
Decoder
Decoder
What is a Scratchpad?
Memory
Array
VS
 Scratchpad (SPM)?
– Local memory (like a cache)
– SPM: software allocated memory
 Simpler Hardware
PMaC
Performance Modeling and Characterization
Scratchpad Allocation
 Dynamic
– Move block of code
– Iterate over code
– Move another block
 Static: Move block of code once
 Strategies
– Knapsack
– Graph Coloring
 register allocation problem
PMaC
Performance Modeling and Characterization
The Idea: Less Data Movement
 Scratchpad saves energy
– Allocation burden now on software
 Less complexity on hardware
 Move only what you use
– Uses temporal locality
 Cache
– Spatial locality can fail: Superfluous data movement
(Spatial locality is built into cache design – note the 8word linesize in most architectures)
A
B
C
D
E
Moved into Cache
PMaC
Performance Modeling and Characterization
Implication of Scratchpads
 Current use: Embedded Systems
– Smaller working set size
– Predictable code
 GPU's
– Coding overhead
 Issue: HPC codes
– Large unpredictable codes
– How to generalize codes?
– How to make it practical and efficient
PMaC
Performance Modeling and Characterization
Outline
 Motivation
 Scratchpad Background
 Simulation Framework and Methodology
 Initial Study
 Current Direction
PMaC
Performance Modeling and Characterization
Question 2
Are there computation patterns
which get the most benefit from
SPM?
PMaC
Performance Modeling and Characterization
Why idioms?
 Pattern of
computation/memory
access
 Characterize
Application Data
Movement
HPC
Code
 Metric to compare
different scientific
codes (good
coverage)
 Easy to port
PMaC
Performance Modeling and Characterization
The Methodology
1. Idiom characterization study: idioms SPM vs.
Cache favorability
2. Find idioms on HPC codes
3. Port SPM favorable idioms in HPC codes to
scratchpad
PMaC
Performance Modeling and Characterization
Tool: PEBIL
 Binary instrumentation tool
– Executable Binary => Identify
Basic Blocks => Cache
Simulation
Executable Binary
Stage 1
 Cache Simulator built on
top of PEBIL
– User Defined Cache Structures
– Profiles executables (hit/miss)
Block1
Block2
PEBIL Output
Block 1 {#hits} {#misses}
Block 2 {#hits} {#misses}
…….
A op B
A=b+3
…..
Stage 2
Cache Block1
Block2
PMaC
Performance Modeling and Characterization
Simulation Environment
Title
Cache
Size
(KB)
Cache
Assoc.
Cache Line
Size (Bytes)
SPM
Size
(KB)
SPM
Assoc.
SPM Line
Size
(Bytes)
Cache
64
8
64
-
-
-
Scratchpad
-
-
-
64
Full
8
Hybrid
32
8
64
32
Full
8
PMaC
Performance Modeling and Characterization
Cache/SPM only
Executable Binary
Stage 1
Block1
Block2
Stage 2
Cache
Block1
Block2
SPM
Block1
Block2
PMaC
Performance Modeling and Characterization
Hybrid System
Executable Binary
Stage 1
Block1
Block2
Stage 2
Hybrid
SPM
Cache
Block1
Block2
PMaC
Performance Modeling and Characterization
Tool: PIR (find Idioms in HPC)
 Used for: automatically identifies idioms in largescale HPC applications
 Input: Idioms.txt
– Idioms are defined using a pattern language
 Output:
– Idioms matched to source line number
Loop1
Gather
Loop2
Transpose
PMaC
Performance Modeling and Characterization
Outline
 Motivation
 Scratchpad Background
 Simulation Framework and Methodology
 Initial Study
 Current Direction
PMaC
Performance Modeling and Characterization
Under the hood: HPC Results
 Under the hood: HPC Results
Fundamental question: Is there a benefit of SPM
for HPC codes?
– Simulate full apps on cache and SPM
– Use simple heuristic to define the mappings
– Simulate on hybrid
 Pitfalls:
– Sometime SPM moves more than cache: LRU
PMaC
Performance Modeling and Characterization
Metrics
Data Moved=(Cache Misses)*Cache Line Size
Data Movement Ratio
(SPM Data Movement)
(Cache Data Movement)
PMaC
Performance Modeling and Characterization
HPC Applications
 Graph500
– Construct and traverse weighted undirected graph
 HYCOM
– Ocean model: hybrid isopycnal-sigma-pressure, generalized
coordinate
 SMG2000
– Parallel semi-coarsening Multi-grid Solver
 Sequoia Benchmarks
– SPHOT
 Monte Carlo photon transport code
– UMT
 Unstructured-mesh deterministic radiation transport code
– AMG2006
 Algebraic mult-grid linear system solver for unstructured mesh
PMaC
Performance Modeling and Characterization
HPC Results
PMaC
Performance Modeling and Characterization
Question 1
Do HPC workloads benefit from
software managed Scratchpads?
YES!
PMaC
Performance Modeling and Characterization
Idiom Gather/Scatter
PMaC
Performance Modeling and Characterization
Using Methodology for HYCOM
1. Gather Idiom: Prefers SPM
2. Find gather in HYCOM: 33 instances
3. Port Idiom Blocks: Hybrid Structure
– Port Gather Basic Blocks to SPM
– Rest on Cache
Result HYCOM (Ocean Modeling Code)
Savings: 20% in data motion
PMaC
Performance Modeling and Characterization
Outline
 Motivation
 Scratchpad Background
 Simulation Framework and Methodology
 Initial Study
 Current Direction
PMaC
Performance Modeling and Characterization
Real SPM for PEBIL?
 Extension of PEBIL Simulator
– Fully associative cache
 Rethink replacement policy
 Dynamic Allocation Scheme
– Idioms determine loops for allocation
– Reuse distance library
 Track how often used
 Track distance of use
A
B
Reuse
Distance = 2
C
A
PMaC
Performance Modeling and Characterization
Results Summary
 SPM
– Simpler Hardware
– Efficient Data Movement
 Developed Methodology for SPM
–
–
–
–
Idiom characterization
Idiom identification in HPC codes
Port SPM hotspots
20% Data Movement Savings for HYCOM
 Scratchpad shows potential
– Good when spatial locality fails
– HPC applications
– SPM only: Average 22% Data Movement Saved
– Hybrid: Average 39% Max 69% Data Movement Saved
– 4x Improvement for Gather idiom
– Current work on creating SPM for PEBIL
PMaC
Performance Modeling and Characterization
Acknowledgements
 Acknowledgements
PMaC team
–
–
–
–
–
Laura Carrington
Ananta Tiwari
Michael Laurenzano
Pietro Cicotii
Mitesh Meswani
 Dedicated to: Allan Snavely
PMaC
Performance Modeling and Characterization
EXTRA
PMaC
Performance Modeling and Characterization
Idioms: Strided Access
PMaC
Performance Modeling and Characterization
Looking Forward
 Idiom Driven Allocation
– PIR-determines loops for allocation
 Pre-Allocated array for SPM
– Pointers to loops: trigger replacement
 Mimic Dynamic Compiler Replacement Policy
PMaC
Performance Modeling and Characterization