Transcript slides
Efficient HPC
Data Motion via
Scratchpad
Memory
Kayla Seager, Ananta Tiwari, Michael
Laurenzano, Joshua Peraza, Pietro Cicotti,
Laura Carrington
PMaC
Performance Modeling and Characterization
Question 1
Do HPC workloads benefit from
software managed Scratchpads?
YES!
If, so how will we manage it?
PMaC
Performance Modeling and Characterization
Outline
Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaC
Performance Modeling and Characterization
Outline
Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaC
Performance Modeling and Characterization
Problem: HPC Powerwall
Can't scale old systems
– Powerwall already reached by petaflop systems
– Must redesign for power savings
Efficiency must increase by 2x
Source: Exascale Report (Kogge, 2008)
PMaC
Performance Modeling and Characterization
How to get Energy Savings
1. Redesign Hardware
– Simpler hardware
– Transfer complexity to software
2. Minimize expensive data movement
– Memory slower
– More cores=more contention
– HPC codes have large working set sizes
PMaC
Performance Modeling and Characterization
Outline
Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaC
Performance Modeling and Characterization
Tagging
Array
Memory
Array
Decoder
Decoder
What is a Scratchpad?
Memory
Array
VS
Scratchpad (SPM)?
– Local memory (like a cache)
– SPM: software allocated memory
Simpler Hardware
PMaC
Performance Modeling and Characterization
Scratchpad Allocation
Dynamic
– Move block of code
– Iterate over code
– Move another block
Static: Move block of code once
Strategies
– Knapsack
– Graph Coloring
register allocation problem
PMaC
Performance Modeling and Characterization
The Idea: Less Data Movement
Scratchpad saves energy
– Allocation burden now on software
Less complexity on hardware
Move only what you use
– Uses temporal locality
Cache
– Spatial locality can fail: Superfluous data movement
(Spatial locality is built into cache design – note the 8word linesize in most architectures)
A
B
C
D
E
Moved into Cache
PMaC
Performance Modeling and Characterization
Implication of Scratchpads
Current use: Embedded Systems
– Smaller working set size
– Predictable code
GPU's
– Coding overhead
Issue: HPC codes
– Large unpredictable codes
– How to generalize codes?
– How to make it practical and efficient
PMaC
Performance Modeling and Characterization
Outline
Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaC
Performance Modeling and Characterization
Question 2
Are there computation patterns
which get the most benefit from
SPM?
PMaC
Performance Modeling and Characterization
Why idioms?
Pattern of
computation/memory
access
Characterize
Application Data
Movement
HPC
Code
Metric to compare
different scientific
codes (good
coverage)
Easy to port
PMaC
Performance Modeling and Characterization
The Methodology
1. Idiom characterization study: idioms SPM vs.
Cache favorability
2. Find idioms on HPC codes
3. Port SPM favorable idioms in HPC codes to
scratchpad
PMaC
Performance Modeling and Characterization
Tool: PEBIL
Binary instrumentation tool
– Executable Binary => Identify
Basic Blocks => Cache
Simulation
Executable Binary
Stage 1
Cache Simulator built on
top of PEBIL
– User Defined Cache Structures
– Profiles executables (hit/miss)
Block1
Block2
PEBIL Output
Block 1 {#hits} {#misses}
Block 2 {#hits} {#misses}
…….
A op B
A=b+3
…..
Stage 2
Cache Block1
Block2
PMaC
Performance Modeling and Characterization
Simulation Environment
Title
Cache
Size
(KB)
Cache
Assoc.
Cache Line
Size (Bytes)
SPM
Size
(KB)
SPM
Assoc.
SPM Line
Size
(Bytes)
Cache
64
8
64
-
-
-
Scratchpad
-
-
-
64
Full
8
Hybrid
32
8
64
32
Full
8
PMaC
Performance Modeling and Characterization
Cache/SPM only
Executable Binary
Stage 1
Block1
Block2
Stage 2
Cache
Block1
Block2
SPM
Block1
Block2
PMaC
Performance Modeling and Characterization
Hybrid System
Executable Binary
Stage 1
Block1
Block2
Stage 2
Hybrid
SPM
Cache
Block1
Block2
PMaC
Performance Modeling and Characterization
Tool: PIR (find Idioms in HPC)
Used for: automatically identifies idioms in largescale HPC applications
Input: Idioms.txt
– Idioms are defined using a pattern language
Output:
– Idioms matched to source line number
Loop1
Gather
Loop2
Transpose
PMaC
Performance Modeling and Characterization
Outline
Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaC
Performance Modeling and Characterization
Under the hood: HPC Results
Under the hood: HPC Results
Fundamental question: Is there a benefit of SPM
for HPC codes?
– Simulate full apps on cache and SPM
– Use simple heuristic to define the mappings
– Simulate on hybrid
Pitfalls:
– Sometime SPM moves more than cache: LRU
PMaC
Performance Modeling and Characterization
Metrics
Data Moved=(Cache Misses)*Cache Line Size
Data Movement Ratio
(SPM Data Movement)
(Cache Data Movement)
PMaC
Performance Modeling and Characterization
HPC Applications
Graph500
– Construct and traverse weighted undirected graph
HYCOM
– Ocean model: hybrid isopycnal-sigma-pressure, generalized
coordinate
SMG2000
– Parallel semi-coarsening Multi-grid Solver
Sequoia Benchmarks
– SPHOT
Monte Carlo photon transport code
– UMT
Unstructured-mesh deterministic radiation transport code
– AMG2006
Algebraic mult-grid linear system solver for unstructured mesh
PMaC
Performance Modeling and Characterization
HPC Results
PMaC
Performance Modeling and Characterization
Question 1
Do HPC workloads benefit from
software managed Scratchpads?
YES!
PMaC
Performance Modeling and Characterization
Idiom Gather/Scatter
PMaC
Performance Modeling and Characterization
Using Methodology for HYCOM
1. Gather Idiom: Prefers SPM
2. Find gather in HYCOM: 33 instances
3. Port Idiom Blocks: Hybrid Structure
– Port Gather Basic Blocks to SPM
– Rest on Cache
Result HYCOM (Ocean Modeling Code)
Savings: 20% in data motion
PMaC
Performance Modeling and Characterization
Outline
Motivation
Scratchpad Background
Simulation Framework and Methodology
Initial Study
Current Direction
PMaC
Performance Modeling and Characterization
Real SPM for PEBIL?
Extension of PEBIL Simulator
– Fully associative cache
Rethink replacement policy
Dynamic Allocation Scheme
– Idioms determine loops for allocation
– Reuse distance library
Track how often used
Track distance of use
A
B
Reuse
Distance = 2
C
A
PMaC
Performance Modeling and Characterization
Results Summary
SPM
– Simpler Hardware
– Efficient Data Movement
Developed Methodology for SPM
–
–
–
–
Idiom characterization
Idiom identification in HPC codes
Port SPM hotspots
20% Data Movement Savings for HYCOM
Scratchpad shows potential
– Good when spatial locality fails
– HPC applications
– SPM only: Average 22% Data Movement Saved
– Hybrid: Average 39% Max 69% Data Movement Saved
– 4x Improvement for Gather idiom
– Current work on creating SPM for PEBIL
PMaC
Performance Modeling and Characterization
Acknowledgements
Acknowledgements
PMaC team
–
–
–
–
–
Laura Carrington
Ananta Tiwari
Michael Laurenzano
Pietro Cicotii
Mitesh Meswani
Dedicated to: Allan Snavely
PMaC
Performance Modeling and Characterization
EXTRA
PMaC
Performance Modeling and Characterization
Idioms: Strided Access
PMaC
Performance Modeling and Characterization
Looking Forward
Idiom Driven Allocation
– PIR-determines loops for allocation
Pre-Allocated array for SPM
– Pointers to loops: trigger replacement
Mimic Dynamic Compiler Replacement Policy
PMaC
Performance Modeling and Characterization