Pin Presentations - Aamer Jaleel

Transcript Pin Presentations - Aamer Jaleel

Last Level Cache (LLC) Performance
of Data Mining Workloads on a CMP
A Case Study of Parallel
Bioinformatics Workloads
Aamer Jaleel
Matthew Mattina
Bruce Jacob
Intel, VSSAD
University of MD
Tilera Corporation
University of MD
ECE Department
[email protected]
[email protected]
[email protected]
[email protected]
Paper Motivation
• Growth of CMPs and Design Issues
CPU
?????
Cache
???????
??????
????
• Growth of Data and Emergence of New Workloads:
DATABASE
FINANCE
MEDICINE
WORLDS DATA
INCREASING
SPATIAL
STOCK
Recognition, Mining, and Synthesis (RMS) Workloads
Paper Contributions
• First to characterize memory behavior of parallel
data-mining workloads on a CMP
– Bioinformatics workloads
• Sharing Analysis:
– Varying amount of data shared between threads
– Shared data frequently accessed
– Degree of sharing is f(cache size)
• Cache Performance Studies:
– Private vs shared cache studies
– Greater sharing  better shared cache performance
Bioinformatics
• Using software to understand,
and analyze biological data
• Why bioinformatics?
– Sophisticated algorithms and
huge data sets
• Use mathematical and
statistical methods to solve
biological problems
–
–
–
–
Sequence analysis
Protein structure prediction
Gene classification
And many, many, more…
Src: http://www.imb-jena.de/~rake/Bioinformatics_WEB
Parallel Bioinformatics Workloads
• Structure Learning:
– GeneNet – Hill Climbing, Bayesian network learning
– SNP – Hill Climbing, Bayesian network learning
– SEMPHY – Structural Expectation Maximization algorithm
• Optimization:
– PLSA – Dynamic Programming
• Recognition:
– SVM-RFE – Feature Selection
• OpenMP workloads developed by Intel Corporation
– Donated to Northwestern University, NU-MineBench Suite
– http://cucis.ece.northwestern.edu/projects/DMS/MineBench.html
– Also made available at: http://www.ece.umd.edu/biobench/
Experimental Methodology - Pin
• Pin – x86 Dynamic Binary
Instrumentation Tool
– Developed at VSSAD, Intel
– ATOM-like tool for Intel Xscale, IA-32,
IPF Linux binaries
– Provides infrastructure for writing
program analysis tools – pin tools
– Supports instrumentation of multithreaded workloads
– Hosted at:
http://rogue.colorado.edu/Pin
The simCMPcache Pin tool
• Instruments all memory references of an application
• Gathers numerous cache performance statistics
• Captures time varying behavior of applications
Experimental Methodology
Simulation Platform 4/8-way SMP system of Pentium 4
processors with hyper-threading
(can simulate 1 to 16 core CMPs)
Data Logging
Every 10 million instructions per thread
Instruction Cache
L1 Data Cache
Not Modeled
32 KB, 4-way, 64B lines, write-through,
inclusive, private cache
L2 Data Cache
256 KB, 8-way, 64B lines, write-back,
inclusive, private cache
LLC/L3 Data Cache 4/8/16/32/64 MB, 16-way, 64B lines,
LRU, write-back, private/shared cache
Measuring Data Sharing
• Shared Cache Line:
– More than one core accesses the
same cache line during its lifetime
in the cache
C3
C2
Shared
Cache
• Shared Access:
– Access to a shared cache line
C0
C1
• Active-Shared Access:
– Access to a shared cache line and
the last core  current core
– Ex: Accesses by core ids in red are
active-shared accesses
Core IDs: …1, 2, 2, 2, 1, 3, 4, 3, 2, 2, 2…
1 Core
3 Core
2 Core
4 Core
Cache Miss
•
•
•
•
PLSA
GeneNet SEMPHY
SNP
(4 Threaded Run)
SVM
100
80
60
40
20
0
Sharing is dependent on algorithm and varies with cache size
Workloads fully utilize a 64MB LLC
Reducing cache misses improves data sharing
Despite size of shared footprint, shared data frequently referenced
(4 Threaded Run)
100
80
60
40
20
0
4MB
8MB
16MB
32MB
64MB
Access Frequency How Much Shared?
1 Thread
2 Thread
3 Thread
4 Thread
Data Sharing Behavior
Sharing Phase Dependent & f (cache size)
16 MB LLC
64 MB LLC
How Much Shared?
4 MB LLC
How Much Shared?
(a) SEMPHY
(b) SVM
4 Threaded Run:
1 Thread
2 Thread
3 Thread
4 Thread
Miss Rate
Miss Rate
Shared/Private Cache – SEMPHY
Private Cache (16MB TOTAL LLC, 4MB/CORE)
Shared Cache (16MB TOTAL LLC)
Total Instructions (billions)
• SEMPHY with 4-threads
• Shared cache out-performs private caches
Shared Refs & Shared Caches…
% Total Accesses
Cache Miss
A
1 Thread
2 Thread
3 Thread
B
Miss Rate
Private LLC
Miss Rate
GeneNet – 16MB LLC
Shared LLC
4 Thread
(4 Threaded Run)
• Phase A: Shared caches
perform better than
private caches (25%)
• Phase B: Shared caches
marginally better than
private caches (5%)
• Shared caches BETTER
when shared data
frequently referenced
• Most workloads frequently
reference shared data
Summary
• This Paper:
– Memory behavior of parallel bioinformatics workloads
• Key Points:
– Workloads exhibit a large amount of data sharing
– Data sharing is a function of the total cache available
• Eliminating cache misses improves data sharing
– Shared data frequently referenced
– Shared caches outperform private caches especially
when shared data is frequently used
Ongoing Work on Bio-Workloads
University of Maryland
BioBench: A Benchmark Suite for
Bioinformatics Applications
BioParallel: Parallel Bioinformatics
Applications Suite (In Progress)
Brought to you by Maryland MemorySystems Research
"BioBench: A benchmark suite of bioinformatics
applications." K. Albayraktaroglu, A. Jaleel, X. Wu,
M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung.
Proc. 2005 IEEE International Symposium on
Performance Analysis of Systems and Software
(ISPASS 2005), pp. 2-9. Austin TX, March 2005.
http://www.ece.umd.edu/biobench/