Transcript Minjang Kim
Prospector: A Toolchain To Help Parallel Programming
Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel
This work will be also supported by Samsung
Motivation (1/2)
Parallel programming is hard
What if there is a tool that helps parallel programming?
Already we have some tools like race detectors
However, not many tools on guiding parallel
programming itself
A program wants to parallelize a serial code
Where to parallelize?
How to parallelize?
2
Motivation (2/2)
We propose Prospector
A set of dynamic program analyzers to help
parallelization of serial code
Goals
Give information to find right parallelization targets
Provide advices on writing correct and optimized
parallelized code
3
Overview of Prospector
Loop1
Func1(){
Loop1;
Loop2;
Func2();
}
Input
Func1(){
Loop1;
Loop2;
Func2();
}
Loop-Centric
Profiler
Parallel
Speedup
Predictor
Speedup
Func2() {
Loop3
}
Source code
or Binary
Parallelizable
Section
Finder
2 4 8
# of core
Parallelism
Pattern
Advisor
Architecture
Advisor
Speedup
Invocation:
8
Iteration: 5,000
Max Iter: 1,600
Min Iter:
40
CPU GPU
Loop3 {
Statements;
Lock();
Statements;
Unlock();
Statements;
}
Parallel
Performance
Analyzer
Loop3 {
Statements;
Lock();
Statements;
Unlock();
Statements;
}
4
Prospector: Loop-Centric Profiler
Q: Which code section would good for parallelization?
Mostly frequently executed loops
Legacy profilers only report hot functions and instructions
We provide details of loop execution
Loop1
Invocation:
8
Iteration: 5,000
Max Iter: 1,600
Min Iter:
40
# of trip count Sufficient work?
# of invocation Low fork/join overhead?
Stats of the length of loop iteration Balanced?
Min, Max, Stdev
5
Prospector: Parallel Speedup Predictor (1/2)
Q: What would be expected speedup?
Analytical models (e.g., Amdahl’s Law) are not
practical to predict speedup in the presence of locks
Our approach
Dynamically predicting speedup based on
light profiling
Speedup
2 4 8
# of core
Challenges
How to model architecture factors (e.g., caches, memory)?
6
Prospector: Parallel Speedup Predictor (2/2)
Mechanisms
Programmers annotate the serial code
Fast and light profiling
Measure time between annotations
Emulation
Describe the behaviors of parallel execution + locks
Obtain estimated parallel execution time for speedup
Modeling architectural parameters
Sampling memory accesses
Using an analytical model for cache hit/miss prediction
7
Prospector: Parallelizable Section Finder (1/3)
Q: Is this code section parallelizable?
Data dependences determine the parallelizability
Compilers may not be good due to pointers and complex
control flows
Parallelizable!
Our approach
Func1(){
Loop1;
Loop2;
Func2();
}
Dynamic data-dependence profiling
Provides detailed dependence information for a given input
Challenges
Too much overhead; Smart algorithm is needed
8
Prospector: Parallelizable Section Finder (2/3)
Mechanisms
A dynamic profiler by using instrumentations
At instrumentation time (or static time)
Instrumentation can be either binary and source level
Analyzes control flow graphs and loop structures
At runtime
We observe memory addresses (no pointer-to analysis)
These memory addresses are stored and analyzed to
discover data dependences
9
Prospector: Parallelizable Section Finder (3/3)
Mechanisms
Scalability
Current tools require too much memory and time to
analyze data dependence
Prospector implements a new scalable algorithm for data
dependence profiling
Key ideas
Using compression and parallelization (MICRO ‘10)
10
Prospector: Parallelism Pattern Advisor
Q: How can I transform the serial code?
If dependences are easily removable
I.e., Embarrassingly parallel loops with some reductions
Guide parallelization strategy directly
Loop3 {
If severe dependences exist
}
Can we give advice on avoiding these dependences?
Statements;
Lock();
Statements;
Unlock();
Statements;
E.g., Use OpenMP pragma here
General solutions are extremely hard
Instead data-dependence pattern analysis
E.g., pipeline parallelism, a certain form of locking
11
Prospector: Parallel Architecture Advisor
Q: Which parallel hardware would be better?
Can we predict performances on different hardware?
E.g., Speedups on multicore and GPGPU
Challenges
Need to model more architectural factors
Speedup
CPU GPU
12
Prospector: Parallel Performance Analyzer
Q: What is the reason of poor speedup?
There are a couple of profiler for this purpose
Analyzes the degree of concurrency
Profiles lock contentions (wait time)
Too low-level information to understand problems
Alternative
Macroscopic profiling of parallelized programs
An alternative form of visualizations
Loop3 {
Statements;
Lock();
Statements;
Unlock();
Statements;
}
13
Related Work
State-of-the-art tools
Parallel Advisor from Intel Parallel Studio 2011
Speedup Predictor: cannot model architectures
Parallelizable Section Finder: scalability issues
vfAnalyst from VectorFabric
Parallelizable Section Finder: scalability issues
14
Current Status and Timeline
June 2010
Initial Prospector’s idea is presented in HotPar ‘10
Dec 2010
Scalable data-dependence profiling algorithm
(for Parallelizable Section Finder and Pattern Advisor) will be presented in
MICRO ’10
Beta version will be released as open source
Loop-centric profiler
Parallelizable Section Finder (i.e. Data-Dependence profiler)
Parallel speedup predictor
Mar 2010
Parallel Speedup Predictor will be released
Aug 2010
First Parallelism Pattern Advisor will be released
15
Conclusion
We need a new type of tool to help parallel
programming
Prospector is a set of parallel programming advisor
based on dynamic program analysis
Finds good parallelization target
Analyzes serial code to understand the behavior
Predicts speedup
Provides advice on code changes
16
Thank you!
Q&A
References
Overall tool architecture
Minjang Kim, Hyesoon Kim, Chi-Keung Luk, "Prospector: Helping Parallel
Programming by A Data-Dependence Profiler", 2nd USENIX Workshop on Hot Topics in
Parallelism (HotPar '10), June 2010.
Scalable data-dependence profiling
Minjang Kim, Hyesoon Kim, Chi-Keung Luk, "SD3: A Scalable Approach To Dynamic
Data-Dependence Profiling", Proceedings of the 43rd IEEE/ACM International
Symposium on Microarchitecture (MICRO), December 2010.
17