Transcript Minjang Kim

Prospector: A Toolchain To Help Parallel Programming
Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel
This work will be also supported by Samsung
Motivation (1/2)



Parallel programming is hard
What if there is a tool that helps parallel programming?
 Already we have some tools like race detectors
However, not many tools on guiding parallel
programming itself
 A program wants to parallelize a serial code
 Where to parallelize?
 How to parallelize?
2
Motivation (2/2)


We propose Prospector
 A set of dynamic program analyzers to help
parallelization of serial code
Goals
 Give information to find right parallelization targets
 Provide advices on writing correct and optimized
parallelized code
3
Overview of Prospector
Loop1
Func1(){
Loop1;
Loop2;
Func2();
}
Input
Func1(){
Loop1;
Loop2;
Func2();
}
Loop-Centric
Profiler
Parallel
Speedup
Predictor
Speedup
Func2() {
Loop3
}
Source code
or Binary
Parallelizable
Section
Finder
2 4 8
# of core
Parallelism
Pattern
Advisor
Architecture
Advisor
Speedup
Invocation:
8
Iteration: 5,000
Max Iter: 1,600
Min Iter:
40
CPU GPU
Loop3 {
Statements;
Lock();
Statements;
Unlock();
Statements;
}
Parallel
Performance
Analyzer
Loop3 {
Statements;
Lock();
Statements;
Unlock();
Statements;
}
4
Prospector: Loop-Centric Profiler

Q: Which code section would good for parallelization?

Mostly frequently executed loops


Legacy profilers only report hot functions and instructions
We provide details of loop execution



Loop1
Invocation:
8
Iteration: 5,000
Max Iter: 1,600
Min Iter:
40
# of trip count  Sufficient work?
# of invocation  Low fork/join overhead?
Stats of the length of loop iteration  Balanced?

Min, Max, Stdev
5
Prospector: Parallel Speedup Predictor (1/2)
Q: What would be expected speedup?

Analytical models (e.g., Amdahl’s Law) are not
practical to predict speedup in the presence of locks

Our approach


Dynamically predicting speedup based on
light profiling
Speedup

2 4 8
# of core
Challenges

How to model architecture factors (e.g., caches, memory)?
6
Prospector: Parallel Speedup Predictor (2/2)

Mechanisms
 Programmers annotate the serial code


Fast and light profiling


Measure time between annotations
Emulation


Describe the behaviors of parallel execution + locks
Obtain estimated parallel execution time for speedup
Modeling architectural parameters


Sampling memory accesses
Using an analytical model for cache hit/miss prediction
7
Prospector: Parallelizable Section Finder (1/3)

Q: Is this code section parallelizable?

Data dependences determine the parallelizability

Compilers may not be good due to pointers and complex
control flows
Parallelizable!

Our approach



Func1(){
Loop1;
Loop2;
Func2();
}
Dynamic data-dependence profiling
Provides detailed dependence information for a given input
Challenges

Too much overhead; Smart algorithm is needed
8
Prospector: Parallelizable Section Finder (2/3)

Mechanisms
 A dynamic profiler by using instrumentations


At instrumentation time (or static time)


Instrumentation can be either binary and source level
Analyzes control flow graphs and loop structures
At runtime


We observe memory addresses (no pointer-to analysis)
These memory addresses are stored and analyzed to
discover data dependences
9
Prospector: Parallelizable Section Finder (3/3)

Mechanisms
 Scalability



Current tools require too much memory and time to
analyze data dependence
Prospector implements a new scalable algorithm for data
dependence profiling
Key ideas

Using compression and parallelization (MICRO ‘10)
10
Prospector: Parallelism Pattern Advisor

Q: How can I transform the serial code?

If dependences are easily removable


I.e., Embarrassingly parallel loops with some reductions
Guide parallelization strategy directly
Loop3 {


If severe dependences exist

}
Can we give advice on avoiding these dependences?


Statements;
Lock();
Statements;
Unlock();
Statements;
E.g., Use OpenMP pragma here
General solutions are extremely hard
Instead data-dependence pattern analysis

E.g., pipeline parallelism, a certain form of locking
11
Prospector: Parallel Architecture Advisor
Q: Which parallel hardware would be better?

Can we predict performances on different hardware?


E.g., Speedups on multicore and GPGPU
Challenges

Need to model more architectural factors
Speedup

CPU GPU
12
Prospector: Parallel Performance Analyzer

Q: What is the reason of poor speedup?

There are a couple of profiler for this purpose




Analyzes the degree of concurrency
Profiles lock contentions (wait time)
Too low-level information to understand problems
Alternative


Macroscopic profiling of parallelized programs
An alternative form of visualizations
Loop3 {
Statements;
Lock();
Statements;
Unlock();
Statements;
}
13
Related Work

State-of-the-art tools
 Parallel Advisor from Intel Parallel Studio 2011



Speedup Predictor: cannot model architectures
Parallelizable Section Finder: scalability issues
vfAnalyst from VectorFabric

Parallelizable Section Finder: scalability issues
14
Current Status and Timeline


June 2010
 Initial Prospector’s idea is presented in HotPar ‘10
Dec 2010
 Scalable data-dependence profiling algorithm
(for Parallelizable Section Finder and Pattern Advisor) will be presented in
MICRO ’10
 Beta version will be released as open source





Loop-centric profiler
Parallelizable Section Finder (i.e. Data-Dependence profiler)
Parallel speedup predictor
Mar 2010
 Parallel Speedup Predictor will be released
Aug 2010
 First Parallelism Pattern Advisor will be released
15
Conclusion


We need a new type of tool to help parallel
programming
Prospector is a set of parallel programming advisor
based on dynamic program analysis
 Finds good parallelization target
 Analyzes serial code to understand the behavior
 Predicts speedup
 Provides advice on code changes
16
Thank you!


Q&A
References
 Overall tool architecture


Minjang Kim, Hyesoon Kim, Chi-Keung Luk, "Prospector: Helping Parallel
Programming by A Data-Dependence Profiler", 2nd USENIX Workshop on Hot Topics in
Parallelism (HotPar '10), June 2010.
Scalable data-dependence profiling

Minjang Kim, Hyesoon Kim, Chi-Keung Luk, "SD3: A Scalable Approach To Dynamic
Data-Dependence Profiling", Proceedings of the 43rd IEEE/ACM International
Symposium on Microarchitecture (MICRO), December 2010.
17