GM_0119_v4.pptx

Download Report

Transcript GM_0119_v4.pptx

Kishore Kumar Pusukuri, Rajiv Gupta, Laxmi N. Bhuyan
Department of Computer Science and Engineering
University of California, Riverside
2011 IEEE International Symposium on Workload Characterization
(IISWC Nov. 2011)
Outline
•
•
•
•
•
•
Motivation
Related Work
Scalability Study and Analysis
Thread Reinforcer Framework
Experiment
Conclusion
19/01/2012
ESL CSIE CCU
2
Motivation
• Since performance of a multithreaded
application depends upon the number of
threads used to run on a multi-core system
• Using few threads
→ Under utilization of system resources
• Using too many threads
→ Contention of Lock and shared resources.
19/01/2012
ESL CSIE CCU
3
Motivation (cont.)
• Binding model (One-thread-per-core) for most
programs performance is significantly worse
19/01/2012
ESL CSIE CCU
4
Motivation (cont.)
• Develop a simple technique for dynamically
determining appropriate number of threads
without recompiling the application or using
complex compilation techniques or modifying
Operating System policies.
19/01/2012
ESL CSIE CCU
5
Related Work
• Linux Utilities:
libmtmalloc, prstat, mpstat, iostat
• PARSEC Benchmark suite
• Previous work
19/01/2012
ESL CSIE CCU
6
libmtmalloc
• Multi-threaded memory allocator
library
• Provide concurrent access to
heap space
• Avoid the high overhead of malloc
19/01/2012
ESL CSIE CCU
7
prstat
• Report active process statistics
• Measure the VCX and ICX data
– Voluntary Context-Switches (VCX)
– Involuntary Context-Switches (ICX)
19/01/2012
ESL CSIE CCU
8
mpstat
• Report processors related statistics
• Measure the migration rate
19/01/2012
ESL CSIE CCU
9
iostat
• Report input/output statistics for
devices and partitions
19/01/2012
ESL CSIE CCU
10
PARSEC Benchmark suite
• The Princeton Application
Repository for Shared-Memory
Computers
19/01/2012
ESL CSIE CCU
11
PARSEC Benchmark suite
• OpenSource
• Parallel model, state-of-art algorithm,
Diverse, Emerging workloads
• Improve the unsuitable of existing
benchmark suites, such as SPLASH2, SPEC CPU2006, OMP2001 etc.
• Supported Pthread, OpenMP, TBB
19/01/2012
ESL CSIE CCU
12
PARSEC Benchmark suite
19/01/2012
ESL CSIE CCU
13
PARSEC Benchmark suite
• Swaption
– Financial Analysis
– Heath-Jarrow-Morton(HJM) framework
– Employ Monte-Carlo simulation
19/01/2012
ESL CSIE CCU
14
PARSEC Benchmark suite
• Ferret
– Similarity Search
19/01/2012
ESL CSIE CCU
15
PARSEC Benchmark suite
• Body track
– Computer Vision application
– Tracks a human body with multiple
cameras
19/01/2012
ESL CSIE CCU
16
PARSEC Benchmark suite
• Black scoles
– Financial Analysis
– Prices a portfolio of options with
the Black-Scholes
19/01/2012
ESL CSIE CCU
17
PARSEC Benchmark suite
• Canneal
– Electronic Design Automation
– Minimizes the routing cost of a chip
design with cache aware simulated
annealing
19/01/2012
ESL CSIE CCU
18
PARSEC Benchmark suite
• Fluid animate
– Computer animation application
– Simulates the underlying physics of
fluid motion for real-time animation
purposes with SPH algorithm
19/01/2012
ESL CSIE CCU
19
PARSEC Benchmark suite
• Facesim
– Computer animation application
– Simulates motions of a human face
for visualization purposes
19/01/2012
ESL CSIE CCU
20
PARSEC Benchmark suite
• Stream cluster
– Machine learning application
– Computes an approximation for the
optimal clustering of a stream of
data points
19/01/2012
ESL CSIE CCU
21
Previous work
• Dynamically finding a suitable number
of threads to optimize performance is
an important problem.
• While this issue has been studied in
context of quad-core and 8-core
systems, it has not been studied for
systems with larger number of cores.
19/01/2012
ESL CSIE CCU
22
Scalability Study and Analysis
• Experiment Setup
– Dell PowerEdge R905 server
– 24 Cores: 4 x 6-Core AMD Opteron 2.4GHz
– 32GB RAM
– OpenSolaris 2009.06
• Eight programs from the PARSEC
swaption, ferret, bodytrack, blackscholes, canneal,
fluidanimate, facesim, streamcluster
19/01/2012
ESL CSIE CCU
23
Scalability Study and Analysis (cont.)
• First, for programs that make extensive use of
heap memory, we used the libmtmalloc library
to allow multiple threads to concurrently
access to heap.
• Second, in some applications where the input
load is not evenly distributed across worker
threads, we improved the load distribution
code.
19/01/2012
ESL CSIE CCU
24
Performance for Varying Number of Threads
19/01/2012
ESL CSIE CCU
25
Performance for Varying Number of Threads
• Value of OPT-Threads varies widely(16~63) is
significant
• The choice of number of threads that are
created is an important one
19/01/2012
ESL CSIE CCU
26
Performance for Varying Number of Threads
19/01/2012
#Threads > 24
ESL CSIE CCU
#Threads < 24
27
Scalability Study and Analysis (cont.)
• OPT-Threads > Number of Cores
– Scalable Performance
• Erratic
• Steady Decline
• Continued Increase
– Performance Does Not Scale
• OPT-Thread < Number of Cores
19/01/2012
ESL CSIE CCU
28
Factors Determining Scalability
• User(USR):
Percentage of time a thread spends in user mode
• System(SYS):
Percentage of time a thread spends in processing the
following system events
• Lock-contention(LCK):
The percentage of time a thread spends waiting for user locks,
condition-variables etc.
• Latency(LAT):
The percentage of time a thread spends waiting for a CPU.
19/01/2012
ESL CSIE CCU
29
Factors Determining Scalability
19/01/2012
ESL CSIE CCU
30
Scalability Study and Analysis (cont.)
• Erratic (swaption □)
19/01/2012
ESL CSIE CCU
31
Scalability Study and Analysis (cont.)
• Erratic (swaption)
• The speedup behavior is a direct consequence of
changes in thread migration rate
• The number of threads is divisible by the number of
cores, then the likelihood of migrations is less
compared to when this is not the case.
19/01/2012
ESL CSIE CCU
32
Scalability Study and Analysis (cont.)
• Steady Decline (bodytrack +)
19/01/2012
ESL CSIE CCU
33
Scalability Study and Analysis (cont.)
• Steady Decline (bodytrack)
• Both lock contention and I/O have the consequence
of increasing the thread migration rate
19/01/2012
ESL CSIE CCU
34
Scalability Study and Analysis (cont.)
• Continued Increase
(ferret ▽)
19/01/2012
ESL CSIE CCU
35
Scalability Study and Analysis (cont.)
• OPT-Thread < Number of Cores
36
Scalability Study and Analysis (cont.)
Performance Does Not Scale
• The increased lock contention leads to
slowdowns because of increased context
switch rate
• The main thread takes lots of time of the
program
19/01/2012
ESL CSIE CCU
37
Scalability Study and Analysis (cont.)
• Context-Switches
– Involuntary Context-Switches (ICX)
(e.g., due to expiration of their time quantum)
– Voluntary Context-Switches (VCX)
e.g., for I/O or fails to acquire a lock.
19/01/2012
ESL CSIE CCU
38
Thread Reinforcer Framework
• Goal :
To find the appropriate number of threads and
to do so quickly so as to minimize runtime
overhead
19/01/2012
ESL CSIE CCU
39
Thread Reinforcer Framework (cont.)
Step
1. Run multiple times for short durations of
time during which its behavior is monitored
and searches for the appropriate number of
threads
2. The application is fully reexecuted with the
number of threads determined in the first
step
19/01/2012
ESL CSIE CCU
40
Thread Reinforcer Framework (cont.)
19/01/2012
ESL CSIE CCU
41
Thread Reinforcer Framework (cont.)
• Four profile:
– LOCK
– MIGR_RATE
– VCX_RATE
– CPU_UTIL
19/01/2012
(lock contention)
(thread migration rate)
(voluntary context switch rate)
(processor utilization)
ESL CSIE CCU
42
Thread Reinforcer Framework (cont.)
19/01/2012
ESL CSIE CCU
43
Thread Reinforcer Framework (cont.)
19/01/2012
ESL CSIE CCU
44
Thread Reinforcer Framework (cont.)
19/01/2012
ESL CSIE CCU
45
Thread Reinforcer Framework (cont.)
19/01/2012
ESL CSIE CCU
46
Experiment
• Experiment Setup
– Dell PowerEdge R905 server
– 24 Cores: 4 x 6-Core AMD Opteron 2.4GHz
– 32GB RAM
– OpenSolaris 2009.06
• Eight programs from the PARSEC
swaption, ferret, bodytrack, blackscholes, canneal,
fluidanimate, facesim, streamcluster
19/01/2012
ESL CSIE CCU
47
Experiment
19/01/2012
ESL CSIE CCU
48
Experiment
19/01/2012
ESL CSIE CCU
49
Conclusion
• To maximize performance, the number of
threads created should equal the number of
cores is not true for systems with significantly
larger number of cores
19/01/2012
ESL CSIE CCU
50
Conclusion
• Nt < Nc
Context switch rate plays an important role.
• Nt > Nc
Thread migrations performed by the OS can
limit the speedups.
Nt: Number of threads
Nc: Number of cores
19/01/2012
ESL CSIE CCU
51
Conclusion
• Thread Reinforcer: Dynamically determining
number of thread via OS level monitoring
– Approximate optimum performance
– Overhead only 0.3% ~ 4.2%
– Without recompile the program
– Without modifying OS
19/01/2012
ESL CSIE CCU
52
Limitations
• Initial periods:
It works well for applications that have short
initialization period
• i.e., creation of worker threads early in the
execution is observed
19/01/2012
ESL CSIE CCU
53
Thanks for Listening...
Your Queries?
19/01/2012
ESL CSIE CCU
54