4-Expoiting-Asymmetr..

Download Report

Transcript 4-Expoiting-Asymmetr..

-Sam Ganzfried
-Ryan Sukauye
-Aniket Ponkshe
Outline
 Effects of asymmetry and how to handle them
 Design Space Exploration for Core Architecture
 Accelerating ‘Critical Sections’
Asymmetric Chip Multiprocessors
 Most current programs assume that all computation
cores are equal
 When cores are not equal, can negatively impact
application stability and scalability
 Sources of asymmetry:
 Process Variation
 Frequency Scaling
 Explicit in processor design
Ways to Improve Stability
 Asymmetry-aware scheduler
 Asymmetry-aware applications
 Fine grained threading
Simulation Results
Outline
 Effects of asymmetry and how to handle them
 Design Space Exploration for Core Architecture
 Accelerating ‘Critical Sections’
Prior Heterogeneous Approaches
 Architecture given:
 Existing architectures
 Different generations of same processor family
 Scaled editions of same processor (e.g., Balakrishnan et
al., ‘05)
 Monotonicity:
 Total ordering among the cores in terms of performance
that remains the same for all applications (e.g., EV6 vs.
EV5).
 Greatly outperformed homogeneous CMP’s.
Increasing the Design Space
[Kumar et al., ’06]
 Full space of heterogeneous processors is huge:
 Can change various architectural parameters on single
processor
 Combined performance of multiple different cores on
arbitrary permutations of the applications.
 Simplifying assumptions:
 Separability: performance is sum of individual
performances.
 Good static scheduling of threads to cores.
 Only consider 4-core processors.
 Private L2 caches.
Methodology
 480 possible cores: over 2.2 billion 4-core MPs.
 Wide range of area and power budgets.
 10 benchmarks for constructing workloads:
 E.g., chemistry, chess, combinatorial optimization.
 Considered all possible 4-threaded combinations.
 250 million cycles of each application on each core.
 Evaluated using weighted speedup.
Experimental results
 Particular given 4-thread workload:
 Best CMP has all cores different.
 7% higher throughput over best homogeneous CMP.
 16.7% improvement with dynamic mapping.
 Workload with given budget:
 Advantage of diversity even for all same workloads!
 Significant benefit to diversity if either area or power
reasonably constrained.
 Best heterogeneous CMP not constructed of cores that
make good general-purpose uniprocessors.
Experiments cont’d
 Quantifying inefficiency due to monotonicity
 Best non-monotonic design outperformed best
monotonic design by 7.5%.
 Outperformed best homogeneous CMP design by 15.4%.
 Search techniques
 Mostly brute-force search was used (~2.2 billion
options).
 Used hill-climbing to speed up search.
 11% better than best homogeneous CMP
 4.5% worse than exhaustive search.
Outline
 Effects of asymmetry and how to handle them
 Design Space Exploration for Core Architecture
 Accelerating ‘Critical Sections’
Accelerating Critical Sections
Questions
 Critical Sections vs. Serial Bottleneck
 What would a traditional CMP do on encountering a critical
section?
 What does ACS do?
ACS
 Advantage:
 Lock and shared data reside on cache hierarchy of large core
 Downside:
 Transfer private data from small core to large core on demand
 False serialization
Critical Sections vs. Serial
Bottleneck
b) On a CMP
c) With ACS
a) Serial, Parallel and
Critical Parts
Some results…
# of cores above which ACS gives better performance
 Performance Trade Offs in
ACS
 Access private data vs. shared
data
 Faster Critical Sections vs.
Fewer Threads
ACS…
 Provides performance
benefits on increasing
number of cores
 Increases scalability
 Issues:
 False Serialization: Bit Vector at each small core
 Fine grained locks: Problem on Saturation
 Future Research:
 Accommodating Multiple Large Cores


Either for different critical sections
Or for different Operations
 More than one application