Transcript Talk

TimeCube
A Manycore Embedded Processor with
Interference-agnostic Progress Tracking
Anshuman Gupta
Jack Sampson
Michael Bedford Taylor
University of California, San Diego
Multicore Processors in Embedded Systems
Intel Atom
Qualcomm
Snapdragon
Apple A6
Applied Micro
Green Mamba
• Standard in domains such as smartphones
• Higher Energy-Efficiency
• Higher Area-Efficiency
2
Towards Manycore Embedded Systems
Unicore Dualcore
Shared Mem
Quadcore
Many(64)core
Shared Cache,
Shared Mem
Shared OCN,
Shared Cache,
Shared Mem etc.
• Number of cores in a processor is increasing
• So is sharing!
3
What’s Great About Manycores
• Lots of resources
Tile GX 8072
Xeon Phi 7120X
• Cores
72
61
• Caches
23MB
30.5MB
• DDR channels
4
16
• Memory Bandwidth
100GB/s
352GB/s
4
What’s Not So Great: Sharing
• Low per-core
resources
Tile Gx
8072
• Cache / core
327 KB
> 7X
2.5 MB
• Memory BW /
core
1.16
B/cyc
> 3X
4.26 B/cyc
Intel Xeon
4650
The applications fight with each other over
the limited resources.
5
Sharing at its Worst
SPEC2K,
SPEC2K6
+ I/O-centric suite
• 32 cores, 16 MB L2 Cache, 96Gb/s DRAM bandwidth, 32GB DDR3
• 12X worstcase slowdowns!
6
Key Problems With Sharing
• I know how I’d run by myself, but how much
are others slowing me down?
• How do I get guarantees of how much
performance I’ll get?
• How do we allocate the resources for the good
of the many, but without punishing the few, or
the one?
7
I know how I’d run by myself, but how much are
others slowing me down?
Solution: We introduce a new metric –
Progress-Time
Time the application would have
taken, were it to have been
allocated all CPU resources.
• This Paper: With the right hardware, we can calculate the
Progress-Time in real time.
• Useful Because: Key building block for the hardware, for the
operating system, and for the application to create guarantees
about execution quality.
8
How do I get guarantees of how much performance
I’ll get?
Solution: We introduce a new hardware-generated data structure –
Progress Tables
For each application, how much
Progress-Time it gets for every
possible resource allocation
– and we extend the hardware to dynamically partition resources.
• This Paper: With a little more hardware, we can compute the
Progress Tables accurately and accordingly partition resources to
guarantee performance, in real time.
• Useful Because: We can determine exactly how much resources
are required to attain a given level of performance.
9
1
0.75
1MB
0.5
256KB
0.25
64KB
0
50
Bandwidth (%)
75
Cache
4MB
0.75
1MB
0.5
256KB
0.25
64KB
25
50
Bandwidth (%)
75
0
1
16MB
4MB
0.75
1MB
0.5
256KB
0.25
64KB
0
10
25
50
Bandwidth (%)
75
0
astar
Cache
0
1
16MB
0
• Red = attaining the full
1ms of Progress-Time in
1ms of real time
25
hmmer
• Graphical images of real
Incremental Progress
Tables generated in real
time by our hardware
4MB
specrand
Sneak
Preview
Cache
16MB
How do we allocate the resources for the good of the
many, but without punishing the few, or the one*?
Solution: We introduce a new hardware-generated data structure –
SPOT (Simultaneous
For each application, how much
resources should be allocated to
Performance
maximize geomean of ProgressOptimization Table) Times across the system.
• This Paper: With 3% more hardware, we can find near-optimal
resource allocations, in real time.
• Useful Because: Greatly improve system performance and
fairness.
* Star Trek reference.
11
DIMMs
Memory Controller
Memory Controller
C Cores
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
DIMMs
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
DIMMs
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
Memory Controller
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
Memory Controller
DIMMs
TimeCube: A Demonstration Vehicle for These Ideas
D L2 Cache Block
• Scalable manycore architecture, in-order memory system
• Critical resources spatially distributed over tiles
12
Outline
• Introduction
• Measuring Execution Quality: Progress-Time
• Enforcing Execution Guarantees: Progress-Table
• Allocating Execution Resources: SPOT
• Conclusion
13
Measuring Execution Progress:
Progress-Time
• What do we need to compute Progress-Time?
Current Universe
Ideal (Shadow) Universe
14
Measuring Execution Progress:
Progress-Time
• What do we need to compute Progress-Time?
Execution
Counters
Current Universe
Ideal (Shadow) Universe
15
Measuring Execution Progress:
Progress-Time
• What do we need to compute Progress-Time?
+
+
+
+
Execution
Counters
Current Universe
Shadow
Counters
Ideal (Shadow) Universe
16
Shadow Structures
• Shadow Tags
• Measure cache miss rates for full cache allocation
• Set-sampling reduces overhead
• Shadow Prefetchers
• Measure prefetches issued and prefetch hit rate
• Track cache miss stream from Shadow Tags
• Launch fake prefetches, no data buffers
• Shadow Banking
• Measure DRAM page hits, misses, and conflicts
• Tracks current state of DRAM row buffers using DDR protocol
17
A Shadow Performance Model for
Progress-Time
• Analytical model to estimate Progress-Time
• Takes into account the critical memory resources
• Assumes no change in core pipeline execution cycles
• Uses events collected from the shadow structures
• Reuses average latencies for accessing individual resources
ExecutionTime = corecycles
+
Shadow Events
Average Latencies for current allocation
L2Hit
x L2HitLatency
PrefHit
x PrefHitLatency
PageHit
x PageHitLatency
PageMiss
x PageMissLatency
PageConflict
x PageConflictLatency
18
Accounting for Bandwidth Stalls
•
L2 misses and prefetcher statistics determine required bandwidth
•
No bandwidth stall assumed if sufficient bandwidth
•
If insufficient bandwidth, performance (IPC) degrades proportionally
19
Evaluation Methodology
• Evaluate a 32-core instance similar to modern manycore processors
• 26 benchmarks from SPEC2K, SPEC2K6, and an I/O-centric suite
• Near unlimited combinations of simultaneous runs
Prefetcher
NoPrefetcher
20
15
10
5
0
1K
B
4K
B
16
6
2
KB 4KB 56K
B
L2 Cache Size
cliff
3
2.5
Prefetcher
NoPrefetcher
2
1.5
1
0.5
0
1K
B
4K
B
16
6
2
KB 4KB 56K
B
L2 Cache Size
20
Requests per Kilo Insts
stream
25
Requests per Kilo Insts
Requests per Kilo Insts
• Compress run-space by classifying apps into streams, cliffs, and
slopes based on cache sensitivity
4
3.5
3
2.5
2
1.5
1
0.5
0
slope
Prefetcher
NoPrefetcher
1K
B
4K
B
16
6
2
KB 4KB 56K
B
L2 Cache Size
Average Estimation Accuracy (%)
Shadow Performance Model and Shadow Structures
Accurately Compute Progress-Time
99%
100
80
60
40
20
0
st
s
s
s
s
s
s
s
s
s
s
s
s
s
s
A
re tre tre tre tre tre tre tre tre tre tre tre tre tre tre VE
am am am am am am am am am am am am am am am R
10 75 75 50 50 50 25 25 25 25 0% 0% 0% 0% 0% AG
0% % % % % % % % % %
_
_
_
_
_ E
_s _slo _slo _slo _slo _slo _slo _slo _slo _slo slop slop slop slop slop
lo
pe pe0 pe2 pe0 pe2 pe5 pe0 pe2 pe5 pe7 e0% e25 e50 e75 e10
0% % 5% % 5% 0% % 5% 0% 5%
% % % 0%
Compositions
• TimeCube tracks Progress-Times with ~1% error
• No latency overheads
21
Outline
• Introduction
• Measuring Execution Quality: Progress-Time
• Enforcing Execution Guarantees: Progress-Table
• Allocating Execution Resources: SPOT
• Conclusion
22
Progress-Tables in TimeCube
0%
0%
Cache
•
One Progress-Table (Ptable) per application
•
Memory bandwidth binned in 1% increments
•
Last-level cache arrays allocated in powers of two
•
Progress-Time accumulated over intervals using last cell
23
50% 100%
Bandwidth
100%
Execution-Time
for app i, cache c
and bandwidth b
Shadow Structures 2.0
• Shadow Tags
• Measure cache miss rates for all power-of-two cache allocations
• LRU-stacking reduces overhead
• Shadow Prefetchers
• Add one instance for each cache allocation
• Shadow Banking
• Add one instance for each cache allocation
Same performance model is used as for Progress-Time.
24
1
4MB
0.75
1MB
0.5
256KB
0.25
64KB
0
50
Bandwidth (%)
75
1
16MB
4MB
0.75
1MB
0.5
256KB
Overall as well as perinterval QoS control
0
25
50
Bandwidth (%)
75
0
1
16MB
4MB
0.75
1MB
0.5
256KB
0.25
64KB
0
25
25
50
Bandwidth (%)
75
0
astar
•
0.25
64KB
TimeCube can use these
maps to guarantee QoS
for applications
Cache
•
0
hmmer
Ptables provide accurate
mapping from resource
allocation to slowdown
Cache
•
25
specrand
Progress-Tables
Examples
Cache
16MB
Outline
• Introduction
• Measuring Execution Quality: Progress-Time
• Enforcing Execution Guarantees: Progress-Table
• Allocating Execution Resources: SPOT
• Conclusion
26
Allocating Execution Resources: SPOT
• Key Idea: Run optimization algorithm over application
Progress-Tables to maximize an objective function
• Objective Function: Mean Progress-Times of all
applications, accumulated over all intervals so far and
the upcoming one
• Geometric-Mean balances throughput and fairness
• The geomean can be approximated to:
27
Implementation: Maximizing the
Mean Progress-Time
Ba
n
h
idt
w
d
…
1
All
Max Mean Progress-Time
0
1
Mean Progress-Time
i Apps, j Cache, k BW
Cache
…
All
0
0
1
Ap
p
lica
…
Simultaneous Performance
Optimization Table (SPOT)
All
tio
ns
•
Bin-packing: Distribute resources among applications to maximize mean
•
Clever algorithm allows optimal solution in pseudo-polynomial time
•
<All,All,All> corner gives maximum mean and corresponding allocation
28
Real-Time TimeCube Resource Allocation
• Interval-based TimeCube execution
• Statistics collected during execution
• Every interval :
• Estimate Progress-Times
• Allocate resource partitions
• Reconfigure partitions
Create pTables
Resource Allocation
Reconfiguration
Execute and
Collect Stats
Intervaln
• Done in parallel with execution
29
time
Normalized
System Throughput
Progress-Based Allocation Improves Throughput
1.7
77%
TimeCube
Baseline
1.5
1.3
36%
1.1
0.9
0.7
0.5
10 75 75 50 50 50 25 25 25 25 0% 0% 0% 0% 0% AV
0% % % % % % % % % %
E
st str str str str str str str str str stre stre stre stre stre RA
re ea ea ea ea ea ea ea ea ea a a a a a G
am m m m m m m m m m m m m m m E
, 0 , 0 , 2 , 0 , 2 , 5 , 0 , 2 , 5 , 7 , 0% , 25 , 50 , 75 , 10
% % 5% % 5% 0% % 5% 0% 5%
% % % 0%
slo slo slo slo slo slo slo slo slo slo slop slo slo slo slo
pe pe pe pe pe pe pe pe pe pe e pe pe pe pe
Compositions
• Allocating resources simultaneously increases throughput
• As much as 77% increase, 36% improvement on average
30
Normalized performance
for slowest application
Maximizing Geometric Mean Provides Fairness
1.5
57%
1.3
Progress−based
Miss−based
19%
1.1
0.9
0.7
0.5
10 75 75 50 50 50 25 25 25 25 0% 0% 0% 0% 0% AV
0% % % % % % % % % %
ER
st str str str str str str str str str stre stre stre stre stre AG
re ea ea ea ea ea ea ea ea ea a a a a a E
am m m m m m m m m m m m m m m
, 0 , 0 , 2 , 0 , 2 , 5 , 0 , 2 , 5 , 7 , 0% , 25 , 50 , 7 5 , 1 0
% % 5 % % 5% 0% % 5% 0% 5%
% % % 0%
slo slo slo slo slo slo slo slo slo slo slop slo slo slo slo
pe pe pe pe pe p e pe pe pe pe e pe pe pe p e
Compositions
• Worstcase performance improves by 19% on average
• As much as 57% worstcase improvement
31
TimeCube’s Mechanisms are Energy-Efficient
Others (0.45%)
Memory Access (11.16%)
L2 Access (0.50%)
L1 Access (12.96%)
Pipeline (34.51%)
pTables (0.01%)
Prefetcher (12.52%)
L1 Evict (1.06%)
L2 Evict (26.84%)
•
Progress-Time Mechanisms consume < 0.5% energy
• Shadow structures consume 0.23%
• Ptable calculation consumes just 0.01%
• SPOT calculation consumes 0.18%
32
TimeCube’s Mechanisms are Area-Efficient
•
Progress-Time Mechanisms consume < 7% area
• Shadow Tags consume 1.40%
• Ptables consume 1.11%
• SPOT consumes 3.20%
33
Related Work
• Measuring Execution Quality [Progress-Time]
• Analytical: Solihin [SC’99], Kaseridis [HPCA’10]
• Regression: Eyerman [ISPASS’11]
• Sampling: Yang [ISCA’13]
• Enforcing Execution Guarantees [Progress-Tables]
• RT systems: Lipari [RTTAS’00], Bernat [RTS’02], Beccari [RTS’05]
• Offline: Mars [ISCA’13], Federova [ATC’05]
• Allocating Execution Resources [SPOT]
• Adaptive: Hsu [PACT’06], Guo [MICRO’07]
• Offline: Bitirgen [MICRO’08], Liu [HPCA’04]
34
Conclusions
•
Problem: Interference on multicore processors can lead to
large unpredictable slowdowns.
•
How to measure execution quality: Progress-Time
• We can track live application progress with high accuracy (~ 1% error) and low
overheads (0.5% performance, < 0.5% energy, < 7% area).
•
How to enforce execution guarantees: Progress-Tables
• We can use Progress-Tables to precisely control the QoS provided, on-the-fly.
•
How to allocate execution resources: SPOT
• We can use SPOT to improve both throughput and fairness (36% and 19% on
average, 77% and 57% in best-case).
•
Multicore processors can employ these three mechanisms, demonstrated
through TimeCube, to make them more attractive for embedded systems.
35
Thank You
Questions?
36
Backup Slides
37
Execution Time Normalized to
Standalone Execution Time
Problem: Resource Sharing Causes Interference
3
2.5
2
1.5
1
0.5
0
17
18
30
40
42
45
46
47
5.
1.
0.
1.
9.
8.
2.
0.
vp
m
tw
bz
m
sje
lib
lbm
cf
cf
olf
q
r
ip
ng
ua
2
nt
um
Benchmarks (with 0−3 Background Applications)
• Unpredictable slowdown during concurrent execution
• Can lead to failed QoS guarantees
38
Progress-Tables
RESOURCE1
R
CE n
R
OU
S
E
Execution-Time [res0] [res1]…[resn ]
for app i
…
RE
SO
UR
CE
Applications
0
•
Progress-Time for a spectrum of resource allocations
•
Provide information for resource management at the right granularity
39
Dynamic Execution Isolation Reduces Interference
•
TimeCube partitions shared resources for dynamic execution isolation
•
Last-Level Cache Partitioning
• Associative Cache Partitioning allocates cache ways to applications
• Virtual Private Caches [Nesbit ISCA 2007]
•
Memory Bandwidth Partitioning
• Memory bandwidth is dynamically allocated between applications
• Fair Queuing Arbiter [Nesbit MICRO 2006] for memory scheduling
•
DRAM Capacity Partitioning
• DRAM memory banks are split between applications
• Row buffers fronting these banks are also partitioned as a result
• OS page management maintains physical memory bank allocation
40
Prefetcher Throttling Increases Bandwidth Utilization
L2
Prefetch
Buffer
Misses
Stream
Tracker
Shadow
Prefetcher
Prefetcher
Required BW
with and w/o
Prefetching
Prefetches
Prefetch
Filter
Aggression
Level
Prefetch
Aggression
Controller
Throttler
Allocated BW
Memory
•
Filter fixed ratio of prefetches based on aggression level, such that
required BW just above allocated BW
•
Shadow Performance Model augmented to give required BW
41
Prefetcher Throttling Chooses the Right-Level
0.8
0.6
0.4
0.2
NoThrottling
Throttling
Throughput per app
Throughput per app
0.8
0
0.6
0.4
0.2
NoPrefetching
ThrottledPrefetching
0 1 2 3 4 6 7
.5 .6 .7 .8 .0 .1
00 25 50 75 00 25
Bits per cycle per core (BW)
0. 1 . 1 . 2. 3.
3 7 12 87 62 37
5 5 5 5 5
Bits per cycle per core (BW)
•
Nine Aggression-Levels used
•
Throttler chooses the right level to give pareto-optimal curve
•
Prefetcher throttling efficiently utilizes the available bandwidth
42
Prefetcher Throttling Chooses the Right-Level
0.8
0.6
0.4
0.2
0
0.6
0.4
NoThrottling
Throttling
0.2
Throughput per app
0.8
Throughput per app
Throughput per app
0.8
0.6
0.4
0.2
NoPrefetching
NoThrottling
Throttling
0
NoPrefetching
ThrottledPrefetching
•
0. 1 . 1 . 2. 3.
1. 2. 3. 4. 6. 7.
37 12 87 602 37
50 62 75 87 00 12
5 5 5 5 15.1 1.5 1.8 2.2 2.6
0 5 0 5 0 5
25
75 5 Bits
25 per cycle per core (BW)
Bits per cycle per core (BW)
Bits per cycle per core (BW)
Nine Aggression-Levels used
•
Throttler chooses the right level to give pareto-optimal curve
•
Prefetcher throttling efficiently utilizes the available bandwidth
43
Multicore Processors Share Resources
Low-Power
Intel “Haswell”
Architecture
• Leads to increased utilization
• Lower per core resources on manycore processors
• Increasing pressure to share resources
44
***
45
Shadow Performance Model and Shadow Structures
Accurately Compute Progress-Time
• TimeCube tracks Progress-Times with ~1% error
• Performance overheads due to reconfiguration are < 0.5%
46
Towards Manycore Embedded Systems
47
Objective: Maximizing Mean Progress-Time
• TimeCube allocates resources between applications
to maximize the Mean Progress-Times
• Geometric-Mean balances throughput and fairness
• The geometric mean can be approximated to:
48
Measuring Execution
Progress: Progress-Time
• What do we need to compute Progress-Time?
Execution
Stats
Ideal (Shadow) Universe
Current Universe
49
Solution: Track Live Application Progress
App1
Processor
App2
Processor
App1
App2
Processor
time
•
Determine and control QoS provided to applications “online”
•
We quantify application progress using Progress-Time:
Progress-Time is the amount of time required for an application to
complete the same amount of work it has done so far, were to have
been allocated all CPU resources.
50
TimeCube: A Progress-Tracking Processor
App1
Processor
App2
Processor
App1
App2
Processor
Track & Use Progress-Time
time
TimeCube
• TimeCube is a manycore processor
• Augmented to track & use live Progress-Times
• Embedded domains can use TimeCube to guarantee QoS
51
TimeCube Periodically Estimates Progress-Times
Shadow Performance
Modeling
Dynamic Execu on
Isola on
Last Level Cache
Memory Bandwidth
Execution
Stats
DRAM Banks
Resource
Allocations
•
•
•
Dynamically partition critical shared resources
•
Fine-grained QoS control
Resource
Management
Shadow performance model estimates Progress Time
•
Uses execution statistics
•
Statistics from shadow structures
Shadow Prefetcher
Shadow Banking
Concurrent execution on dynamically isolated resources
•
Shadow Cache
Progress-Time estimates used for shared resource management
52
Progress-Time
Tables
TimeCube Periodically Estimates Progress-Times
Shadow Performance
Modeling
Dynamic Execu on
Isola on
Last Level Cache
Memory Bandwidth
Execution
Stats
DRAM Banks
Resource
Allocations
•
•
•
Dynamically partition critical shared resources
•
Fine-grained QoS control
Resource
Management
Shadow performance model estimates Progress Time
•
Uses execution statistics
•
Statistics from shadow structures
Shadow Prefetcher
Shadow Banking
Concurrent execution on dynamically isolated resources
•
Shadow Cache
Progress-Time estimates used for shared resource management
53
Progress-Time
Tables
TimeCube Periodically Estimates Progress-Times
Shadow Performance
Modeling
Dynamic Execu on
Isola on
Last Level Cache
Memory Bandwidth
Execution
Stats
DRAM Banks
Resource
Allocations
•
•
•
Dynamically partition critical shared resources
•
Fine-grained QoS control
Resource
Management
Shadow performance model estimates Progress Time
•
Uses execution statistics
•
Statistics from shadow structures
Shadow Prefetcher
Shadow Banking
Concurrent execution on dynamically isolated resources
•
Shadow Cache
Progress-Time estimates used for shared resource management
54
Progress-Time
Tables
Isolation Can’t Remove Performance Interference
Progress Time Table for astar
Progress Time Table for hmmer
Progress Time Table for specrand
2^10
0.4
2^3
2^2
0.2
2^1
2^5
0.6
2^4
0.4
2^3
2^2
0.2
2^1
20
40
60
Bandwidth (%)
80
2^5
0.6
2^4
0.4
2^3
2^2
0.2
2^1
0
0
0
0
2^6
2^0
2^0
2^0
0.8
0.8
2^6
Cache
0.6
2^4
Relative Progress Time
Cache
2^5
2^7
2^7
0.8
2^6
Relative Progress Time
2^7
0
20
40
60
Bandwidth (%)
80
0
20
40
60
Bandwidth (%)
• Isolation removes resources interference only
• Performance not linearly related to resource allocation
• Same resource allocations can lead to different performance
• TimeCube uses Shadow Performance Modeling to estimate
performance impact of different resource allocations
55
80
Relative Progress Time
2^10
2^10
Cache
1
1
1
Prefetcher Throttling Chooses the Right-Level
0.8
0.4
0.2
0
NoThrottling
Throttling
0. 1 . 1 . 2. 3.
3 7 12 87 62 37
5 5 5 5 5
Bits per cycle per core (BW)
0.8
0.6
0.4
NoPrefetching
NoThrottling
Throttling
0.2
0
1. 1. 1. 2. 2.
12 5 87 25 62
5
5
5
Bits per cycle per core (BW)
Throughput per app
0.6
Throughput per app
Throughput per app
0.8
0.6
0.4
0.2
NoPrefetching
ThrottledPrefetching
0 1 2 3 4 6 7
.5 .6 .7 .8 .0 .1
00 25 50 75 00 25
Bits per cycle per core (BW)
•
Nine Aggression-Levels used
•
Throttler chooses the right level to give pareto-optimal curve
•
Prefetcher throttling efficiently utilizes the available bandwidth
56