Accelerating Asynchronous Programs through Event Sneak Peek Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science.

Download Report

Transcript Accelerating Asynchronous Programs through Event Sneak Peek Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science.

Accelerating Asynchronous Programs
through Event Sneak Peek
Gaurav Chadha, Scott Mahlke, Satish Narayanasamy
17 June 2015
University of Michigan
Electrical Engineering and Computer Science
Asynchronous programs are ubiquitous
Web
Mobile
Servers (node.js)
Internet-of-Things
Sensor networks
Asynchronous programming hides I/O latency
Synchronous
Sequential model
Task 1
Asynchronous
model
Waiting
for I/O
Task 2
Task 3
speedup
Asynchronous programming is well-suited
to handle wide array of asynchronous inputs
• Computation is driven by events
• The Hollywood Principle (“Don’t call us, we’ll call you”)
Illustration:
Asynchronous Programming Model
onClick
Pop an event
for execution
Event
Queue
getLocation
Web
onImageLoad
Waits on
events
Looper
Thread
Conventional architecture is not optimized
for asynchronous programs
Short events
execute varied tasks
Asynchronous
model
 Large instruction footprint
 Destroys cache locality
 Little hot code causes
poor branch prediction
Event
Queue
Processor
View
Large performance improvement potential
in asynchronous programs
Branch Misprediction rate
L1-I mpki
Web Apps
SPECint 2006
PARSEC
L1-D miss rate
9.8
24
2.3
4.4
8.4
6.3
0.7
Maximum Performance Improvement (%)
Web Apps
52
69 79
5.5
1.3
Execute asynchronous program on
a specialized Event Sneak Peek (ESP) core
CPU
Heterogeneous
Multi-core
Processor
Execute asynchronous program on
a specialized Event Sneak Peek (ESP) core
Browser Engine
Parse
Layout
Parse
Layout
Asynchronous
JavaScript Events
CSS
Render
CPU
CSS
ESP
Render
WebCore
Zhu & Reddi, ISCA ‘14
Heterogeneous
Multi-core
Processor
How to customize a core
for asynchronous programs?
HTML5 asynchronous programming model
guarantees sequential execution of events
Event
Queue
Looper
Thread
Opportunity:
Event-Level Parallelism (ELP)
 Advance knowledge of
future events
 Events are
functionally independent
 How to exploit this ELP?
Event
Queue
#1: Parallel Execution
Event
Queue
Not provably independent
#2: Optimistic Concurrency
Speculative parallelization (e.g., transactions)
Event
Queue
 >99% of event pairs conflict
 Primarily, low-level memory dependencies
– Maintenance code
– Memory pool recycling
– …
Observation
Speculative
pre-execution
Event
Queue
Good
match
 98% of events “match” with a 99% accuracy
– Control flow paths
– Addresses
How to customize a core
for asynchronous programs?
Exploit ELP using
speculative pre-execution
ESP Design:
Expose event-queue to hardware
Software
Event
Queue
IS
A
H/W
Event Queue
Hardware
ESP Design:
Speculatively pre-execute future events
on stalls
Memoize
Isolate
LLC miss
Warm-Up
LLC miss
millions of
instructions
Trigger
speedup
H/W
Event Queue
Realizing ESP design
Isolation
Memoization
Triggering
 Correctness
– Isolate speculative updates
 Performance
– Avoid destructive interference between execution contexts
Isolation of multiple execution contexts
Register State
Memory State
Core Pipeline
Fetch Unit
PC
PC
L1-I
cache
ESP
RRAT
Branch Predictor
Isolation of multiple execution contexts
Register State
Memory State
Branch Predictor
 Cachelets isolate speculative updates
 Performance:
– Avoid L1 pollution
– Capture 95% of reuse
L1-I
Cache
ESP
I-Cachelet
D-Cachelet
L1-D
Cache
Isolation of multiple execution contexts
Register State
Branch Predictor
PIR
Predictor
Tables
PIR
ESP
Memory State
Branch Predictor
 PIR tracks path history
 Isolating PIR is adequate
Realizing ESP design
Isolation
Memoization
Triggering
Warm-up during speculative pre-execution is ineffective
 Future events might execute millions of instructions later
Memoization of architectural bottlenecks
Addresses
Branches
Record instruction and data addresses,
along with instruction count
I-List
ESP
L1-I
Cache
I-Cachelet
D-List
D-Cachelet
L1-D
Cache
Memoization of architectural bottlenecks
Addresses
Branches
Record branch outcomes
 Branch address, directions and targets, instruction count
Branch Predictor
PIR
Predictor
Tables
PIR
B-List
ESP
Realizing ESP design
Isolation
Memoization
Use memoized lists
 Launch timely prefetches
 Warm-up branch predictor ahead of branches
Triggering
Triggering timely prefetches using
memoized information
ESP
Start Prefetches
Instr.
Address Count
~100 instr.
Prefetch
Prefetch
>
Current
Instr. Count
Baseline Architecture
PIR
Predictor
Branch Predictor
Core Pipeline
RRAT
Fetch Unit
PC
NLI
NL-D,S
L1-I
Cache
L1-D
Cache
L2 cache
ESP Architecture
Branch Predictor
Predictor
PIR
Core Pipeline
RRAT
Fetch Unit
Event
Queu
e
PC
ESP
Mode
NLI
NL-D,S
L1-I
Cache
L1-D
Cache
L2 cache
ESP
ESP Architecture
Branch Predictor
Predictor
PIR
PIR
Core Pipeline
RRAT
Fetch Unit
Event
Queu
e
PC
PC
ESP
Mode
NLI
L1-I
Cache
NL-D,S
D-Cachelet
I-Cachelet
L2 cache
ESP
L1-D
Cache
ESP Architecture
Branch Predictor
Predictor
PIR
PIR
Core Pipeline
RRAT
B-List
Fetch Unit
Event
Queu
e
PC
PC
ESP
Mode
I-List
L1-I
Cache
NLI
NL-D,S
D-Cachelet
I-Cachelet
L2 cache
ESP
D-List
L1-D
Cache
ESP Architecture
Branch Predictor
Predictor
PIR
PIR
Core Pipeline
PIR
RRAT
B-List
Fetch Unit
Event
Queu
e
ESP
PC
PC
PC
Mode
I-List
L1-I
Cache
NLI
NL-D,S
I-Cachelet
D-Cachelet
L2 cache
ESP-1
ESP-2
D-List
L1-D
Cache
Methodology
 Timing: Trace-driven simulator, Sniper Sim
– Instrumented Chromium
– Collected and simulated traces of JavaScript events
 Energy: McPAT and CACTI
Architectural Model

Core: 4-wide issue, OoO, 1.66 GHz

L1-(I,D) Cache: 32 KB, 2-way

L2 Cache: 2 MB, 16-way

Energy Modeling: Vdd = 1.2 V, 32 nm
Limitations of Runahead
Speculative
pre-execution
[Dundas, et. al. ’97, Mutlu, et. al. ‘03]
Data cache
miss
Event
Queue
Reduces data cache misses
– Not a significant problem in web applications
Cannot mitigate I-cache misses
Does not exploit ELP
– No notion of events
– Future events are a rich source of independent instructions
Events are short
ShortAction
events # Events
Buy headphones
7,787
execute
varied tasks
# Instructions
Event Size (instr)
Web App
433 million
55k
amazon
53k
bing
91k
cnn
232k
facebook
372k
gdocs
472k
gmaps
56k
pixlr
 Large instruction footprint
 Destroys cache locality
 Little hot code causes
poor branch prediction
ESP outperforms other designs
Performance improvement w.r.t. no prefetching (%)
ESP
21.8
Runahead
12.5
Baseline
14.0
Baseline : Next-line (NL) + Stride
ESP outperforms other designs
Performance improvement w.r.t. no prefetching (%)
ESP + NL
32.1
Runahead + NL
21.3
Baseline
14.0
Baseline : Next-line (NL) + Stride
Largest performance improvement
comes from improved I-cache performance
Performance Improvement (%)
I-Cache
Branch Predictor
D-Cache
Max
52
ESP
21
28 32
69
79
ESP consumes less static energy, but
expends more dynamic energy
Energy consumed w.r.t. no prefetching
Static Energy
Dynamic Energy
ESP
NL
0
0.2
0.4
0.6
ESP executes 21% more instructions,
but consumes only 8% more energy
0.8
1
1.2
Hardware area overhead
ESP-1 ESP-2
Cachelets
Lists
Registers
12.6 1.2
KB KB
Summary
 Accelerators for asynchronous programs
 ESP exploits Event-Level Parallelism (ELP)
– Expose event queue to hardware
– Speculatively pre-execute future events
 Performance: 16%
Accelerating Asynchronous Programs
through Event Sneak Peek
Gaurav Chadha, Scott Mahlke, Satish Narayanasamy
17 June 2015
University of Michigan
Electrical Engineering and Computer Science
Jumping ahead two events is sufficient
Max
10000
# cache lines
1000
100
10
1
0
95%
85%
Impact of JS execution on response time
JavaScript
DO
M
CSS
Network
Server
Chow, et. al., ’14
Client delay
Chow, et. al., ’14