Accelerating Asynchronous Programs through Event Sneak Peek Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science.
Download ReportTranscript Accelerating Asynchronous Programs through Event Sneak Peek Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science.
Accelerating Asynchronous Programs through Event Sneak Peek Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science Asynchronous programs are ubiquitous Web Mobile Servers (node.js) Internet-of-Things Sensor networks Asynchronous programming hides I/O latency Synchronous Sequential model Task 1 Asynchronous model Waiting for I/O Task 2 Task 3 speedup Asynchronous programming is well-suited to handle wide array of asynchronous inputs • Computation is driven by events • The Hollywood Principle (“Don’t call us, we’ll call you”) Illustration: Asynchronous Programming Model onClick Pop an event for execution Event Queue getLocation Web onImageLoad Waits on events Looper Thread Conventional architecture is not optimized for asynchronous programs Short events execute varied tasks Asynchronous model Large instruction footprint Destroys cache locality Little hot code causes poor branch prediction Event Queue Processor View Large performance improvement potential in asynchronous programs Branch Misprediction rate L1-I mpki Web Apps SPECint 2006 PARSEC L1-D miss rate 9.8 24 2.3 4.4 8.4 6.3 0.7 Maximum Performance Improvement (%) Web Apps 52 69 79 5.5 1.3 Execute asynchronous program on a specialized Event Sneak Peek (ESP) core CPU Heterogeneous Multi-core Processor Execute asynchronous program on a specialized Event Sneak Peek (ESP) core Browser Engine Parse Layout Parse Layout Asynchronous JavaScript Events CSS Render CPU CSS ESP Render WebCore Zhu & Reddi, ISCA ‘14 Heterogeneous Multi-core Processor How to customize a core for asynchronous programs? HTML5 asynchronous programming model guarantees sequential execution of events Event Queue Looper Thread Opportunity: Event-Level Parallelism (ELP) Advance knowledge of future events Events are functionally independent How to exploit this ELP? Event Queue #1: Parallel Execution Event Queue Not provably independent #2: Optimistic Concurrency Speculative parallelization (e.g., transactions) Event Queue >99% of event pairs conflict Primarily, low-level memory dependencies – Maintenance code – Memory pool recycling – … Observation Speculative pre-execution Event Queue Good match 98% of events “match” with a 99% accuracy – Control flow paths – Addresses How to customize a core for asynchronous programs? Exploit ELP using speculative pre-execution ESP Design: Expose event-queue to hardware Software Event Queue IS A H/W Event Queue Hardware ESP Design: Speculatively pre-execute future events on stalls Memoize Isolate LLC miss Warm-Up LLC miss millions of instructions Trigger speedup H/W Event Queue Realizing ESP design Isolation Memoization Triggering Correctness – Isolate speculative updates Performance – Avoid destructive interference between execution contexts Isolation of multiple execution contexts Register State Memory State Core Pipeline Fetch Unit PC PC L1-I cache ESP RRAT Branch Predictor Isolation of multiple execution contexts Register State Memory State Branch Predictor Cachelets isolate speculative updates Performance: – Avoid L1 pollution – Capture 95% of reuse L1-I Cache ESP I-Cachelet D-Cachelet L1-D Cache Isolation of multiple execution contexts Register State Branch Predictor PIR Predictor Tables PIR ESP Memory State Branch Predictor PIR tracks path history Isolating PIR is adequate Realizing ESP design Isolation Memoization Triggering Warm-up during speculative pre-execution is ineffective Future events might execute millions of instructions later Memoization of architectural bottlenecks Addresses Branches Record instruction and data addresses, along with instruction count I-List ESP L1-I Cache I-Cachelet D-List D-Cachelet L1-D Cache Memoization of architectural bottlenecks Addresses Branches Record branch outcomes Branch address, directions and targets, instruction count Branch Predictor PIR Predictor Tables PIR B-List ESP Realizing ESP design Isolation Memoization Use memoized lists Launch timely prefetches Warm-up branch predictor ahead of branches Triggering Triggering timely prefetches using memoized information ESP Start Prefetches Instr. Address Count ~100 instr. Prefetch Prefetch > Current Instr. Count Baseline Architecture PIR Predictor Branch Predictor Core Pipeline RRAT Fetch Unit PC NLI NL-D,S L1-I Cache L1-D Cache L2 cache ESP Architecture Branch Predictor Predictor PIR Core Pipeline RRAT Fetch Unit Event Queu e PC ESP Mode NLI NL-D,S L1-I Cache L1-D Cache L2 cache ESP ESP Architecture Branch Predictor Predictor PIR PIR Core Pipeline RRAT Fetch Unit Event Queu e PC PC ESP Mode NLI L1-I Cache NL-D,S D-Cachelet I-Cachelet L2 cache ESP L1-D Cache ESP Architecture Branch Predictor Predictor PIR PIR Core Pipeline RRAT B-List Fetch Unit Event Queu e PC PC ESP Mode I-List L1-I Cache NLI NL-D,S D-Cachelet I-Cachelet L2 cache ESP D-List L1-D Cache ESP Architecture Branch Predictor Predictor PIR PIR Core Pipeline PIR RRAT B-List Fetch Unit Event Queu e ESP PC PC PC Mode I-List L1-I Cache NLI NL-D,S I-Cachelet D-Cachelet L2 cache ESP-1 ESP-2 D-List L1-D Cache Methodology Timing: Trace-driven simulator, Sniper Sim – Instrumented Chromium – Collected and simulated traces of JavaScript events Energy: McPAT and CACTI Architectural Model Core: 4-wide issue, OoO, 1.66 GHz L1-(I,D) Cache: 32 KB, 2-way L2 Cache: 2 MB, 16-way Energy Modeling: Vdd = 1.2 V, 32 nm Limitations of Runahead Speculative pre-execution [Dundas, et. al. ’97, Mutlu, et. al. ‘03] Data cache miss Event Queue Reduces data cache misses – Not a significant problem in web applications Cannot mitigate I-cache misses Does not exploit ELP – No notion of events – Future events are a rich source of independent instructions Events are short ShortAction events # Events Buy headphones 7,787 execute varied tasks # Instructions Event Size (instr) Web App 433 million 55k amazon 53k bing 91k cnn 232k facebook 372k gdocs 472k gmaps 56k pixlr Large instruction footprint Destroys cache locality Little hot code causes poor branch prediction ESP outperforms other designs Performance improvement w.r.t. no prefetching (%) ESP 21.8 Runahead 12.5 Baseline 14.0 Baseline : Next-line (NL) + Stride ESP outperforms other designs Performance improvement w.r.t. no prefetching (%) ESP + NL 32.1 Runahead + NL 21.3 Baseline 14.0 Baseline : Next-line (NL) + Stride Largest performance improvement comes from improved I-cache performance Performance Improvement (%) I-Cache Branch Predictor D-Cache Max 52 ESP 21 28 32 69 79 ESP consumes less static energy, but expends more dynamic energy Energy consumed w.r.t. no prefetching Static Energy Dynamic Energy ESP NL 0 0.2 0.4 0.6 ESP executes 21% more instructions, but consumes only 8% more energy 0.8 1 1.2 Hardware area overhead ESP-1 ESP-2 Cachelets Lists Registers 12.6 1.2 KB KB Summary Accelerators for asynchronous programs ESP exploits Event-Level Parallelism (ELP) – Expose event queue to hardware – Speculatively pre-execute future events Performance: 16% Accelerating Asynchronous Programs through Event Sneak Peek Gaurav Chadha, Scott Mahlke, Satish Narayanasamy 17 June 2015 University of Michigan Electrical Engineering and Computer Science Jumping ahead two events is sufficient Max 10000 # cache lines 1000 100 10 1 0 95% 85% Impact of JS execution on response time JavaScript DO M CSS Network Server Chow, et. al., ’14 Client delay Chow, et. al., ’14