Transcript [talk]
Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at University of Wisconsin-Madison UW-Madison Computer Sciences Multifacet Group © 2011 Executive summary • Applications of deterministic record-replay – Debugging – Fault tolerance – Security • Existing hardware record-replayer – Fast record but – Slow replay or – Requires major hardware changes • Karma: Faster Replay with nearlyconventional h/w – Extends Rerun – Records more parallelism 2 Outline • Background & Motivation • Rerun Overview • Karma Insights • Karma Implementation • Evaluation • Conclusion 3 Deterministic Record-Replay • Multi-threaded execution non-deterministic • Deterministic record-replay to reincarnate past execution • Record: – Record selective events in a log • Replay: – Use the log to reincarnate past execution • Key Challenge: Memory races 4 Record-Replay Motivation • Debugging – Ensures bugs faithfully reappear (no heisenbugs) Replay speed matters • Fault-Tolerance – Enable hot backup for primary server to shadow primary & take over on failure • Security – Real time intrusion detection & attack analysis 5 Previous work • Record Dependence – Wisconsin Flight Data Recorder [ISCA’03,etc.]: Too much state – UCSD Strata [ASPLOS’06]: Log size grows rapidly w #cores • Record Independence – – – – UIUC DeLorean [ISCA’08]: Non-conventional BulkSC H/W Wisconsin Rerun [ISCA’08]: Sequential replay Intel MRR [MICRO’09]: Only for snoop based systems Timetraveler [ISCA’10]: Extends Rerun to lower log size • Our Goal – Retain Rerun’s near-conventional hardware – Enable Faster Replay 6 Outline • Background & Motivation • Rerun Overview • Karma Insights • Karma Implementation • Evaluation • Conclusion 7 Rerun’s Recording • Most code executes without races – Use race-free regions for ordering • Episodes: independent execution regions – Defined per thread T1 T0 LD ST ST LD LD LD ST ST LD A B C F X Q Q K Z ST LD ST LD ST ST ST ST T2 E B X R T C E X ST ST LD LD LD LD V Z W J J V 8 Partially adopted from ISCA’08 talk Rerun’s Recording (Contd.) • Capturing causality: – Timestamp via Lamport scalar clock [Lamport ‘78] • Replay in timestamp order – Episodes with same timestamp can be replayed in parallel T0 60 61 T1 T2 43 22 44 23 23 44 62 45 9 Rerun’s Replay T0 T1 TS=22 22 TS=43 43 44 TS=44 TS=45 TS=60 TS=61 T2 44 45 60 61 10 Outline • Background & Motivation • Rerun Overview • Karma Insights • Karma Implementation • Evaluation • Conclusion 11 Karma’s Insight 1: • Capture order with DAG (not scalar clock) Recording: DAG captured with episode predecessor & successor sets T0 60 61 T1 T2 43 22 44 23 23 44 62 45 12 Karma’s Insight 1: T1 T2 Rerun’s Replay 22 43 44 44 45 Karma’s Replay T0 T0 60 T1 61 43 T2 22 44 44 62 60 61 13 Karma’s Insight 1: (Contd.) • Naïve approach: DAG arcs point to episodes – Episode represented by integers – Too much log size overhead !! • Our approach: DAG arcs point to cores – Recording: Only one “active” episode per core – Replay: Send wakeup message(s) to core(s) of successor episode(s) 14 Karma’s Insight 1: T0 T1 60 T2 Anatomy of a log entry 22 84 61 0|0|1 0|0|1 43 44 44 62 15 Karma Insight 2: • Not necessary to end the episode on every conflict: – As long as the episodes can be ordered during replay T0 LD ST ST LD LD LD ST ST LD A B C F X Q Q K Z T1 ST LD ST LD ST ST ST ST E B X R T C E X T2 ST ST LD LD LD LD V Z W J J V 17 Outline • Background & Motivation • Rerun Overview • Karma Insights • Karma Implementation • Evaluation • Conclusion 18 Karma Hardware Base System DRAM L2 14 bytes/core Interconnect Core 0 Core 1 … L2 15 Core 14 DRAM L2 L2 … 1 0 Total State: 148 Core Address Filter(FLT) 15 Reference (REFS) Predecessor(PRED) Successor(SUCC) Timestamp(TS) 19 Outline • Background & Motivation • Rerun Overview • Karma Insights • Karma Implementation • Evaluation • Conclusion 20 Evaluation: • Were we able to speed up the replay? Speedup normalized to "Base" of corresponding configuration Apache 1.2 Base Rerun Replay Karma Replay 1 0.8 0.6 0.4 0.2 0 4core-4MB 8core-8MB 16core-16MB Number of cores-L2 cache size 21 Evaluation: • Were we able to speed up the replay? Rerun Replay Karma Replay 1 0.8 0.6 0.4 0.2 0 4core-4MB Jbb Base Speedup normalized to "Base" of corresponding configuration Speedup normalized to "Base" of corresponding configuration Apache 1.2 1.2 Rerun Replay Karma Replay 1 0.8 0.6 0.4 0.2 On Average ~4X improvement in replay speed over Rerun 8core-8MB 0 4core-4MB 16core-16MB Base Rerun Replay Karma Replay 1 0.8 0.6 0.4 0.2 0 4core-4MB 8core-8MB Number of cores-L2 cache size 16core-16MB Zeus Speedup normalized to "Base" of corresponding configuration Speedup normalized to "Base" of corresponding configuration Oltp 8core-8MB 16core-16MB Number of cores-L2 cache size Number of cores-L2 cache size 1.2 Base Base Rerun Replay 1.2 Karma Replay 1 0.8 0.6 0.4 0.2 0 4core-4MB 8core-8MB Number of cores-L2 cache size 16core-16MB 22 Evaluation Karma log size normalized to Rerun's log size • Did we blowup log size? 1.4 1.2 Apache On average Karma does not increase theZeus size of the log but instead improves it byOltp Jbb as much as 40% as we allow larger episodes 1 0.8 0.6 0.4 0.2 0 128 256 512 1024 2048 4096 8192 Unbounded Maximum allowable Episode size 23 Conclusion • Applications of deterministic replay – Debugging – Fault tolerance – Security • Existing hardware record-replayer – Slow replay or – Requires major hardware changes • Karma: Faster Replay with nearly-conventional h/w – Extends Rerun – Uses DAG instead of Scalar clock – Extend episodes past conflicts • Widen Application + Lower Cost More Attractive 25 Questions? 26