Transcript pptx
Relaxed Consistency Deterministic Computer “deterministic deeds, done dirt cheap” Joseph Devietti, Jacob Nelson, Tom Bergan Luis Ceze, Dan Grossman determinism tested inputs behave identically in production Test testing results are reproducible improves the software development cycle no need to stress test Deploy Debug more robust production code reverse debugging is possible production bugs can be reproduced in-house 3 determinism improves the software development cycle Test Deploy Debug 4 History of Deterministic Execution Deterministic Execution for Arbitrary Programs Deterministic Execution for Restricted Programs DMP [ASPLOS ‘09] CoreDet [ASPLOS ‘10] dOS [OSDI ‘10] Determinator [OSDI ‘10] Calvin [HPCA ‘11] [ASPLOS ‘11] Kendo [ASPLOS ‘09] Grace [OOPSLA ‘09] 5 History of Deterministic Execution DMP [ASPLOS ‘09] CoreDet [ASPLOS ‘10] seq. consistency total store order [ASPLOS ‘11] DRF0 [ISCA ‘90] "Piled Higher and Deeper" by Jorge Cham www.phdcomics.com Jorge Cham © 2008 6 Contributions Outline 1 2 DMP-HB a new deterministic consistency model based on DRF0 with improved performance a low-complexity hw/sw deterministic execution system 4 3 C/C++ compiler based on LLVM, runs on commodity multicore hardware simulation using Pin hw: store buffers and instruction counting sw: everything else 7 starting simple: serialization quantum round quantum threads T1 deterministic quantum size + deterministic scheduling determinism T2 T3 time → 8 recovering parallelism with DMP-TSO parallel T1 T2 T3 wr A commit rd A serial parallel mode: buffer all stores (no communication) commit mode: deterministically publish buffers serial mode: for atomic ops lock A rd A lock B time → 9 Why is DMP-TSO slow? parallel T1 commit serial Kendo [ASPLOS ‘09] serialization imbalance T2 T3 time → 10 Why is DMP-TSO slow? parallel commit Kendo [ASPLOS ‘09] serialization imbalance T1 DMP-HB T2 parallel-mode synchronization complements relaxed consistency T3 time → 11 synchronization in parallel mode with Kendo [Olszewski et al., ASPLOS ‘09] thread with globally min insn count can do atomic op T2 is not globally globally min min insn insn count count T1 T2 lock A T3 instruction count → 12 Why is DMP-TSO slow? parallel T1 commit serial Kendo [ASPLOS ‘09] serialization imbalance T2 T3 time → 13 Why is DMP-TSO slow? parallel commit Kendo [ASPLOS ‘09] serialization imbalance T1 DMP-HB T2 T3 time → 14 DRF0: happens-before consistency [Adve and Hill, ISCA ‘90] • happens-before edges defined by synchronization operations • remote updates visible via cross-thread happens-before edges • SC for DRF programs • upholds C++/Java memory models • programmer-visible model doesn’t change 15 sync in parallel mode (Kendo) relaxed consistency (DRF0) deterministic scheduling (DMP) DMP-HB 16 DMP-HB : happens-before determinism parallel commit explicit fences rarely necessary T1 T2 lock A unlock A TSO RC DRF0 lock A T3 no serial mode less imbalance explicit fence iff inter-thread HB edge doesn’t cross commit time → 17 Outline 1 2 DMP-HB a new deterministic consistency model with improved performance a low-complexity hw/sw deterministic execution system 4 3 C/C++ compiler based on LLVM, runs on commodity multicore hardware simulation using Pin hw: store buffers and instruction counting sw: everything else 18 Architecture runtime system L2$ Store Buffers in Private $ application/OS can StoreToSB choose nondeterminism CommitSB align context switches SaveSB with quantum boundaries RestoreSB Precise Insn Counting L1$ L1$ Core Core StartInsnCount StopInsnCount ReadInsnCount Traps SBFull QuantumReached 19 Outline 1 2 DMP-HB a new deterministic consistency model with improved performance a low-complexity hw/sw deterministic execution system 4 3 C/C++ compiler based on LLVM, runs on commodity multicore hardware simulation using Pin hw: store buffers and instruction counting sw: everything else 20 Experimental Setup Pin-based simulator 1 IPC, except for memory ops PARSEC v2.1 with simsmall inputs structure size access latency private L1 8-way, 32KB 1 cycle private L2 8-way, 256KB 10 cycles shared L3 16-way, 8MB 35 cycles memory - 120 cycles extended CoreDet C/C++ compiler [ASPLOS ‘10] 8-core Intel Harpertown @ 2.8GHz, 10GB RAM PARSEC v2.1 with simlarge inputs 21 Simulation: overhead < 60% in worst case 70% % overhead compared to nondet Overheads 2p 60% 4p 50% 8p 40% 16p 30% 20% 10% 0% blacksch quantum size 50k (insns) dedup ferret fluid 50k 25k 1k streamcl swaptions 1k 50k vips x264 50k 50k 22 % overhead compared to nondet Compiler: DMP-HB vs. DMP-TSO 450% 400% hb 350% tso 300% 250% 200% 150% 100% 50% 0% threads quantum size (insns) 2 4 8 blackscholes 200k 2 4 8 swaptions 200k 2 4 8 fluidanimate 50k 2 4 fmm 8 50k 23 Conclusions • DMP-HB: a new deterministic consistency model • : a new deterministic multiprocessor design – no speculation – lightweight hardware support • Relaxed consistency is a natural optimization for determinism source code and data available at http://sampa.cs.washington.edu 24 Thanks! Questions? source code and data available at http://sampa.cs.washington.edu 25 DRF0 hardware requirements [ISCA ‘90] 1. 2. 3. 4. 5. Intra-processor dependencies are preserved. All writes to the same location can be totally ordered based on their commit times, and this is the order in which they are observed by all processors. All synchronization operations to the same location can be totally ordered based on their commit times, and this is also the order in which they are globally performed. Further, if S1 and S2 are synchronization operations and S1 is committed and globally performed before S2, then all components of S1 are committed and globally performed before any in S2. A new access is not generated by a processor until all its previous synchronization operations (in program order) are committed. Once a synchronization operation S by processor Pi is committed, no other synchronization operations on the same location by another processor can commit until after all reads of Pi before S (in program order) are committed and all writes of Pi before S are globally performed. 26