Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin1, Mark D. Hill2, David A.
Download ReportTranscript Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin1, Mark D. Hill2, David A.
Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin1, Mark D. Hill2, David A. Wood2 1Department of Electrical & Computer Engineering Duke University 2Computer Sciences Department University of Wisconsin-Madison (C) 2003 Daniel Sorin Duke Architecture My Talk in One Slide • Commercial server availability is important – System model: Symmetric Multiprocessor (SMP) – Fault model: Mostly transient, some permanent • Recent work developed efficient checkpoint/recovery – But we can only recover from hardware errors we detect – Many hardware errors are hard to detect • Proposal: Dynamic verification of invariants – Online checking of end-to-end system invariants – Checking performed with distributed signature analysis – Triggers recovery if invariant is violated slide 2 DSN 2003 – Daniel Sorin Outline • Background – SMPs and availability – Existing hardware error detection • Invariant checking with distributed signature analysis • Two invariant checkers • Evaluation • Conclusions slide 3 DSN 2003 – Daniel Sorin Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction shared wire bus I M Issue request P P P P Wait for response Receive response slide 4 DSN 2003 – Daniel Sorin Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction I switch switch switch M Issue request Wait for response P P P P Receive response – Broadcast request not delivered to subset of nodes – Broadcast requests delivered out of order to subset of nodes slide 5 DSN 2003 – Daniel Sorin Symmetric Multiprocessor (SMP) System Model Cache Coherence Transaction request arrives switch switch switch I t2 t1 M issue request P P P P – Broadcast request not delivered to subset of nodes – Broadcast requests delivered out of order to subset of nodes slide 6 response arrives response arrives t3 request arrives – More chances for incorrect state transitions DSN 2003 – Daniel Sorin Backward Error Recovery • Can improve availability with backward error recovery • If error detected, then recover to pre-fault state • Backward error recovery (BER) requires: – Checkpoint/recovery mechanism – Error detection mechanisms slide 7 DSN 2003 – Daniel Sorin SafetyNet Checkpoint/Recovery • SafetyNet: all-hardware scheme [ISCA 2002] – Periodically take logical checkpoint of multiprocessor • MP State: processor registers, caches, memory – Incrementally log changes to caches and memory – Consistent checkpointing performed in logical time • E.g., every 3000 broadcast cache coherence requests – Can tolerate >100,000 cycles of error detection latency CP 1 CP 2 CP 3 CP 4 Validated Pending validation – Active execution Still detecting errors execution time slide 8 DSN 2003 – Daniel Sorin Error Detection • Error model: mostly due to transient faults • Example error detection mechanisms: – Parity bit on cache line – Checksum on incoming message – Timeout on cache coherence transaction • But error detection for servers is still weak • Why? – Error detection is often on critical path and must be fast – Fast error detection can’t incorporate info from other nodes slide 9 DSN 2003 – Daniel Sorin Why Local Information Isn’t Sufficient switch switch P1 switch P2 Shared slide 10 P3 P4 Owned DSN 2003 – Daniel Sorin Why Local Information Isn’t Sufficient switch switch Broadcast Request for Exclusive switch fault! P1 P2 Shared slide 11 P3 P4 Owned DSN 2003 – Daniel Sorin Why Local Information Isn’t Sufficient switch switch Broadcast Request for Exclusive switch fault! P1 P2 P3 Shared P4 Owned Invalid Data Response slide 12 DSN 2003 – Daniel Sorin Why Local Information Isn’t Sufficient switch switch P1 switch P2 P3 P4 Modified Shared Neither P1 nor P2 can detect that an error has occurred! slide 13 DSN 2003 – Daniel Sorin Outline • Background • End-to-end invariant checking • Two invariant checkers • Evaluation • Conclusions slide 14 DSN 2003 – Daniel Sorin Distributed Signature Analysis • Reduces long history of events into small signature – Signatures map almost-uniquely to event histories Event N at P1 Event N at P2 : P1 : Event 2 at P1 Event 2 at P2 Event 1 at P1 Event 1 at P2 Signature P1’s signature Signature P2’s signature Checker slide 15 P2 } Check periodically in logical time (every 3000 requests) DSN 2003 – Daniel Sorin Designing Signature Analysis Schemes • Must devise two functions: Update and Check • Signature(Pi) = Update(Signature(Pi), Event) • Check(Signature(P1),…,Signature(PN)) = true • Simple example: check that message inflow=outflow if error – Assume only unicast messages – Update: +1 for receive, -1 for send – Check: true if sum of all signatures doesn’t equal 0 slide 16 DSN 2003 – Daniel Sorin Implementing Distributed Signature Analysis • All components cooperate to perform checking – Component = cache controller or memory controller • Each component contains: – Local signature register – Logic to compute signature updates • System contains: – System controller that performs check function • Use distributed signature analysis for dynamic verification – Verify end-to-end invariants slide 17 DSN 2003 – Daniel Sorin Outline • Background • End-to-end invariant checking • Two invariant checkers – Message invariant – Cache coherence invariant • Evaluation • Conclusions slide 18 DSN 2003 – Daniel Sorin A Message-Level Invariant Checker • Context: symmetric multiprocessor (SMP) – Cache coherence with broadcast snooping protocol • Invariant: all nodes see same total order of broadcast cache coherence requests • Update: for each incoming broadcast, “add” Address – Not quite this simple (e.g., doesn’t detect reorderings) • Check: error if all signatures aren’t equal slide 19 DSN 2003 – Daniel Sorin Aliasing • Aliasing occurs if two histories have same signature • 3 possible sources of aliasing – Finite resources – b bits can only distinguish 2b histories – Fault in signature analysis hardware itself – Inherent flaw in scheme • Examples of inherent aliasing in previous scheme – Arrival of message with Address=0 doesn’t change signature – Reordering of messages doesn’t change signature – We solve aliasing issues in paper • Tricks: hash more than 1 field of message, use LFSRs, etc. slide 20 DSN 2003 – Daniel Sorin A Cache Coherence Invariant Checker • Invariant: all coherence upgrades cause downgrades – Upgrade: increase permissions to block (e.g., noneread) – Downgrade: decrease permissions (e.g., write read) • Update: add Address for upgrade subtract Address for downgrade • Check: error if sum of all signatures doesn’t equal 0 • Challenges – Can be more than one downgrade per upgrade – Upgrader doesn’t know how how many downgraders exist – See paper for solutions to these challenges slide 21 DSN 2003 – Daniel Sorin Outline • Background • End-to-end invariant checking • Two invariant checkers • Evaluation • Conclusions slide 22 DSN 2003 – Daniel Sorin Methodology • Full-system simulation of 16-processor machine – Simics provides functional simulation of everything – We added timing simulation for memory system & SafetyNet • Commercial workloads running on Solaris 8 – – – – slide 23 Database: IBM’s DB2 running online transaction processing Static web server: Apache Dynamic web server: Slashdot Java middleware DSN 2003 – Daniel Sorin Detection Coverage • How do we know if our checkers work? • Inject errors periodically – – – – Corrupt messages Drop messages Reorder messages Improperly process cache coherence messages Global invariant checkers detected all errors slide 24 DSN 2003 – Daniel Sorin Performance • Error bars represent +/- one standard deviation slide 25 DSN 2003 – Daniel Sorin Conclusions • • Goal: improve multiprocessor availability How? Dynamic verification of end-to-end invariants – Implemented with distributed signature analysis • Results – Detects previously undetectable hardware errors – Negligible performance overhead for error-free execution • Duke FaultFinder Project – http://www.ee.duke.edu/~sorin/faultfinder • Wisconsin Multifacet Project – http://www.cs.wisc.edu/multifacet/ slide 26 DSN 2003 – Daniel Sorin