Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin1, Mark D. Hill2, David A.

Download Report

Transcript Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin1, Mark D. Hill2, David A.

Dynamic Verification of
End-to-End Multiprocessor Invariants
Daniel J. Sorin1, Mark D. Hill2, David A. Wood2
1Department
of Electrical & Computer Engineering
Duke University
2Computer
Sciences Department
University of Wisconsin-Madison
(C) 2003 Daniel Sorin
Duke Architecture
My Talk in One Slide
• Commercial server availability is important
– System model: Symmetric Multiprocessor (SMP)
– Fault model: Mostly transient, some permanent
• Recent work developed efficient checkpoint/recovery
– But we can only recover from hardware errors we detect
– Many hardware errors are hard to detect
• Proposal: Dynamic verification of invariants
– Online checking of end-to-end system invariants
– Checking performed with distributed signature analysis
– Triggers recovery if invariant is violated
slide 2
DSN 2003 – Daniel Sorin
Outline
• Background
– SMPs and availability
– Existing hardware error detection
• Invariant checking with distributed signature analysis
• Two invariant checkers
• Evaluation
• Conclusions
slide 3
DSN 2003 – Daniel Sorin
Symmetric Multiprocessor (SMP)
System Model
Cache Coherence Transaction
shared wire bus
I
M
Issue request
P
P
P
P
Wait for response
Receive response
slide 4
DSN 2003 – Daniel Sorin
Symmetric Multiprocessor (SMP)
System Model
Cache Coherence Transaction
I
switch
switch
switch
M
Issue request
Wait for response
P
P
P
P
Receive response
– Broadcast request not
delivered to subset of
nodes
– Broadcast requests
delivered out of order to
subset of nodes
slide 5
DSN 2003 – Daniel Sorin
Symmetric Multiprocessor (SMP)
System Model
Cache Coherence Transaction
request
arrives
switch
switch
switch
I
t2
t1
M
issue request
P
P
P
P
– Broadcast request not
delivered to subset of
nodes
– Broadcast requests
delivered out of order to
subset of nodes
slide 6
response
arrives
response
arrives
t3
request
arrives
– More chances for incorrect
state transitions
DSN 2003 – Daniel Sorin
Backward Error Recovery
• Can improve availability with backward error recovery
• If error detected, then recover to pre-fault state
• Backward error recovery (BER) requires:
– Checkpoint/recovery mechanism
– Error detection mechanisms
slide 7
DSN 2003 – Daniel Sorin
SafetyNet Checkpoint/Recovery
• SafetyNet: all-hardware scheme [ISCA 2002]
– Periodically take logical checkpoint of multiprocessor
• MP State: processor registers, caches, memory
– Incrementally log changes to caches and memory
– Consistent checkpointing performed in logical time
• E.g., every 3000 broadcast cache coherence requests
– Can tolerate >100,000 cycles of error detection latency
CP 1
CP 2
CP 3
CP 4
Validated
Pending validation –
Active
execution
Still detecting errors
execution
time
slide 8
DSN 2003 – Daniel Sorin
Error Detection
• Error model: mostly due to transient faults
• Example error detection mechanisms:
– Parity bit on cache line
– Checksum on incoming message
– Timeout on cache coherence transaction
• But error detection for servers is still weak
• Why?
– Error detection is often on critical path and must be fast
– Fast error detection can’t incorporate info from other nodes
slide 9
DSN 2003 – Daniel Sorin
Why Local Information Isn’t Sufficient
switch
switch
P1
switch
P2
Shared
slide 10
P3
P4
Owned
DSN 2003 – Daniel Sorin
Why Local Information Isn’t Sufficient
switch
switch
Broadcast
Request
for
Exclusive
switch
fault!
P1
P2
Shared
slide 11
P3
P4
Owned
DSN 2003 – Daniel Sorin
Why Local Information Isn’t Sufficient
switch
switch
Broadcast
Request
for
Exclusive
switch
fault!
P1
P2
P3
Shared
P4
Owned
Invalid
Data
Response
slide 12
DSN 2003 – Daniel Sorin
Why Local Information Isn’t Sufficient
switch
switch
P1
switch
P2
P3
P4
Modified Shared
Neither P1 nor P2 can detect that an error has occurred!
slide 13
DSN 2003 – Daniel Sorin
Outline
• Background
• End-to-end invariant checking
• Two invariant checkers
• Evaluation
• Conclusions
slide 14
DSN 2003 – Daniel Sorin
Distributed Signature Analysis
•
Reduces long history of events into small signature
– Signatures map almost-uniquely to event histories
Event N at P1
Event N at P2
:
P1
:
Event 2 at P1
Event 2 at P2
Event 1 at P1
Event 1 at P2
Signature
P1’s signature
Signature
P2’s signature
Checker
slide 15
P2
}
Check periodically in
logical time
(every 3000 requests)
DSN 2003 – Daniel Sorin
Designing Signature Analysis Schemes
•
Must devise two functions: Update and Check
•
Signature(Pi) = Update(Signature(Pi), Event)
•
Check(Signature(P1),…,Signature(PN)) = true
•
Simple example: check that message inflow=outflow
if error
– Assume only unicast messages
– Update: +1 for receive, -1 for send
– Check: true if sum of all signatures doesn’t equal 0
slide 16
DSN 2003 – Daniel Sorin
Implementing Distributed Signature Analysis
• All components cooperate to perform checking
– Component = cache controller or memory controller
• Each component contains:
– Local signature register
– Logic to compute signature updates
• System contains:
– System controller that performs check function
• Use distributed signature analysis for dynamic verification
– Verify end-to-end invariants
slide 17
DSN 2003 – Daniel Sorin
Outline
• Background
• End-to-end invariant checking
• Two invariant checkers
– Message invariant
– Cache coherence invariant
• Evaluation
• Conclusions
slide 18
DSN 2003 – Daniel Sorin
A Message-Level Invariant Checker
•
Context: symmetric multiprocessor (SMP)
– Cache coherence with broadcast snooping protocol
•
Invariant: all nodes see same total order of
broadcast cache coherence requests
•
Update: for each incoming broadcast, “add” Address
– Not quite this simple (e.g., doesn’t detect reorderings)
•
Check: error if all signatures aren’t equal
slide 19
DSN 2003 – Daniel Sorin
Aliasing
• Aliasing occurs if two histories have same signature
• 3 possible sources of aliasing
– Finite resources – b bits can only distinguish 2b histories
– Fault in signature analysis hardware itself
– Inherent flaw in scheme
• Examples of inherent aliasing in previous scheme
– Arrival of message with Address=0 doesn’t change signature
– Reordering of messages doesn’t change signature
– We solve aliasing issues in paper
• Tricks: hash more than 1 field of message, use LFSRs, etc.
slide 20
DSN 2003 – Daniel Sorin
A Cache Coherence Invariant Checker
• Invariant: all coherence upgrades cause downgrades
– Upgrade: increase permissions to block (e.g., noneread)
– Downgrade: decrease permissions (e.g., write  read)
• Update: add Address for upgrade
subtract Address for downgrade
• Check: error if sum of all signatures doesn’t equal 0
• Challenges
– Can be more than one downgrade per upgrade
– Upgrader doesn’t know how how many downgraders exist
– See paper for solutions to these challenges
slide 21
DSN 2003 – Daniel Sorin
Outline
• Background
• End-to-end invariant checking
• Two invariant checkers
• Evaluation
• Conclusions
slide 22
DSN 2003 – Daniel Sorin
Methodology
• Full-system simulation of 16-processor machine
– Simics provides functional simulation of everything
– We added timing simulation for memory system & SafetyNet
• Commercial workloads running on Solaris 8
–
–
–
–
slide 23
Database: IBM’s DB2 running online transaction processing
Static web server: Apache
Dynamic web server: Slashdot
Java middleware
DSN 2003 – Daniel Sorin
Detection Coverage
• How do we know if our checkers work?
• Inject errors periodically
–
–
–
–
Corrupt messages
Drop messages
Reorder messages
Improperly process cache coherence messages
Global invariant checkers detected all errors
slide 24
DSN 2003 – Daniel Sorin
Performance
• Error bars represent +/- one standard deviation
slide 25
DSN 2003 – Daniel Sorin
Conclusions
•
•
Goal: improve multiprocessor availability
How? Dynamic verification of end-to-end invariants
– Implemented with distributed signature analysis
•
Results
– Detects previously undetectable hardware errors
– Negligible performance overhead for error-free execution
•
Duke FaultFinder Project
– http://www.ee.duke.edu/~sorin/faultfinder
•
Wisconsin Multifacet Project
– http://www.cs.wisc.edu/multifacet/
slide 26
DSN 2003 – Daniel Sorin