Debugging Concurrent Software by Context-Bounded Analysis Shaz Qadeer Microsoft Research Joint work with: •Jakob Rehof, Microsoft Research •Dinghao Wu, Princeton University.
Download ReportTranscript Debugging Concurrent Software by Context-Bounded Analysis Shaz Qadeer Microsoft Research Joint work with: •Jakob Rehof, Microsoft Research •Dinghao Wu, Princeton University.
Debugging Concurrent Software by Context-Bounded Analysis Shaz Qadeer Microsoft Research Joint work with: •Jakob Rehof, Microsoft Research •Dinghao Wu, Princeton University
Concurrent software
Thread 1 Thread 2 Thread 3 Thread 4 Processor 1 Processor 2 • Operating systems, device drivers • Databases, web servers, browsers, GUIs, ...
• Modern languages: C#, Java
Concurrency is increasingly important
• Single-chip multiprocessors are an architectural inflexion point – Software running on these chips will be even more concurrent • Embedded systems – Airplanes, cars, PDAs, cellphones • Web services
Reliable concurrent software?
• Correctness Problem – does program behave correctly for all inputs and all interleavings?
• Bugs due to concurrency are insidious – nondeterministic, timing dependent – difficult to detect, reproduce, eliminate – coverage from testing very poor
Analysis of concurrent programs is difficult (1)
• Finite-data single-procedure program – n lines – m states for global data variables • 1 thread – n * m states • K threads – (n) K * m states
Analysis of concurrent programs is difficult (2)
• Finite-data program with procedures – n lines – m states for global data variables • 1 thread – Infinite number of states – Can still decide assertions in O(n * m • K 2 threads – Undecidable! (Ramalingam 00) 3 ) – SLAM, ESP, BLAST implement this algorithm
Context-bounded verification of concurrent software Context Context switch Context Context switch Context Analyze all executions with small number of context switches !
Why context-bounded analysis?
• Many subtle concurrency errors are manifested in executions with a small number of contexts • Context-bounded analysis can be performed efficiently
KISS: A static checker for concurrent software • An implementation of context-bounded analysis – Technique to use any sequential checker to perform context-bounded concurrency analysis • Has found a number of concurrency errors in NT device drivers
KISS: A static checker for concurrent software No error found Concurrent program P KISS Sequential program Q Sequential Checker Error in Q indicates error in P
KISS: A static checker for concurrent software No error found Concurrent program P KISS Sequential program Q SDV Error in Q indicates error in P
KISS: A static checker for concurrent software No error found Concurrent program P KISS Sequential program Q PREfix Error in Q indicates error in P
KISS: A static checker for concurrent software No error found Concurrent program P KISS Sequential program Q ESP Error in Q indicates error in P
} Inside a static checker for sequential programs int x, y, z; void foo ( ) { } if (x > y) { y = x; } if (y > z) { z = y; assert (x ≤ z); • Symbolically analyze all paths • Check the assertion for each path • Interprocedural analysis – e.g., PREfix, ESP, SLAM, BLAST
KISS strategy
Concurrent program P KISS Sequential program Q SDV • Q encodes executions of P with small number of context switches – instrumentation introduces lots of extra paths to mimic context switches • Leverage all-path analysis of sequential checkers
DispatchRoutine( ) { int t; if (! de->stopping) { AtomicIncr(& de->count); // do useful work // … t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent); } } } PnpStop( ) { int t; de->stopping = T; t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent); WaitEvent(& de->stopEvent);
DispatchRoutine( ) { int t; if (! de->stopping) { AtomicIncr(& de->count); // do useful work // … t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent); } } } PnpStop( ) { int t; if ($) return; de->stopping = T; if ($) return; t = AtomicDecr(& de->count); if ($) return; if (t == 0) SetEvent(& de->stopEvent); if ($) return; WaitEvent(& de->stopEvent);
bool done = F; CODE } DispatchRoutine( ) { int t; CODE; if (! de->stopping) { CODE; AtomicIncr(& de->count); // do useful work // … CODE; t = AtomicDecr(& de->count); CODE; if (t == 0) SetEvent(& de->stopEvent); CODE; } } if ( !done ) { if ($) { done = T; PnpStop( ); } PnpStop( ) { int t; if ($) return; de->stopping = T; if ($) return; t = AtomicDecr(& de->count); if ($) return; if (t == 0) SetEvent(& de->stopEvent); if ($) return; WaitEvent(& de->stopEvent); }
bool done = F; CODE } DispatchRoutine( ) { int t; CODE; if (! de->stopping) { CODE; AtomicIncr(& de->count); // do useful work // … CODE; t = AtomicDecr(& de->count); CODE; if (t == 0) SetEvent(& de->stopEvent); CODE; } } if ( !done ) { if ($) { done = T; PnpStop( ); } PnpStop( ) { int t; if ($) return; de->stopping = T; if ($) return; t = AtomicDecr(& de->count); if ($) return; if (t == 0) SetEvent(& de->stopEvent); if ($) return; WaitEvent(& de->stopEvent); } main( ) { DispatchRoutine( ); }
bool done = F; CODE DispatchRoutine( ) { int t; if ($) return; if (! de->stopping) { if ($) return; AtomicIncr(& de->count); // do useful work // … if ($) return; t = AtomicDecr(& de->count); if ($) return; if (t == 0) SetEvent(& de->stopEvent); } if ( !done ) { if ($) { done = T; PnpStop( ); } } PnpStop( ) { int t; CODE; de->stopping = T; CODE; t = AtomicDecr(& de->count); CODE; if (t == 0) SetEvent(& de->stopEvent); CODE; WaitEvent(& de->stopEvent); CODE; } } main( ) { PnpStop( ); }
KISS features
• KISS trades off soundness for scalability • Cost of analyzing a concurrent program P = cost of analyzing a sequential program Q – Size of Q asymptotically same as size of P • Unsoundness is precisely quantifiable – for 2-thread program, explores all executions with up to two context switches – for n-thread program, explores up to 2n-2 context switches • Allows any sequential checker to analyze concurrency
Experimental Evaluation of KISS
Driver Stopping Error in Bluetooth Driver (1 KLOC) } DispatchRoutine() { int t; if (! de->stopping) { AtomicIncr(& de->count); assert ! driverStopped; // do useful work // … t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent); } } PnpStop() { int t; de->stopping = T; t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent); WaitEvent(& de->stopEvent); driverStopped = T;
int t; if (! de->stopping) { } AtomicIncr(& de->count); assert ! driverStopped; // do useful work // … if (t == 0) Assertion fails!
t = AtomicDecr(& de->count); SetEvent(& de->stopEvent); int t; de->stopping = T; t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent); WaitEvent(& de->stopEvent); driverStopped = T;
IRP Cancellation Error in Packet Driver (2.5 KLOC) } DispatchRoutine(IRP *irp) { … irp->CancelRoutine = … PacketCancelRoutine; Enqueue(irp); IoMarkIrpPending(irp); } IoCancelIrp(IRP *irp) { IoAcquireCancelSpinLock(); if (irp->CancelRoutine) { (irp->CancelRoutine)(irp); } … } PacketCancelRoutine(IRP *irp) { … Dequeue(irp); IoCompleteRequest(irp); IoReleaseCancelSpinLock(); …
… irp->CancelRoutine = PacketCancelRoutine; Enqueue(irp); IoAcquireCancelSpinLock(); if (irp->CancelRoutine) { // inline PacketCancelRoutine(irp) … Dequeue(irp); IoCompleteRequest(irp); IoReleaseCancelSpinLock(); IoMarkIrpPending(irp); Error: An irp should not be marked pending after it has been completed !
Data-race Conditions in DDK Sample Drivers • Device extension shared among threads • Data-races on device extension fields • 18 sample DDK drivers – Range 0.5-9.2 KLOC – Total 70 KLOC • Each field checked separately with resource limit of 20 minutes and 800MB • Two threads: each calls nondeterministically chosen dispatch routine
Driver Tracedrv Moufiltr Kbfiltr Imca Startio Toaster/toastmon Diskperf 1394diag 1394vdev Fakemodem Toaster/bus Serenum Toaster/func Mouclass Kbdclass Mouser Fdc KLOC 0.5
1.0
1.1
1.1
1.1
1.4
2.4
2.7
2.8
2.9
5.0
5.9
6.6
7.0
7.4
7.6
9.2
5 9 8 16 18 18 39 30 41 24 34 36 34 92 # Fields 3 14 15 1 0 1 0 0 1 6 0 2 5 1 1 1 9 # Races 0 0 0 Total: 30 races
DevicePnpState Field in Toaster/toastmon
} { ToastMon_DispatchPnp( DEVICE_OBJECT *obj, IRP *irp) … IoAcquireRemoveLock(); … case IRP_MN_QUERY_STOP_DEVICE: // Race: write access deviceExt->DevicePnPState = StopPending; … break; … IoReleaseRemoveLock(); … { ToastMon_DispatchPower( DEVICE_OBJECT *obj, IRP *irp) … // Race: read access if (deviceExt->DevicePnpState == Deleted) { … } … }
Acknowledgments
• Tom Ball • Byron Cook • John Henry • Doron Holan • Vladimir Levin • Jakob Lichtenberg • Adrian Oney • Sriram Rajamani • Peter Wieland • …
Keep It Simple and Sequential
• Context-bounded analysis by leveraging existing sequential checkers • Validates the hypothesis that many concurrency errors require few context switches to show up
However…
• Hard limit on number of explored contexts – e.g., two context switches for concurrent program with two threads • Case study: Concurrent transaction management code written in C# (Naik-Rehof 04) – Analyzed by the Zing model checker after automatically translating to the Zing input language – Found three bugs each requiring between three and four context switches
Is a tuning knob possible?
Given a concurrent boolean program P, does P go wrong by failing an assertion?
Undecidable Given a concurrent boolean program P and a positive integer c, does P go wrong by failing an assertion via an execution with at most c contexts?
Decidable Given a concurrent boolean program P with unbounded fork-join parallelism and a positive integer c, does P go wrong by failing an assertion via an execution with at most c contexts?
Decidable
Context Context switch Context Context switch Context Problem: • Unbounded computation possible within each context!
• Unbounded execution depth and reachable state space • Different from bounded-depth model checking
Sequential boolean program
Global store g, valuation to global variables Local store l, valuation to local variables Stack s, sequence of local stores State (g, s)
bool a = F; void main( ) { L1: a = T; L2: flip(a); L3: } void flip(bool x) { L4: a = !x; L5: }
Example
(a, x, pc ) (F, _, L1 ) (T, _, L2 ) (T, _, L3 T, L4 ) (F, _, L3 T, L5 ) (F, _, L3 ) (F, )
Sequential boolean program
Global store g, valuation to global variables Local store l, valuation to local variables Stack s, sequence of local stores State (g, s) Transition relation: (g, s) (g’, s’)
Reachability problem for sequential boolean program
Given (g, s), is there s’ such that (g, s) * (error,s’)?
Aggregate state
Set of stacks ss Aggregate state (g, ss) = { (g,s) | s ss } Reach(g, ss) = { (g’, s’) | exists s ss such that (g, s) * (g’, s’) }
Aggregate transition relation
Observations: • There is a unique smallest partition of Reach(g, ss) into aggregate states: (g’ 1 , ss’ 1 ) … (g’ n , ss’ n ) • The number of elements in the partition is bounded by the number of global stores (g, ss) .
.
.
(g, ss) (g’ 1 , ss’ 1 ) (g’ n , ss’ n )
Theorem (Buchi, Schwoon00)
• If ss is regular and (g, ss) (g’, ss’), then ss’ is regular.
• If ss is given as a finite automaton A, then a finite automaton A’ for ss’ can be constructed from A in polynomial time.
Algorithm
Problem: Given (g, s), is there s’ such that (g, s) * (error,s’)?
Solution: Compute automaton for ss’ such that (g, {s}) (error, ss’) and check if ss’ is nonempty.
Concurrent boolean program
Global store g, valuation to global variables Local store l, valuation to local variables Stack s, sequence of local stores State (g, s 1 , s 2 ) Transition relation: (g, s 1 ) (g’, s’ 1 ) in thread 1 (g, s 1 , s 2 ) 1 (g’, s’ 1 , s 2 ) (g, s 2 ) (g’, s’ 2 ) in thread 2 (g, s 1 , s 2 ) 2 (g’, s 1 , s’ 2 )
Reachability problem for concurrent boolean program
Given (g, s 1 , s 2 ), are there s’ 1 and s’ 2 such that (g, s 1 , s 2 ) reaches (error, s’ 1 , s’ 2 ) via an execution with at most c contexts?
Aggregate transition relation
(g, ss 1 , ss 2 ) = { (g, s 1 , s 2 ) | s 1 ss 1 , s 2 ss 2 } (g, ss 1 ) (g’, ss’ 1 ) in thread 1 (g, ss 1 , ss 2 ) 1 (g’, ss’ 1 , ss 2 ) (g, ss 2 ) (g’, ss’ (g, ss 1 , ss 2 ) 2 2 ) in thread 2 (g’, ss 1 , ss’ 2 )
Algorithm: 2 threads, c contexts Depth c 1 1 2 1 2 (g, {s 1 }, {s 2 }) 2 Compute the set of reachable aggregate states.
Report an error if (g, ss 1 , ss 2 ) is reachable and g = error, ss 1 is nonempty, and ss 2 is nonempty.
Complexity: 2 threads, c contexts Depth c 1 1 2 1 2 (g, {s 1 }, {s 2 }) 2 Depth of tree = context bound c Branching factor bounded by G Number of edges bounded by (G 2 (G = # of global stores) 2) (c+1) Each edge computable in polynomial time
Context-bounded analysis of concurrent software • Many subtle concurrency errors are manifested in executions with few context switches – Experience with KISS on Windows drivers – Experience with Zing on transaction manager • Algorithms for context-bounded analysis are more efficient than those for unbounded analysis – Reducibility to sequential checking with KISS – Decidability of assertion checking for concurrent boolean programs
Applications of context-bounded analysis • Coverage metric for testing concurrent software • Analysis of computer protocols – networking – cache-coherence
Unbounded fork-join parallelism • Fork operation: x = fork • Join operation: join(x) • Copy thread identifier from one variable to another
Algorithm: unbounded fork-join parallelism, c contexts • At most c threads may perform a transition • Reduce to previously solved problem with c threads and c contexts – Nondeterministically pick c forked threads for execution
start : {1, …, c} end : {1, …, c} boolean, initialized to boolean, initialized to i. (i == 1) i. false • c statically created threads • thread i starts execution when start[i] is true • thread i sets end[i] to true on termination count : {1, …, c}, initialized to 1 x = fork translates to if ($) { assume(count < c); count = count + 1; x = count; start[count] = true; } else { x = c + 1; } join(x) translates to assume(x c); assume(end[x]);