Automatic Detection and Repair of Errors in Data Structures Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

Download Report

Transcript Automatic Detection and Repair of Errors in Data Structures Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

Automatic Detection and Repair of Errors
in Data Structures
Brian Demsky
Martin Rinard
Laboratory for Computer Science
Massachusetts Institute of Technology
Motivation
Broken Data Structure
F = 20
G = 10
F = 20
G=5
•
•
•
•
•
I=5
J=2
Errors
Missing elements
Inappropriate sharing
Dangling references
Out of bounds array
indices
Inconsistent values
Goal
Broken Data Structure
F = 20
G = 10
F = 20
G=5
Consistent Data Structure
Repair
Algorithm
F=2
G=1
I=5
I=3
J=2
J=2
F = 20
G = 10
F = 10
G=5
Goal
Broken Data Structure
F = 20
G = 10
F = 20
G=5
Consistency Consistent Data Structure
Properties
From
Developer
Repair
Algorithm
F=2
G=1
I=5
I=3
J=2
J=2
F = 20
G = 10
F = 10
G=5
What Does Repair Algorithm Produce?
• Data structure that
• Satisfies consistency properties, and
• Heuristically close to broken data structure
• Not necessarily the same data structure as
(hypothetical) correct program would produce
• But enough to keep program operating
successfully
Precursors
• Data structure repair has historically appeared
in systems with extreme reliability goals
• 5ESS switch – hand coded audit routines
• IBM MVS operating system – hand coded
failure recovery routines
• Key component of these systems
Where Is This Likely To Be Useful?
• Not for systems with slack - can just reboot
• Cause of error must go away after reboot
• Must be OK to lose volatile state
• Must be OK to wait for reboot
• Persistent data structures
(file systems, application files)
• Autonomous and/or safety critical systems
• Monitor/control unstable physical phenomena
• Largely independent subcomputations
• Moving time window
Broken
Abstract Model
Architecture
Internal
Consistency
Properties
Model
Definition &
Translation
Broken
Bits
1011100110001111011
1010101011110011101
1010111000111101110
Repaired
Abstract Model
External
Consistency
Properties
1010011110001111011
1010110101110011010
1010111011001100010
Repaired
Bits
Architecture Rationale
Why go through the abstract model?
• Simple, uniform structure
• Sets of objects
• Relations between objects
• Simplifies both
• Expression of consistency properties
• Repair algorithm
• Enables system to support full range of
efficient, heavily encoded data structures
File System Example
abst 0 intro 2 1
Directory Entries
struct Entry {
byte name[Length];
int firstBlock;
}
struct Block {
int nextBlock;
data byte[BlockSize];
}
-5
1
-1
Disk Blocks
struct Disk {
Entry dir[NumEntries];
Block block[NumBlocks];
}
Disk D;
Model Definition
• Sets of objects
set blocks of integer : partition used | free;
• Relations between objects – values of object
fields, referencing relationships between objects
relation next : used, used;
blocks
next
used
free
Model Translation
Bits translated to sets and relations in abstract model
using statements of the form:
Quantifiers, Condition  Inclusion Constraint
for i in 0..NumEntries, 0  D.dir[i].firstBlock and
D.dir[i].firstBlock < NumBlocks 
D.dir[i].firstBlock in used
for b in used, 0  D.block[b].nextBlock and
D.block[b].nextBlock < NumBlocks 
b,D.block[b].nextBlock in next
for b,n in next, true  n in used
for b in 0..NumBlocks, not (b in used)  b in free
Model in Example
abst 0 intro 2 1
-5
Directory Entries
blocks
next
1
2
-1
Disk Blocks
used
0
1
next
free
3
Internal Consistency Properties
•
•
•
•
•
•
Quantifiers, Body
Body is first-order property of basic propositions
Inequality constraints on values of numeric fields
• V.R = E, V.R < E, V.R  E, V.R  E, V.R > E
Presence of required number of objects
• size(S) = C, size(S)  C, size(S)  C
Topology of region surrounding each object
• size(V.R) = C, size(V.R)  C, size(V.R)  C
• size(R.V) = C, size(R.V)  C, size(R.V)  C
Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R
Example: for b in used, size(next.b)  1
Internal Consistency Violations
Evaluate consistency properties, find violations
for b in used, size(next.b)  1 is false for b = 1
blocks
used
0
next
1
2
next
free
3
Repairing Violations of Internal
Consistency Properties
• Violation provides binding for quantified variables
• Convert Body to disjunctive normal form
(p1  …  pn )  …  (q1  …  qm )
p1 … pn , q1 … qm are basic propositions
• Choose a conjunction to satisfy
• Repair violated basic propositions in conjunction
Repairing Violations of Basic Propositions
• Inequality constraints on values of numeric fields
• V.R = E, V.R < E, V.R  E, V.R  E, V.R > E
• Compute value of expression, assign field
• Presence of required number of objects
• size(S) = C, size(S)  C, size(S)  C
• Remove or insert objects from/to set
• Topology of region surrounding each object
• size(V.R) = C, size(V.R)  C, size(V.R)  C
• size(R.V) = C, size(R.V)  C, size(R.V)  C
• Remove or insert pairs from/to relation
• Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R
• Remove or add the object or pair from/to set or
relation
Repair in Example
for b in used, size(next.b)  1 is false for b = 1
Must repair size(next.1)  1
Can remove either 0,1 or 2,1 from next
blocks
used
0
next
1
2
next
free
3
Repair in Example
for b in used, size(next.b)  1 is false for b = 1
Must repair size(next.1)  1
Can remove either 0,1 or 2,1 from next
blocks
used
0
next
1
2
free
3
Acyclic Repair Dependences
• Questions
• Isn’t it possible for the repair of one
constraint to invalidate another constraint?
• What about infinite repair loops?
• What about unsatisfiable specifications?
• Answer
• We require specifications to have no cyclic
repair dependences between constraints
• So all repair sequences terminate
• Repair can fail only because of resource
limitations
External Consistency Constraints
Quantifiers, Condition  Body
• Body of form V = E, V.F = E, V.F[I] = E
• Example
for b in free, true  D.block[b].nextBlock = -2
for i,j in next, true  D.block[i].nextBlock = j
for b in used, size(b.next) = 0 
D.block[b].nextBlock = -1
• Repair simply performs assignments
• Translates model repairs to bit repairs
Repair in Example
Inconsistent File System
abst 0 intro 2 1
-5
Directory Entries
1
-1
Disk Blocks
Repaired File System
abst 0 intro 2 1
Directory Entries
-1
-1
Disk Blocks
-2
When to Test for Consistency and Repair
• Persistent data structures
• Repair can be independent activity, or
• Repair when data written out or read in
• Volatile data structures in running program
• Under programmer control
• Transaction-based approach
• Identify transaction start and end
• Repair at start, end, or both
• Failure-based approach
• Wait until program fails
• Repair and restart from latest safe point
Experience
• We acquired four benchmarks (written in C/C++)
• CTAS (air-traffic control tool)
• Simplified Linux file system
• Freeciv interactive game
• Microsoft Word files
• We developed specifications for all four
• Very little development time (days, not weeks)
• Most of time spent figuring out Freeciv and CTAS
• Each benchmark has
• Workload
• Fault insertion methodology
• Ran benchmarks with and without repair
CTAS
• Set of air-traffic control tools
• Traffic management
• Arrival planning
• Flow visualization
• Shortcut planning
• Deployed in centers around country
(Dallas/Ft. Worth, Los Angeles, Denver, Miami,
Minneapolis/St. Paul, Atlanta, Oakland)
• Approximately 1 million lines of C/C++ code
CTAS Screen Shot
Results
• Workload – recorded radar feed from DFW
• Fault insertion
• Simulate error in flight plan processing
• Bad airport index in flight plan data structure
• Without repair
• System crashes – segmentation fault
• With repair
• Aircraft has different origin or destination
• System continues to execute
• Anomaly eventually flushed from system
Aspects of CTAS
• Lots of independent subcomputations
• System processes hundreds of aircraft –
problem with one should not affect others
• Multipurpose system
(visualization, arrival planning, shortcuts, …) –
problem in one purpose should not affect others
• Sliding time window: anomalies eventually flushed
• Rebooting ineffective – system will crash again as
soon as it sees the problematic flight plan
Simplified Linux File System
intro 0
110
1011
super group directory inode block inode … inode
block block block bitmap bitmap
block block
inode block
disk blocks
Some Consistency Properties
•
•
•
•
•
inode bitmap consistent with inode usage
block bitmap consistent with block usage
directory entries refer to valid inodes
files contain valid blocks only
files do not share blocks
Results
• Workload – write and verify several files
• Fault insertion – crash file system
• Inode and block bitmap errors
• Partially initialized directory and inode entries
• Without repair
• Incorrect file contents because of inode and
disk block sharing
• With repair
• Bitmaps repaired preventing illegal sharing,
correct file contents
Freeciv
Terrain Grid
O
O
O
P
P
O
P
P
M
P
M
P
M
M
M
M
O = Ocean
P = Plain
M = Mountain
City
Structures
loc: 3,0
loc: 2,3
Consistency Properties
• Tiles have valid terrain values
• Cities are not in the ocean
• Each city has exactly one
reference from city location grid
• City locations are consistent in
• City structures and
• tile grid
Results
•
•
•
•
Workload – Freeciv software plays against itself
Fault insertion – randomly corrupt terrain values
Without repair – program fails (seg fault)
With repair
• Game runs just fine
• But game plays out differently because of the
different terrain values
Microsoft Word Files
• Files consist of a sequence of streams
• Streams stored using FAT-based data structure
abst 1 7 0 intro 1 9 2 1 -1 -1 -2
Directory Entries
FAT
Disk Blocks
• Consistency Properties
• FAT blocks exist and contain valid entries
• FAT streams are properly terminated
• Free blocks properly marked
• Streams contain valid blocks
• No sharing of blocks between streams
Results
• Workload – several Microsoft Word files
• Fault insertion – scramble FAT
• Without repair
• If blocks containing the FAT were
incorrectly marked as free, Word
successfully loads file
• Otherwise,
“The document name or path is not valid”
• With repair
• Word loads all files
Recent Work
Broken
Abstract Model
Repaired
Abstract Model
Internal
Consistency
Properties
Model
Definition &
Translation
1011100110001111011
Broken 1010101011110011101
Bits
1010111000111101110
External
Consistency
Properties
10100111100011101011
1010110101110011010 Repaired
Bits
1010111011001100010
• External consistency properties translate model
repairs to data structure repairs
• Errors may cause data structures to remain
inconsistent even after repair
Recent Work
Broken
Abstract Model
Repaired
Abstract Model
Internal
Consistency
Properties
Model
Definition &
Translation
1011100110001111011
Broken 1010101011110011101
Bits
1010111000111101110
External
Consistency
Properties
10100111100011101011
1010110101110011010 Repaired
Bits
1010111011001100010
• Current strategy
• Eliminate external consistency properties
• Analyze model definition rules and internal
consistency properties
• Automatically generate data structure repairs
Broken
Abstract Model
Abstract
Repair
Repaired
Abstract Model
Recent Work
. . . .
Model
Definition &
Translation
10111001011
10101011101
10101110110
Broken
Bits
Automatically
Generated
Concrete
Repair
10111001011
10101011101
10101110110
. . . .
10111001011
10101011101
10101110110
Repaired
Bits
Result: Repaired bits guaranteed to satisfy
consistency constraints
Recent Work
• Efficient evaluation of consistency properties
• Compilation to remove interpreter overhead (4.7x
speedup)
• Fixed point elimination (210x speedup)
• Relation construction elimination (500x speedup)
• Set construction elimination (3900x speedup)
• Model-based error localization
• User study shows benefit from approach
• Users with tool take 11 minutes on average to find
and fix a bug
• Users without tool mostly failed to find a bug
within the hour allocated
Related Work
• Hand-coded repair
• Lucent 5ESS switch
• IBM MVS operating system
• Integrity Maintenance in Databases (Ceri,
Widom, Urban)
• Self-stabilizing algorithms
• Log-based recovery for database systems
• Recovery-oriented computing
• Recursive restartability
• Undo framework
Conclusion
• Data structure repair interesting way to
(potentially) improve reliability
• Specification-based approach promises to
make technique more widely applicable
• Moving towards more robust, probabilistic,
continuous concept of system behavior