Efficient On-the-Fly Data Race Detection in C++

Transcript Efficient On-the-Fly Data Race Detection in C++

What is a Data Race?

Two concurrent accesses to a shared
location, at least one of them for writing.

Indicative of a bug
Thread 1
X++
Z=2
Thread 2
T=Y
T=X
1
How Can Data Races be Prevented?

Explicit synchronization between threads:








Locks
Critical Sections
Barriers
Mutexes
Semaphores
Monitors
Events
Etc.
Thread 1
Lock(m)
X++
Unlock(m)
Thread 2
Lock(m)
T=X
Unlock(m)
2
Is This Sufficient?


Yes!
No!

Programmer dependent

Correctness – programmer may forget to synch


Need tools to detect data races
Expensive

Efficiency – to achieve correctness, programmer may
overdo.

Need tools to remove excessive synch’s
3
Where is Waldo?
#define N 100
Type g_stack = new Type[N];
int g_counter = 0;
Lock g_lock;
void push( Type& obj ){lock(g_lock);...unlock(g_lock);}
void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);}
void popAll( ) {
lock(g_lock);
delete[] g_stack;
g_stack = new Type[N];
g_counter = 0;
unlock(g_lock);
}
int find( Type& obj, int number ) {
lock(g_lock);
for (int i = 0; i < number; i++)
if (obj == g_stack[i]) break; // Found!!!
if (i == number) i = -1; // Not found… Return -1 to caller
unlock(g_lock);
return i;
}
int find( Type& obj ) {
return find( obj, g_counter );
}
4
Can You Find the Race?
#define N 100
Type g_stack = new Type[N];
int g_counter = 0;
Lock g_lock;
Similar problem was found
in java.util.Vector
void push( Type& obj ){lock(g_lock);...unlock(g_lock);}
void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);}
void popAll( ) {
lock(g_lock);
delete[] g_stack;
g_stack = new Type[N];
g_counter = 0;
write
unlock(g_lock);
}
int find( Type& obj, int number ) {
lock(g_lock);
for (int i = 0; i < number; i++)
if (obj == g_stack[i]) break; // Found!!!
if (i == number) i = -1; // Not found… Return -1 to caller
unlock(g_lock);
return i;
}
read
int find( Type& obj ) {
return find( obj, g_counter );
}
5
Detecting Data Races?

NP-hard







[Netzer&Miller 1990]
Input size = # instructions performed
Even for 3 threads only
Even with no loops/recursion
Execution orders/scheduling (#threads)thread_length
# inputs
Detection-code’s side-effects
Weak memory, instruction reorder, atomicity
6
Motivation
Run-time framework goals



Collect a complete trace of a program’s user-mode execution
Keep the tracing overhead for both space and time low
Re-simulate the traced execution deterministically based on the
collected trace with full fidelity down to the instruction level

Full fidelity: user mode only, no tracing of kernel, only user-mode
I/O callbacks
Advantages


Complete program trace that can be analyzed from multiple
perspectives (replay analyzers: debuggers, locality, etc)
Trace can be collected on one machine and re-played on other
machines (or perform live analysis by streaming)
Challenges: Trace Size and Performance
7
Original Record-Replay Approaches

InstantReplay ’87



RecPlay ’00



Record only synchronizations
Not deterministic if have data races
Netzer ’93



Record order or memory accesses
overhead may affect program behavior
Record optimal trace
too expensive to keep track of all memory locations
Bacon & Goldstein ’91


Record memory bus transactions with hardware
high logging bandwidth
8
Motivation
Increasing use and development for multi-core
processors


MT program behavior is non-deterministic
To effectively debug software, developers must be able
to replay executions that exhibit concurrency bugs

Shared memory updates happen in different order
9
Related Concepts

Runtime interpretation/translation of binary instructions




Requires no static instrumentation, or special symbol information
Handle dynamically generated code, self modifying code
Recording/Logging: ~100-200x
More recent logging







Proposed hardware support (for MT domain)
FDR (Flight Data Recorder)
BugNet (cache bits set on first load)
RTR (Regulated Transitive Reduction)
DeLorean (ISCA 2008- chunks of instructions)
Strata (time layer across all the logs for the running threads)
iDNA (Diagnostic infrastructure using NirvanA- Microsoft)
10
Deterministic Replay
Re-execute the exact same sequence of instructions
as recorded in a previous run

Single threaded programs






Record Load Values needed for reproducing behavior of a
run (Load Log)
Registers updated by system calls and signal handlers (Reg
Log)
Output of special instructions: RDTSC, CPUID (Reg Log)
System call (virtualization- cloning arguments, updates)
Checkpointing (log summary ~10Million)
Multi-threaded programs

Log interleaving among threads (shared memory updates
ordering – SMO Log)
11
PinSEL – System Effect Log (SEL)
Logging program load values needed for deterministic replay:
–
First access from a memory location
–
Values modified by the system (system effect) and read by
program
–
Machine and time sensitive instructions (cpuid,rdtsc)
Program
execution
Store A; (A
111)
Store B; (B
55)
Load C; (C = 9)
Load D; (D = 10)
Syscall modifies
location (B -> 0)
and (C -> 99)
Load A; (A = 111)
system call
Load B; (B = 0)
Logged
Not Logged
Load C; (C = 99)
Load D; (D = 10)
12
•Trace size is ~4-5 bytes per instruction
reads



Observation: Hardware caches eliminate most off-chip reads
Optimize logging:

Logger and replayer simulate identical cache memories

Simple cache (the memory copy structure) to decide which values
to log. No tags or valid bits to check. If the values mismatch they
are logged.
Average trace size is <1 bit per instruction
i = 1;
for (j = 0; j < 10; j++)
{
i = i + j;
}
k = i; // value read is 46
System_call();
k = i; // value read is 0 (not predicted)

The only read not predicted and logged follows the system call
13
Example Overhead

PinSEL and PinPLAY

Initial work (2006) with single threaded programs:



SPEC2000 ref runs: 130x slowdown for pinSEL and ~80x
for PinPLAY (w/o in-lining)
Working with a subset of SPLASH2 benchmarks: 230x
slowdown for PinSEL
Now: Geo-mean SPEC2006



Pin 1.4x
Logger 83.6x
Replayer 1.4x
14
Example: Microsoft iDNA Trace
Writer Performance
Trace File
Size
Trace File
Bits /
Instructio
n
Native
Execution
Time
Execution
Time While
Tracing
Execution
Overhead
24,097
245 MB
0.09
11.7s
187s
15.98
Excel
1,781
99 MB
0.47
18.2s
105s
5.76
Power
Point
7,392
528 MB
0.60
43.6s
247s
5.66
IE
116
5 MB
0.50
0.499s
6.94s
13.90
Vulcan
2,408
152 MB
0.53
2.74s
46.6s
17.01
127s
12.98
Applicatio
n
Simulated
Instructions
(millions)
Gzip
•Memchecker and valgrind are in 30-40x range on CPU 2006
Satsolver
9,431
1300 MB
1.16
9.78s
•iDNA ~11x, (does not log shared-memory dependences explicitly)
•Use a sequential number for every lock prefixed memory operation: offline
15
data race analysis
Logging Shared Memory Ordering
(Cristiano’s PinSEL/PLAY Overview)

Emulation of Directory Based Cache
Coherence



Identifies RAW, WAR, WAW dependences
Indexed by hashing effective address
Each entry represents an address range
Program execution
Directory
Dir Entry
Dir Entry
Dir Entry
Store A
Load B
hash
Dir Entry
16
Directory Entries

Every DirEntry maintains:





Thread id of the last_writer
A timestamp is the # of memory ref. the thread has executed
Vector of timestamps of last access for each thread to that
entry
On Loads: update the timestamp for the thread in the entry
On Stores: update the timestamp and the last_writer fields
Directory
Program execution
Thread T1
1: Store A
Thread T2
1: Load F
2: Store A
DirEntry: [A:D]
Last writer id:
T1:
1
2
T2
T1
T2: 2
Vector
DirEntry: [E:H]
2: Load A
3: Store F
3: Load F
Last writer id:
T1:
3
T2:
T1
13
17
Detecting Dependences

RAW dependency between threads T and T’ is established if:



WAW dependency between T and T’ is established if:



T executes a load that maps to the directory entry A
T’ is the last_writer for the same entry
T executes a store that maps to the directory entry A
T’ is the last_writer for the same entry
WAR dependency between T and T’ is established if:


T executes a store that maps to the directory entry A
T’ has accessed the same entry in the past and T is not the
last_writer
18
Example
Program execution
Thread T1
Thread T2
1: Store A
DirEntry: [A:D]
Last writer id:
1: Load F
WAW
T1:
2: Store A
3: Load F
Last writer id:
WAR
3: Store F
T2: 2
DirEntry: [E:H]
RAW
2: Load A
12
T2
T1
T1:
3
T2:
T1
13
Last_writer
Last access to
the DirEntry
T1 2 T2 2
T1 3 T2 3
SMO logs:
T2 2 T1 1
Last access to the DirEntry
Thread T2 cannot execute memory
reference 2 until T1 executes its
memory reference 1
Thread T1 cannot execute memory reference 2
until T2 executes its memory reference 2
19
Ordering Memory Accesses
(Reducing log size)

Preserving order will reproduce
execution



a→b: “a happens-before b”
Ordering is transitive: a→b, b→c means
a→c
Two instructions must be ordered if:


they both access the same memory, and
one of them is a write
20
Constraints: Enforcing Order
P1
a
b
P2

To guarantee a→d:



c
d


a→d
b→d
a→c
b→c
overconstrained
Suppose we need b→c


b→c is necessary
a→d is redundant
21
Problem Formulation
Thread I Conflicts
Thread J
ld A
add
(red)
ld A
(black)
add
st B
st C
st B
st C
st C
ld B
st C
ld B
ld D
st A
ld D
st A
sub
st C
sub
st C
ld B
st D
ld B
st D
Recording

ThreadDependence
I
Thread J
Log
Replay
Reproduce exact same conflicts: no more, no less
22
Log All Conflicts
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
3
st C
ld B
3
Log J: 23
14
35
46
4
ld D
st A
4
Log I: 23
5
sub
st C
5
6
ld B
st D
6
Replay
 Assign IC
 (logical
 But
Timestamps)
Dependence Log
Log Size: 5*16=80 bytes
(10 integers)
Detect conflicts
too many conflicts
log

16
bytes
Write
23
Netzer’s Transitive Reduction
Thread I
Thread J
TR Reduced Log
1
ld A
TR
reduced
add
2
st B
st C
2
3
st C
ld B
3
1
4
ld D
st A
4
5
sub
st C
5
6
ld B
st D
6
Replay
Log J: 23
35
46
Log I: 23
Log Size: 64 bytes
(8 integers)
24
RTR (Regulated Transitive Reduction):
Stricter Dependences to Aid Vectorization
Thread I
Thread J
1
ld A
add
1
2
st B
st C
2
Log J: 23
45
3
st C
3
Log I: 23
4
ld D
ld B
stricter
st A
5
sub
st C
5
ld B
6 Reduced
st D
6
Replay
New Reduced Log
4
Log Size: 48 bytes
(6 integers)
4% Overhead RTR+FDR (simulated on GEMs)
.2 MB/core/second logging (Apache)
25