Transcript pptx

Relaxed
Consistency
Deterministic
Computer
“deterministic deeds, done dirt cheap”
Joseph Devietti, Jacob Nelson, Tom Bergan
Luis Ceze, Dan Grossman
determinism
tested inputs behave
identically in production
Test
testing results
are reproducible
improves the software
development cycle
no need to
stress test
Deploy
Debug
more robust production code
reverse debugging is possible
production bugs
can be reproduced
in-house
3
determinism
improves the software
development cycle
Test
Deploy
Debug
4
History of Deterministic Execution
Deterministic Execution
for Arbitrary Programs
Deterministic Execution
for Restricted Programs
DMP [ASPLOS ‘09]
CoreDet [ASPLOS ‘10]
dOS [OSDI ‘10]
Determinator [OSDI ‘10]
Calvin [HPCA ‘11]
[ASPLOS ‘11]
Kendo [ASPLOS ‘09]
Grace [OOPSLA ‘09]
5
History of Deterministic Execution
DMP [ASPLOS ‘09]
CoreDet [ASPLOS ‘10]
seq. consistency
total store order
[ASPLOS ‘11]
DRF0 [ISCA ‘90]
"Piled Higher and Deeper" by Jorge Cham
www.phdcomics.com
Jorge Cham © 2008
6
Contributions
Outline
1
2
DMP-HB
a new deterministic
consistency model based on
DRF0 with improved
performance
a low-complexity hw/sw
deterministic execution
system
4
3
C/C++ compiler
based on LLVM,
runs on commodity
multicore
hardware
simulation using Pin
hw: store buffers and
instruction counting
sw: everything else
7
starting simple: serialization
quantum round
quantum
threads
T1
deterministic
quantum size
+
deterministic
scheduling
determinism
T2
T3
time →
8
recovering parallelism
with DMP-TSO
parallel
T1
T2
T3
wr A
commit
rd A
serial
parallel mode: buffer all
stores (no communication)
commit mode:
deterministically publish
buffers
serial mode: for atomic ops
lock A
rd A
lock B
time →
9
Why is DMP-TSO slow?
parallel
T1
commit
serial
Kendo [ASPLOS ‘09]
serialization
imbalance
T2
T3
time →
10
Why is DMP-TSO slow?
parallel
commit
Kendo [ASPLOS ‘09]
serialization
imbalance
T1
DMP-HB
T2
parallel-mode synchronization
complements
relaxed consistency
T3
time →
11
synchronization in
parallel mode with Kendo
[Olszewski et al., ASPLOS ‘09]
thread with globally min insn
count can do atomic op
T2 is
not
globally
globally
min
min
insn
insn
count
count
T1
T2
lock A
T3
instruction count →
12
Why is DMP-TSO slow?
parallel
T1
commit
serial
Kendo [ASPLOS ‘09]
serialization
imbalance
T2
T3
time →
13
Why is DMP-TSO slow?
parallel
commit
Kendo [ASPLOS ‘09]
serialization
imbalance
T1
DMP-HB
T2
T3
time →
14
DRF0: happens-before consistency
[Adve and Hill, ISCA ‘90]
• happens-before edges defined by
synchronization operations
• remote updates visible via cross-thread
happens-before edges
• SC for DRF programs
• upholds C++/Java memory models
• programmer-visible model doesn’t change
15
sync in parallel mode (Kendo)
relaxed consistency (DRF0)
deterministic scheduling (DMP)
DMP-HB
16
DMP-HB : happens-before determinism
parallel
commit
explicit fences
rarely necessary
T1
T2
lock A unlock A
TSO
RC DRF0
lock A
T3
no serial mode
less imbalance
explicit fence iff
inter-thread HB
edge doesn’t
cross commit
time →
17
Outline
1
2
DMP-HB
a new deterministic
consistency model with
improved performance
a low-complexity hw/sw
deterministic execution
system
4
3
C/C++ compiler
based on LLVM,
runs on commodity
multicore
hardware
simulation using Pin
hw: store buffers and
instruction counting
sw: everything else
18
Architecture
runtime system
L2$
Store Buffers in Private $
application/OS can StoreToSB
choose nondeterminism CommitSB
align context switches
SaveSB
with quantum boundaries RestoreSB
Precise Insn Counting
L1$
L1$
Core
Core
StartInsnCount
StopInsnCount
ReadInsnCount
Traps
SBFull
QuantumReached
19
Outline
1
2
DMP-HB
a new deterministic
consistency model with
improved performance
a low-complexity hw/sw
deterministic execution
system
4
3
C/C++ compiler
based on LLVM,
runs on commodity
multicore
hardware
simulation using Pin
hw: store buffers and
instruction counting
sw: everything else
20
Experimental Setup
Pin-based simulator
1 IPC, except for memory ops
PARSEC v2.1 with simsmall inputs
structure
size
access latency
private L1
8-way, 32KB
1 cycle
private L2
8-way, 256KB
10 cycles
shared L3
16-way, 8MB
35 cycles
memory
-
120 cycles
extended CoreDet C/C++ compiler [ASPLOS ‘10]
8-core Intel Harpertown @ 2.8GHz, 10GB RAM
PARSEC v2.1 with simlarge inputs
21
Simulation:
overhead < 60% in worst case
70%
% overhead compared to nondet
Overheads
2p
60%
4p
50%
8p
40%
16p
30%
20%
10%
0%
blacksch
quantum size
50k
(insns)
dedup
ferret
fluid
50k
25k
1k
streamcl swaptions
1k
50k
vips
x264
50k
50k
22
% overhead compared to nondet
Compiler: DMP-HB vs. DMP-TSO
450%
400%
hb
350%
tso
300%
250%
200%
150%
100%
50%
0%
threads
quantum size
(insns)
2
4
8
blackscholes
200k
2
4
8
swaptions
200k
2
4
8
fluidanimate
50k
2
4
fmm
8
50k
23
Conclusions
• DMP-HB: a new deterministic
consistency model
•
: a new deterministic
multiprocessor design
– no speculation
– lightweight hardware support
• Relaxed consistency is a natural
optimization for determinism
source code and data available at
http://sampa.cs.washington.edu
24
Thanks!
Questions?
source code and data available at
http://sampa.cs.washington.edu
25
DRF0 hardware requirements [ISCA ‘90]
1.
2.
3.
4.
5.
Intra-processor dependencies are preserved.
All writes to the same location can be totally ordered based on their
commit times, and this is the order in which they are observed by all
processors.
All synchronization operations to the same location can be totally
ordered based on their commit times, and this is also the order in which
they are globally performed. Further, if S1 and S2 are synchronization
operations and S1 is committed and globally performed before S2, then
all components of S1 are committed and globally performed before any
in S2.
A new access is not generated by a processor until all its previous
synchronization operations (in program order) are committed.
Once a synchronization operation S by processor Pi is committed, no
other synchronization operations on the same location by another
processor can commit until after all reads of Pi before S (in program
order) are committed and all writes of Pi before S are globally
performed.
26