Transcript [talk]

Karma:
Scalable Deterministic Record-Replay
Arkaprava Basu
Jayaram Bobba
Mark D. Hill
Work done at University of Wisconsin-Madison
UW-Madison Computer Sciences Multifacet Group
© 2011
Executive summary
• Applications of deterministic record-replay
– Debugging
– Fault tolerance
– Security
• Existing hardware record-replayer
– Fast record but
– Slow replay or
– Requires major hardware changes
• Karma: Faster Replay with nearlyconventional h/w
– Extends Rerun
– Records more parallelism
2
Outline
• Background & Motivation
• Rerun Overview
• Karma Insights
• Karma Implementation
• Evaluation
• Conclusion
3
Deterministic Record-Replay
• Multi-threaded execution non-deterministic
• Deterministic record-replay to reincarnate
past execution
• Record:
– Record selective events in a log
• Replay:
– Use the log to reincarnate past execution
• Key Challenge: Memory races
4
Record-Replay Motivation
• Debugging
– Ensures bugs faithfully reappear (no heisenbugs)
Replay speed matters
• Fault-Tolerance
– Enable hot backup for primary server to
shadow primary & take over on failure
• Security
– Real time intrusion detection & attack analysis
5
Previous work
• Record Dependence
– Wisconsin Flight Data Recorder [ISCA’03,etc.]: Too much
state
– UCSD Strata [ASPLOS’06]: Log size grows rapidly w #cores
• Record Independence
–
–
–
–
UIUC DeLorean [ISCA’08]: Non-conventional BulkSC H/W
Wisconsin Rerun [ISCA’08]: Sequential replay
Intel MRR [MICRO’09]: Only for snoop based systems
Timetraveler [ISCA’10]: Extends Rerun to lower log size
• Our Goal
– Retain Rerun’s near-conventional hardware
– Enable Faster Replay
6
Outline
• Background & Motivation
• Rerun Overview
• Karma Insights
• Karma Implementation
• Evaluation
• Conclusion
7
Rerun’s Recording
• Most code executes without races
– Use race-free regions for ordering
• Episodes: independent execution regions
– Defined per thread
T1
T0
LD
ST
ST
LD
LD
LD
ST
ST
LD
A
B
C
F
X
Q
Q
K
Z
ST
LD
ST
LD
ST
ST
ST
ST
T2
E
B
X
R
T
C
E
X
ST
ST
LD
LD
LD
LD
V
Z
W
J
J
V
8
Partially adopted from ISCA’08 talk
Rerun’s Recording (Contd.)
• Capturing causality:
– Timestamp via Lamport scalar clock [Lamport ‘78]
• Replay in timestamp order
– Episodes with same timestamp can be replayed in parallel
T0
60
61
T1
T2
43
22
44
23
23
44
62
45
9
Rerun’s Replay
T0
T1
TS=22
22
TS=43
43
44
TS=44
TS=45
TS=60
TS=61
T2
44
45
60
61
10
Outline
• Background & Motivation
• Rerun Overview
• Karma Insights
• Karma Implementation
• Evaluation
• Conclusion
11
Karma’s Insight 1:
• Capture order with DAG (not scalar clock)
Recording: DAG
captured with
episode
predecessor &
successor sets
T0
60
61
T1
T2
43
22
44
23
23
44
62
45
12
Karma’s Insight 1:
T1
T2
Rerun’s Replay
22
43
44
44
45
Karma’s Replay
T0
T0
60
T1
61
43
T2
22
44
44
62
60
61
13
Karma’s Insight 1: (Contd.)
• Naïve approach: DAG arcs point to episodes
– Episode represented by integers
– Too much log size overhead !!
• Our approach: DAG arcs point to cores
– Recording: Only one “active” episode per core
– Replay: Send wakeup message(s) to core(s) of
successor episode(s)
14
Karma’s Insight 1:
T0
T1
60
T2
Anatomy of a log entry
22
84
61
0|0|1
0|0|1
43
44
44
62
15
Karma Insight 2:
• Not necessary to end the episode on every
conflict:
– As long as the episodes can be ordered during
replay
T0
LD
ST
ST
LD
LD
LD
ST
ST
LD
A
B
C
F
X
Q
Q
K
Z
T1
ST
LD
ST
LD
ST
ST
ST
ST
E
B
X
R
T
C
E
X
T2
ST
ST
LD
LD
LD
LD
V
Z
W
J
J
V
17
Outline
• Background & Motivation
• Rerun Overview
• Karma Insights
• Karma Implementation
• Evaluation
• Conclusion
18
Karma Hardware
Base System
DRAM
L2
14
bytes/core
Interconnect
Core
0
Core
1
…
L2
15
Core
14
DRAM
L2
L2
…
1
0
Total State: 148
Core
Address Filter(FLT)
15
Reference (REFS)
Predecessor(PRED)
Successor(SUCC)
Timestamp(TS)
19
Outline
• Background & Motivation
• Rerun Overview
• Karma Insights
• Karma Implementation
• Evaluation
• Conclusion
20
Evaluation:
• Were we able to speed up the replay?
Speedup normalized to "Base" of corresponding
configuration
Apache
1.2
Base
Rerun Replay
Karma Replay
1
0.8
0.6
0.4
0.2
0
4core-4MB
8core-8MB
16core-16MB
Number of cores-L2 cache size
21
Evaluation:
• Were we able to speed up the replay?
Rerun Replay
Karma Replay
1
0.8
0.6
0.4
0.2
0
4core-4MB
Jbb
Base
Speedup normalized to "Base" of corresponding
configuration
Speedup normalized to "Base" of corresponding
configuration
Apache
1.2
1.2
Rerun Replay
Karma Replay
1
0.8
0.6
0.4
0.2
On Average ~4X improvement
in replay speed over Rerun
8core-8MB
0
4core-4MB
16core-16MB
Base
Rerun Replay
Karma Replay
1
0.8
0.6
0.4
0.2
0
4core-4MB
8core-8MB
Number of cores-L2 cache size
16core-16MB
Zeus
Speedup normalized to "Base" of corresponding
configuration
Speedup normalized to "Base" of corresponding
configuration
Oltp
8core-8MB
16core-16MB
Number of cores-L2 cache size
Number of cores-L2 cache size
1.2
Base
Base
Rerun Replay
1.2
Karma Replay
1
0.8
0.6
0.4
0.2
0
4core-4MB
8core-8MB
Number of cores-L2 cache size
16core-16MB
22
Evaluation
Karma log size normalized to Rerun's log size
• Did we blowup log size?
1.4
1.2
Apache
On average Karma does not increase theZeus
size of the log but instead improves it byOltp
Jbb
as much as 40% as we allow larger
episodes
1
0.8
0.6
0.4
0.2
0
128
256
512
1024
2048
4096
8192
Unbounded
Maximum allowable Episode size
23
Conclusion
• Applications of deterministic replay
– Debugging
– Fault tolerance
– Security
• Existing hardware record-replayer
– Slow replay or
– Requires major hardware changes
• Karma: Faster Replay with nearly-conventional
h/w
– Extends Rerun
– Uses DAG instead of Scalar clock
– Extend episodes past conflicts
• Widen Application + Lower Cost  More
Attractive
25
Questions?
26