Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, Jipeng Huang, Man Cao, Michael D.

Download Report

Transcript Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, Jipeng Huang, Man Cao, Michael D.

Low-Overhead Software Transactional Memory
with Progress Guarantees and Strong Semantics
Minjia Zhang,
Jipeng Huang, Man Cao, Michael D. Bond
1
Do We Need Efficient STM?
2
Problem Solved!
Blue Gene/Q
3
Problem Solved?
HTM is limited…
4
Problem Solved?
Best-effort HTM: no completion guarantee1
Performance penalty: short transactions2
Language-level support for atomic blocks: STM fallback
atomic {
from.balance -= amount;
to.balance += amount;
}
transaction
[1] I. Calciu et al. Invyswell: A Hybrid Transactional Memory for Haswell’s Restricted Transactional Memory.
In PACT, 2014.
[2] R. M. Yoo et al. Performance Evaluation of Intel Transactional Synchronization Extensions for High-Performance
Computing. In SC, 2013.
5
Software Transactional Memory Is Slow
Existing STMs add high overhead 1,2,3
[1] C. Cascaval et al. Software Transactional Memory: Why Is It Only a Research Toy? In CACM, 2008
[2] A. Dragojevi´c, et al. Why STM Can Be More than a Research Toy. In CACM, 2011
[3] R. M. Yoo et al. Kicking the Tires of Software Transactional Memory: Why the Going Gets Tough. In SPAA, 2008.
6
Software Transactional Memory Is Slow
Existing STMs add high overhead 1,2,3
Related challenges: scalability, progress
guarantees, strong semantics
[1] C. Cascaval et al. Software Transactional Memory: Why Is It Only a Research Toy? In CACM, 2008
[2] A. Dragojevi´c, et al. Why STM Can Be More than a Research Toy. In CACM, 2011
[3] R. M. Yoo et al. Kicking the Tires of Software Transactional Memory: Why the Going Gets Tough. In SPAA, 2008.
7
Challenge
Expensive to detect conflicts
T2
T1
atomic {
…
o.f = …
… = o.f;
… = p.g;
…
o.f = …;
p.g = …;
…
}
8
Challenge
Expensive to detect conflicts
T2
T1
atomic {
…
p.g = …
… = o.f;
… = p.g;
…
o.f = …;
p.g = …;
…
}
9
Challenge
Expensive to detect conflicts
T2
T1
atomic {
…
t.k = …
… = o.f;
… = p.g;
…
o.f = …;
p.g = …;
…
}
10
Challenge
Expensive to detect conflicts
T2
T1
atomic {
…
?
… = o.f;
… = p.g;
…
instrumentation
o.f = …;
p.g = …;
…
}
11
12
LarkTM Contributions
 Adds very low overhead
 Achieves good scalability by using a hybrid approach
 Provides strong progress guarantees
 Provides strong atomicity
13
Key Insight
Avoid high instrumentation costs by minimizing
instrumentation costs for non-conflicting accesses
14
LarkTM Design
Per-object biased reader-writer locks1,2
Eager concurrency control
Piggybacking conflict detection and
conflict resolution on lock transfers
1. M. D. Bond et al. Octet: Capturing and Controlling Cross-Thread Dependences Efficiently. In OOSPLA, 2013.
2. B. Hindman and D. Grossman. Atomicity via Source-to-Source Translation. In MSPC, 2006.
15
LarkTM Design
Per-object biased reader-writer locks1,2
Eager concurrency control
Piggybacking conflict detection and
conflict resolution on lock transfers
• Minimal instrumentation and synchronization for both
transactional and non-transactional non-conflicting accesses
• Does not release locks even if transactions commit
1. M. D. Bond et al. Octet: Capturing and Controlling Cross-Thread Dependences Efficiently. In OOSPLA, 2013.
2. B. Hindman and D. Grossman. Atomicity via Source-to-Source Translation. In MSPC, 2006.
16
Biased Locks
object o
lock state
f
17
Biased Locks
object o
∈ {WrExT, RdExT, RdSh}
lock state
f
18
Multi-thread Execution
T2
T1
object o
WrExT1
lock state
Time
f
19
Multi-thread Execution
T2
T1
object o
transaction
start
txn id: 42
Time
o.f = 1
WrExT1
lock state
last txn
f
20
Multi-thread Execution
T2
T1
object o
transaction
start
txn id: 42
42
Time
o.f = 1
WrExT1
lock state
last txn
update
f
21
Multi-thread Execution
T2
T1
object o
transaction
start
txn id: 42
undo log
add
o.f = 1
WrExT1
lock state
last txn
f
42
…
Time
o.f
22
Multi-thread Execution
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
WrExT1
lock state
last txn
update
f
42
1
Time
…
23
Multi-thread Execution
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
WrExT1
lock state
last txn
f
42
1
Time
…
24
o.f = 2
Multi-thread Execution
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
…
…
WrExT1
lock state
last txn
f
42
1
…
o.f = 2
Time
Problem!
No synchronization on T1’s accesses to o
25
Multi-thread Execution
T2
T1
object o
transaction
start
txn id: 42
42
1
…
Time
o.f = 1
…
…
WrExT1
lock state
last txn
f
T2 starts coordination
26
o.f = 2
Coordination
T2
T1
object o
transaction
start
txn id: 42
42
1
…
Time
o.f = 1
…
…
IntT2
lock state
last txn
f
27
update
o.f = 2
Coordination
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
…
…
IntT2
lock state
last txn
f
42
1
…
Time
request
28
o.f = 2
Coordination
T2
T1
object o
transaction
start
txn id: 42
safe point
o.f = 1
…
…
IntT2
lock state
last txn
f
42
1
…
request
Time
… = o.f
safe point
29
o.f = 2
Coordination
T2
T1
object o
transaction
start
txn id: 42
safe point
o.f = 1
…
…
IntT2
lock state
last txn
f
42
1
…
request
Time
… = o.f
safe point
Detecting
Conflicts
30
o.f = 2
A Transactional Conflict
T2
T1
object o
transaction
start
txn id: 42
safe point
o.f = 1
…
…
IntT2
lock state
last txn
f
42
1
…
request
Time
… = o.f
safe point
Detecting
Conflicts
detected
conflicts
Resolving
Conflicts
Contention Management
31
o.f = 2
Not A Transactional Conflict
T2
T1
object o
transaction
start
…
…
…
Time
42
1
…
request
txn id: 43
safe point
IntT2
lock state
last txn
f
Detecting
safe
Conflicts
no conflict
point
32
o.f = 2
Coordination
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
…
…
IntT2
lock state
last txn
f
42
1
…
request
Time
… = o.f
safe point
Detecting
Conflicts
33
o.f = 2
Coordination
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
…
…
IntT2
lock state
last txn
f
42
1
…
o.f = 2
request
Time
… = o.f
safe point
waiting
Detecting
Conflicts
response
34
Strong Progress Guarantees
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
…
…
IntT2
lock state
last txn
f
42
1
…
o.f = 2
request
Time
… = o.f
safe point
waiting
Detecting
Conflicts
response
may abort
may abort
35
Strong Progress Guarantees
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
…
…
IntT2
lock state
last txn
f
42
1
…
o.f = 2
request
Time
… = o.f
safe point
waiting
Detecting
Conflicts
response
may abort
may abort
Starvation and livelock freedom
36
Strong Atomicity Semantics
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
…
…
IntT2
lock state
last txn
f
transaction start
42
1
…
transactional access
o.f = 2
request
Time
… = o.f
safe point
waiting
Detecting
Conflicts
response
abort
Transactional vs. Transactional Conflict
37
Strong Atomicity Semantics
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
…
…
IntT2
lock state
last txn
f
transaction start
42
1
…
transactional access
o.f = 2
request
Time
… = o.f
retry
safe point
waiting
Detecting
Conflicts
response
abort
Transactional vs. Transactional Conflict
38
Strong Atomicity Semantics
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
…
…
IntT2
lock state
last txn
f
non-transactional
access
42
1
…
o.f = 2
request
Time
… = o.f
safe point
waiting
Detecting
safe
Conflicts
point
response
abort
Transactional vs. Non-transactional Conflict
39
Strong Atomicity Semantics
T2
T1
object o
transaction
start
txn id: 42
o.f = 1
…
…
IntT2
lock state
last txn
f
non-transactional
access
42
1
…
o.f = 2
request
Time
… = o.f
safe point
waiting
Detecting
Conflicts
response
abort
Transactional vs. Non-transactional Conflict
40
retry
Strong Atomicity Semantics
T2
T1
transaction
end
request
o.f = 2
Time
safe point
response
… = o.f
o.f = …
Non-transactional accesses  short transactions
no setting up/tearing down cost
41
non-transactional
access
No Transactional Conflict
T2
T1
object o
IntT2
lock state
last txn
f
…
Time
transaction
end
safe point
transaction
start
42
1
txn id: 51
o.f = 2
request
waiting
Detecting
Conflicts
response
42
No Transactional Conflict
T2
T1
object o
WrExT2
lock state
last txn
f
Time
safe point
transaction
start
42
1
…
transaction
end
acquire
lock
txn id: 51
o.f = 2
request
waiting
Detecting
Conflicts
response
43
No Transactional Conflict
T2
T1
object o
WrExT2
lock state
last txn
f
51
2
…
transaction
start
update
txn id: 51
o.f = 2
add
transaction
end
undo log
request
Time
o.f
safe point
waiting
Detecting
Conflicts
response
44
No Transactional Conflict
T2
T1
object o
WrExT2
lock state
last txn
f
…
transaction
end
transaction
start
51
2
txn id: 51
o.f = 2
undo log
request
Time
o.f
safe point
waiting
Detecting
Conflicts
response
o.f = 2
Two versions of coordination protocol
45
LarkTM-O
Adds very low overhead and scales well for
low-contention cases
46
High-Contention Applications
T2
T1
Time
txn: 42
…
o.f = …
… = o.f
…
…
o.f = …
…
…
… = o.f
…
txn: 51
txn: 43
…
o.f = …
… = o.f
…
…
o.f = …
…
47
txn: 52
High-Contention Applications
T2
T1
txn: 42
…
o.f = …
request
safe point
Time
response
…
… = o.f
…
… = o.f
…
…
o.f = …
…
txn: 51
request
safe point
response
txn: 43
…
o.f = …
request
48
… = o.f
…
…
o.f = …
…
txn: 52
LarkTM-S
Handling High Contention
49
LarkTM-S: Hybrid with Traditional
Locking
T2
T1
Time
txn: 42
…
o.f = 1
o causes high contention
… = o.f
…
…
o.f = …
…
…
… = o.f
…
txn: 51
txn: 43
…
o.f = …
… = o.f
…
…
o.f = …
…
50
txn: 52
LarkTM-S: Hybrid with Traditional
Locking
T2
T1
Time
txn: 42
…
o.f = 1
… = o.f
…
…
o.f = …
…
…
… = o.f
…
txn: 51
txn: 43
…
o.f = …
… = o.f
…
…
o.f = …
…
51
txn: 52
Comparison Of Concurrency Control
Write concurrency control
Read concurrency control
LarkTM-O
Eager per-object biased
reader–writer lock
Eager per-object biased
reader–writer lock
LarkTM-S
IntelSTM–LarkTM-O hybrid
IntelSTM–LarkTM-O hybrid
IntelSTM1,2
Eager per-object lock
Lazy version validation
NOrec3
Lazy global seqlock
Lazy value validation
1 B. Saha et al. McRT-STM: A High Performance Software Transactional Memory System for a Multi-Core Runtime. In PPoPP, 2006.
2 T. Shpeisman et al. Enforcing Isolation and Ordering in STM. In PLDI, 2007.
3 L. Dalessandro et al. NOrec: Streamlining STM by Abolishing Ownership Records. In PPoPP, 2010.
52
Comparison Of Instrumentation
Instrumented accesses
LarkTM-O
All accesses
LarkTM-S
All accesses
IntelSTM
All accesses
NOrec
All transactional accesses
53
except redundant
accesses
Comparison Of Progress Guarantees
Progress Guarantee
LarkTM-O
Livelock and starvation free
LarkTM-S
Livelock and starvation free
IntelSTM
None
NOrec
Livelock free
54
Comparison Of Semantics
Semantics
LarkTM-O
Strong Atomicity
LarkTM-S
Strong Atomicity
IntelSTM
Strong Atomicity
NOrec
Single Global Lock Atomicity (SLA)
55
Implementation
• LarkTM-O, LarkTM-S, IntelSTM (McRT), and NOrec
• Developed in Jikes RVM 3.1.3
• All STMs share features as much as possible (e.g., inlining
decisions, redundant barrier analysis, name-mangling)
• Source code publicly available on
the Jikes RVM Research Archive
56
Evaluation Methodology
• TM programs
• STAMP benchmarks
• STM comparison
•
•
•
•
Norec
IntelSTM
LarkTM-O
LarkTM-S
• Platform
• Eight 8-core processors (AMD Opteron 6272)
• Four 8-core processors (Intel Xeon E5-4620)
57
Overhead (%)
Single-Thread Performance
0
58
Single-Thread Performance
300
610
Overhead (%)
250
NOrec
200
150
100
50
0
59
Single-Thread Performance
300
2870
610
Overhead (%)
250
NOrec
200
IntelSTM
150
100
50
0
60
Single-Thread Performance
300
2870
610
Overhead (%)
250
NOrec
200
IntelSTM
150
LarkTM-O
100
50
0
61
Single-Thread Performance
300
2870
610
Overhead (%)
250
NOrec
200
IntelSTM
150
LarkTM-O
100
LarkTM-S
50
0
62
Single-Thread Performance
300
2870
610
Overhead (%)
250
NOrec
200
IntelSTM
150
LarkTM-O
100
LarkTM-S
50
73%
0
40%
63
Speedup Geomean
Speedup
NOrec
IntelSTM
LarkTM-O
LarkTM-S
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
NOrec
1
2
4
Threads
8
64
Speedup Geomean
Speedup
NOrec
IntelSTM
LarkTM-O
LarkTM-S
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
NOrec
IntelSTM
1
2
4
Threads
8
65
Speedup Geomean
Speedup
NOrec
IntelSTM
LarkTM-O
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
LarkTM-S
LarkTM-O
NOrec
IntelSTM
1
2
4
Threads
8
66
Speedup Geomean
Speedup
NOrec
IntelSTM
LarkTM-O
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
LarkTM-S
LarkTM-S
LarkTM-O
NOrec
IntelSTM
1
2
4
Threads
8
67
Speedup
Toward Practical STM
2
1.8
1.6
1.4Low instrumentation
overhead
1.2
1
0.8
0.6
0.4
0.2
0
1
2
LarkTM-S
LarkTM-O
NOrec
IntelSTM
4
Threads
8
68
Speedup
Toward Practical STM
2
1.8
1.6
1.4Low instrumentation
overhead
1.2
1
0.8
0.6
0.4
0.2
0
1
2
LarkTM-S
LarkTM-O
NOrec
IntelSTM
4
Threads
8
69
scales well
Speedup
Toward Practical STM
2
1.8
1.6
1.4Low instrumentation
overhead
1.2
1
0.8
0.6
0.4
0.2
0
1
2
LarkTM-S
Strong progress guarantees
LarkTM-O
NOrec
IntelSTM
4
Threads
8
70
scales well
Speedup
Toward Practical STM
2
1.8
1.6
1.4Low instrumentation
overhead
1.2
1
0.8
0.6
0.4
0.2
0
1
2
LarkTM-S
Strong progress guarantees
Strong semantics
LarkTM-O
NOrec
IntelSTM
4
Threads
8
71
scales well
Speedup
Toward Practical STM
2
1.8
1.6
1.4Low instrumentation
overhead
1.2
1
0.8
0.6
0.4
0.2
0
1
2
LarkTM-S
Strong progress guarantees
scales well
Strong semantics
LarkTM-O
NOrec
IntelSTM
Thank you
4
Threads
8
72