No Slide Title

Transcript No Slide Title

Software Transactional Memory:
Where Do We Come From? What Are We?
Where Are We Going?
Nir Shavit
Tel-Aviv University and Sun Labs
Traditional Software Scaling
7x
Speedup
1.8x
3.6x
User code
Traditional
Uniprocessor
Time: Moore’s law
2
Multicore Software Scaling
7x
Speedup
1.8x
3.6x
User code
Multicore
Unfortunately, not so simple…
Real-World Multicore Scaling
Speedup
1.8x
2x
User code
Multicore
Parallelization and Synchronization
require great care…
4
2.9x
Why?
Amdahl’s Law:
Speedup = 1/(ParallelPart/N + SequentialPart)
Pay for N = 8 cores
SequentialPart = 25%
As
num
cores
grows
the
effect
of
25%
Speedup = only 2.9 times!
becomes more accute
2.3/4, 2.9/8, 3.4/16, 3.7/32….
Shared Data Structures
c
c c c
c
c
c
Coarse
Grained
Fine
Grained
25%
Shared
c
c
c
c
c
Fine grained
The reason
parallelism
get
has hugewe
performance
onlybenefit
2.9 speedup
c
c
c
c
75%
Unshared
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
25%
Shared
75%
Unshared
A FIFO Queue
Head
a
b
Dequeue() => a
Tail
c
d
Enqueue(d)
A Concurrent FIFO Queue
Simple Code, easy to prove correct
Head
Tail
Object lock
a
b
P: Dequeue() => a
c
d
Q: Enqueue(d)
Contention and sequential bottleneck
Fine Grain Locks
Finer Granularity, More Complex Code
Head
a
Tail
b
P: Dequeue() => a
c
d
Q: Enqueue(d)
Verification nightmare: worry about deadlock, livelock…
Fine Grain Locks
Complex boundary cases: empty queue, last item
Head
a
Tail
ba
P: Dequeue() => a
c
b
d
Q: Enqueue(b)
Worry how to acquire multiple locks
Lock-Free (JDK 1.5+)
Even Finer Granularity, Even More Complex Code
Head
a
Tail
b
P: Dequeue() => a
c
d
Q: Enqueue(d)
Worry about starvation, subtle bugs, hardness to modify…
Real Applications
Complex: Move data atomically between structures
Head
a
Tail
b
c
d
P: Dequeue(Q1,a)
Enqueue(Q2,a)
Head
b
Tail
c
d
a
More than twice the worry…
Transactional Memory
[HerlihyMoss93]
Promise of Transactional Memory
Great Performance, Simple Code
Head
a
Tail
b
P: Dequeue() => a
c
d
Q: Enqueue(d)
Don’t worry about deadlock, livelock, subtle bugs, etc…
Promise of Transactional Memory
Don’t worry which locks need to cover
which variables when…
Head
a
Tail
ba
P: Dequeue() => a
c
b
d
Q: Enqueue(d)
TM deals with boundary cases under the hood
For Real Applications
Will be easy to modify multiple structures atomically
Head
a
Tail
b
c
d
P: Dequeue(Q1,a)
Enqueue(Q2,a)
Head
b
Tail
c
d
Provide Serializability…
a
Using Transactional Memory
enqueue (Q, newnode) {
Q.tail-> next = newnode
Q.tail = newnode
}
Using Transactional Memory
enqueue (Q, newnode) {
atomic{
Q.tail-> next = newnode
Q.tail = newnode
}
}
Transactions Will Solve Many of
Locks’ Problems
No need to think what needs to be locked,
what not, and at what granularity
No worry about deadlocks and livelocks
No need to think about read-sharing
Can compose concurrent objects in a way that
is safe and scalable
Hardware TM [HerlihyMoss93]
Hardware Transactions 20-30…but not ~1000
instructions long
Diff Machines… expect different hardware
support
Hardware is not flexible…abort policies, retry
policies, all application dependent…
Software Transactional Memory
[ShavitTouitou94]
The semantics of hardware transactions…today
Tomorrow: serve as a standard interface to
hardware
Allow to extend hardware features when they
arrive
Today’s focus…
Still, we need to have reasonable performance…
The Brief History of STM
2007-9…New lock based
STMs from IBM, Intel, Sun,
Microsoft
Lock-free
Obstruction-free
Lock-based
As Good As Fine Grained
Locking
Postulate (i.e. take it or leave it):
If we could implement fine-grained locking with
the same simplicity of course grained, we would
never think of building a transactional memory.
Implication:
Lets try to provide STMs that get as close as
possible to hand-crafted fine-grained locking.
Transactional Consistency
• Memory Transactions are collections of
reads and writes executed atomically
• Tranactions should maintain internal
and external consistency
– External: with respect to the interleavings
of other transactions.
– Internal: the transaction itself should
operate on a consistent state.
External Consistency
Invariant x = 2y
84 X
42 Y
Transaction A:
Write x
Write y
Transaction B:
Read x
Read y
Compute z = 1/(x-y) = 1/4
Application
Memory
Locking STM Design Choices
Map
Array of VersionedWrite-Locks
Application
Memory
V#
PS = Lock per Stripe
(separate array of locks)
PO = Lock per Object
(embedded in object)
Encounter Order Locking (Undo Log)
Mem
[Ennals,Saha,Harris,TinySTM…]
Locks
Blue code does not change memory, red does
V#
V#
X
V#+1
00
V#+1
V#
V#
1
V#
V#
Y
00
00
V#
V#+1
00
V#
1
V#+1
V#
V#
00
V#
V#
V#
000
1.
2.
3.
4.
5.
6.
To Read: load lock + location
Check unlocked add to Read-Set
To Write: lock location, store value
Add old value to undo-set
Validate read-set v#’s unchanged
Release each lock with v#+1
Quick read of values freshly
written by the reading transaction
Commit Time Locking (Write Log)
[TL,TL2]
Mem
Locks
X
X
V#
V#
V#
V#+1
V#
V#
V#+1
0
00
0
00
1
Y
Y
V#
V#
V#
V#+1
V#+1
V#+1
V#
V#
00
1
00
0
V#
V#
V#
V#
V#
V#
00
000
0
1.
2.
3.
4.
5.
6.
7.
To Read: load lock + location
Location in write-set? (Bloom Filter)
Check unlocked add to Read-Set
To Write: add value to write set
Acquire Locks
Validate read/write v#’s unchanged
Release each lock with v#+1
Hold locks for very short duration
COM vs. ENC High Load
Red-Black Tree 20% Delete 20% Update 60% Lookup
Hand
COM
ENC
Lock
COM vs. ENC Low Load
Red-Black Tree 5% Delete 5% Update 90% Lookup
Hand
COM
ENC
Lock
Subliminal Cut
Technion 2008
32
Problem: Internal Inconsistency
• A Zombie is a currently active transaction that
is destined to abort because it saw an
inconsistent state
• If Zombies see inconsistent states errors can
occur and the fact that the transaction will
eventually abort does not save us
Internal Inconsistency
Invariant x = 2y
Application
Memory
48 X
Transaction B:
Read x = 4
24 Y
Transaction A:
Write x
Write y
Transaction B:
Read y = 4 {trans is zombie}
Compute z = 1/(x-y)
DIV by 0 ERROR
Past Approaches
1. Design STMs that allow internal
inconsistency.
2. To detect zombies introduce
validation into user code at fixed
intervals or loops, used traps, OS
support
3. Still there are cases where zombie’s
cannot be detected  infinite loops
in user code…
Global Clock [TL2/SnapIsolation]
[DiceShalevShavit06/ReigelFelberFetzer06]
• Have a shared global version clock
• Incremented by writing transactions (as
infrequently as possible)
• Read by all transactions
• Used to validate that the state viewed
by a transaction is always consistent
TL2 Version Clock: Read-Only Trans
Mem
Locks
100
87
87
0
34
34
34
00
88
88
0
V#
99
99
0
44
44
0
50
50
V#
0
100
Vclock (shared)
1. RV  VClock
2. To Read: read lock, read mem,
read lock, check unlocked,
unchanged, and v# <= RV
3. Commit.
Reads form a snapshot of memory.
No read set!
RV (private)
TL2 Version Clock: Writing Trans
Mem
X
X
Y
Y
121
120
100
Locks
87
87
87
121
34
34
121
88
88
00
0
0
01
0
0
0
V#
121
99
121
44
44
0
0
10
0
0
50
V#
50
V#
50
0
0
0
Commit
100
VClock
1. RV  VClock
2. To Read/Write: check unlocked
and v# <= RV then add to
Read/Write-Set
3. Acquire Locks
4. WV = F&I(VClock)
5. Validate each v# <= RV
6. Release locks with v#  WV
RV
Reads+Inc+Writes
=serializable
How we learned to stop worrying
and love the clock
Version clock rate is a progress
concern, not a safety concern, so ..
– (GV4) if failed to increment VClock
using CAS use VClock set by winner
– (GV5) use WV = VClock + 2; inc VClock
on abort
– (GV7) localized clocks…
[AvniShavit08]
Uncontended Large Red-Black Tree
Hand5% Delete 5% Update 90% Lookup
crafted
TL/PO
TL2/P0
Ennals
TL/PS
TL2/PS
Fraser
Harris
Lockfree
Contended Small RB-Tree
30% Delete 30% Update 40% Lookup
TL/P0
TL2/P0
Ennals
Implicit Privatization [Menon et al]
• In real apps: often want to “privatize”
data
• Then operate on it non-transactionally
• Many STMs (like TL2) based on
“Invisible Readers”
• Invisible readers are a problem if we
want implicit privatization…
Privatization Pathology
P privatizes node b
then modifies it
non-transactionally
P
a
0b
P: atomically{
a.next = c;
}
// b is private
b.value = 0;
c
d
Privatization Pathology
Invisible reader Q cannot
detect non-transactional
modification to node b
P
a
0b
Q
P: atomically{
a.next = c;
}
// b is private
b.value = 0;
c
d
Q: divide
by 0 error
Q:
Q: atomically{
atomically{
tmp
tmp == a.next;
a.next;
foo
foo == (1/tmp.value)
(1/tmp.value)
}}
Visible Readers
• Use read-write locks. Trans. also lock to
read.
• Privatization is immediate…
• But RW-locks will make us burn in
coherence traffic hell: CAS to
increment/decrement reader-count
• Which is why we had invisible readers
in the first place 
Read-Write Bytelocks
[DiceShavit09]
• An new read-write lock for multicores
• Common case: no CAS, only store +
membar to read
• Claim: on modern multicores cost of
Array of read-write
coherent storesMapnot toobyte-locks
bad…
Application
Memory
a bytelock
The ByteLock Lock Record
• Writer ID
• Visible readers :
traditional
– Reader count for unslotted threads
• CAS to increment and decrement
– Reader array for slotted threads
• Array of atomically addressable bytes
• 48 or 112 slots, Write + Membar to Modify
Writer id
a byte per slotSlots
counter for unslotted
1
wrtid
48
2
3
4
5
0 1
0
0
1
Single $ line
rdcnt
ByteLock Write
Writers wait till
readers drain
out
CAS
Spin until all 0
Mem
X
1
0i
wrtid
49
Writer i
2
3
4
5
0 1
0
0
1
3
rdcnt
Intel, AMD, Sun read 8 or
16 bytes at a time
Slotted Read
Readers give
pref to writers
Mem
No Writer?1
0
Read Mem
50
wrtid
Slotted
Reader i
2
3
4
5
0 1
0
1
0
1
3
Release: simple rdcnt
On
Intel, AMD, Sun
store
store to byte +
membar is very fast
Slotted Read Slow-Path
Readers give
pref to writers
Mem
Spin until non0
then retry
If non-0
retry1
i
wrtid
51
Slotted
Reader i
2
3
4
5
0 1
0
1
0
1
3
rdcnt
Intel, AMD, Sun store
to byte + membar is
very fast
Unslotted Read
Unslotted
readers like in
traditional RW
Mem
If non-0
i
wrtid
52
1
2
0 1
Unslotted
Reader i
3
4
5
0
1
CAS
4
3
rdcnt
Decrement using
CAS and wait for
writer to go away
TLRW
Bytelock
48 slot
ByteLock Performance
TLRW
Bytelock
128 slot
TL2GV6
PS
Mutex
53
Transact 2009
TLRW
Inc/dec
read
counters
Where we are heading…
• A lot more work on performance
• Think GC, game just begun
– Improve single threaded
– Amazing possibilities for compiler optimization
– OS support
• Explosion of new STMs
– Many new STMs, Java, C#, Compilers, added
to languages,…
– ~100 New TM papers in last couple of years
A bit further down the road…
• Transactional Languages
– No Implicit Privatization Problem…
– Composability
• And when hardware TM arrives…
– Contention management
– New possibilities for extending and
interfacing…
Need Experience with Apps
• Today
– MSF, Quake, Apache, FenixEDU (Large
Distributed App),…
• Need a lot more transactification of
applications
– Not just rewriting of concurrent apps
– But actually applications that are parallelized
from scratch using TM
Thanks!

No Slide Title

Transcript No Slide Title

Directory