Software Transactional Memory Nir Shavit Tel-Aviv University and Sun Labs “Where Do We Come From? What Are We? Where Are We Going?”

Download Report

Transcript Software Transactional Memory Nir Shavit Tel-Aviv University and Sun Labs “Where Do We Come From? What Are We? Where Are We Going?”

Software Transactional Memory

Nir Shavit Tel-Aviv University and Sun Labs “Where Do We Come From? What Are We? Where Are We Going?”

Traditional Software Scaling

7x Speedup 3.6x

1.8x

User code Traditional Uniprocessor Time: Moore’s law 2

Multicore Software Scaling

7x 3.6x

Speedup 1.8x

User code Multicore Unfortunately, not so simple … 3

Real-World Multicore Scaling

Speedup 2.9x

1.8x

2x User code Multicore Parallelization and Synchronization require great care … 4

Why?

Amdahl ’s Law:

Speedup = 1/(ParallelPart/N + SequentialPart )

Pay for N = 8 cores SequentialPart = 25% Effect of 25% becomes more accute as num of cores grows 2.3

/4, 2.9

/8, 3.4

/16, 3.7

/32……… 4 /

Coarse Grained c c c c

Shared Data Structures

c c c c c c c c 25% Shared Fine Grained c c c

The reason we get only 2.9 speedup benefit

c c c c c 25% Shared c c c c 75% Unshared c c c c c c c c 75% Unshared

A FIFO Queue

Head Tail a b Dequeue() => a c d Enqueue(d)

A Concurrent FIFO Queue

Simple Code, easy to prove correct Object lock

Head a b P: Dequeue() => a

Contention and sequential bottleneck

Tail c d Q: Enqueue(d)

Fine Grain Locks

Finer Granularity, More Complex Code

Head Tail a b P: Dequeue() => a c d Q: Enqueue(d)

Verification nightmare: worry about deadlock, livelock…

Fine Grain Locks Complex boundary cases: empty queue, last item

a b a P: Dequeue() => a d Q: Enqueue(b)

Worry how to acquire multiple locks

Lock-Free (JDK 1.5+)

Even Finer Granularity, Even More Complex Code

Head Tail a b P: Dequeue() => a c d Q: Enqueue(d)

Worry about starvation, subtle bugs, hardness to modify …

Real Applications

Complex: Move data atomically between structures

Head Tail b c d a P: Dequeue(Q1,a) Enqueue(Q2,a) Head b c d Tail a

More than twice the worry…

Transactional Memory [HerlihyMoss93]

Promise of Transactional Memory Great Performance, Simple Code

Head Tail a b P: Dequeue() => a c d Q: Enqueue(d)

Don ’t worry about deadlock, livelock, subtle bugs, etc …

Promise of Transactional Memory Don ’t worry which locks need to cover which variables when …

a b a P: Dequeue() => a d Q: Enqueue(d)

TM deals with boundary cases under the hood

For Real Applications

Will be easy to modify multiple structures atomically

Head Tail b c d a P: Dequeue(Q1,a) Enqueue(Q2,a) Head b c d Tail a

Provide Serializability …

Using Transactional Memory

} enqueue (Q, newnode) { Q.tail-> next = newnode Q.tail = newnode

Using Transactional Memory

} enqueue (Q, newnode) { atomic{ } Q.tail-> next = newnode Q.tail = newnode

Transactions Will Solve Many of Locks’ Problems

No need to think what needs to be locked, what not, and at what granularity No worry about deadlocks and livelocks No need to think about read-sharing Can compose concurrent objects in a way that is safe and scalable

Hardware TM [HerlihyMoss93]

Hardware Transactions 20-30 …but not ~1000 instructions long Diff Machines … expect different hardware support Hardware is not flexible …abort policies, retry policies, all application dependent …

Software Transactional Memory

[ShavitTouitou94] The semantics of hardware transactions … today Tomorrow : serve as a standard interface to hardware Allow to extend hardware features when they arrive Today ’s focus… Still, we need to have reasonable performance …

The Brief History of STM

2007-9 …New lock based STMs from IBM, Intel, Sun, Microsoft Lock-free Obstruction-free Lock-based

As Good As Fine Grained Locking

Postulate (i.e. take it or leave it): If we could implement fine-grained locking with the same simplicity of course grained, we would never think of building a transactional memory. Implication: Lets try to provide STMs that get as close as possible to hand-crafted fine-grained locking.

Subliminal Cut

25

Transactional Consistency

• Memory Transactions are collections of reads and writes executed atomically • Tranactions should maintain internal and external consistency – External : with respect to the interleavings of other transactions.

– Internal : the transaction itself should operate on a consistent state.

External Consistency

X Y

Invariant x = 2y Transaction A: Write x Write y Transaction B: Read x Read y Compute z = 1/(x-y) = 1/4

Application Memory

Locking STM Design Choices

Map Array of Versioned Write-Locks Application Memory

V#

PS = Lock per Stripe (separate array of locks) PO = Lock per Object (embedded in object)

Encounter Order Locking (Undo Log) [Ennals,Saha,Harris,TinySTM,SwissTM…] Mem Locks Blue code does not change memory, red does 1. To Read: load lock + location 2. Check unlocked add to Read-Set 3. To Write: lock location, store value 4. Add old value to undo-set 5. Validate read-set v# ’s unchanged 6. Release each lock with v#+1 Quick read of values freshly written by the reading transaction

Commit Time Locking (Write Log) [TL,TL2,SkySTM …] Mem Locks 1. To Read: load lock + location 2. Location in write-set? (Bloom Filter) 3. Check unlocked add to Read-Set 4. To Write: add value to write set 5. Acquire Locks 6. Validate read/write v# ’s unchanged 7. Release each lock with v#+1 Hold locks for very short duration

COM vs. ENC High Load

Red-Black Tree 20% Delete 20% Update 60% Lookup Hand COM ENC Lock

COM vs. ENC Low Load

Red-Black Tree 5% Delete 5% Update 90% Lookup Hand COM ENC Lock

Problem: Internal Inconsistency

• A Zombie is a currently active transaction that is destined to abort because it saw an inconsistent state • If Zombies see inconsistent states errors can occur and the fact that the transaction will eventually abort does not save us

Application Memory

Internal Inconsistency

X Y

Invariant x = 2y Transaction B: Read x = 4 Transaction A: Write x Write y Transaction B: Read y = 4 {trans is zombie} Compute z = 1/(x-y) DIV by 0 ERROR

Past Approaches

1. Design STMs that allow internal inconsistency. 2. To detect zombies introduce validation into user code at fixed intervals or loops, used traps, OS support 3. Still there are cases where zombie ’s cannot be detected  infinite loops in user code …

Global Clock [TL2/SnapIsolation]

[DiceShalevShavit06/ReigelFelberFetzer06] • Have a shared global version clock • Incremented by writing transactions (as infrequently as possible) • Read by all transactions • Used to validate that the state viewed by a transaction is always consistent

TL2 Version Clock: Read-Only Trans Mem Locks

100

Vclock (shared) 1. RV  VClock 2. To Read: read lock, read mem, read lock, check unlocked, unchanged, and v# <= RV 3. Commit.

Reads form a snapshot of memory .

No read set!

100

RV (private)

TL2 Version Clock: Writing Trans

VClock Mem Locks Commit

100

RV 1. RV  VClock 2. To Read/Write: check unlocked and v# <= RV then add to Read/Write-Set 3. Acquire Locks 4. WV = F&I(VClock) 5. Validate each v# <= RV 6. Release locks with v#  WV Reads+Inc+Writes =serializable

How we learned to stop worrying and love the clock

Version clock rate is a progress concern , not a safety concern , so ..

– (GV4) if failed to increment VClock using CAS use VClock set by winner – (GV5) use WV = VClock + 2; inc VClock on abort – (GV7) localized clocks … [AvniShavit08]

Uncontended Large Red-Black Tree Hand crafted 5% Delete 5% Update 90% Lookup TL/PO TL2/P0 Ennals TL/PS TL2/PS Fraser Harris Lock free

Contended Small RB-Tree

30% Delete 30% Update 40% Lookup TL/P0 TL2/P0 Ennals

Implicit Privatization [Menon et al]

• In real apps: often want to “ privatize ” data • Then operate on it non-transactionally • Many STMs (like TL2) based on “Invisible Readers” • Invisible Readers/Writers are a problem if we want implicit privatization …

Privatization Pathology

P privatizes node b then modifies it non-transactionally

P c d a P: atomically{ a.next = c; } // b is private b.value = 0;

Privatization Pathology

Invisible reader Q cannot detect non-transactional modification to node b

P a P: Q atomically{ a.next = c; } // b is private b.value = 0; c d

Q : divide by 0 error

Q: atomically{ } tmp = a.next; foo = (1/tmp.value) }

Solving the Privatization Problem

• Visible Writers –

Reads

are made aware of overlapping writes [DiceShavit07/M4,Gottschlich Connors07, SpearMichaelVonPraun08 …]

P b

• Visible Readers –

Writes

Q

are made aware of overlapping reads [EllenLevLuchangcoMoir07/SNZI, DiceShavit09/Bytelocks …]

Visible Readers

• Use read-write locks . Transactions also lock to read. • Privatization is immediate… • But RW-locks will make us burn in coherence traffic hell : CAS to increment/decrement reader-count • Which is why we had invisible readers in the first place 

Read-Write Bytelocks

[DiceShavit09] • A new read-write lock for multicores • Common case: no CAS, only store + membar to read • Claim: on modern multicores cost of coherent stores not too bad… byte-locks Application Memory a bytelock

The ByteLock Lock Record

• Writer ID • Visible readers : – Reader count for unslotted threads • CAS to increment and decrement – Reader array for slotted threads traditional • Array of atomically addressable bytes • 48 or 112 slots, Write + Membar to Modify Writer id counter for unslotted wrtid 1 2 0 1 3 0 4 0 5 1 rdcnt Single $ line 49

50 Mem

X

ByteLock Write

Writers wait till readers drain out

Writer i CAS

wrtid 1 Spin until all 0 2 3 4 5 0 1 0 0 1 3 rdcnt Intel, AMD, Sun read 8 or 16 bytes at a time

Slotted Read

Readers give pref to writers

Slotted Reader i

51 Mem Read Mem No Writer?

0 1 2 0 1 wrtid 3

0

4 0 5 1 Release: simple 3 rdcnt store to byte + membar is very fast

Unslotted Read

Unslotted readers like in traditional RW

Unslotted Reader i

53 Mem If non-0 i 1 2 0 1 wrtid 3 5 1

CAS

4 0 rdcnt Decrement using CAS and wait for writer to go away

54 TLRW

ByteLock Performance

48 slot TLRW Bytelock 128 slot TL2 GV6 PS Mutex TLRW Inc/dec read counters Transact 2009

Where we are heading…

• A lot more work on performance – Visible writers, visible readers • Think GC, game just begun – Improve single threaded – Amazing possibilities for compiler optimization – OS support • Explosion of new STMs – ~100 TM papers in last couple of years

A bit further down the road…

• Transactional Languages – No Implicit Privatization Problem… – Composability • And when hardware TM arrives … – Contention management – New possibilities for extending and interfacing …

Need Experience with Apps

• Today – MSF, Quake, Apache, FenixEDU (Large Distributed App),Student Trials in Germany and US … • Need a lot more transactification of applications – Not just rewriting of concurrent apps – But actually applications that are parallelized from scratch using TM

Thanks!