Transcript slides

IBM T. J. Watson Research Center
Overview of POWER HTM
Maged Michael
IBM T J Watson Research Center
WTTM 2014
15 July 2014
Outline
 POWER HTM features
 Use cases
 Performance results
Acknowledgment of IBM colleagues in Austin, Yorktown, Tokyo, and Toronto
Any errors in describing POWER HTM features and performance in this presentation are my own.
2
WTTM 2014 - POWER HTM
POWER HTM Features
3
WTTM 2014 - POWER HTM
Basic Transactional Instructions
 TBEGIN: Begins an outermost transaction (or increments nesting level)
 TEND: Commits an outermost transaction (or decrements nesting level)
 TBEGIN sets a condition register to indicate success or failure
 TEND sets a condition register to indicate whether it was executed in a
transaction or not (i.e., extraneous TEND)
 Transaction failure transfers control to the instruction following TBEGIN
 Basic example
tbegin.
# begin transaction
beq failure_handler
# branch to failure handler if failure code is set
...
tend.
bgt was_not_in_a_transaction
# (optional) check if tend was extraneous
4
WTTM 2014 - POWER HTM
Features of Basic Transactions
 No hardware progress guarantee. Failure handlers must include an
alternative non-HTM software path.
 Strong isolation. Hardware detection of conflicts with non-transactional
accesses.
 Flat nesting. Transaction failure transfers control to the instruction
following the outermost TBEGIN.
 Order guarantee for successful transactions among three groups of
(cacheable write-back) memory accesses:
– Before TBEGIN
– Inside the transaction
– After TEND
Example: Initially X == Y == 0. r1 == r2 == 0 not allowed
st X = 1
tbegin.
ld r1 = Y
tend.
5
st Y = 1
tbegin.
ld r2 = X
tend.
WTTM 2014 - POWER HTM
Transaction Abort
 TABORT: Causes transaction failure
 Unconditional variants with and without 8-bit code
 Conditional variants with 32/64-bit register or immediate parameters
 Example: Transactional lock elision entry
tbegin.
beq- tle_failure_handler
ld r=LOCK
#
cmpi r==FREE
#
beq+ $+8
#
tabort.
#
<critical section>
load lock
compare with free value
if free, start critical section
if not free, abort TLE transaction
tbegin.
beq- tle_failure_handler
ld r=LOCK
# load lock
tabort[wd]ci. r!=FREE
# If not free, abort TLE transaction
<critical section>
6
WTTM 2014 - POWER HTM
Transactional Registers and Failure Causes
 TFHAR: Address of failure handler, i.e., outermost TBEGIN + 4
 TFIAR: Address of failure instruction when applicable
 TEXASR: Transaction exception and status register. Includes cause of
transaction failure.
 TEXASR register contains a summary bit that provides a hint of whether
the cause of failure is likely to be persistent or transient
 TEXASR register also contains an 8-bit software code that may have
been provided with a TABORT instruction
 Failure causes include conflicts, abort instructions, footprint overflow ,
I/O, access to non-write-back memory, nesting level overflow, disallowed
instructions (e.g., sleep, cache invalidation).
7
WTTM 2014 - POWER HTM
Suspending/Resuming Transactional State
 TSUSPEND: Suspends the current transaction. I.e., transitions from
transactional state to suspended
 TRESUME: Resumes the suspended transaction.
 Loads and stores in suspended state are performed non-speculatively as
they occur and do not use hardware transactional resources
 No new transactions can be initiated in suspended state
 Transaction failure is recorded but failure handling is deferred until the
transaction is resumed
 Load instructions of location written transactionally return the written
values as long as the transaction has not failed
 Stores in suspended state to locations accessed transactionally cause
transaction failure
 TCHECK: Checks for transaction failure and validity of prior memory
operations. (May be used in transactional state too)
8
WTTM 2014 - POWER HTM
Rollback Only Transactions (ROT)
 Intended for single thread speculation
 Not intended for shared data
 No conflict detection
 Keeps track only of transactional stores
 No order guarantees
 May be nested with atomic transactions
9
WTTM 2014 - POWER HTM
Use Cases
10
WTTM 2014 - POWER HTM
Transactional Lock Elision
 Transactional lock elision - Entry
pthread_mutex_lock(mutex) {
if (do_tle(mutex)) { // Check TLE state and collect stats if needed
attempts = 0; // Count TLE attempts for current
TRY_TLE:
if (__TM_begin()) { // Inside HW transaction
if (!is_free(mutex)) __TM_abort();
// If mutex is busy abort HW transaction
return 0; // return SUCCESS
}
// HW transaction failed
// Failure handler:
//
//
//
Decide to retry TLE or fallback on conventional implementation
based on number of failed attempts, cause of failure, and lock recursion
May update TLE stats for the mutex
if (decide_to_try_TLE_again(mutex,++attempts,__TM_is_failure_persistent())) {
wait_until_free(mutex);
backoff(attempts);
goto TRY_TLE;
}
}
<Fallback on conventional non-TLE lock acquisition implementation>
}
11
WTTM 2014 - POWER HTM
Transactional Lock Elision
 Transactional lock elision - Exit
pthread_mutex_unlock(mutex) {
if (is_free(mutex))
if (__TM_end()
return 0;
// End TLE transaction
// return success
<Follow conventional non-TLE path>
}
12
WTTM 2014 - POWER HTM
Path Length Reduction
 Example: java.util.concurrent ConcurrentLinkedQueue.offer() critical path
of CAS-based implementation
No TM
1
l
2
isync
3
l
4
isync
5
l
6
isync
7
cmp
r,t
8
bne
start_over
9
cmpi
s,0
10
bne
fix_tail
11
hwsync
12
13
s=[t.next]
r=[tail]
larx
r=[t.next]
13
cmp
r,s
14
bne
start_over
15
stcx
[t.next]=n
16
bne-
L1
17
hwsync
18
L1:
t=[tail]
L2:
larx
r=[tail]
19
cmp
r,t
20
bne
skip_stcx
21
stcx
[tail]=n
22
bne-
L2
23
isync
WTTM 2014 - POWER HTM
Path Length Reduction
 CLQ with TM
TM
1
tbegin
2
beq-
failure_handler
3
l
t=[tail]
4
l
s=[t.next]
5
cmpi
s,0
6
beq+
L1
# skip next instruction
mr
t=s
# not common case
st
[t.next]=n
8
st
[tail]=n
9
tend
7
L1:
 Fallback on conventional CAS-based implementation in case of TM
failure
 Aggregation of memory barriers
14
WTTM 2014 - POWER HTM
Other Use Case Examples
 Hybrid HW/SW high-level transactions. E.g., HTM commit acceleration,
spin-waiting in suspended state.
 Thread-level speculation with commit ordering using suspended-mode
accesses
 Single thread speculation using Rollback-Only Transaction. Assume safe
optimization and rollback if optimization was unsafe.
15
WTTM 2014 - POWER HTM
Performance
16
WTTM 2014 - POWER HTM
Single Thread
 An empty Pthreads TLE critical section is 6% faster than a conventional
Pthreads critical section.
 71% reduction in execution time (warm caches) of CLQ offer()/poll()
pairs using TM path length reduction and memory barrier aggregation
 The execution time of an empty transaction with suspend/resume is 3.4x
that of an empty transaction without suspend/resume
17
WTTM 2014 - POWER HTM
Pthreads TLE - Microbenchmarks
 Pattern 1:
high contention, no conflicts, data set fits in TM capacity
 Pattern 2:
high contention, data set that overflows TM capacity
 Pattern 3: Mixed pattern
80% high contention, no conflict, fits in TM capacity
20% medium contention, overflows TM capacity
TLE
Locking
2
30
1.5
20
TLE
0.5
0
0
Locking
12
10
1
10
8
6
4
2
0
16
32
48
Threads
18
Locking
Speedup
40
Speedup
Speedup
TLE
64
80
96
0
0
16
32
48
64
Threads
WTTM 2014 - POWER HTM
80
96
0
16
32
48
Threads
64
80
96
Pthreads TLE - Memcached
 Memcached server with varying number of threads
 Client running on the same machine.
 96 hardware threads. 12 cores. SMT 8
 Best TLE throughput (on 16 threads) is 26.9% higher than best locking
throughput (on 12 threads)
 On 16 threads, TLE is higher by 37.5%
TLE
Locking
10
Speedup
8
6
4
2
0
0
19
8
16
24
32
Memcached Server Threads
WTTM 2014 - POWER HTM
40
48
Summary
 POWER HTM Instruction Set
 Suspend / Resume
 Rollback Only Transactions
 Low HTM overheads
 Caution not to learn wrong lessons from specific implementations of
specific HTM architectures. E.g., POWER HTM and BG/Q HTM
Thank You
20
WTTM 2014 - POWER HTM