Transcript slides
IBM T. J. Watson Research Center
Overview of POWER HTM
Maged Michael
IBM T J Watson Research Center
WTTM 2014
15 July 2014
Outline
POWER HTM features
Use cases
Performance results
Acknowledgment of IBM colleagues in Austin, Yorktown, Tokyo, and Toronto
Any errors in describing POWER HTM features and performance in this presentation are my own.
2
WTTM 2014 - POWER HTM
POWER HTM Features
3
WTTM 2014 - POWER HTM
Basic Transactional Instructions
TBEGIN: Begins an outermost transaction (or increments nesting level)
TEND: Commits an outermost transaction (or decrements nesting level)
TBEGIN sets a condition register to indicate success or failure
TEND sets a condition register to indicate whether it was executed in a
transaction or not (i.e., extraneous TEND)
Transaction failure transfers control to the instruction following TBEGIN
Basic example
tbegin.
# begin transaction
beq failure_handler
# branch to failure handler if failure code is set
...
tend.
bgt was_not_in_a_transaction
# (optional) check if tend was extraneous
4
WTTM 2014 - POWER HTM
Features of Basic Transactions
No hardware progress guarantee. Failure handlers must include an
alternative non-HTM software path.
Strong isolation. Hardware detection of conflicts with non-transactional
accesses.
Flat nesting. Transaction failure transfers control to the instruction
following the outermost TBEGIN.
Order guarantee for successful transactions among three groups of
(cacheable write-back) memory accesses:
– Before TBEGIN
– Inside the transaction
– After TEND
Example: Initially X == Y == 0. r1 == r2 == 0 not allowed
st X = 1
tbegin.
ld r1 = Y
tend.
5
st Y = 1
tbegin.
ld r2 = X
tend.
WTTM 2014 - POWER HTM
Transaction Abort
TABORT: Causes transaction failure
Unconditional variants with and without 8-bit code
Conditional variants with 32/64-bit register or immediate parameters
Example: Transactional lock elision entry
tbegin.
beq- tle_failure_handler
ld r=LOCK
#
cmpi r==FREE
#
beq+ $+8
#
tabort.
#
<critical section>
load lock
compare with free value
if free, start critical section
if not free, abort TLE transaction
tbegin.
beq- tle_failure_handler
ld r=LOCK
# load lock
tabort[wd]ci. r!=FREE
# If not free, abort TLE transaction
<critical section>
6
WTTM 2014 - POWER HTM
Transactional Registers and Failure Causes
TFHAR: Address of failure handler, i.e., outermost TBEGIN + 4
TFIAR: Address of failure instruction when applicable
TEXASR: Transaction exception and status register. Includes cause of
transaction failure.
TEXASR register contains a summary bit that provides a hint of whether
the cause of failure is likely to be persistent or transient
TEXASR register also contains an 8-bit software code that may have
been provided with a TABORT instruction
Failure causes include conflicts, abort instructions, footprint overflow ,
I/O, access to non-write-back memory, nesting level overflow, disallowed
instructions (e.g., sleep, cache invalidation).
7
WTTM 2014 - POWER HTM
Suspending/Resuming Transactional State
TSUSPEND: Suspends the current transaction. I.e., transitions from
transactional state to suspended
TRESUME: Resumes the suspended transaction.
Loads and stores in suspended state are performed non-speculatively as
they occur and do not use hardware transactional resources
No new transactions can be initiated in suspended state
Transaction failure is recorded but failure handling is deferred until the
transaction is resumed
Load instructions of location written transactionally return the written
values as long as the transaction has not failed
Stores in suspended state to locations accessed transactionally cause
transaction failure
TCHECK: Checks for transaction failure and validity of prior memory
operations. (May be used in transactional state too)
8
WTTM 2014 - POWER HTM
Rollback Only Transactions (ROT)
Intended for single thread speculation
Not intended for shared data
No conflict detection
Keeps track only of transactional stores
No order guarantees
May be nested with atomic transactions
9
WTTM 2014 - POWER HTM
Use Cases
10
WTTM 2014 - POWER HTM
Transactional Lock Elision
Transactional lock elision - Entry
pthread_mutex_lock(mutex) {
if (do_tle(mutex)) { // Check TLE state and collect stats if needed
attempts = 0; // Count TLE attempts for current
TRY_TLE:
if (__TM_begin()) { // Inside HW transaction
if (!is_free(mutex)) __TM_abort();
// If mutex is busy abort HW transaction
return 0; // return SUCCESS
}
// HW transaction failed
// Failure handler:
//
//
//
Decide to retry TLE or fallback on conventional implementation
based on number of failed attempts, cause of failure, and lock recursion
May update TLE stats for the mutex
if (decide_to_try_TLE_again(mutex,++attempts,__TM_is_failure_persistent())) {
wait_until_free(mutex);
backoff(attempts);
goto TRY_TLE;
}
}
<Fallback on conventional non-TLE lock acquisition implementation>
}
11
WTTM 2014 - POWER HTM
Transactional Lock Elision
Transactional lock elision - Exit
pthread_mutex_unlock(mutex) {
if (is_free(mutex))
if (__TM_end()
return 0;
// End TLE transaction
// return success
<Follow conventional non-TLE path>
}
12
WTTM 2014 - POWER HTM
Path Length Reduction
Example: java.util.concurrent ConcurrentLinkedQueue.offer() critical path
of CAS-based implementation
No TM
1
l
2
isync
3
l
4
isync
5
l
6
isync
7
cmp
r,t
8
bne
start_over
9
cmpi
s,0
10
bne
fix_tail
11
hwsync
12
13
s=[t.next]
r=[tail]
larx
r=[t.next]
13
cmp
r,s
14
bne
start_over
15
stcx
[t.next]=n
16
bne-
L1
17
hwsync
18
L1:
t=[tail]
L2:
larx
r=[tail]
19
cmp
r,t
20
bne
skip_stcx
21
stcx
[tail]=n
22
bne-
L2
23
isync
WTTM 2014 - POWER HTM
Path Length Reduction
CLQ with TM
TM
1
tbegin
2
beq-
failure_handler
3
l
t=[tail]
4
l
s=[t.next]
5
cmpi
s,0
6
beq+
L1
# skip next instruction
mr
t=s
# not common case
st
[t.next]=n
8
st
[tail]=n
9
tend
7
L1:
Fallback on conventional CAS-based implementation in case of TM
failure
Aggregation of memory barriers
14
WTTM 2014 - POWER HTM
Other Use Case Examples
Hybrid HW/SW high-level transactions. E.g., HTM commit acceleration,
spin-waiting in suspended state.
Thread-level speculation with commit ordering using suspended-mode
accesses
Single thread speculation using Rollback-Only Transaction. Assume safe
optimization and rollback if optimization was unsafe.
15
WTTM 2014 - POWER HTM
Performance
16
WTTM 2014 - POWER HTM
Single Thread
An empty Pthreads TLE critical section is 6% faster than a conventional
Pthreads critical section.
71% reduction in execution time (warm caches) of CLQ offer()/poll()
pairs using TM path length reduction and memory barrier aggregation
The execution time of an empty transaction with suspend/resume is 3.4x
that of an empty transaction without suspend/resume
17
WTTM 2014 - POWER HTM
Pthreads TLE - Microbenchmarks
Pattern 1:
high contention, no conflicts, data set fits in TM capacity
Pattern 2:
high contention, data set that overflows TM capacity
Pattern 3: Mixed pattern
80% high contention, no conflict, fits in TM capacity
20% medium contention, overflows TM capacity
TLE
Locking
2
30
1.5
20
TLE
0.5
0
0
Locking
12
10
1
10
8
6
4
2
0
16
32
48
Threads
18
Locking
Speedup
40
Speedup
Speedup
TLE
64
80
96
0
0
16
32
48
64
Threads
WTTM 2014 - POWER HTM
80
96
0
16
32
48
Threads
64
80
96
Pthreads TLE - Memcached
Memcached server with varying number of threads
Client running on the same machine.
96 hardware threads. 12 cores. SMT 8
Best TLE throughput (on 16 threads) is 26.9% higher than best locking
throughput (on 12 threads)
On 16 threads, TLE is higher by 37.5%
TLE
Locking
10
Speedup
8
6
4
2
0
0
19
8
16
24
32
Memcached Server Threads
WTTM 2014 - POWER HTM
40
48
Summary
POWER HTM Instruction Set
Suspend / Resume
Rollback Only Transactions
Low HTM overheads
Caution not to learn wrong lessons from specific implementations of
specific HTM architectures. E.g., POWER HTM and BG/Q HTM
Thank You
20
WTTM 2014 - POWER HTM