Performance of memory reclamation for lockless synchronization

Download Report

Transcript Performance of memory reclamation for lockless synchronization

Performance of memory reclamation
for lockless synchronization
By Thomas E. Hart, Paul E. McKenney,
Angela Demke Brown, Jonathan Walpole
Handwaved about by Jim Cotillier
CS510 – Concurrent Systems
1
The Problem

Why not just stick with classical locks?
o
o
o
Performance issues (blocking)
CAS-class instruction overhead
Susceptible to:
•
•
•

Deadlock
Priority Inversion
Convoying
Lockless synchronization addresses this, but is
exposed to Read/Reclaim Races
o
Reclamation of shared data elements without coordination
with all contenders leads to an inconsistent global state
•
Such ex post facto references to deleted data yield
unpredictable results
Uncoordinated reclamation…
Some Approaches to Solutions

QSBR –- Quiescent State Based Reclamation
EBR/NEBR –- Epoch Based Reclamation
HPBR –- Hazard Pointers Based Reclamation
LFRC –- Lock-free Reference Counting

Functionality provided by a client/library interface



o
But no single, invariant set of interface semantics exists
across all schemes
QSBR




Permits the reclamation of data only after a time
interval elapses, called a Grace Period
QSBR defines a Grace Period to be the temporal
interval (a,b), such that any data element deleted
before a can be reclaimed after b
A Quiescent State is a state of a thread, T, in which
T holds no references to shared elements, active or
deleted (zombie)
Any interval in which each thread passes through a
Quiescent State is a QSBR Grace Period
Three-thread QSBR example…
QSBR Fuzzy Barriers



Protect access to “protected” code which no thread
should execute before all other threads reach a
specified point
Do not absolutely block, a la hard barriers, only
prevent execution of “protected” code until barrier
opens
Thus, can be used to synchronize reclamation
Using QSBR

Client explicitly declares Quiescent State:
… and thereby enters a fuzzy barrier
Problem: thread failure

A dead thread cannot call quiescent_state() and
thus can force QSBR to block…
EBR (Fraser)

Uses Grace Periods, like QSBR
o

Encapsulates lockless operations within Critical
Sections
o

But does not rely upon explicit client Quiescent State
declarations, as QSBR does
…which the client explicitly declares, via the functions
critical_enter() and critical_exit()
Counts the number of Critical Region invocations, and
then attempts to enter a fuzzy barrier to reclaim
memory
Linked list search using EBR
EBR Epochs






Epochs are modeled after [3], the group of
equivalence classes modulo 3
Epochs are hierarchical: Global and Local
Each epoch has an associated zombie element list
Fuzzy barrier for reclamation is entered upon entry
to each new epoch
A thread entering a Critical Region updates its Local
Epoch to match the global epoch
After M (magic number) LE updates, a thread will
attempt to increment the GE
EBR Epochs Cont’d.



A GE update attempt only succeeds if the LE of each
thread in a CR matches the GE
Since threads update their LE only at the start of a
CR, whenever, for a thread T, its LE = GE, then all
lockless operations of other threads in progress the
last time T was in epoch GE have completed
Thankfully, a grace period has expired!
EBR Epoch Cycle
NEBR – a Modest EBR Improvement


EBR must pay for the expensive fences at the
beginning and end of a CS
Modeled a little after QSBR: have the application
set/reset a “critical section(s) may be in here” flag
o
o

NEBR then does not “automatically” do this in each CS
“Application independence” dies in favor of performance
Reduces EBR’s overhead modestly--closer to QSBR
o
NEBR is attractive as the programmer’s responsibilities are
limited to marking sections that might contain lockless
operations
HPBR/SMR (Michael)

Each thread T has (magic) K Hazard Pointers used to
protect elements from reclamation by other threads
o
o



Thus, for N threads, H = NK HP’s exist in toto
K is small, often 2 (queues and lists); 1 (stacks)
T caches removed elements privately in a list P of
size (magic) R
After R removals, T reclaims each element in P that
does not have a corresponding HP
If T fails, a maximum of K+R removed elements can
be leaked
HPBR Paradigm
HPBR Paradigm Cont’d.

Hazardous References—references to shared
elements that may now be zombies or ABA situations
o
o
Algorithms using HPBR must identify a Hazardous
Reference, set a Hazard Pointer, then check for element
removal
If an element has not been removed, it continues to be
referentially safe
LFRC (Valois, Detlefs, et al.)

Threads track the instantaneous count of
references to elements
o

Many variations on this scheme may or may not allow
element types to change upon reclamation
o
o

When count = 0, element can be reclaimed
May require type invariance (Valios); type independence
requires DCAS (Detlefs, et al.)
Zombies may consume unbounded memory
Performance may be worse than lock-based
o
CAS, FAA (Intel: LOCK XADD) very expensive
Summary of Schemes




QSBR –- Detects grace periods using applicationspecified quiescent states
EBR -- Detects grace periods using applicationindependent epochs
HPBR –- Uses per-thread Hazard Pointers to
synchronize reclamation
LFRC – Uses per-element reference counts to
synchronize reclamation
Performance Factors…

Depends on a lot of stuff
o
o

Memory consistency and constraints
Workload, contention and thread scheduling
Sequentially consistent memory model is still
generally assumed by the lock-free literature
o
But the hardware trends are toward weaker models
•
•
Coder needs to rely on fences (MBarriers), which artificially
add overhead
HPBR, EBR and LFRC require per-operation fences, but not
QSBR—this is shown to be a distinct advantage
Performance Factors Cont’d.

Thread preemption
o

Descheduled threads are blocked threads, as far as
reclamation schemes are concerned
o

Can start when number threads > number CPUs
Anything that prevents a Grace Period from closing is bad
Threads may sometimes need to borrow memory
from a locked, global pool
o
o
A thread may be preempted whilst holding such a lock;
setting up a thread convoy on memory
HPBR bounds memory stress and has an advantage here
The μBenchmark
The μBenchmark Cont’d.

Master thread flow logic
o
o
o


Average execution time/measured operation = test
duration/number of operations
Net CPU time = execution time * number of threads
o

Create N children
Start a timer
When timer expires, stop children
If thread count > CPU count, report execution time;
otherwise report CPU time.
Driver parameters were selected not to be biased
toward any particular reclamation scheme
The μBenchmark Cont’d.


CS implemented on POWER via larx/stcx (LL/SC)
Fences implemented via eieio (“Enforce In-order
Execution of I/O”)


Spin locks implemented via cas and fences
Statically allocated HPBR Hazard Pointers
o

Some algorithms may require unbounded HP counts
Choice of placement of QSBR QS declarations may
not be obvious in some algorithms
Performance Measurement Guidelines

Measure the base costs first
o
Single-threaded execution, small data structures
•
•

No contention, preemption, traversing long lists
Non-blocking queues, single-element linked lists…
Then move toward complexity
o
o
Pedagogical approach--try to change only one factor at a
time
Consider the R/O, the W/O and the R/W cases in each of
the examined reclamation schemes
Base Performance Costs
Scalability with Fractional Workload
Scalability with Traversal Length
Scalability of LFRC
No Preemption; R/O Workload
No Preemption; W/O Workload
Preemption; R/O Workload
Preemption; W/O Workload
Memory Stress Busy Wait
Hash Tables; Update Fraction Workload
No Preemption; R/O Workload with NEBR
Case Study—RCU API in Linux

RCU concepts—”Read/Copy/Update”
o
o
o
Lockless concurrent reads with deferred destruction of
zombie elements
Writers may not prevent readers from accessing shared
data
Writers must coordinate with each other in some way
•
o
o
RCU does not specify what way
RCU neither blocks nor fails for readers
Preemptable kernels necessitate the use of
rcu_read_lock() and rcu_read_unlock()to toggle kernel
preemption
•
…so that context switches do not occur at intolerable times
Case Study—RCU Cont’d.

QSBR is a natural choice for memory reclamation
o

EBR could be used as well, but would not offer any
advantages over QSBR
RCU is best targeted to read-mostly data structures
o
Rare updates imply rare reclamation
Case Study—RCU Cont’d.

SysV IPC subsystem implemented in Linux via CRQSBR
o
o
o
Implements semaphores, message queues and shared
memory
Apps use an integer Accessor ID to access in-kernel data
structures (essentially a “resource handle”)
The dynamic, mostly-read (AID/resource) array, formerly
spinlocked in stock Linux, was protected here through CRQSBR instead, and benchmarked
Case Study—RCU Cont’d.
Semopbench, 8-CPU, 700 MHz Intel P-III
Case Study—RCU Cont’d.
DBT1 Database Benchmark Raw Results
Case Study—RCU Cont’d.
DBT1 database benchmark results (TPS)
Conclusions

Reclamation has a huge effect on lockless algorithm
performance
o




So one must tune to the design of the application
Both QSBR and EBR can suffer in the face of
memory exhaustion
HPBR and EBR have higher base costs than QSBR
due to fences
The NEBR enhancements modestly improve EBR
LFRC has the highest overhead due to the perelement atomic instruction requirement
Conclusons Cont’d.


HPBR scales poorly as the traversal length increases
QSBR is, overall, the best performing reclamation
scheme
o
o
…and best suited to an OS kernel environment
Lockless approaches using QSBR can widely outperform
locking approaches by a large margin
Rantings -- STAE

STAE – Specified Thread Abnormal Exit
o
o
o
o
User provides Exit code to be run on condition of thread
error trap
Exit is driven by the etrap interrupt logic; Exit is called
immediately after etrap is detected, e.g., SEGV
Exit has full access to environment of failing thread; may
modify any data, etc.
Exit may:
•
•
•
Allow failing thread to die (the status quo)
Resuscitate failing thread by telling the dispatcher to restart
the thread at an Exit-specified point in its code
Call a completely new program to run in place of failing thread
(with all of the failing thread’s credentials and context)
Rantings -- PLO

PLO – Perform Locked Operation (IBM z Platform)
o
o
o
o
o
o
Meta instruction that atomically encapsulates all of:
CAL, CAS, DCAS, CASAS, CASADS, CASATS
into single-instruction global atomicity
32, 64, or 128-bit operands are supported
Acquires a global hardware interlock unique to PLO
Is very powerful and flexible, but is so complex that it may
require a pre-built parameter list just to “program” it!
Usually needs to be coded with a zillion operands 
Its proprietary μalgorithm has to be huge, but whether its
utility outstrips its cost enough to yield a net gain in
performance, has not yet been answered (afaik)
Questions/Musings

Suppose DCAS was “improved” so that it uses an
order of magnitude fewer clocks than today.
o

To what extent could macroscopically faster hardware
atomicity affect the utility of these lockless schemes?
Could the STAE formalism provably solve the failed
thread blocking problem in QSBR?
o
If you believe the answer is yes, based on the empirical
data in this paper, would the paradigm (QSBR+STAE)
satisfy Ockham’s Razor and thus become the overall best
solution to the lockless reclamation problem?