Transcript ppt
The read-copy-update mechanism for supporting real-time applications
on shared-memory multiprocessor systems with Linux
Guniguntala et al.
Publish-Subscribe
◦ insertion
◦ reader-writer synchronization
Wait for pre-existing readers to complete
◦ deletion
◦ change – wait for readers – free
◦ safe memory reclamation
Maintain multiple versions of update objects
◦ for readers
Starting List
New Node
Copy B to B’ and Modify
Move A.Next to B’
B still visible, but not for new readers
Readers complete, remove B
RCU Semantics: A First Attempt
McKenney & Walpole
Reader-Writer
◦ rcu_assign_pointer()
p->a = 1;
p->b = 2;
p->c = 3;
rcu_assign_pointer(gp, p);
◦ rcu_dereference()
◦ Memory barriers embedded
in API
Writer-Collection
◦ rcu_synchronize() blocks
caller until safe to collect
◦ call_rcu() is asychronous
call for collection
Reader-Collection (?)
General issues in non-blocking & swap-free
Non-blocking queue
◦ When is it safe to free memory?
◦ Memory reclamation tracking can be relatively costly
◦ Expensive atomic operations / memory barriers required
◦ Atomic operation expense
CAS (15-25 clock cycles on P4)
◦ Retry on contention
Non-blocking synchronization
◦ Atomic operation expense
store_conditional
◦ Data structure copy expense
With interactions between reader, writer and
collector, when is it time to reclaim memory?
◦ Writer identifies what to collect and trigger
collection to occur (synchronously or asynch)
◦ Readers (indirectly) indicate when to collect by no
longer referencing the freed object
One solution for collector:
◦ Track copies of global pointer into thread-local
memory
Each thread maintains a list of it’s currently active
pointers
◦ Collector checks the thread-local list prior to
memory reclamation
Sounds a lot like the hazard pointer !
Hazard Pointer Disadvantages:
◦ Required manual identification of hazard references
◦ Expensive on the read path
Requires two memory barriers on the read path
Copy of the global pointer to local reference
Entry of hazard pointer into the list
Every read thread incurs this extra overhead as the
cost for correct memory reclamation. Expensive for
many-reader situations
RCU -> Collection based on ‘quiescent state’
◦ Threads prevent the occurrence of quiescent state
while their local memory is alive
◦ Collector indirectly observes state of all threads to
infer when safe to reclaim memory
◦ The definition chosen for ‘quiescent state’ will
significantly impact performance
Best choice: Infer by operations that occur anyway
Reader-Collection
◦ rcu_read_lock()
◦ rcu_read_unlock()
◦ read-side critical section
rcu_read_lock();
retval = rcu_dereference(gbl_foo)->a;
rcu_read_unlock();
return retval;
Non-preemptible kernel
◦ Programming convention is
to avoid yielding in the readside critical section
◦ Memory reclamation on
voluntary context switch
◦ rcu_read_lock/unlock calls
do nothing in nonpreemptible kernel
‘Simple case’: Non-preemptible kernel
◦ All threads use read-side critical section with no voluntary yield
no context switch within a read-side critical section
◦ Collector observes all CPU to determine when all threads have
undergone a context switch
Indicates a pass into a quiescent state
All previous read-side critical sections are now guaranteed to have
exited
Any new threads no longer have visibility to removed object
◦ Safe–conservative-imprecise–degrades real-time
Detection of quiescent state occurs after last reader use
Collector waits for all readers to finish even if not all readers were
accessing the memory to be reclaimed
Delay real-time response due to refusal to yield within read-side critical
Read-side critical section
◦ Readers can now be preempted in their read-side critical
◦ Disable preemption on entry and re-enable on exit
Memory freed using synchronize_sched()
◦ Counts scheduler preemptions
Benefits and trade-offs
◦ Allows use of RCU with preemptible kernel
◦ Read-side critical section won’t be preempted by RT
events, negative consequences for RT responsiveness
◦ Additional read-side work to disable/enable preemption
Global counter
◦ Atomic increment in rcu_read_lock()
◦ Atomic decrement in rcu_read_unlock()
Quiescent state defined as global counter=0
Not practical
◦ As CPU count increases, counter may never reach 0
Use two-element array as counter
◦ Atomically increment/decrement as
matched pair of ‘current’ and ‘last’
counter
◦ Grace period starts – swap sense of
‘current’ and ‘last’, proceed to only
decrement the ‘last’ counter
◦ Counter eventually reaches 0, marking
end of grace period
High overhead due to memory
contention / cache misses
2xN arrays, N=thread
count (2 per thread)
Global index
Updated with
rcu_read_lock() and
rcu_read_unlock()
Requires a graceperiod detection state
machine
Improves read-side performance
◦ Avoids cache-miss
◦ Does not require (expensive) atomic instructions
◦ Does not require (expensive) memory barriers
Requires state-machine for grace period
detection
Indefinite delays in read-side critical sections
Priority boost would work – but relatively
expensive and not required for every reader
Solution is to defer priority boosting
◦
◦
◦
◦
Extends grace period
Exhausts memory since no collection can occur
Writers cannot allocate memory
Need to prevent low-priority threads from being
indefinitely preempted
◦ Preempted read-side critical threads added to list
◦ List serves as an ‘aging’ tracker
Issue List
Global definition of grace period
◦ Single delayed thread in read-side critical section
can stall memory reclamation for everyone
◦ Stall occurs even though reader’s data is unrelated
to memory trying to be reclaimed
RCU Control Block
◦ Reader/updater invocations share defined control
blocks
◦ Readers won’t block reclamation for unrelated
idx = srcu_read_lock(&scb)
control blocks
/* read-side critical */
srcu_read_unlock(&scb, idx)
/* collection */
synchronize_srcu(&scb)
RCU Performance Comparisons
Fast concurrent reads
Relatively slow writers
Preemption & RT support requires increased read-side work