Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Download
Report
Transcript Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision:
Enabling Highly Concurrent
Multithreaded Execution
Ravi Rajwar and Jams R. Goodman
Presented by Yang Liu
CPS221 Spring 2008
Outline
Why want to elide locks?
How to elide locks? – SLE
Why does SLE work correctly?
How to implement SLE?
How is performance improved?
Why want to elide locks?
Locks force serialization
Not all locks are required
How to elide locks?
- Atomicity conditions
Within a speculatively executing critical section
Read data is not modified by another thread before
section ends
Written data is not accessed by another thread
before section ends
How to elide locks?
- Eliding silent pairs
How to elide locks? - Steps
Predict lock release store (silent pairs)
Predict atomicity and elide lock acquire
Execute critical section speculatively and
buffer results
If no atomicity, trigger misspeculation,
recover and explicitly acquire lock
If lock release store is seen, elide lock
release, commit state and exit section
How to elide locks? - Example
How to elide locks? - Example
Lock Acquire
How to elide locks? - Example
Lock Acquire
How to elide locks? - Example
Lock Acquire
Lock Release
Why does SLE work correctly?
Two predictions
Predict lock release store
Resolved by monitoring memory location of the stores
Predict memory operation atomicity
Resolved by checking atomicity conditions using
cache coherence mechanisms
How to implement SLE? – Four Aspects
Initiating speculation
Filter, index, confidence metric
Buffering speculative state
Speculative register state
ROB
Register checkpointing
Speculative memory state
Augmented write-buffers
How to implement SLE? – Four Aspects
Misspeculation conditions and detection
Atomicity violations
Violations due to limited resources
ROB
Register checkpointing with access bit
Finite cache/write-buffer/ROB size
Uncached accesses or events
Committing speculative memory state
How to implement SLE?
How is performance improved?
- Evaluation methodology
Three multiprocessor systems
CMP/SMP/DSM
A simple microbenchmark and six applications
Single register checkpoint
32-entry lock predictor indexed by PC
How is performance improved?
- Microbenchmark result
How is performance improved?
- Application result
How is performance improved?
- Application result
Some thoughts
Idea is good
Remove unnecessary serialization
Make programmers’ work easier
No results for using ROB
Why padding the benchmarks to reduce false
sharing?
Shouldn’t this be done by SLE?