4-ahmad_deyoung_iyer..

Download Report

Transcript 4-ahmad_deyoung_iyer..

Memory Consistency
Arbob Ahmad, Henry DeYoung, Rakesh Iyer
15-740/18-740: Recent Research in Architecture
October 14, 2009
“Memory Model =
Instruction Reordering + Store Atomicity”
Arvind and Jan-Willem Maessen
“Memory consistency models exist to describe
and constrain the behavior of [memory systems]”
●
Gives a unifying framework for
SC and relaxed models with an atomic memory
●
Instruction Reordering
vs. Store Atomicity
●
Instruction reordering rules:
●
Consistency within a thread
●
e.g.:
●
Store atomicity rules:
●
Ordering which must exist in every serialization
●
Consistency across threads
Store Atomicity
1.Predecessor Stores of a Load are ordered
before its source.
x ← 2
x → 2
x ← 1
Store Atomicity
1.Predecessor Stores of a Load are ordered
before its source.
2.Successor Stores of a Store are ordered after
its observers.
x ← 2
x → 2
x ← 1
Store Atomicity
1.Predecessor Stores of a Load are ordered
before its source.
2.Successor Stores of a Store are ordered after
its observers.
3.Mutual ancestors of Loads are ordered before
the mutual successors of the distinct Stores
they observe.
?
Thread A
Thread B
Thread C
Local ordering
constraints
x ← 1
Fence
y → 2
y → 4
y ← 2
Fence
z ← 6
y ← 4
Fence
z → 6
Fence
x ← 8
x → ?
Thread A
Thread B
Thread C
Observation
constraints
x ← 1
Fence
y → 2
y → 4
y ← 2
Fence
z ← 6
y ← 4
Fence
z → 6
Fence
x ← 8
x → ?
Thread A
Thread B
Thread C
Question:
Are there any
ordering constraints
not represented?
x ← 1
Fence
y → 2
y → 4
y ← 2
Fence
z ← 6
y ← 4
Fence
z → 6
Fence
x ← 8
x → ?
Thread A
Thread B
Thread C
Question:
Are there any
ordering constraints
not represented?
Order is
x ← 1
Fence
y → 2
y → 4
y ← 2
Fence
z ← 6
y ← 4
Fence
z → 6
Fence
x ← 8
x → ?
y←2
:
y→2
or
:
y←4
:
y→4
y←4
:
y→4
:
y←2
:
y→2
Thread A
Thread B
Thread C
Order is
y←2
:
y→2
or
:
y←4
:
y→4
●
x ← 1
Fence
y → 2
y → 4
y ← 2
Fence
z ← 6
y ← 4
Fence
z → 6
Fence
x ← 8
x → ?
y←4
:
y→4
:
y←2
:
y→2
x ← 1 must precede both
y → 2 and y → 4
●
●
●
z → 6 must follow both
y → 2 and y → 4
Thread A
Thread B
Thread C
Store atomicity
constraint
x ← 1
Fence
y → 2
y → 4
y ← 2
Fence
z ← 6
y ← 4
Fence
z → 6
Fence
x ← 8
x → ?
Sequential Consistency
●
Programmer's gold standard
Question: How can we have the clarity of SC
without sacrificing performance?
●
Improving the Performance of SC
Key Idea: Rather than turning the switch at
individual memory access boundaries,
do it only at chunk boundaries.
This is the topic of:
“BulkSC: Bulk Enforcement of
Sequential Consistency”
Luis Ceze, James Tuck, Pablo Montesinos, and Josep Torrellas
“Mechanisms for Store-wait-free
Multiprocessors”
Thomas Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos
Coarse Grain Enforcement of SC
Similar to tasks in TLS and transactions in TM
●
But, chunks are created dynamically by hardware;
tasks and transactions are specified statically in code
●
Common Ground
•Dynamically divide the program into ‘chunks’ or ‘atomic
sequences’
•ASO begins an atomic sequence when an ordering
constraint would stall instruction retirement.
•BulkSC assumes chunks are around 1000 instructions.
•Re-ordering allowed within chunks/atomic sequences.
•Updates not visible until the commit.
•Evaluated on a full system simulator (Simics/Flexus)
Bulk SC: Bulk Enforcement of Sequential Consistency
Computes minimum
serialization requirement.
Enables BulkSC on machines
without broadcast capabilites
Chunk executes, updates L1
Commit Made,
R,W Signatures broadcast
Bulk Disambiguator computes intersection
- Restart computation if non empty
Atomic Store Ordering
•Scalable Store Buffer
•Eliminates store buffer capacity related stalls.
•No associative lookup required.
•ASO Implementation
•Eliminates ordering related stalls.
•Atomic sequence tracking.
•Detecting atomicity violations.
•Rollback on violation.
•Commit atomic sequences.
Performance Results
Bulk SC
ASO
More realistic
workloads
Open Research Questions in
Memory Consistency
Memory model framework was descriptive. What
are the prescriptive consequences?
●
Can the “big-step” semantics of transactions be
explained with “small-step” framework?
●
Can the same hardware in a single system be
used for all of coarse-grain SC, TLS, and TM?
●
●
...
Thank you!
Extra Slides
Thread A
Thread B
Local ordering
constraints
x ← 1
Fence
y ← 2
y → 3
y ← 3
Fence
x ← 4
x → ?
Thread A
Thread B
Observation
constraint
x ← 1
Fence
y ← 2
y → 3
y ← 3
Fence
x ← 4
x → ?
Thread A
Thread B
Question:
We need one
more edge to
capture the
ordering. Where
should it go?
x ← 1
Fence
y ← 2
y → 3
y ← 3
Fence
x ← 4
x → ?
Thread A
Thread B
Store atomicity
constraint
Moral: When a store is observed to have
been overwritten, the stores must be ordered.