Parallel Architectures - Cavium University Program
Download
Report
Transcript Parallel Architectures - Cavium University Program
Programming Multi-Core
Processors based Embedded
Systems
A Hands-On Experience on Cavium
Octeon based Platforms
Lecture 3 (Complexities of Parallelism)
Course Outline
Introduction
Multi-threading on multi-core processors
Multi-core applications and their complexities
Multi-core parallel applications
Complexities of multi-threading and parallelism
Application layer computing on multi-core
Performance measurement and tuning
Copyright © 2009
3-2
KICS, UET
Agenda for Today
Multi-core parallel applications space
Scientific/engineering applications
Commercial applications
Complexities due to parallelism
Threading related issues
Memory consistency and cache coherence
Synchronization
Copyright © 2009
3-3
KICS, UET
Parallel Applications
Science/engineering application, generalpurpose application, and desktop applications
David E. Culler and Jaswinder Pal Singh, Parallel
Computer Architecture: A Hardware/Software
Approach, Morgan Kaufmann, 1998
Parallel Application Trends
There is an ever-increasing demand for high
performance computing in a number of application
areas
Scientific and engineering applications:
Computational fluid dynamics
Weather modeling
Number of applications from physics, chemistry, biology, etc.
General-purpose computing applications
Video encoding/decoding, graphics, games
Database management
Networking applications
Copyright © 2009
3-5
KICS, UET
Application Trends (2)
Demand for cycles fuels advances in hardware, and vice-versa
Range of performance demands
Cycle drives exponential increase in microprocessor performance
Drives parallel architecture harder: most demanding applications
Need range of system performance with progressively increasing
cost
Platform pyramid
Goal of applications in using multi-core machines: Speedup
Speedup (p cores) =
Performance (p cores)
Performance (1 core)
For a fixed problem size (input data set), performance = 1/time
Speedup fixed problem (p cores) =
Time (1 core)
Time (p cores)
Copyright © 2009
3-6
KICS, UET
Scientific Computing Demand
Copyright © 2009
3-7
KICS, UET
Engineering Application
Demands
Large parallel machines a mainstay in many
industries
Petroleum (reservoir analysis)
Automotive (crash simulation, drag analysis, combustion
efficiency),
Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism),
Computer-aided design
Pharmaceuticals (molecular modeling)
Visualization
in all of the above
entertainment (films like Toy Story)
architecture (walk-throughs and rendering)
Copyright © 2009
3-8
KICS, UET
Application Trends Example:
ASCI
Accelerated Strategic Computing Initiative
(ASCI) is a US DoE program that proposes
the use of high performance computing for 3D modeling and simulation
Promised to provide 5 orders of magnitude
greater computing power in 8 years (1996 to
2004) than state-of-the-art (1 GFlops to 100
Tflops)
Copyright © 2009
3-9
KICS, UET
Application Trends Example (2)
Platforms
ASCI Red
ASCI Blue Mountain
3.1 TOPs peak performance
Developed by Intel with 4,510 nodes
3 TOPs peak performance
Developed by SGI with 48, 128 node Origin2000s
ASCI White
12 TOPs peak performance
Developed by IBM as cluster of SMPs
Copyright © 2009
3-10
KICS, UET
Commercial Applications
Databases, online-transaction processing,
decision support, data mining
Also relies on parallelism for high end
Scale not so large, but use much more widespread
High performance means performing more work
(transactions) in a fixed time
Copyright © 2009
3-11
KICS, UET
Commercial Applications (2)
TPC benchmarks (TPC-C order entry, TPC-D
decision support)
Explicit scaling criteria provided
Size of enterprise scales with size of system
Problem size no longer fixed as p increases, so
throughput is used as a performance measure
(transactions per minute or tpm)
Desktop applications
Video applications
Secure computing and web services
Copyright © 2009
3-12
KICS, UET
Parallel Applications Landscape
HPCC
(Science/
engineering)
Data Center Appls.
(Search, e-commerce,
Enterprise, SOA)
Desktop Applications
(WWW browser, office,
multimedia applications)
Embedded Applications
(Wireless and mobile devices, PDAs,
consumer electronics)
Copyright © 2009
3-13
KICS, UET
Summary of Application Trends
Transition to parallel computing has occurred for
scientific and engineering computing
In rapid progress in commercial computing
Desktop also uses multithreaded programs, which are
a lot like parallel programs
Demand for improving throughput on sequential
workloads
Greatest use of small-scale multiprocessors
Currently employ multi-core processors
Solid application demand exists and will increase
Copyright © 2009
3-14
KICS, UET
Solutions to Common
Parallel Programming
Problems using
Multiple Threads
Chapter 7
Shameem Akhtar and Jason Roberts, MultiCore Programming, Intel Press, 2006
Common Problems
Too many threads
Data races,
deadlocks, and live
locks
Heavily contended
locks
Non-blocking
algorithms
Copyright © 2009
3-16
Thread-safe
functions and
libraries
Memory issues
Cache related issues
Pipeline stalls
Date organization
KICS, UET
Too Many Threads
Little threading good many will be great
Not always true
Excessive threading can degrade performance
Two types of impacts of excessive threads
Too little work per thread
Overhead of starting and maintaining dominates
Fine granularity of work hides any performance benefits
Excessive contention for hardware resources
OS uses time-slicing for fair scheduling
May result in excessive context switching overhead
Thrashing at virtual memory level
Copyright © 2009
3-17
KICS, UET
Data Races, Deadlocks, and
Livelocks
Race condition
Due to unsynchronized accesses to shared data
Program results are non-deterministic
Can be handled through locking
Deadlock
Depend on relative timings of threads
A problem due to incorrect locking
Results due to cyclic dependence that stops forward
progress by threads
Livelock
Thread continuously conflict with each other and back off
No thread makes any progress
Solution: back off with release of acquired locks to allow at
least one thread to make progress
Copyright © 2009
3-18
KICS, UET
Races among Unsynchronized
Threads
Copyright © 2009
3-19
KICS, UET
Race Conditions Hiding Behind
Language Syntax
Copyright © 2009
3-20
KICS, UET
A Higher-Level Race Condition
Example
Race conditions possible with synch
However, synchronization at too low level
Higher level may still have data races
Example
Each key should occur only once in the list
Individual list operators have locks
Problem: two threads simultaneously may find that key does not
exist and insert the same key in the list one after the other
Solution: locking both for list as well as to protect key repetition
Copyright © 2009
3-21
KICS, UET
Deadlock Caused by Cycle
Copyright © 2009
3-22
KICS, UET
Conditions for a Deadlock
Deadlock can occur only if the following four
conditions are true:
Access to each resource is exclusive;
A thread is allowed to hold one resource
requesting another;
No thread is willing to relinquish a resource
that it has acquired; and
There is a cycle of threads trying to acquire
resources, where each resource is held by
one thread and requested by another
Copyright © 2009
3-23
KICS, UET
Locks Ordered by their
Addresses
Consistent ordering of lock acquisition
Prevents deadlock
Copyright © 2009
3-24
KICS, UET
Try and Backoff Logic
One reason for deadlocks: no thread willing to give
up a resource
Solution: thread gives up resource if it cannot acquire
another one
Copyright © 2009
3-25
KICS, UET
Heavily Contested Locks
Locks ensure correctness
By preventing race conditions
By preventing deadlocks
Performance impact
When locks become heavily contested among
threads
Threads try to acquire the lock at a rate faster
than the rate at which a thread can execute the
corresponding critical section
If a thread falls asleep, all threads have to wait for
it
Copyright © 2009
3-26
KICS, UET
Priority Inversion Scenario
Copyright © 2009
3-27
KICS, UET
Solution: Spreading out
Contention
Copyright © 2009
3-28
KICS, UET
Hash Table with Fine-Grained
Locking
Mutexes protecting each bucket
Copyright © 2009
3-29
KICS, UET
Non-Blocking Algorithms
How about not using locks at all!
To resolve the locking problems
Such algorithms are called non-blocking
Stopping one thread does not prevent rest of the system
from making progress
Non-blocking guarantees:
Obstruction freedom—thread makes progress as long as no
contention livelock possible uses exponential backoff
to avoid it
Lock freedom—system as a whole makes progress
Wait freedom—every thread makes progress even when
faced with contention practically difficult to achieve
Copyright © 2009
3-30
KICS, UET
Thread-Safe Functions
Thread-safe function when concurrently called on different
objects
Implementer should ensure thread safety of hidden shared state
Copyright © 2009
3-31
KICS, UET
Memory Issues
Speed disparity
Processing is fast
Memory access is slow
Multiple cores can exacerbate the problem
Specific memroy issues
Bandwidth
Working in the cache
Memory contention
Memory consistency
Copyright © 2009
3-32
KICS, UET
Bandwidth
Copyright © 2009
3-33
KICS, UET
Working in the Cache
Copyright © 2009
3-34
KICS, UET
Memory Contention
Types of memory accesses
Two types of data dependences:
Between a core and main memory
Between two cores
Read-write dependency: a core write a cache line and then
different core reads it
Write-write dependency: a cores write a cache line and then
a different core writes it
Interactions among cores
Consume bandwidth
Are avoided when multiple cores only read from cache lines
Can be avoided by minimizing the shared locations
Copyright © 2009
3-35
KICS, UET
False Sharing
Cache block may
also introduce
artifacts
Two distinct
variables in the
same cache block
Technique: allocate
data used by each
processor
contiguously, or at
least avoid
interleaving in
memory
Example problem:
an array of ints, one
written frequently by
each processor
(many ints per cache
line)
Copyright © 2009
3-36
KICS, UET
Performance Impact of False
Sharing
Copyright © 2009
3-37
KICS, UET
What is Memory Consistency?
Copyright © 2009
3-38
KICS, UET
Itanium Architecture
Copyright © 2009
3-39
KICS, UET
Shared Memory without a Lock
Copyright © 2009
3-40
KICS, UET
Memory Consistency and
Cache Coherence
David E. Culler and Jaswinder Pal Singh,
Parallel Computer Architecture: A
Hardware/Software Approach, Morgan
Kaufmann, 1998
(Advanced Topics—can be skipped)
Memory Consistency for Multi-Core
Architectures
Memory consistency issue
Programs are written for a conceptual sequential machine
with memory
Programs for parallel architectures:
Written for multiple concurrent instruction streams
Memory accesses may occur in any order
May result in incorrect comupation
This is a well-known problem
Traditional parallel architecture deal with it
Multi-core architectures inherit this complexity
Presented in this section for sake of completion
More relevant for HPCC applications
Not as complex for multi-threading thread level solutions
Copyright © 2009
3-42
KICS, UET
Memory Consistency
Consistency requirement:
writes to a location become visible to all in
the same order
But when does a write become visible
How to establish orders between a write and a
read by different process?
Typically use event synchronization
By using more than one location
Copyright © 2009
3-43
KICS, UET
Memory Consistency (2)
P
P
1
2
/*Assume initial value of A and flag is 0*/
A = 1;
while (flag == 0);
flag = 1;
print A;
Sometimes expect memory to respect order between
accesses to different locations issued by a given
processor
/*spin idly*/
to preserve orders among accesses to same location by
different processes
Coherence doesn’t help: pertains only to single
location
Copyright © 2009
3-44
KICS, UET
An Example of Orders
P1
P2
/*Assume initial values of A and B are 0*/
(1a) A = 1;
(2a) print B;
(1b) B = 2;
We need an ordering model for clear
semantics
(2b) print A;
across different locations as well
so programmers can reason about what results
are possible
This is the memory consistency model
Copyright © 2009
3-45
KICS, UET
Memory Consistency Model
Specifies constraints on the order in which
memory operations (from any process) can
appear to execute with respect to one
another
What orders are preserved?
Given a load, constrains the possible values
returned by it
Without it, can’t tell much about an SAS
program’s execution
Copyright © 2009
3-46
KICS, UET
Memory Consistency Model (2)
Implications for both programmer and system
designer
Programmer uses to reason about correctness and
possible results
System designer can use to constrain how much
accesses can be reordered by compiler or
hardware
Contract between programmer and system
Copyright © 2009
3-47
KICS, UET
Sequential Consistency
Processors
P1
issuing memory
references as
per program order
P2
Pn
T he “switch” is randomly
set after each memory
reference
Memory
(as if there were no caches, and a single
memory)
Copyright © 2009
3-48
KICS, UET
Sequential Consistency (2)
Total order achieved by interleaving accesses from
different processes
Maintains program order, and memory operations, from all
processes, appear to [issue, execute, complete] atomically
w.r.t. others
Programmer’s intuition is maintained
“A multiprocessor is sequentially consistent if the
result of any execution is the same as if the
operations of all the processors were executed in
some sequential order, and the operations of each
individual processor appear in this sequence in the
order specified by its program.” [Lamport, 1979]
Copyright © 2009
3-49
KICS, UET
What Really is Program Order?
Intuitively, order in which operations appear in
source code
Straightforward translation of source code to assembly
At most one memory operation per instruction
But not the same as order presented to hardware by
compiler
So which is program order?
Depends on which layer, and who’s doing the
reasoning
We assume order as seen by programmer
Copyright © 2009
3-50
KICS, UET
Sequential Consistency: Example
P1
P2
/*Assume initial values of A and B are 0*/
(1a) A = 1;
(2a) print B;
(1b) B = 2;
(2b) print A;
possible outcomes for (A,B): (0,0), (1,0), (1,2)
impossible under SC: (0,2)
we know 1a1b and 2a2b by program order
A = 0 implies 2b1a, which implies 2a1b
B = 2 implies 1b2a, which leads to a contradiction
BUT, actual execution 1b1a2b2a is SC, despite not
program order
appears just like 1a1b2a2b as visible from results
actual execution 1b2a2b1a is not SC
Copyright © 2009
3-51
KICS, UET
Implementing SC
Two kinds of requirements:
Program order
memory operations issued by a process must appear to
become visible (to others and itself) in program order
Atomicity
in the overall total order, one memory operation should
appear to complete with respect to all processes before
the next one is issued
needed to guarantee that total order is consistent across
processes
tricky part is making writes atomic
Copyright © 2009
3-52
KICS, UET
Write Atomicity
Write Atomicity: Position in total order at
which a write appears to perform should be
the same for all processes
Nothing a process does after it has seen the new
value produced by a write W should be visible to
other processes until they too have seen W
In effect, extends write serialization to writes from
multiple processes
Copyright © 2009
3-53
KICS, UET
Write Atomicity (2)
Transitivity implies A should print as 1 under
SC
Problem if P2 leaves loop, writes B, and P3
sees new B but old A (from its cache, say)
P1
A=1;
Copyright © 2009
P2
P3
while (A==0);
B=1;
3-54
while (B==0);
print A;
KICS, UET
Formal Definition of SC
Each process’s program order imposes partial
order on set of all operations
Interleaving of these partial orders defines a
total order on all operations
Many total orders may be SC (SC does not
define particular interleaving)
Copyright © 2009
3-55
KICS, UET
Formal Definition of SC (2)
SC Execution:
An execution of a program is SC if the results it
produces are the same as those produced by
some possible total order (interleaving)
SC System:
A system is SC if any possible execution on that
system is an SC execution
Copyright © 2009
3-56
KICS, UET
Sufficient Conditions for SC
Every process issues memory operations in program
order
After a write operation is issued, the issuing process
waits for the write to complete before issuing its next
operation
After a read operation is issued, the issuing process
waits for the read to complete, and for the write
whose value is being returned by the read to
complete, before issuing its next operation (provides
write atomicity)
Copyright © 2009
3-57
KICS, UET
Sufficient Conditions for SC (2)
Sufficient, not necessary, conditions
Clearly, compilers should not reorder for SC, but they
do!
Even if issued in order, hardware may violate for
better performance
Loop transformations, register allocation (eliminates!)
Write buffers, out of order execution
Reason: uniprocessors care only about dependences
to same location
Makes the sufficient conditions very restrictive for
performance
Copyright © 2009
3-58
KICS, UET
Summary of SC Implementation
Assume for now that compiler does not reorder
Hardware needs mechanisms to detect:
For all protocols and implementations, we will see
Detect write completion (read completion is easy)
Ensure write atomicity
How they satisfy coherence, particularly write serialization
How they satisfy sufficient conditions for SC (write
completion and write atomicity)
How they can ensure SC but not through sufficient
conditions
Will see that centralized bus interconnect makes it
easier
Copyright © 2009
3-59
KICS, UET
Cache Coherence
CC for SMP architectures
One memory location in multiple caches
Not a problem for read accesses
Write access drive coherence requirements
No need to update the memory address
Computation can continue on local processor
Memory needs to be updated
Need to invalidate cache copies in other processors
Multiple ways to deal with updates
Update memory immediately write through caches
Update later write back caches
Copyright © 2009
3-60
KICS, UET
Cache Coherence (2)
CC is a well-known problem
For traditional SMP style multiprocessors
Inherited by multi-core processors
Multiple solutions
Can be resolved in software
However, traditionally resolved in hardware
Hardware supports CC protocols
A mechanism to detect cache coherence related events
Mechanisms to keep the caches coherent
Presented here for the sake of completion
Programmer does not have to worry about it
However, a key consideration for a multi-core architecture
Copyright © 2009
3-61
KICS, UET
SC in Write-through
Provides SC, not just coherence
Extend arguments used for coherence
Writes and read misses to all locations serialized
by bus into bus order
If read obtains value of write W, W guaranteed to
have completed
since it caused a bus transaction
When write W is performed w.r.t. any processor,
all previous writes in bus order have completed
Copyright © 2009
3-62
KICS, UET
Design Space for Snooping
Protocols
No need to change processor, main memory, cache
…
Focus on protocols for write-back caches
Dirty state now also indicates exclusive ownership
Extend cache controller and exploit bus (provides
serialization)
Exclusive: only cache with a valid copy
Owner: responsible for supplying block upon a request for it
Design space
Invalidation versus Update-based protocols
Set of states
Copyright © 2009
3-63
KICS, UET
Invalidation-based Protocols
Exclusive means can modify without notifying anyone
else
i.e. without bus transaction
Must first get block in exclusive state before writing into it
Even if already in valid state, need transaction, so called a
write miss
Store to non-dirty data generates a read-exclusive
bus transaction
Copyright © 2009
3-64
KICS, UET
Invalidation-based Protocols (2)
The read-exclusive bus transaction (cont’d)
Tells others about impending write, obtains exclusive
ownership
makes the write visible, i.e. write is performed
may be actually observed (by a read miss) only later
write hit made visible (performed) when block updated in
writer’s cache
Only one RdX can succeed at a time for a block: serialized
by bus
Read and Read-exclusive bus transactions drive
coherence actions
Writeback transactions also, but not caused by memory
operation and quite incidental to coherence protocol
note: replaced block that is not in modified state can be
dropped
Copyright © 2009
3-65
KICS, UET
Update-based Protocols
A write operation updates values in other
caches
New, update bus transaction
Advantages
Other processors don’t miss on next access:
reduced latency
In invalidation protocols, they would miss and cause
more transactions
Single bus transaction to update several caches
can save bandwidth
Also, only the word written is transferred, not whole
block
Copyright © 2009
3-66
KICS, UET
Update-based Protocols (2)
Disadvantages
Multiple writes by same processor cause multiple
update transactions
In invalidation, first write gets exclusive ownership,
others local
Detailed tradeoffs more complex
Copyright © 2009
3-67
KICS, UET
Invalidate versus Update
Basic question of program behavior
Is a block written by one processor read by others before it
is rewritten?
Invalidation:
Yes => readers will take a miss
No => multiple writes without additional traffic
Update:
Yes => readers will not miss if they had a copy previously
and clears out copies that won’t be used again
single bus transaction to update all copies
No => multiple useless updates, even to dead copies
Invalidation protocols much more popular
Some systems provide both, or even hybrid
Copyright © 2009
3-68
KICS, UET
Protocols
3-state writeback invalidation protocol
4-state writeback invalidation protocol
4-state writeback update protocol
Copyright © 2009
3-69
KICS, UET
Basic MSI Writeback
Invalidation Protocol
States
Invalid (I)
Shared (S): one or more
Dirty or Modified (M):
one only
Processor Events:
PrRd (read)
PrWr (write)
Bus Transactions
3-70
BusRdX: asks for copy
with intent to modify
BusWB: updates memory
Actions
Copyright © 2009
BusRd: asks for copy
with no intent to modify
Update state, perform
bus transaction, flush
value onto bus
KICS, UET
State Transition Diagram
Write to shared
block:
PrRd/—
Already have latest
data; can use
upgrade (BusUpgr)
instead of BusRdX
M
BusRd/Flush
PrWr/BusRdX
Replacement
changes state of two
blocks: outgoing and
incoming
Copyright © 2009
PrWr/—
3-71
S
BusRdX/Flush
BusRdX/—
PrRd/BusRd
PrRd/—
BusRd/—
PrWr/BusRdX
I
KICS, UET
Satisfying Coherence
Write propagation is clear
Write serialization?
All writes that appear on the bus (BusRdX)
ordered by the bus
Write performed in writer’s cache before it handles other
transactions, so ordered in same way even w.r.t. writer
Reads that appear on the bus ordered wrt these
Copyright © 2009
3-72
KICS, UET
Satisfying Coherence (2)
Write serialization? (cont’d)
Write that don’t appear on the bus:
sequence of such writes between two bus trnsactions for
the block must come from same processor, say P
in serialization, the sequence appears between these two
bus transactions
reads by P will see them in this order w.r.t. other bus
transactions
reads by other processors separated from sequence by a
bus transaction, which places them in the serialized
order w.r.t the writes
so reads by all processors see writes in same order
Copyright © 2009
3-73
KICS, UET
Satisfying Sequential
Consistency
Appeal to definition:
Bus imposes total order on bus xactions for all locations
Between transactions, processors perform reads/writes
locally in program order
So any execution defines a natural partial order
Mj subsequent to Mi if (i) follows in program order on same
processor, (ii) Mj generates bus xaction that follows the
memory operation for Mi
In segment between two bus transactions, any interleaving
of ops from different processors leads to consistent total
order
In such a segment, writes observed by processor P
serialized as follows
Writes from other processors by the previous bus xaction P
issued
Writes from P by program
order
Copyright © 2009
3-74
KICS, UET
Satisfying Sequential
Consistency (2)
Show sufficient conditions are satisfied
Write completion: can detect when write appears
on bus
Write atomicity: if a read returns the value of a
write, that write has already become visible to all
others already (can reason different cases)
Copyright © 2009
3-75
KICS, UET
Lower-level Protocol Choices
BusRd observed in M state: what transition to make?
Depends on expectations of access patterns
S: assumption that I’ll read again soon, rather than other
will write
good for mostly read data
what about “migratory” data
I read and write, then you read and write, then X reads and
writes...
better to go to I state, so I don’t have to be invalidated on your
write
Synapse transitioned to I state
Sequent Symmetry and MIT Alewife use adaptive protocols
Choices can affect performance of memory system
Copyright © 2009
3-76
KICS, UET
MESI (4-state) Invalidation
Protocol
Problem with MSI protocol
Reading and modifying data is 2 bus xactions,
even if none sharing
e.g. even in sequential program
BusRd (I->S) followed by BusRdX or BusUpgr (S->M)
Add exclusive state: write locally without
xaction, but not modified
Main memory is up to date, so cache not
necessarily owner
Copyright © 2009
3-77
KICS, UET
MESI (4-state) Invalidation Protocol
(2)
Add exclusive state: (cont’d)
States
invalid
exclusive or exclusive-clean (only this cache has copy,
but not modified)
shared (two or more caches may have copies)
modified (dirty)
I E on PrRd if no one else has copy
needs “shared” signal on bus: wired-or line asserted in
response to BusRd
Copyright © 2009
3-78
KICS, UET
MESI State Transition Diagram
BusRd(S) means shared
line asserted on BusRd
transaction
Flush’: if cache-to-cache
sharing (see next), only
one cache flushes data
MOESI protocol: Owned
state: exclusive but
memory not valid
PrRd
PrWr/—
M
BusRd/Flush
BusRdX/Flush
PrWr/—
PrWr/BusRdX
E
BusRd/
Flush
PrRd/—
BusRdX/Flush
PrWr/BusRdX
S
BusRdX/Flush
PrRd/
BusRd (S
)
PrRd/—
BusRd/Flush
PrRd/
BusRd(S)
I
Copyright © 2009
3-79
KICS, UET
Lower-level Protocol Choices
Who supplies data on miss when not in M
state: memory or cache
Original, lllinois MESI: cache, since assumed
faster than memory
Cache-to-cache sharing
Not true in modern systems
Intervening in another cache more expensive than
getting from memory
Copyright © 2009
3-80
KICS, UET
Lower-level Protocol Choices (2)
Cache-to-cache sharing also adds complexity
How does memory know it should supply data
(must wait for caches)
Selection algorithm if multiple caches have valid
data
But valuable for cache-coherent machines
with distributed memory
May be cheaper to obtain from nearby cache than
distant memory
Especially when constructed out of SMP nodes
(Stanford DASH)
Copyright © 2009
3-81
KICS, UET
Dragon Write-back Update
Protocol
4 states
Exclusive-clean or exclusive (E): I and memory
have it
Shared clean (Sc): I, others, and maybe memory,
but I’m not owner
Shared modified (Sm): I and others but not
memory, and I’m the owner
Sm and Sc can coexist in different caches, with only one
Sm
Modified or dirty (D): I and, noone else
Copyright © 2009
3-82
KICS, UET
Dragon Write-back Update Protocol
(2)
No invalid state
New processor events: PrRdMiss, PrWrMiss
If in cache, cannot be invalid
If not present in cache, can view as being in notpresent or invalid state
Introduced to specify actions when block not
present in cache
New bus transaction: BusUpd
Broadcasts single word written on bus; updates
other relevant caches
Copyright © 2009
3-83
KICS, UET
Dragon State Transition Diagram
PrRd/—
BusUpd/Update
PrRd/—
BusRd/—
E
Sc
PrRdMiss/BusRd(S)
PrRdMiss/BusRd(S)
PrWr/—
PrWr/BusUpd(S)
PrWr/BusUpd(S)
BusUpd/Update
BusRd/Flush
PrWrMiss/BusRd(S)
PrWrMiss/(BusRd(S); BusUpd)
Sm
M
PrWr/BusUpd(S)
PrRd/—
PrWr/BusUpd(S)
BusRd/Flush
Copyright © 2009
PrRd/—
PrWr/—
3-84
KICS, UET
Lower-level Protocol Choices
Can shared-modified state be eliminated?
If update memory as well on BusUpd transactions
(DEC Firefly)
Dragon protocol doesn’t (assumes DRAM memory
slow to update)
Should replacement of an Sc block be
broadcast?
Would allow last copy to go to E state and not
generate updates
Replacement bus transaction is not in critical path,
later update may be
Copyright © 2009
3-85
KICS, UET
Lower-level Protocol Choices (2)
Shouldn’t update local copy on write hit
before controller gets bus
Can mess up serialization
Coherence, consistency considerations much
like write-through case
In general, many subtle race conditions in
protocols
But first, let’s illustrate quantitative
assessment at logical level
Copyright © 2009
3-86
KICS, UET
Synchronization
David E. Culler and Jaswinder Pal Singh,
Parallel Computer Architecture: A
Hardware/Software Approach, Morgan
Kaufmann, 1998
(Advanced Topic—can be skipped)
Synchronization
Synchronization is a fundamental concept of
parallel computing
“A parallel computer is a collection of
processing elements that cooperate and
communicate to solve large problems fast.”
Types
Mutual Exclusion
Event synchronization
point-to-point
group
global (barriers)
Copyright © 2009
3-88
KICS, UET
Synchronization (2)
Synchronization is a well-known problem
Resolution requires hardware and software
In traditional parallel computing
Inherited by multi-core architectures
Processor instruction set needs to provide an atomic testand-set instruction
System software uses it to provide synchronization
mechanisms
Presented here for sake of completion
To provide exposure to idea behind it
Multithreading software provides synchronization primitives
Copyright © 2009
3-89
KICS, UET
History and Perspectives
Much debate over hardware primitives over
the years
Conclusions depend on technology and
machine style
speed vs flexibility
Most modern methods use a form of atomic
read-modify-write
IBM 370: included atomic compare&swap for
multiprogramming
x86: any instruction can be prefixed with a lock
modifier
Copyright © 2009
3-90
KICS, UET
History and Perspectives (2)
Atomic read-modify-write (cont’d)
High-level language advocates want hardware
locks/barriers
SPARC: atomic register-memory ops (swap,
compare&swap)
MIPS, IBM Power: no atomic operations but pair
of instructions
but it goes against the “RISC” flow
load-locked, store-conditional
later used by PowerPC and DEC Alpha too
Rich set of tradeoffs
Copyright © 2009
3-91
KICS, UET
Components of a
Synchronization Event
Acquire method
Waiting algorithm
Wait for synch to become available when it isn’t
Release method
Acquire right to the synch (enter critical section,
go past event
Enable other processors to acquire right to the
synch
Waiting algorithm is independent of type of
synchronization
Copyright © 2009
3-92
KICS, UET
Waiting Algorithms
Blocking
Waiting processes are descheduled
High overhead
Allows processor to do other things
Busy-waiting
Waiting processes repeatedly test a location until
it changes value
Releasing process sets the location
Lower overhead, but consumes processor
resources
Can cause network traffic
Copyright © 2009
3-93
KICS, UET
Waiting Algorithms (2)
Busy-waiting better when
Scheduling overhead is larger than expected wait
time
Processor resources are not needed for other
tasks
Scheduler-based blocking is inappropriate (e.g. in
OS kernel)
Hybrid methods: busy-wait a while, then
block
Copyright © 2009
3-94
KICS, UET
Role of System and User
User wants to use high-level synchronization
operations
Locks, barriers...
Doesn’t care about implementation
System designer: how much hardware
support in implementation?
Speed versus cost and flexibility
Waiting algorithm difficult in hardware, so provide
support for others
Copyright © 2009
3-95
KICS, UET
Role of System and User (2)
Popular trend:
System provides simple hardware primitives
(atomic operations)
Software libraries implement lock, barrier
algorithms using these
But some propose and implement full-hardware
synchronization
Copyright © 2009
3-96
KICS, UET
Challenges
Same synchronization may have different needs at
different times
Lock accessed with low or high contention
Different performance requirements: low latency or high
throughput
Different algorithms best for each case, and need different
primitives
Multiprogramming can change synchronization
behavior and needs
Process scheduling and other resource interactions
May need more sophisticated algorithms, not so good in
dedicated case
Copyright © 2009
3-97
KICS, UET
Challenges (2)
Rich area of software-hardware interactions
Which primitives available affects what algorithms
can be used
Which algorithms are effective affects what
primitives to provide
Need to evaluate using workloads
Copyright © 2009
3-98
KICS, UET
Mutual Exclusion
Mutual exclusion = lock-unlock operation
Wide range of algorithms to implement these
operations
Role of contention for locks
Simple algorithms are fast when low contention for locks
Sophisticated algorithms deal with contention in a better
way but have higher cost
Types of locks
Hardware locks
Simple lock algorithms
Advanced lock algorithms
Copyright © 2009
3-99
KICS, UET
Hardware Locks
Separate lock lines on the bus: holder of a lock
asserts the line
Locking algorithm
Busy-wait with timeout
Lock registers (Cray XMP)
Priority mechanism for multiple requestors
Set of registers shared among processors
Inflexible, so not popular for general purpose use
few locks can be in use at a time (one per lock line)
hardwired waiting algorithm
Primarily used to provide atomicity for higher-level
software locks
Copyright © 2009
3-100
KICS, UET
First Attempt at Simple Software
Lock
lock:
and
unlock:
ld
cmp
bnz
st
ret
register, location
location, #0
lock
location, #1
/* copy location to register */
/* compare with 0 */
/* if not 0, try again */
/* store 1 to mark it locked */
/* return control to caller */
st
ret
location, #0
/* write 0 to location */
/* return control to caller */
Copyright © 2009
3-101
KICS, UET
First Attempt at Simple Software
Lock (2)
Problem: lock needs atomicity in its own
implementation
Read (test) and write (set) of lock
variable by a process not atomic
Solution: atomic read-modify-write or
exchange instructions
atomically test value of location and set it
to another value, return success or failure
somehow
Copyright © 2009
3-102
KICS, UET
Atomic Exchange Instruction
Specifies a location and register. In atomic
operation:
Value in location read into a register
Another value (function of value read or not)
stored into location
Many variants
Varying degrees of flexibility in second part
Copyright © 2009
3-103
KICS, UET
Atomic Exchange Instruction (2)
Simple example: test&set
Value in location read into a specified register
Constant 1 stored into location
Successful if value loaded into register is 0
Other constants could be used instead of 1 and 0
Can be used to build locks
Copyright © 2009
3-104
KICS, UET
Simple Test&Set Lock
lock:
unlock:
t&s
bnz
ret
st
ret
Copyright © 2009
register, location
lock
location, #0
3-105
/* if not 0, try again */
/* return control to caller */
/* write 0 to location */
/* return control to caller */
KICS, UET
Simple Test&Set Lock (2)
Other read-modify-write primitives can be
used too
Swap
Fetch&op
Compare&swap
Three operands: location, register to compare with,
register to swap with
Not commonly supported by RISC instruction sets
Can be cacheable or uncacheable (we
assume cacheable)
Copyright © 2009
3-106
KICS, UET
Simple Test&Set Lock (3)
On SGI Challenge Code:
lock;
delay(c);
unlock;
Same total number of lock calls as p
increases
measure time per transfer
Copyright © 2009
3-107
KICS, UET
T&S Lock Microbenchmark
Performance (2)
20
18
l
s
s
T est&set,
c =0
l
T est&set, exponential backof
f,
c = 3.64
T est&set, exponential backof
f,
c =0
u
Ideal
s
s
16
s
s
14
l
s
Time ( m s)
s
l
s
s
s
l
10
8
l
6
l
l
s
l
l
s
12
s
l
4
s
l
2
l
s
uuuuuuuuuuuuuuu
l
su
0 l
3
l
l
5
7
9
11
13
15
Number of processors
Performance degrades because unsuccessful test&sets generate traffic
Copyright © 2009
3-108
KICS, UET
Enhancements to Simple Lock
Algorithm
Reduce frequency of issuing test&sets while
waiting
Test&set lock with backoff
Don’t back off too much or will be backed off
when lock becomes free
Exponential backoff works quite well empirically:
ith time = k*ci
Copyright © 2009
3-109
KICS, UET
Enhancements to Simple Lock
Algorithm (2)
Busy-wait with read operations rather than
test&set
Test-and-test&set lock
Keep testing with ordinary load
cached lock variable will be invalidated when release
occurs
When value changes (to 0), try to obtain lock with
test&set
only one attemptor will succeed; others will fail and start
testing again
Copyright © 2009
3-110
KICS, UET
Performance Criteria (T&S Lock)
Uncontended Latency
Traffic
Very low if repeatedly accessed by same
processor; indept. of p
Lots if many processors compete; poor scaling
with p
Each t&s generates invalidations, and all rush out
again to t&s
Storage
Very small (single variable); independent of p
Copyright © 2009
3-111
KICS, UET
Performance Criteria (2)
Fairness
Test&set with backoff similar, but less traffic
Test-and-test&set: slightly higher latency, much less
traffic
But still all rush out to read miss and test&set on
release
Poor, can cause starvation
Traffic for p processors to access once each: O(p2)
Luckily, better hardware primitives as well as
algorithms exist
Copyright © 2009
3-112
KICS, UET
Improved Hardware Primitives:
LL-SC
Goals:
Test with reads
Failed read-modify-write attempts don’t generate
invalidations
Nice if single primitive can implement range of rm-w operations
Two instructions: Load-Locked (or -linked),
Store-Conditional
LL reads variable into register
Copyright © 2009
3-113
KICS, UET
Improved Hardware Primitives
(2)
Follow with arbitrary instructions to
manipulate its value
SC tries to store back to location if and only if
no one else has written to the variable since
this processor’s LL
If SC succeeds, means all three steps happened
atomically
If fails, doesn’t write or generate invalidations
(need to retry LL)
Success indicated by condition codes
Copyright © 2009
3-114
KICS, UET
Simple Lock with LL-SC
lock:
unlock:
Copyright © 2009
ll
sc
beqz
ret
st
ret
reg1, location /* LL location to reg1 */
location, reg2 /* SC reg2 into location*/
reg2, lock
/* if failed, start again */
location, #0
3-115
/* write 0 to location */
KICS, UET
Simple Lock with LL-SC (2)
Can do more fancy atomic ops by changing what’s
between LL & SC
SC can fail (without putting transaction on bus) if:
But keep it small so SC likely to succeed
Don’t include instructions that would need to be undone
(e.g. stores)
Detects intervening write even before trying to get bus
Tries to get bus but another processor’s SC gets bus first
LL, SC are not lock, unlock respectively
Only guarantee no conflicting write to lock variable between
them
But can use directly to implement simple operations on
shared variables
Copyright © 2009
3-116
KICS, UET
More Efficient SW Locking
Algorithms
Problem with Simple LL-SC lock
No invals on failure, but read misses by all waiters
after both release and successful SC by winner
No test-and-test&set analog, but can use backoff
to reduce burstiness
Doesn’t reduce traffic to minimum, and not a fair
lock
Copyright © 2009
3-117
KICS, UET
More Efficient SW Locking (2)
Better SW algorithms for bus (for r-m-w
instructions or LL-SC)
Only one process to try to get lock upon release
valuable when using test&set instructions; LL-SC does it
already
Only one process to have read miss upon release
valuable with LL-SC too
Ticket lock achieves first
Array-based queueing lock achieves both
Both are fair (FIFO) locks as well
Copyright © 2009
3-118
KICS, UET
Ticket Lock
Only one r-m-w (from only one processor)
per acquire
Works like waiting line at a bank
Two counters per lock (next_ticket, now_serving)
Acquire: fetch&inc next_ticket; wait for
now_serving to equal it
atomic op when arrive at lock, not when it’s free (so less
contention)
Release: increment now-serving
FIFO order, low latency for low-contention if
fetch&inc cacheable
Copyright © 2009
3-119
KICS, UET
Ticket Lock (2)
Works like waiting line at a bank (cont’d)
Still O(p) read misses at release, since all spin on
same variable
Can be difficult to find a good amount to delay on
backoff
like simple LL-SC lock, but no inval when SC succeeds,
and fair
exponential backoff not a good idea due to FIFO order
backoff proportional to now-serving - next-ticket may
work well
Wouldn’t it be nice to poll different locations
...
Copyright © 2009
3-120
KICS, UET
Array-based Queuing Locks
Waiting processes poll on different locations
in an array of size p
Acquire
Release
fetch&inc to obtain address on which to spin (next array
element)
ensure that these addresses are in different cache lines
or memories
set next location in array, thus waking up process
spinning on it
O(1) traffic per acquire with coherent caches
Copyright © 2009
3-121
KICS, UET
Array-based Queuing Locks (2)
Waiting processes poll on different locations
in an array of size p (cont’d)
FIFO ordering, as in ticket lock
But, O(p) space per lock
Good performance for bus-based machines
Not so great for non-cache-coherent machines
with distributed memory
array location I spin on not necessarily in my local
memory
Copyright © 2009
3-122
KICS, UET
Lock Performance on SGI
Challenge
Loop: lock; delay(c); unlock; delay(d);
Array-based
6
LL-SC
LL-SC, exponential
u
Ticket
s
Ticket, proportional
7
u
6
u
u
u
6
u
u
u u
u
u
u
5
u
l l
l l u
l
l
u l sl l
s l
l s l l
s6 s
s s s6
s s
s s
4
3
6
6
6 6
2
1
6
6 6
su
6
u
l
s 6
6
sl
u
6
6
6
Time (ms)
Time (ms)
5
5
7
9
Number of processors
(a) Null (c = 0, d = 0)
Copyright © 2009
11
13 15
u
u l
l l
u u sl l sl s
u
su s
s
s
4
3
1
6
l
s
sl
l 6 l
l s6
s
s s6
6
l
s
7
u
6
9
11 13 15
Number of processors
(b) Critical-section (c = 3.64 ms, d = 0)
3-123
u
l l
u 6
l
su
6 l
u
6 s
6
s
su
s
6
3
2
u
u
u
6
6
6
6
l
4
u
u l
6
6 l
l
6
l
l 6 6 l
l s
l s l
s
s
s s
s
s
s
1
0
5
u u
l
6 6 6 6 6 6
6 6
3
u
5
6
s
l
u
6
u u u
u
2
0
3
u
1
0
1
l
7
u
Time (ms)
7
l
s
l
6
u
1
3
5
7
9
11 13
15
Number of processors
(c) Delay (c = 3.64 ms, d = 1.29 ms)
KICS, UET
Lock Performance on SGI Challenge
(2)
Simple LL-SC lock does best at small p due to
unfairness
Not so with delay between unlock and next lock
Need to be careful with backoff
Ticket lock with proportional backoff scales
well, as does array lock
Methodologically challenging, and need to
look at real workloads
Copyright © 2009
3-124
KICS, UET
Point to Point Event
Synchronization
Software methods:
Interrupts
Busy-waiting: use ordinary variables as flags
Blocking: use semaphores
Full hardware support: full-empty bit with
each word in memory
Set when word is “full” with newly produced data
(i.e. when written)
Unset when word is “empty” due to being
consumed (i.e. when read)
Copyright © 2009
3-125
KICS, UET
Point to Point Event Synchronization
(2)
Full hardware support: (cont’d)
Natural for word-level producer-consumer
synchronization
producer: write if empty, set to full; consumer: read if
full; set to empty
Hardware preserves atomicity of bit manipulation
with read or write
Problem: flexiblity
multiple consumers, or multiple writes before consumer
reads?
needs language support to specify when to use
composite data structures?
Copyright © 2009
3-126
KICS, UET
Barriers
Software algorithms implemented using locks,
flags, counters
Hardware barriers
Wired-AND line separate from address/data bus
Set input high when arrive, wait for output to be
high to leave
In practice, multiple wires to allow reuse
Useful when barriers are global and very frequent
Copyright © 2009
3-127
KICS, UET
Barriers (2)
Hardware barriers
Difficult to support arbitrary subset of processors
Difficult to dynamically change number and
identity of participants
even harder with multiple processes per processor
e.g. latter due to process migration
Not common today on bus-based machines
Let’s look at software algorithms with simple
hardware primitives
Copyright © 2009
3-128
KICS, UET
A Simple Centralized Barrier
Shared counter maintains number of processes that
have arrived
increment when arrive (lock), check until reaches numprocs
struct bar_type {int counter; struct lock_type lock; int flag
= 0;} bar_name;
BARRIER (bar_name, p) {
LOCK(bar_name.lock);
if (bar_name.counter == 0)
bar_name.flag = 0;
/* reset flag if first to reach*/
mycount = bar_name.counter++;
/* mycount is private */
UNLOCK(bar_name.lock);
if (mycount == p) {
/* last to arrive */
bar_name.counter = 0;
/* reset for next barrier */
bar_name.flag = 1;
/* release waiters */
}
else while (bar_name.flag == 0) {}; /* busy wait for release */
}
Problem?
Copyright © 2009
3-129
KICS, UET
A Working Centralized Barrier
Consecutively entering the same barrier
doesn’t work
Must prevent process from entering until all have
left previous instance
Could use another counter, but increases latency
and contention
Sense reversal: wait for flag to take different
value consecutive times
Toggle this value only when all processes reach
Copyright © 2009
3-130
KICS, UET
A Working Centralized Barrier
(2)
BARRIER (bar_name, p) {
local_sense = !(local_sense); /* toggle private sense variable */
LOCK(bar_name.lock);
mycount = bar_name.counter++;
/* mycount is private */
if (bar_name.counter == p)
UNLOCK(bar_name.lock);
bar_name.flag = local_sense; /* release waiters*/
else
{ UNLOCK(bar_name.lock);
while (bar_name.flag != local_sense) {}; }
}
Copyright © 2009
3-131
KICS, UET
Centralized Barrier Performance
Latency
Traffic
Want short critical path in barrier
Centralized has critical path length at least
proportional to p
Barriers likely to be highly contended, so want
traffic to scale well
About 3p bus transactions in centralized
Storage Cost
Very low: centralized counter and flag
Copyright © 2009
3-132
KICS, UET
Centralized Barrier Performance
(2)
Fairness
Same processor should not always be last to exit
barrier
No such bias in centralized
Key problems for centralized barrier are
latency and traffic
Especially with distributed memory, traffic goes to
same node
Copyright © 2009
3-133
KICS, UET
Improved Barrier Algorithms for
a Bus
Contention
Little contention
Flat
Tree structured
Software combining tree
Only k processors access the same location, where
k is degree of tree
Copyright © 2009
3-134
KICS, UET
Improved Barrier Algorithms for a
Bus (2)
Separate arrival and exit trees, and use sense
reversal
Valuable in distributed network: communicate
along different paths
On bus, all traffic goes on same bus, and no less
total traffic
Higher latency (log p steps of work, and O(p)
serialized bus xactions)
Advantage on bus is use of ordinary reads/writes
instead of locks
Copyright © 2009
3-135
KICS, UET
Barrier Performance on SGI
Challenge
l
35
30
u
s
Centralized
Combining tree
Tournament
Dissemination
Time (ms)
25
u
u
20
15
10
5
u
sl
u
sl
u
s
l
sl
u
l
s
u
l
l
s
s
u
0 sl
12345678
Number of processors
Centralized does quite well
Copyright © 2009
3-136
KICS, UET
Synchronization Summary
Rich interaction of hardware-software tradeoffs
Must evaluate hardware primitives and software
algorithms together
Evaluation methodology is challenging
primitives determine which algorithms perform well
Use of delays, microbenchmarks
Should use both microbenchmarks and real workloads
Simple software algorithms with common hardware
primitives do well on bus
Copyright © 2009
3-137
KICS, UET
Key Takeaways for this Session
Multi-core processors are here
These are multiprocessor/MIMD systems
We need to understand parallel programming
System support for multi-core is available
Strengths, weaknesses, opportunities, and threads
No “free lunch” for performance improvement
OS: both Linux and Windows support them
Compilers/language support: gcc, C#, java
Two types of development tracks
High performance computing
High throughput computing
Both have their unique challenges
Copyright © 2009
3-138
KICS, UET
Key Takeaways (2)
High performance computing
Most scientific/engineering applications
Available programming models: message-passing (MPI) or
shared-memory processing (OpenMP)
Challenge: performance scalability with cores and problem
size while dealing with data/function partitioning
High throughput computing
Most business applications
Available programming model: multi-threading (sharedmemory processing)
Challenge: performance scalability while dealing with
deadlocks, locking, cache, and memory issues
Copyright © 2009
3-139
KICS, UET