Parallel Architectures - Cavium University Program

Transcript Parallel Architectures - Cavium University Program

Programming Multi-Core
Processors based Embedded
Systems
A Hands-On Experience on Cavium
Octeon based Platforms
Lecture 3 (Complexities of Parallelism)
Course Outline



Introduction
Multi-threading on multi-core processors
Multi-core applications and their complexities




Multi-core parallel applications
Complexities of multi-threading and parallelism
Application layer computing on multi-core
Performance measurement and tuning
Copyright © 2009
3-2
KICS, UET
Agenda for Today

Multi-core parallel applications space



Scientific/engineering applications
Commercial applications
Complexities due to parallelism



Threading related issues
Memory consistency and cache coherence
Synchronization
Copyright © 2009
3-3
KICS, UET
Parallel Applications
Science/engineering application, generalpurpose application, and desktop applications
David E. Culler and Jaswinder Pal Singh, Parallel
Computer Architecture: A Hardware/Software
Approach, Morgan Kaufmann, 1998
Parallel Application Trends


There is an ever-increasing demand for high
performance computing in a number of application
areas
Scientific and engineering applications:




Computational fluid dynamics
Weather modeling
Number of applications from physics, chemistry, biology, etc.
General-purpose computing applications



Video encoding/decoding, graphics, games
Database management
Networking applications
Copyright © 2009
3-5
KICS, UET
Application Trends (2)

Demand for cycles fuels advances in hardware, and vice-versa



Range of performance demands



Cycle drives exponential increase in microprocessor performance
Drives parallel architecture harder: most demanding applications
Need range of system performance with progressively increasing
cost
Platform pyramid
Goal of applications in using multi-core machines: Speedup
Speedup (p cores) =
Performance (p cores)
Performance (1 core)

For a fixed problem size (input data set), performance = 1/time
Speedup fixed problem (p cores) =
Time (1 core)
Time (p cores)
Copyright © 2009
3-6
KICS, UET
Scientific Computing Demand
Copyright © 2009
3-7
KICS, UET
Engineering Application
Demands

Large parallel machines a mainstay in many
industries






Petroleum (reservoir analysis)
Automotive (crash simulation, drag analysis, combustion
efficiency),
Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism),
Computer-aided design
Pharmaceuticals (molecular modeling)
Visualization



in all of the above
entertainment (films like Toy Story)
architecture (walk-throughs and rendering)
Copyright © 2009
3-8
KICS, UET
Application Trends Example:
ASCI


Accelerated Strategic Computing Initiative
(ASCI) is a US DoE program that proposes
the use of high performance computing for 3D modeling and simulation
Promised to provide 5 orders of magnitude
greater computing power in 8 years (1996 to
2004) than state-of-the-art (1 GFlops to 100
Tflops)
Copyright © 2009
3-9
KICS, UET
Application Trends Example (2)

Platforms

ASCI Red



ASCI Blue Mountain



3.1 TOPs peak performance
Developed by Intel with 4,510 nodes
3 TOPs peak performance
Developed by SGI with 48, 128 node Origin2000s
ASCI White


12 TOPs peak performance
Developed by IBM as cluster of SMPs
Copyright © 2009
3-10
KICS, UET
Commercial Applications


Databases, online-transaction processing,
decision support, data mining
Also relies on parallelism for high end


Scale not so large, but use much more widespread
High performance means performing more work
(transactions) in a fixed time
Copyright © 2009
3-11
KICS, UET
Commercial Applications (2)

TPC benchmarks (TPC-C order entry, TPC-D
decision support)




Explicit scaling criteria provided
Size of enterprise scales with size of system
Problem size no longer fixed as p increases, so
throughput is used as a performance measure
(transactions per minute or tpm)
Desktop applications


Video applications
Secure computing and web services
Copyright © 2009
3-12
KICS, UET
Parallel Applications Landscape
HPCC
(Science/
engineering)
Data Center Appls.
(Search, e-commerce,
Enterprise, SOA)
Desktop Applications
(WWW browser, office,
multimedia applications)
Embedded Applications
(Wireless and mobile devices, PDAs,
consumer electronics)
Copyright © 2009
3-13
KICS, UET
Summary of Application Trends




Transition to parallel computing has occurred for
scientific and engineering computing
In rapid progress in commercial computing
Desktop also uses multithreaded programs, which are
a lot like parallel programs
Demand for improving throughput on sequential
workloads



Greatest use of small-scale multiprocessors
Currently employ multi-core processors
Solid application demand exists and will increase
Copyright © 2009
3-14
KICS, UET
Solutions to Common
Parallel Programming
Problems using
Multiple Threads
Chapter 7
Shameem Akhtar and Jason Roberts, MultiCore Programming, Intel Press, 2006
Common Problems




Too many threads
Data races,
deadlocks, and live
locks
Heavily contended
locks
Non-blocking
algorithms
Copyright © 2009





3-16
Thread-safe
functions and
libraries
Memory issues
Cache related issues
Pipeline stalls
Date organization
KICS, UET
Too Many Threads

Little threading good  many will be great



Not always true
Excessive threading can degrade performance
Two types of impacts of excessive threads

Too little work per thread



Overhead of starting and maintaining dominates
Fine granularity of work hides any performance benefits
Excessive contention for hardware resources



OS uses time-slicing for fair scheduling
May result in excessive context switching overhead
Thrashing at virtual memory level
Copyright © 2009
3-17
KICS, UET
Data Races, Deadlocks, and
Livelocks

Race condition


Due to unsynchronized accesses to shared data
Program results are non-deterministic



Can be handled through locking
Deadlock



Depend on relative timings of threads
A problem due to incorrect locking
Results due to cyclic dependence that stops forward
progress by threads
Livelock



Thread continuously conflict with each other and back off
No thread makes any progress
Solution: back off with release of acquired locks to allow at
least one thread to make progress
Copyright © 2009
3-18
KICS, UET
Races among Unsynchronized
Threads
Copyright © 2009
3-19
KICS, UET
Race Conditions Hiding Behind
Language Syntax
Copyright © 2009
3-20
KICS, UET
A Higher-Level Race Condition
Example

Race conditions possible with synch



However, synchronization at too low level
Higher level may still have data races
Example




Each key should occur only once in the list
Individual list operators have locks
Problem: two threads simultaneously may find that key does not
exist and insert the same key in the list one after the other
Solution: locking both for list as well as to protect key repetition
Copyright © 2009
3-21
KICS, UET
Deadlock Caused by Cycle
Copyright © 2009
3-22
KICS, UET
Conditions for a Deadlock
Deadlock can occur only if the following four
conditions are true:
 Access to each resource is exclusive;
 A thread is allowed to hold one resource
requesting another;
 No thread is willing to relinquish a resource
that it has acquired; and
 There is a cycle of threads trying to acquire
resources, where each resource is held by
one thread and requested by another
Copyright © 2009
3-23
KICS, UET
Locks Ordered by their
Addresses


Consistent ordering of lock acquisition
Prevents deadlock
Copyright © 2009
3-24
KICS, UET
Try and Backoff Logic


One reason for deadlocks: no thread willing to give
up a resource
Solution: thread gives up resource if it cannot acquire
another one
Copyright © 2009
3-25
KICS, UET
Heavily Contested Locks

Locks ensure correctness



By preventing race conditions
By preventing deadlocks
Performance impact



When locks become heavily contested among
threads
Threads try to acquire the lock at a rate faster
than the rate at which a thread can execute the
corresponding critical section
If a thread falls asleep, all threads have to wait for
it
Copyright © 2009
3-26
KICS, UET
Priority Inversion Scenario
Copyright © 2009
3-27
KICS, UET
Solution: Spreading out
Contention
Copyright © 2009
3-28
KICS, UET
Hash Table with Fine-Grained
Locking

Mutexes protecting each bucket
Copyright © 2009
3-29
KICS, UET
Non-Blocking Algorithms

How about not using locks at all!




To resolve the locking problems
Such algorithms are called non-blocking
Stopping one thread does not prevent rest of the system
from making progress
Non-blocking guarantees:



Obstruction freedom—thread makes progress as long as no
contention  livelock possible  uses exponential backoff
to avoid it
Lock freedom—system as a whole makes progress
Wait freedom—every thread makes progress even when
faced with contention  practically difficult to achieve
Copyright © 2009
3-30
KICS, UET
Thread-Safe Functions


Thread-safe function  when concurrently called on different
objects
Implementer should ensure thread safety of hidden shared state
Copyright © 2009
3-31
KICS, UET
Memory Issues

Speed disparity




Processing is fast
Memory access is slow
Multiple cores can exacerbate the problem
Specific memroy issues




Bandwidth
Working in the cache
Memory contention
Memory consistency
Copyright © 2009
3-32
KICS, UET
Bandwidth
Copyright © 2009
3-33
KICS, UET
Working in the Cache
Copyright © 2009
3-34
KICS, UET
Memory Contention

Types of memory accesses



Two types of data dependences:



Between a core and main memory
Between two cores
Read-write dependency: a core write a cache line and then
different core reads it
Write-write dependency: a cores write a cache line and then
a different core writes it
Interactions among cores



Consume bandwidth
Are avoided when multiple cores only read from cache lines
Can be avoided by minimizing the shared locations
Copyright © 2009
3-35
KICS, UET
False Sharing




Cache block may
also introduce
artifacts
Two distinct
variables in the
same cache block
Technique: allocate
data used by each
processor
contiguously, or at
least avoid
interleaving in
memory
Example problem:
an array of ints, one
written frequently by
each processor
(many ints per cache
line)
Copyright © 2009
3-36
KICS, UET
Performance Impact of False
Sharing
Copyright © 2009
3-37
KICS, UET
What is Memory Consistency?
Copyright © 2009
3-38
KICS, UET
Itanium Architecture
Copyright © 2009
3-39
KICS, UET
Shared Memory without a Lock
Copyright © 2009
3-40
KICS, UET
Memory Consistency and
Cache Coherence
David E. Culler and Jaswinder Pal Singh,
Parallel Computer Architecture: A
Hardware/Software Approach, Morgan
Kaufmann, 1998
(Advanced Topics—can be skipped)
Memory Consistency for Multi-Core
Architectures

Memory consistency issue


Programs are written for a conceptual sequential machine
with memory
Programs for parallel architectures:




Written for multiple concurrent instruction streams
Memory accesses may occur in any order
May result in incorrect comupation
This is a well-known problem



Traditional parallel architecture deal with it
Multi-core architectures inherit this complexity
Presented in this section for sake of completion


More relevant for HPCC applications
Not as complex for multi-threading  thread level solutions
Copyright © 2009
3-42
KICS, UET
Memory Consistency


Consistency requirement:
writes to a location become visible to all in
the same order
But when does a write become visible

How to establish orders between a write and a
read by different process?


Typically use event synchronization
By using more than one location
Copyright © 2009
3-43
KICS, UET
Memory Consistency (2)
P
P
1
2
/*Assume initial value of A and flag is 0*/
A = 1;
while (flag == 0);
flag = 1;
print A;

Sometimes expect memory to respect order between
accesses to different locations issued by a given
processor


/*spin idly*/
to preserve orders among accesses to same location by
different processes
Coherence doesn’t help: pertains only to single
location
Copyright © 2009
3-44
KICS, UET
An Example of Orders
P1
P2
/*Assume initial values of A and B are 0*/
(1a) A = 1;
(2a) print B;
(1b) B = 2;

We need an ordering model for clear
semantics



(2b) print A;
across different locations as well
so programmers can reason about what results
are possible
This is the memory consistency model
Copyright © 2009
3-45
KICS, UET
Memory Consistency Model

Specifies constraints on the order in which
memory operations (from any process) can
appear to execute with respect to one
another



What orders are preserved?
Given a load, constrains the possible values
returned by it
Without it, can’t tell much about an SAS
program’s execution
Copyright © 2009
3-46
KICS, UET
Memory Consistency Model (2)

Implications for both programmer and system
designer



Programmer uses to reason about correctness and
possible results
System designer can use to constrain how much
accesses can be reordered by compiler or
hardware
Contract between programmer and system
Copyright © 2009
3-47
KICS, UET
Sequential Consistency
Processors
P1
issuing memory
references as
per program order
P2
Pn
T he “switch” is randomly
set after each memory
reference
Memory

(as if there were no caches, and a single
memory)
Copyright © 2009
3-48
KICS, UET
Sequential Consistency (2)

Total order achieved by interleaving accesses from
different processes



Maintains program order, and memory operations, from all
processes, appear to [issue, execute, complete] atomically
w.r.t. others
Programmer’s intuition is maintained
“A multiprocessor is sequentially consistent if the
result of any execution is the same as if the
operations of all the processors were executed in
some sequential order, and the operations of each
individual processor appear in this sequence in the
order specified by its program.” [Lamport, 1979]
Copyright © 2009
3-49
KICS, UET
What Really is Program Order?

Intuitively, order in which operations appear in
source code






Straightforward translation of source code to assembly
At most one memory operation per instruction
But not the same as order presented to hardware by
compiler
So which is program order?
Depends on which layer, and who’s doing the
reasoning
We assume order as seen by programmer
Copyright © 2009
3-50
KICS, UET
Sequential Consistency: Example
P1
P2
/*Assume initial values of A and B are 0*/
(1a) A = 1;
(2a) print B;
(1b) B = 2;
(2b) print A;

possible outcomes for (A,B): (0,0), (1,0), (1,2)





impossible under SC: (0,2)
we know 1a1b and 2a2b by program order
A = 0 implies 2b1a, which implies 2a1b
B = 2 implies 1b2a, which leads to a contradiction
BUT, actual execution 1b1a2b2a is SC, despite not
program order


appears just like 1a1b2a2b as visible from results
actual execution 1b2a2b1a is not SC
Copyright © 2009
3-51
KICS, UET
Implementing SC

Two kinds of requirements:

Program order


memory operations issued by a process must appear to
become visible (to others and itself) in program order
Atomicity



in the overall total order, one memory operation should
appear to complete with respect to all processes before
the next one is issued
needed to guarantee that total order is consistent across
processes
tricky part is making writes atomic
Copyright © 2009
3-52
KICS, UET
Write Atomicity

Write Atomicity: Position in total order at
which a write appears to perform should be
the same for all processes


Nothing a process does after it has seen the new
value produced by a write W should be visible to
other processes until they too have seen W
In effect, extends write serialization to writes from
multiple processes
Copyright © 2009
3-53
KICS, UET
Write Atomicity (2)


Transitivity implies A should print as 1 under
SC
Problem if P2 leaves loop, writes B, and P3
sees new B but old A (from its cache, say)
P1
A=1;
Copyright © 2009
P2
P3
while (A==0);
B=1;
3-54
while (B==0);
print A;
KICS, UET
Formal Definition of SC



Each process’s program order imposes partial
order on set of all operations
Interleaving of these partial orders defines a
total order on all operations
Many total orders may be SC (SC does not
define particular interleaving)
Copyright © 2009
3-55
KICS, UET
Formal Definition of SC (2)

SC Execution:


An execution of a program is SC if the results it
produces are the same as those produced by
some possible total order (interleaving)
SC System:

A system is SC if any possible execution on that
system is an SC execution
Copyright © 2009
3-56
KICS, UET
Sufficient Conditions for SC



Every process issues memory operations in program
order
After a write operation is issued, the issuing process
waits for the write to complete before issuing its next
operation
After a read operation is issued, the issuing process
waits for the read to complete, and for the write
whose value is being returned by the read to
complete, before issuing its next operation (provides
write atomicity)
Copyright © 2009
3-57
KICS, UET
Sufficient Conditions for SC (2)


Sufficient, not necessary, conditions
Clearly, compilers should not reorder for SC, but they
do!


Even if issued in order, hardware may violate for
better performance


Loop transformations, register allocation (eliminates!)
Write buffers, out of order execution
Reason: uniprocessors care only about dependences
to same location

Makes the sufficient conditions very restrictive for
performance
Copyright © 2009
3-58
KICS, UET
Summary of SC Implementation


Assume for now that compiler does not reorder
Hardware needs mechanisms to detect:



For all protocols and implementations, we will see




Detect write completion (read completion is easy)
Ensure write atomicity
How they satisfy coherence, particularly write serialization
How they satisfy sufficient conditions for SC (write
completion and write atomicity)
How they can ensure SC but not through sufficient
conditions
Will see that centralized bus interconnect makes it
easier
Copyright © 2009
3-59
KICS, UET
Cache Coherence

CC for SMP architectures


One memory location in multiple caches
Not a problem for read accesses



Write access drive coherence requirements



No need to update the memory address
Computation can continue on local processor
Memory needs to be updated
Need to invalidate cache copies in other processors
Multiple ways to deal with updates


Update memory immediately  write through caches
Update later  write back caches
Copyright © 2009
3-60
KICS, UET
Cache Coherence (2)

CC is a well-known problem



For traditional SMP style multiprocessors
Inherited by multi-core processors
Multiple solutions



Can be resolved in software
However, traditionally resolved in hardware
Hardware supports CC protocols



A mechanism to detect cache coherence related events
Mechanisms to keep the caches coherent
Presented here for the sake of completion


Programmer does not have to worry about it
However, a key consideration for a multi-core architecture
Copyright © 2009
3-61
KICS, UET
SC in Write-through


Provides SC, not just coherence
Extend arguments used for coherence


Writes and read misses to all locations serialized
by bus into bus order
If read obtains value of write W, W guaranteed to
have completed


since it caused a bus transaction
When write W is performed w.r.t. any processor,
all previous writes in bus order have completed
Copyright © 2009
3-62
KICS, UET
Design Space for Snooping
Protocols

No need to change processor, main memory, cache
…



Focus on protocols for write-back caches
Dirty state now also indicates exclusive ownership



Extend cache controller and exploit bus (provides
serialization)
Exclusive: only cache with a valid copy
Owner: responsible for supplying block upon a request for it
Design space


Invalidation versus Update-based protocols
Set of states
Copyright © 2009
3-63
KICS, UET
Invalidation-based Protocols

Exclusive means can modify without notifying anyone
else

i.e. without bus transaction

Must first get block in exclusive state before writing into it


Even if already in valid state, need transaction, so called a
write miss
Store to non-dirty data generates a read-exclusive
bus transaction
Copyright © 2009
3-64
KICS, UET
Invalidation-based Protocols (2)

The read-exclusive bus transaction (cont’d)

Tells others about impending write, obtains exclusive
ownership





makes the write visible, i.e. write is performed
may be actually observed (by a read miss) only later
write hit made visible (performed) when block updated in
writer’s cache
Only one RdX can succeed at a time for a block: serialized
by bus
Read and Read-exclusive bus transactions drive
coherence actions

Writeback transactions also, but not caused by memory
operation and quite incidental to coherence protocol

note: replaced block that is not in modified state can be
dropped
Copyright © 2009
3-65
KICS, UET
Update-based Protocols

A write operation updates values in other
caches


New, update bus transaction
Advantages

Other processors don’t miss on next access:
reduced latency


In invalidation protocols, they would miss and cause
more transactions
Single bus transaction to update several caches
can save bandwidth

Also, only the word written is transferred, not whole
block
Copyright © 2009
3-66
KICS, UET
Update-based Protocols (2)

Disadvantages

Multiple writes by same processor cause multiple
update transactions


In invalidation, first write gets exclusive ownership,
others local
Detailed tradeoffs more complex
Copyright © 2009
3-67
KICS, UET
Invalidate versus Update

Basic question of program behavior


Is a block written by one processor read by others before it
is rewritten?
Invalidation:


Yes => readers will take a miss
No => multiple writes without additional traffic


Update:

Yes => readers will not miss if they had a copy previously



and clears out copies that won’t be used again
single bus transaction to update all copies
No => multiple useless updates, even to dead copies
Invalidation protocols much more popular

Some systems provide both, or even hybrid
Copyright © 2009
3-68
KICS, UET
Protocols



3-state writeback invalidation protocol
4-state writeback invalidation protocol
4-state writeback update protocol
Copyright © 2009
3-69
KICS, UET
Basic MSI Writeback
Invalidation Protocol

States





Invalid (I)
Shared (S): one or more
Dirty or Modified (M):
one only


Processor Events:


PrRd (read)
PrWr (write)
Bus Transactions


3-70
BusRdX: asks for copy
with intent to modify
BusWB: updates memory
Actions

Copyright © 2009
BusRd: asks for copy
with no intent to modify
Update state, perform
bus transaction, flush
value onto bus
KICS, UET
State Transition Diagram

Write to shared
block:


PrRd/—
Already have latest
data; can use
upgrade (BusUpgr)
instead of BusRdX
M
BusRd/Flush
PrWr/BusRdX
Replacement
changes state of two
blocks: outgoing and
incoming
Copyright © 2009
PrWr/—
3-71
S
BusRdX/Flush
BusRdX/—
PrRd/BusRd
PrRd/—
BusRd/—
PrWr/BusRdX
I
KICS, UET
Satisfying Coherence


Write propagation is clear
Write serialization?

All writes that appear on the bus (BusRdX)
ordered by the bus


Write performed in writer’s cache before it handles other
transactions, so ordered in same way even w.r.t. writer
Reads that appear on the bus ordered wrt these
Copyright © 2009
3-72
KICS, UET
Satisfying Coherence (2)

Write serialization? (cont’d)

Write that don’t appear on the bus:





sequence of such writes between two bus trnsactions for
the block must come from same processor, say P
in serialization, the sequence appears between these two
bus transactions
reads by P will see them in this order w.r.t. other bus
transactions
reads by other processors separated from sequence by a
bus transaction, which places them in the serialized
order w.r.t the writes
so reads by all processors see writes in same order
Copyright © 2009
3-73
KICS, UET
Satisfying Sequential
Consistency

Appeal to definition:



Bus imposes total order on bus xactions for all locations
Between transactions, processors perform reads/writes
locally in program order
So any execution defines a natural partial order



Mj subsequent to Mi if (i) follows in program order on same
processor, (ii) Mj generates bus xaction that follows the
memory operation for Mi
In segment between two bus transactions, any interleaving
of ops from different processors leads to consistent total
order
In such a segment, writes observed by processor P
serialized as follows
Writes from other processors by the previous bus xaction P
issued
Writes from P by program
order
Copyright © 2009
3-74
KICS, UET

Satisfying Sequential
Consistency (2)

Show sufficient conditions are satisfied


Write completion: can detect when write appears
on bus
Write atomicity: if a read returns the value of a
write, that write has already become visible to all
others already (can reason different cases)
Copyright © 2009
3-75
KICS, UET
Lower-level Protocol Choices


BusRd observed in M state: what transition to make?
Depends on expectations of access patterns

S: assumption that I’ll read again soon, rather than other
will write


good for mostly read data
what about “migratory” data





I read and write, then you read and write, then X reads and
writes...
better to go to I state, so I don’t have to be invalidated on your
write
Synapse transitioned to I state
Sequent Symmetry and MIT Alewife use adaptive protocols
Choices can affect performance of memory system
Copyright © 2009
3-76
KICS, UET
MESI (4-state) Invalidation
Protocol

Problem with MSI protocol

Reading and modifying data is 2 bus xactions,
even if none sharing



e.g. even in sequential program
BusRd (I->S) followed by BusRdX or BusUpgr (S->M)
Add exclusive state: write locally without
xaction, but not modified

Main memory is up to date, so cache not
necessarily owner
Copyright © 2009
3-77
KICS, UET
MESI (4-state) Invalidation Protocol
(2)

Add exclusive state: (cont’d)

States





invalid
exclusive or exclusive-clean (only this cache has copy,
but not modified)
shared (two or more caches may have copies)
modified (dirty)
I  E on PrRd if no one else has copy

needs “shared” signal on bus: wired-or line asserted in
response to BusRd
Copyright © 2009
3-78
KICS, UET
MESI State Transition Diagram



BusRd(S) means shared
line asserted on BusRd
transaction
Flush’: if cache-to-cache
sharing (see next), only
one cache flushes data
MOESI protocol: Owned
state: exclusive but
memory not valid
PrRd
PrWr/—
M
BusRd/Flush
BusRdX/Flush
PrWr/—
PrWr/BusRdX
E
BusRd/
Flush
PrRd/—
BusRdX/Flush
PrWr/BusRdX
S
BusRdX/Flush

PrRd/
BusRd (S
)
PrRd/—
BusRd/Flush

PrRd/
BusRd(S)
I
Copyright © 2009
3-79
KICS, UET
Lower-level Protocol Choices


Who supplies data on miss when not in M
state: memory or cache
Original, lllinois MESI: cache, since assumed
faster than memory


Cache-to-cache sharing
Not true in modern systems

Intervening in another cache more expensive than
getting from memory
Copyright © 2009
3-80
KICS, UET
Lower-level Protocol Choices (2)

Cache-to-cache sharing also adds complexity



How does memory know it should supply data
(must wait for caches)
Selection algorithm if multiple caches have valid
data
But valuable for cache-coherent machines
with distributed memory


May be cheaper to obtain from nearby cache than
distant memory
Especially when constructed out of SMP nodes
(Stanford DASH)
Copyright © 2009
3-81
KICS, UET
Dragon Write-back Update
Protocol

4 states



Exclusive-clean or exclusive (E): I and memory
have it
Shared clean (Sc): I, others, and maybe memory,
but I’m not owner
Shared modified (Sm): I and others but not
memory, and I’m the owner


Sm and Sc can coexist in different caches, with only one
Sm
Modified or dirty (D): I and, noone else
Copyright © 2009
3-82
KICS, UET
Dragon Write-back Update Protocol
(2)

No invalid state



New processor events: PrRdMiss, PrWrMiss


If in cache, cannot be invalid
If not present in cache, can view as being in notpresent or invalid state
Introduced to specify actions when block not
present in cache
New bus transaction: BusUpd

Broadcasts single word written on bus; updates
other relevant caches
Copyright © 2009
3-83
KICS, UET
Dragon State Transition Diagram
PrRd/—
BusUpd/Update
PrRd/—
BusRd/—
E
Sc
PrRdMiss/BusRd(S)
PrRdMiss/BusRd(S)
PrWr/—
PrWr/BusUpd(S)
PrWr/BusUpd(S)
BusUpd/Update
BusRd/Flush
PrWrMiss/BusRd(S)
PrWrMiss/(BusRd(S); BusUpd)
Sm
M
PrWr/BusUpd(S)
PrRd/—
PrWr/BusUpd(S)
BusRd/Flush
Copyright © 2009
PrRd/—
PrWr/—
3-84
KICS, UET
Lower-level Protocol Choices

Can shared-modified state be eliminated?



If update memory as well on BusUpd transactions
(DEC Firefly)
Dragon protocol doesn’t (assumes DRAM memory
slow to update)
Should replacement of an Sc block be
broadcast?


Would allow last copy to go to E state and not
generate updates
Replacement bus transaction is not in critical path,
later update may be
Copyright © 2009
3-85
KICS, UET
Lower-level Protocol Choices (2)

Shouldn’t update local copy on write hit
before controller gets bus




Can mess up serialization
Coherence, consistency considerations much
like write-through case
In general, many subtle race conditions in
protocols
But first, let’s illustrate quantitative
assessment at logical level
Copyright © 2009
3-86
KICS, UET
Synchronization
David E. Culler and Jaswinder Pal Singh,
Parallel Computer Architecture: A
Hardware/Software Approach, Morgan
Kaufmann, 1998
(Advanced Topic—can be skipped)
Synchronization


Synchronization is a fundamental concept of
parallel computing
“A parallel computer is a collection of
processing elements that cooperate and
communicate to solve large problems fast.”
Types

Mutual Exclusion

Event synchronization

point-to-point

group

global (barriers)
Copyright © 2009
3-88
KICS, UET
Synchronization (2)

Synchronization is a well-known problem



Resolution requires hardware and software



In traditional parallel computing
Inherited by multi-core architectures
Processor instruction set needs to provide an atomic testand-set instruction
System software uses it to provide synchronization
mechanisms
Presented here for sake of completion


To provide exposure to idea behind it
Multithreading software provides synchronization primitives
Copyright © 2009
3-89
KICS, UET
History and Perspectives


Much debate over hardware primitives over
the years
Conclusions depend on technology and
machine style


speed vs flexibility
Most modern methods use a form of atomic
read-modify-write


IBM 370: included atomic compare&swap for
multiprogramming
x86: any instruction can be prefixed with a lock
modifier
Copyright © 2009
3-90
KICS, UET
History and Perspectives (2)

Atomic read-modify-write (cont’d)

High-level language advocates want hardware
locks/barriers



SPARC: atomic register-memory ops (swap,
compare&swap)
MIPS, IBM Power: no atomic operations but pair
of instructions



but it goes against the “RISC” flow
load-locked, store-conditional
later used by PowerPC and DEC Alpha too
Rich set of tradeoffs
Copyright © 2009
3-91
KICS, UET
Components of a
Synchronization Event

Acquire method


Waiting algorithm


Wait for synch to become available when it isn’t
Release method


Acquire right to the synch (enter critical section,
go past event
Enable other processors to acquire right to the
synch
Waiting algorithm is independent of type of
synchronization
Copyright © 2009
3-92
KICS, UET
Waiting Algorithms

Blocking




Waiting processes are descheduled
High overhead
Allows processor to do other things
Busy-waiting




Waiting processes repeatedly test a location until
it changes value
Releasing process sets the location
Lower overhead, but consumes processor
resources
Can cause network traffic
Copyright © 2009
3-93
KICS, UET
Waiting Algorithms (2)

Busy-waiting better when




Scheduling overhead is larger than expected wait
time
Processor resources are not needed for other
tasks
Scheduler-based blocking is inappropriate (e.g. in
OS kernel)
Hybrid methods: busy-wait a while, then
block
Copyright © 2009
3-94
KICS, UET
Role of System and User


User wants to use high-level synchronization
operations

Locks, barriers...

Doesn’t care about implementation
System designer: how much hardware
support in implementation?


Speed versus cost and flexibility
Waiting algorithm difficult in hardware, so provide
support for others
Copyright © 2009
3-95
KICS, UET
Role of System and User (2)

Popular trend:



System provides simple hardware primitives
(atomic operations)
Software libraries implement lock, barrier
algorithms using these
But some propose and implement full-hardware
synchronization
Copyright © 2009
3-96
KICS, UET
Challenges

Same synchronization may have different needs at
different times




Lock accessed with low or high contention
Different performance requirements: low latency or high
throughput
Different algorithms best for each case, and need different
primitives
Multiprogramming can change synchronization
behavior and needs


Process scheduling and other resource interactions
May need more sophisticated algorithms, not so good in
dedicated case
Copyright © 2009
3-97
KICS, UET
Challenges (2)

Rich area of software-hardware interactions



Which primitives available affects what algorithms
can be used
Which algorithms are effective affects what
primitives to provide
Need to evaluate using workloads
Copyright © 2009
3-98
KICS, UET
Mutual Exclusion



Mutual exclusion = lock-unlock operation
Wide range of algorithms to implement these
operations
Role of contention for locks



Simple algorithms are fast when low contention for locks
Sophisticated algorithms deal with contention in a better
way but have higher cost
Types of locks



Hardware locks
Simple lock algorithms
Advanced lock algorithms
Copyright © 2009
3-99
KICS, UET
Hardware Locks

Separate lock lines on the bus: holder of a lock
asserts the line


Locking algorithm


Busy-wait with timeout
Lock registers (Cray XMP)


Priority mechanism for multiple requestors
Set of registers shared among processors
Inflexible, so not popular for general purpose use



few locks can be in use at a time (one per lock line)
hardwired waiting algorithm
Primarily used to provide atomicity for higher-level
software locks
Copyright © 2009
3-100
KICS, UET
First Attempt at Simple Software
Lock
lock:
and
unlock:
ld
cmp
bnz
st
ret
register, location
location, #0
lock
location, #1
/* copy location to register */
/* compare with 0 */
/* if not 0, try again */
/* store 1 to mark it locked */
/* return control to caller */
st
ret
location, #0
/* write 0 to location */
/* return control to caller */
Copyright © 2009
3-101
KICS, UET
First Attempt at Simple Software
Lock (2)

Problem: lock needs atomicity in its own
implementation


Read (test) and write (set) of lock
variable by a process not atomic
Solution: atomic read-modify-write or
exchange instructions

atomically test value of location and set it
to another value, return success or failure
somehow
Copyright © 2009
3-102
KICS, UET
Atomic Exchange Instruction

Specifies a location and register. In atomic
operation:



Value in location read into a register
Another value (function of value read or not)
stored into location
Many variants

Varying degrees of flexibility in second part
Copyright © 2009
3-103
KICS, UET
Atomic Exchange Instruction (2)

Simple example: test&set





Value in location read into a specified register
Constant 1 stored into location
Successful if value loaded into register is 0
Other constants could be used instead of 1 and 0
Can be used to build locks
Copyright © 2009
3-104
KICS, UET
Simple Test&Set Lock
lock:
unlock:
t&s
bnz
ret
st
ret
Copyright © 2009
register, location
lock
location, #0
3-105
/* if not 0, try again */
/* return control to caller */
/* write 0 to location */
/* return control to caller */
KICS, UET
Simple Test&Set Lock (2)

Other read-modify-write primitives can be
used too



Swap
Fetch&op
Compare&swap



Three operands: location, register to compare with,
register to swap with
Not commonly supported by RISC instruction sets
Can be cacheable or uncacheable (we
assume cacheable)
Copyright © 2009
3-106
KICS, UET
Simple Test&Set Lock (3)

On SGI Challenge Code:




lock;
delay(c);
unlock;
Same total number of lock calls as p
increases

measure time per transfer
Copyright © 2009
3-107
KICS, UET
T&S Lock Microbenchmark
Performance (2)
20
18
l
s
s
T est&set,
c =0
l
T est&set, exponential backof
f,
c = 3.64

T est&set, exponential backof
f,
c =0
u
Ideal
s

s
16
s
s
14
l
s
Time ( m s)
s
l
s
s
s
l
10

8
l
6


l

l
s
l
l

s
12
s



l

4
s
l
2
l
s
uuuuuuuuuuuuuuu
l




su
0 l
3
l
l


5
7
9
11
13
15
Number of processors

Performance degrades because unsuccessful test&sets generate traffic
Copyright © 2009
3-108
KICS, UET
Enhancements to Simple Lock
Algorithm

Reduce frequency of issuing test&sets while
waiting



Test&set lock with backoff
Don’t back off too much or will be backed off
when lock becomes free
Exponential backoff works quite well empirically:
ith time = k*ci
Copyright © 2009
3-109
KICS, UET
Enhancements to Simple Lock
Algorithm (2)

Busy-wait with read operations rather than
test&set


Test-and-test&set lock
Keep testing with ordinary load


cached lock variable will be invalidated when release
occurs
When value changes (to 0), try to obtain lock with
test&set

only one attemptor will succeed; others will fail and start
testing again
Copyright © 2009
3-110
KICS, UET
Performance Criteria (T&S Lock)

Uncontended Latency


Traffic



Very low if repeatedly accessed by same
processor; indept. of p
Lots if many processors compete; poor scaling
with p
Each t&s generates invalidations, and all rush out
again to t&s
Storage

Very small (single variable); independent of p
Copyright © 2009
3-111
KICS, UET
Performance Criteria (2)

Fairness




Test&set with backoff similar, but less traffic
Test-and-test&set: slightly higher latency, much less
traffic
But still all rush out to read miss and test&set on
release


Poor, can cause starvation
Traffic for p processors to access once each: O(p2)
Luckily, better hardware primitives as well as
algorithms exist
Copyright © 2009
3-112
KICS, UET
Improved Hardware Primitives:
LL-SC

Goals:




Test with reads
Failed read-modify-write attempts don’t generate
invalidations
Nice if single primitive can implement range of rm-w operations
Two instructions: Load-Locked (or -linked),
Store-Conditional

LL reads variable into register
Copyright © 2009
3-113
KICS, UET
Improved Hardware Primitives
(2)


Follow with arbitrary instructions to
manipulate its value
SC tries to store back to location if and only if
no one else has written to the variable since
this processor’s LL



If SC succeeds, means all three steps happened
atomically
If fails, doesn’t write or generate invalidations
(need to retry LL)
Success indicated by condition codes
Copyright © 2009
3-114
KICS, UET
Simple Lock with LL-SC
lock:
unlock:
Copyright © 2009
ll
sc
beqz
ret
st
ret
reg1, location /* LL location to reg1 */
location, reg2 /* SC reg2 into location*/
reg2, lock
/* if failed, start again */
location, #0
3-115
/* write 0 to location */
KICS, UET
Simple Lock with LL-SC (2)

Can do more fancy atomic ops by changing what’s
between LL & SC



SC can fail (without putting transaction on bus) if:



But keep it small so SC likely to succeed
Don’t include instructions that would need to be undone
(e.g. stores)
Detects intervening write even before trying to get bus
Tries to get bus but another processor’s SC gets bus first
LL, SC are not lock, unlock respectively


Only guarantee no conflicting write to lock variable between
them
But can use directly to implement simple operations on
shared variables
Copyright © 2009
3-116
KICS, UET
More Efficient SW Locking
Algorithms

Problem with Simple LL-SC lock



No invals on failure, but read misses by all waiters
after both release and successful SC by winner
No test-and-test&set analog, but can use backoff
to reduce burstiness
Doesn’t reduce traffic to minimum, and not a fair
lock
Copyright © 2009
3-117
KICS, UET
More Efficient SW Locking (2)

Better SW algorithms for bus (for r-m-w
instructions or LL-SC)

Only one process to try to get lock upon release


valuable when using test&set instructions; LL-SC does it
already
Only one process to have read miss upon release

valuable with LL-SC too

Ticket lock achieves first
Array-based queueing lock achieves both

Both are fair (FIFO) locks as well

Copyright © 2009
3-118
KICS, UET
Ticket Lock


Only one r-m-w (from only one processor)
per acquire
Works like waiting line at a bank


Two counters per lock (next_ticket, now_serving)
Acquire: fetch&inc next_ticket; wait for
now_serving to equal it



atomic op when arrive at lock, not when it’s free (so less
contention)
Release: increment now-serving
FIFO order, low latency for low-contention if
fetch&inc cacheable
Copyright © 2009
3-119
KICS, UET
Ticket Lock (2)

Works like waiting line at a bank (cont’d)

Still O(p) read misses at release, since all spin on
same variable


Can be difficult to find a good amount to delay on
backoff



like simple LL-SC lock, but no inval when SC succeeds,
and fair
exponential backoff not a good idea due to FIFO order
backoff proportional to now-serving - next-ticket may
work well
Wouldn’t it be nice to poll different locations
...
Copyright © 2009
3-120
KICS, UET
Array-based Queuing Locks

Waiting processes poll on different locations
in an array of size p

Acquire



Release


fetch&inc to obtain address on which to spin (next array
element)
ensure that these addresses are in different cache lines
or memories
set next location in array, thus waking up process
spinning on it
O(1) traffic per acquire with coherent caches
Copyright © 2009
3-121
KICS, UET
Array-based Queuing Locks (2)

Waiting processes poll on different locations
in an array of size p (cont’d)




FIFO ordering, as in ticket lock
But, O(p) space per lock
Good performance for bus-based machines
Not so great for non-cache-coherent machines
with distributed memory

array location I spin on not necessarily in my local
memory
Copyright © 2009
3-122
KICS, UET
Lock Performance on SGI
Challenge
Loop: lock; delay(c); unlock; delay(d);
Array-based
6
LL-SC

LL-SC, exponential
u
Ticket
s
Ticket, proportional
7
u
6
u
u
u
6
u
u
u u
u
u
u
5
u
l l
l l u
l
l
u l sl l
s l
l s l l
s6 s
s s s6
s s
s s
4
3
6
6
6 6
2
1
6
6 6

su
6
u

l

s 6
 
6
 
sl

u
6  
6


6
 

Time (ms)
Time (ms)
5
5
7
9
Number of processors
(a) Null (c = 0, d = 0)
Copyright © 2009
11
13 15
u
u l
l l
u u sl l sl s
u
su s
s
s
4
3
1
6
l
s
sl
l 6 l
l s6
s
s s6
6
l
s
7
u
6
9
11 13 15
Number of processors
(b) Critical-section (c = 3.64 ms, d = 0)
3-123
u
l l
u 6
l
su
6 l
u
6 s
6
s
su
s
6
  


3
2
u
u
u
6
6
6
6
l
4
u
u l
6
6 l
l
6
l
l 6 6 l
l s
l s l
s
s

s s
s
s
s

 
   
1
0
5
u u
l

6 6 6 6 6 6
 
6 6
     
 

 
3
u
5
6
s
l

u
6
u u u
u
2
0
3
u
1
0
1
l
7
u
Time (ms)
7
l
s
l

6
u
1
3
5
7
9
11 13
15
Number of processors
(c) Delay (c = 3.64 ms, d = 1.29 ms)
KICS, UET
Lock Performance on SGI Challenge
(2)

Simple LL-SC lock does best at small p due to
unfairness




Not so with delay between unlock and next lock
Need to be careful with backoff
Ticket lock with proportional backoff scales
well, as does array lock
Methodologically challenging, and need to
look at real workloads
Copyright © 2009
3-124
KICS, UET
Point to Point Event
Synchronization

Software methods:




Interrupts
Busy-waiting: use ordinary variables as flags
Blocking: use semaphores
Full hardware support: full-empty bit with
each word in memory


Set when word is “full” with newly produced data
(i.e. when written)
Unset when word is “empty” due to being
consumed (i.e. when read)
Copyright © 2009
3-125
KICS, UET
Point to Point Event Synchronization
(2)

Full hardware support: (cont’d)

Natural for word-level producer-consumer
synchronization



producer: write if empty, set to full; consumer: read if
full; set to empty
Hardware preserves atomicity of bit manipulation
with read or write
Problem: flexiblity



multiple consumers, or multiple writes before consumer
reads?
needs language support to specify when to use
composite data structures?
Copyright © 2009
3-126
KICS, UET
Barriers


Software algorithms implemented using locks,
flags, counters
Hardware barriers




Wired-AND line separate from address/data bus
Set input high when arrive, wait for output to be
high to leave
In practice, multiple wires to allow reuse
Useful when barriers are global and very frequent
Copyright © 2009
3-127
KICS, UET
Barriers (2)

Hardware barriers

Difficult to support arbitrary subset of processors


Difficult to dynamically change number and
identity of participants



even harder with multiple processes per processor
e.g. latter due to process migration
Not common today on bus-based machines
Let’s look at software algorithms with simple
hardware primitives
Copyright © 2009
3-128
KICS, UET
A Simple Centralized Barrier

Shared counter maintains number of processes that
have arrived

increment when arrive (lock), check until reaches numprocs
struct bar_type {int counter; struct lock_type lock; int flag
= 0;} bar_name;
BARRIER (bar_name, p) {
LOCK(bar_name.lock);
if (bar_name.counter == 0)
bar_name.flag = 0;
/* reset flag if first to reach*/
mycount = bar_name.counter++;
/* mycount is private */
UNLOCK(bar_name.lock);
if (mycount == p) {
/* last to arrive */
bar_name.counter = 0;
/* reset for next barrier */
bar_name.flag = 1;
/* release waiters */
}
else while (bar_name.flag == 0) {}; /* busy wait for release */
}

Problem?
Copyright © 2009
3-129
KICS, UET
A Working Centralized Barrier

Consecutively entering the same barrier
doesn’t work



Must prevent process from entering until all have
left previous instance
Could use another counter, but increases latency
and contention
Sense reversal: wait for flag to take different
value consecutive times

Toggle this value only when all processes reach
Copyright © 2009
3-130
KICS, UET
A Working Centralized Barrier
(2)
BARRIER (bar_name, p) {
local_sense = !(local_sense); /* toggle private sense variable */
LOCK(bar_name.lock);
mycount = bar_name.counter++;
/* mycount is private */
if (bar_name.counter == p)
UNLOCK(bar_name.lock);
bar_name.flag = local_sense; /* release waiters*/
else
{ UNLOCK(bar_name.lock);
while (bar_name.flag != local_sense) {}; }
}
Copyright © 2009
3-131
KICS, UET
Centralized Barrier Performance

Latency



Traffic



Want short critical path in barrier
Centralized has critical path length at least
proportional to p
Barriers likely to be highly contended, so want
traffic to scale well
About 3p bus transactions in centralized
Storage Cost

Very low: centralized counter and flag
Copyright © 2009
3-132
KICS, UET
Centralized Barrier Performance
(2)

Fairness



Same processor should not always be last to exit
barrier
No such bias in centralized
Key problems for centralized barrier are
latency and traffic

Especially with distributed memory, traffic goes to
same node
Copyright © 2009
3-133
KICS, UET
Improved Barrier Algorithms for
a Bus
Contention
Little contention
Flat

Tree structured
Software combining tree

Only k processors access the same location, where
k is degree of tree
Copyright © 2009
3-134
KICS, UET
Improved Barrier Algorithms for a
Bus (2)





Separate arrival and exit trees, and use sense
reversal
Valuable in distributed network: communicate
along different paths
On bus, all traffic goes on same bus, and no less
total traffic
Higher latency (log p steps of work, and O(p)
serialized bus xactions)
Advantage on bus is use of ordinary reads/writes
instead of locks
Copyright © 2009
3-135
KICS, UET
Barrier Performance on SGI
Challenge
l
35
30
u
s

Centralized
Combining tree
Tournament
Dissemination
Time (ms)
25
u
u

20
15
10
5
u
sl


u
sl
u
s
l
sl

u
l
s
u


l
l
s
s
u

0 sl
12345678
Number of processors

Centralized does quite well
Copyright © 2009
3-136
KICS, UET
Synchronization Summary


Rich interaction of hardware-software tradeoffs
Must evaluate hardware primitives and software
algorithms together


Evaluation methodology is challenging



primitives determine which algorithms perform well
Use of delays, microbenchmarks
Should use both microbenchmarks and real workloads
Simple software algorithms with common hardware
primitives do well on bus
Copyright © 2009
3-137
KICS, UET
Key Takeaways for this Session

Multi-core processors are here


These are multiprocessor/MIMD systems
We need to understand parallel programming



System support for multi-core is available



Strengths, weaknesses, opportunities, and threads
No “free lunch” for performance improvement
OS: both Linux and Windows support them
Compilers/language support: gcc, C#, java
Two types of development tracks



High performance computing
High throughput computing
Both have their unique challenges
Copyright © 2009
3-138
KICS, UET
Key Takeaways (2)

High performance computing




Most scientific/engineering applications
Available programming models: message-passing (MPI) or
shared-memory processing (OpenMP)
Challenge: performance scalability with cores and problem
size while dealing with data/function partitioning
High throughput computing



Most business applications
Available programming model: multi-threading (sharedmemory processing)
Challenge: performance scalability while dealing with
deadlocks, locking, cache, and memory issues
Copyright © 2009
3-139
KICS, UET

Parallel Architectures - Cavium University Program

Transcript Parallel Architectures - Cavium University Program

Directory