No Slide Title
Download
Report
Transcript No Slide Title
ECE200 – Computer Organization
Chapter 9 – Multiprocessors
What we’ll cover today
Multiprocessor motivation
Multiprocessor organizations
Shared memory multiprocessors
Cache
coherence
Synchronization
Multiprocessor motivation, part 1
Many scientific applications take too long to run
on a single processor machine
Modeling
of weather patterns, astrophysics, chemical
reactions, ocean currents, etc.
Many of these are parallel applications which
largely consist of loops which operate on
independent data
Such applications can make efficient use of a
multiprocessor machine with each loop iteration
running on a different processor and operating
on independent data
Multiprocessor motivation, part 2
Many multi-user environments require more
compute power than available from a single
processor machine
Airline
reservation system, department store chain inventory
system, file server for a large department, web server for a
major corporation, etc.
These consist of largely parallel transactions
which operate on independent data
Such applications can make efficient usage of a
multiprocessor machine with each transaction
running on a different processor and operating
on independent data
Multiprocessor organizations
Shared memory multiprocessors
All
processors share the same memory address space
Single copy of the OS (although some parts may be parallel)
Relatively easy to program and port sequential code to
Difficult to scale to large numbers of processors
Uniform memory access (UMA) machine block diagram
Multiprocessor organizations
Distributed memory multiprocessors
Processors
have their own memory address space
Message passing used to access another processor’s memory
Multiple copies of the OS
Usually commodity hardware and network (e.g., Ethernet)
More difficult to program
Easier to scale hardware and more inherently fault resilient
Multiprocessor variants
Non-uniform memory access (NUMA) shared
memory multiprocessors
All
memory can be addressed by all processors, but access to
a processor’s own local memory is faster than access to
another processor’s remote memory
Looks like a distributed machine, but interconnection network
is usually custom-designed switches and/or buses
Multiprocessor variants
Distributed shared memory (DSM)
multiprocessors
Commodity
hardware of a distributed memory multiprocessor,
but all processors have the illusion of shared memory
Operating system handles accesses to remote memory
“transparently” on behalf of the application
Relieves application developer of the burden of memory
management across the network
Multiprocessor variants
Shared memory machines connected together
over a network (operating as a distributed
memory or DSM machine)
…
network
controller
network
controller
network
Shared memory multiprocessors
Major design issues
Cache
coherence: ensuring that stores to cached data are
seen by other processors
Synchronization: the coordination among processors accessing
shared data
Memory consistency: definition of when a processor must
observe a write from another processor
Cache coherence problem
Two writeback caches becoming incoherent
(1) CPU 0 reads block A
CPU 0
CPU 1
A
cache
cache
A
main memory
Cache coherence problem
Two writeback caches becoming incoherent
(1) CPU 0 reads block A
(2) CPU 1 reads block A
CPU 0
CPU 1
CPU 0
CPU 1
A
cache
cache
A
cache
A
cache
A
main memory
A
main memory
Cache coherence problem
Two writeback caches becoming incoherent
(1) CPU 0 reads block A
(2) CPU 1 reads block A
CPU 0
CPU 1
CPU 0
CPU 1
A
cache
cache
A
cache
A
cache
A
main memory
A
main memory
(3) CPU 0 writes block A
CPU 0
CPU 1
A
cache
A
cache
A
main memory
old, out of date
copies of block A
Cache coherence protocols
Ensures that cached blocks that are written to
are observable by all processors
Assigns a state field to all cached blocks
Defines actions for performing reads and writes
to blocks in each state that ensure cache
coherence
Actions are much more complicated than
described here in a real machine with a split
transaction bus
MESI cache coherence protocol
Commonly used (or variant thereof) in shared
memory multiprocessors
Idea is to ensure that when a cache wants to
write to a cache block that other remote caches
invalidate their copies first
Each cache block is in one of four states (2 bits
stored with each cache block)
Invalid:
contents are not valid
Shared: other processor caches may have the same copy;
main memory has the same copy
Exclusive: no other processor cache has a copy; main memory
has the same copy
Modified: no other processor cache has a copy; main memory
has an old copy
MESI cache coherence protocol
Actions on a load that results in cache hit
Local
cache actions
Read block
Remote
None
cache actions
Actions on a load that results in cache miss
Local
cache actions
Request block from bus
If not in a remote cache, set state to Exclusive
If also in a remote cache, set state to Shared
Remote
cache actions
Look up cache tags to see if the block is present
If so, signal the local cache that we have a copy, provide it if it is in
state Modified, and change the state of our copy to Shared
MESI cache coherence protocol
Actions on a store that results in cache hit
Local
cache actions
Check state of block
If Shared, send an Invalidation bus command to all remote caches
Write the block and change the state to Modified
Remote
cache actions
Upon receipt of an Invalidation command on the bus, look up cache
tags to see if the block is present
If so, change the state of the block to Invalid
Actions on a store that results in cache miss
Local
cache actions
Simultaneously request block from bus and send an Invalidation
command
After block received, write the block and set the state to Modified
Remote
cache actions
Look up cache tags to see if the block is present
If so, signal the local cache that we have a copy, provide it if it is in
state Modified, and change the state of our copy to Invalid
Cache coherence problem revisited
(1) CPU 0 reads block A
CPU 0
A
cache
CPU 1
Exclusive
A
main memory
cache
Cache coherence problem revisited
(1) CPU 0 reads block A
CPU 0
A
cache
Exclusive
A
main memory
(2) CPU 1 reads block A
CPU 1
CPU 0
cache
A
cache
CPU 1
Shared
A
main memory
A
cache
Shared
Cache coherence problem revisited
(1) CPU 0 reads block A
CPU 0
A
cache
Exclusive
(2) CPU 1 reads block A
CPU 1
CPU 0
cache
A
cache
A
main memory
A
cache
CPU 1
Shared
A
main memory
Shared
A
main memory
(3) CPU 0 cache invalidates remote block A
CPU 0
CPU 1
A
cache
Invalid
Invalidate
command
A
cache
Shared
Cache coherence problem revisited
(1) CPU 0 reads block A
CPU 0
A
cache
Exclusive
(2) CPU 1 reads block A
CPU 1
CPU 0
cache
A
cache
A
main memory
A
cache
CPU 1
Shared
A
main memory
Shared
A
cache
Shared
A
main memory
(3) CPU 0 cache invalidates remote block A
CPU 0
CPU 1
A
cache
(4) CPU 0 writes block A
CPU 0
Invalid
Invalidate
command
A
cache
CPU 1
Modified
A
main memory
A
cache
Invalid
Synchronization
For parallel programs to share data, we must
make sure that accesses to a given memory
location are ordered
Example:
database of available inventory at a department
store simultaneously accessed from different store
computers; only one computer must “win the race” to reserve
a particular item
Solution
defines a special atomic swap instruction in which
a memory location is tested for 0, and if so, is set to 1
Software associates a lock variable with each data that needs
to be ordered (e.g., particular class of merchandise) and uses
the atomic swap instruction to try to set it
Software acquires the lock before modifying the associated
data (e.g., reserving the merchandise)
Software releases the lock by setting it to 0 when done
Architecture
Synchronization flowchart
“spinning”
Synchronization and coherence example
Questions?