No Slide Title

Download Report

Transcript No Slide Title

ECE200 – Computer Organization
Chapter 9 – Multiprocessors
What we’ll cover today

Multiprocessor motivation

Multiprocessor organizations

Shared memory multiprocessors
 Cache
coherence
 Synchronization
Multiprocessor motivation, part 1

Many scientific applications take too long to run
on a single processor machine
 Modeling
of weather patterns, astrophysics, chemical
reactions, ocean currents, etc.


Many of these are parallel applications which
largely consist of loops which operate on
independent data
Such applications can make efficient use of a
multiprocessor machine with each loop iteration
running on a different processor and operating
on independent data
Multiprocessor motivation, part 2

Many multi-user environments require more
compute power than available from a single
processor machine
 Airline
reservation system, department store chain inventory
system, file server for a large department, web server for a
major corporation, etc.


These consist of largely parallel transactions
which operate on independent data
Such applications can make efficient usage of a
multiprocessor machine with each transaction
running on a different processor and operating
on independent data
Multiprocessor organizations

Shared memory multiprocessors
 All
processors share the same memory address space
 Single copy of the OS (although some parts may be parallel)
 Relatively easy to program and port sequential code to
 Difficult to scale to large numbers of processors
 Uniform memory access (UMA) machine block diagram
Multiprocessor organizations

Distributed memory multiprocessors
 Processors
have their own memory address space
 Message passing used to access another processor’s memory
 Multiple copies of the OS
 Usually commodity hardware and network (e.g., Ethernet)
 More difficult to program
 Easier to scale hardware and more inherently fault resilient
Multiprocessor variants

Non-uniform memory access (NUMA) shared
memory multiprocessors
 All
memory can be addressed by all processors, but access to
a processor’s own local memory is faster than access to
another processor’s remote memory
 Looks like a distributed machine, but interconnection network
is usually custom-designed switches and/or buses
Multiprocessor variants

Distributed shared memory (DSM)
multiprocessors
 Commodity
hardware of a distributed memory multiprocessor,
but all processors have the illusion of shared memory
 Operating system handles accesses to remote memory
“transparently” on behalf of the application
 Relieves application developer of the burden of memory
management across the network
Multiprocessor variants

Shared memory machines connected together
over a network (operating as a distributed
memory or DSM machine)
…
network
controller
network
controller
network
Shared memory multiprocessors

Major design issues
 Cache
coherence: ensuring that stores to cached data are
seen by other processors
 Synchronization: the coordination among processors accessing
shared data
 Memory consistency: definition of when a processor must
observe a write from another processor
Cache coherence problem

Two writeback caches becoming incoherent
(1) CPU 0 reads block A
CPU 0
CPU 1
A
cache
cache
A
main memory
Cache coherence problem

Two writeback caches becoming incoherent
(1) CPU 0 reads block A
(2) CPU 1 reads block A
CPU 0
CPU 1
CPU 0
CPU 1
A
cache
cache
A
cache
A
cache
A
main memory
A
main memory
Cache coherence problem

Two writeback caches becoming incoherent
(1) CPU 0 reads block A
(2) CPU 1 reads block A
CPU 0
CPU 1
CPU 0
CPU 1
A
cache
cache
A
cache
A
cache
A
main memory
A
main memory
(3) CPU 0 writes block A
CPU 0
CPU 1
A
cache
A
cache
A
main memory
old, out of date
copies of block A
Cache coherence protocols


Ensures that cached blocks that are written to
are observable by all processors
Assigns a state field to all cached blocks

Defines actions for performing reads and writes
to blocks in each state that ensure cache
coherence

Actions are much more complicated than
described here in a real machine with a split
transaction bus
MESI cache coherence protocol

Commonly used (or variant thereof) in shared
memory multiprocessors

Idea is to ensure that when a cache wants to
write to a cache block that other remote caches
invalidate their copies first

Each cache block is in one of four states (2 bits
stored with each cache block)
 Invalid:
contents are not valid
 Shared: other processor caches may have the same copy;
main memory has the same copy
 Exclusive: no other processor cache has a copy; main memory
has the same copy
 Modified: no other processor cache has a copy; main memory
has an old copy
MESI cache coherence protocol

Actions on a load that results in cache hit
 Local

cache actions
Read block
 Remote


None
cache actions
Actions on a load that results in cache miss
 Local



cache actions
Request block from bus
If not in a remote cache, set state to Exclusive
If also in a remote cache, set state to Shared
 Remote


cache actions
Look up cache tags to see if the block is present
If so, signal the local cache that we have a copy, provide it if it is in
state Modified, and change the state of our copy to Shared
MESI cache coherence protocol

Actions on a store that results in cache hit
 Local



cache actions
Check state of block
If Shared, send an Invalidation bus command to all remote caches
Write the block and change the state to Modified
 Remote



cache actions
Upon receipt of an Invalidation command on the bus, look up cache
tags to see if the block is present
If so, change the state of the block to Invalid
Actions on a store that results in cache miss
 Local


cache actions
Simultaneously request block from bus and send an Invalidation
command
After block received, write the block and set the state to Modified
 Remote


cache actions
Look up cache tags to see if the block is present
If so, signal the local cache that we have a copy, provide it if it is in
state Modified, and change the state of our copy to Invalid
Cache coherence problem revisited
(1) CPU 0 reads block A
CPU 0
A
cache
CPU 1
Exclusive
A
main memory
cache
Cache coherence problem revisited
(1) CPU 0 reads block A
CPU 0
A
cache
Exclusive
A
main memory
(2) CPU 1 reads block A
CPU 1
CPU 0
cache
A
cache
CPU 1
Shared
A
main memory
A
cache
Shared
Cache coherence problem revisited
(1) CPU 0 reads block A
CPU 0
A
cache
Exclusive
(2) CPU 1 reads block A
CPU 1
CPU 0
cache
A
cache
A
main memory
A
cache
CPU 1
Shared
A
main memory
Shared
A
main memory
(3) CPU 0 cache invalidates remote block A
CPU 0
CPU 1
A
cache
Invalid
Invalidate
command
A
cache
Shared
Cache coherence problem revisited
(1) CPU 0 reads block A
CPU 0
A
cache
Exclusive
(2) CPU 1 reads block A
CPU 1
CPU 0
cache
A
cache
A
main memory
A
cache
CPU 1
Shared
A
main memory
Shared
A
cache
Shared
A
main memory
(3) CPU 0 cache invalidates remote block A
CPU 0
CPU 1
A
cache
(4) CPU 0 writes block A
CPU 0
Invalid
Invalidate
command
A
cache
CPU 1
Modified
A
main memory
A
cache
Invalid
Synchronization

For parallel programs to share data, we must
make sure that accesses to a given memory
location are ordered
 Example:
database of available inventory at a department
store simultaneously accessed from different store
computers; only one computer must “win the race” to reserve
a particular item

Solution
defines a special atomic swap instruction in which
a memory location is tested for 0, and if so, is set to 1
 Software associates a lock variable with each data that needs
to be ordered (e.g., particular class of merchandise) and uses
the atomic swap instruction to try to set it
 Software acquires the lock before modifying the associated
data (e.g., reserving the merchandise)
 Software releases the lock by setting it to 0 when done
 Architecture
Synchronization flowchart
“spinning”
Synchronization and coherence example
Questions?