CSE 431. Computer Architecture
Download
Report
Transcript CSE 431. Computer Architecture
Review: Where are We Now?
Processor
Processor
Output
Control
Datapath
Output
Memory
Memory
Input
Input
Control
Datapath
Multiprocessor – multiple processors with a single shared
address space
Cluster – multiple computers (each with their own
address space) connected over a local area network
(LAN) functioning as a single system
BusMultis.1
Multiprocessor Basics
Q1 – How do they share data?
Q2 – How do they coordinate?
Q3 – How scalable is the architecture? How many
processors?
# of Proc
Communication Message passing 8 to 2048
model
Shared NUMA 8 to 256
address UMA
2 to 64
Physical
connection
BusMultis.2
Network
8 to 256
Bus
2 to 36
Single Bus (Shared Address UMA) Multi’s
Proc1
Proc2
Caches
Caches
Proc3
Proc4
Caches
Caches
Single Bus
Memory
I/O
Caches are used to reduce latency and to lower bus traffic
Write-back caches used to keep bus traffic at a minimum
Must provide hardware to ensure that caches and memory
are consistent (cache coherency)
Must provide a hardware mechanism to support process
synchronization
BusMultis.3
Multiprocessor Cache Coherency
Cache coherency protocols
Bus snooping – cache controllers monitor shared bus traffic with
duplicate address tag hardware (so they don’t interfere with
processor’s access to the cache)
Proc1
Snoop DCache
Proc2
Snoop DCache
ProcN
Snoop DCache
Single Bus
Memory
BusMultis.4
I/O
Bus Snooping Protocols
Multiple copies are not a problem when reading
Processor must have exclusive access to write a word
What happens if two processors try to write to the same shared
data word in the same clock cycle? The bus arbiter decides
which processor gets the bus first (and this will be the
processor with the first exclusive access). Then the second
processor will get exclusive access. Thus, bus arbitration
forces sequential behavior.
This sequential consistency is the most conservative of the
memory consistency models. With it, the result of any
execution is the same as if the accesses of each processor
were kept in order and the accesses among different
processors were interleaved.
All other processors sharing that data must be informed
of writes
BusMultis.5
Handling Writes
Ensuring that all other processors sharing data are
informed of writes can be handled two ways:
1.
2.
Write-update (write-broadcast) – writing processor
broadcasts new data over the bus, all copies are
updated
All writes go to the bus higher bus traffic
Since new values appear in caches sooner, can reduce latency
Write-invalidate – writing processor issues invalidation
signal on bus, cache snoops check to see if they have a
copy of the data, if so they invalidate their cache block
containing the word (this allows multiple readers but
only one writer)
BusMultis.6
Uses the bus only on the first write lower bus traffic, so better
use of bus bandwidth
Write-Invalidate CC Examples
I = invalid (many), S = shared (many), M = modified (only one)
3. snoop sees
read request
Proc 1 for
A & lets MM
supply A
A S
1. read miss for A
Proc 2
4. gets A from MM
& changes its state
A I to S
2. read request for A
Main Mem
A
3. snoop sees read 1. read miss for A
request Proc
for A,1writes- Proc 2
4. gets A from MM
back A to MM
& changes its state
5. change A
A M
A I to M
state to I
5. P2 sends invalidate for A
2. read request for A
Main Mem
A
BusMultis.9
1. write miss for A
Proc 1
4. change A
A IS
state to
Proc 2
2. writes A &
changes its state
A I to M
3. P2 sends invalidate for A
Main Mem
A
1. write miss for A
Proc 1
4. change A
A IM
state to
Proc 2
2. writes A &
changes its state
A I to M
3. P2 sends invalidate for A
Main Mem
A
SMP Data Miss Rates
Shared data has lower spatial and temporal locality
Share data misses often dominate cache behavior even though
they may only be 10% to 40% of the data accesses
Capacity miss rate
Coherence miss rate
64KB 2-way set associative
data cache with 32B blocks
18
16
Hennessy & Patterson, Computer
Architecture: A Quantitative Approach
Capacity miss rate
14
12
10
Coherence miss rate
8
8
6
6
4
4
2
2
0
0
1
2
4
FFT
BusMultis.10
8
16
1
2
4
Ocean
8
16
Block Size Effects
Writes to one word in a multi-word block mean
either the full block is invalidated (write-invalidate)
or the full block is exchanged between processors (write-update)
- Alternatively, could broadcast only the written word
Multi-word blocks can also result in false sharing: when
two processors are writing to two different variables in
the same cache block
With write-invalidate false sharing increases cache miss rates
Proc1
Proc2
A
B
4 word cache block
Compilers can help reduce false sharing by allocating
highly correlated data to the same cache block
BusMultis.11
Other Coherence Protocols
There are many variations on cache coherence protocols
Another write-invalidate protocol used in the Pentium 4
(and many other micro’s) is MESI with four states:
Modified – same
Exclusive – only one copy of the shared data is allowed to be
cached; memory has an up-to-date copy
- Since there is only one copy of the block, write hits don’t need to
send invalidate signal
Shared – multiple copies of the shared data may be cached (i.e.,
data permitted to be cached with more than one processor);
memory has an up-to-date copy
Invalid – same
BusMultis.12
Process Synchronization
Need to be able to coordinate processes working on a
common task
Lock variables (semaphores) are used to coordinate or
synchronize processes
Need an architecture-supported arbitration mechanism to
decide which processor gets access to the lock variable
Single bus provides arbitration mechanism, since the bus is the
only path to memory – the processor that gets the bus wins
Need an architecture-supported operation that locks the
variable
BusMultis.14
Locking can be done via an atomic swap operation (processor
can both read a location and set it to the locked state – test-andset – in the same bus operation)
Spin Lock Synchronization
Read lock
variable
Spin
No
Unlocked?
(=0?)
Yes
Try to lock variable using swap: atomic
operation
read lock variable and set it
to locked value (1)
No
Succeed?
(=0?)
Yes
The single winning processor will read a 0 all others processors will read the 1 set by
the winning processor
BusMultis.15
unlock variable:
set lock variable
to 0
Finish update of
shared data
.
.
.
Begin update of
shared data
Review: Summing Numbers on a SMP
Pn is the processor’s number, vectors A and sum are
shared variables, i is a private variable, half is a private
variable initialized to the number of processors
sum[Pn] = 0;
for (i = 1000*Pn; i< 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];
/* each processor sums its
/* subset of vector A
repeat
/* adding together the
/* partial sums
synch();
/*synchronize first
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
half = half/2
if (Pn<half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
/*final sum in sum[0]
BusMultis.16
An Example with 10 Processors
synch(): Processors must synchronize before the
“consumer” processor tries to read the results from the
memory location written by the “producer” processor
Barrier synchronization – a synchronization scheme where
processors wait at the barrier, not proceeding until every processor
has reached it
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
sum[P0] sum[P1] sum[P2] sum[P3]sum[P4]sum[P5]sum[P6] sum[P7] sum[P8] sum[P9]
P0
BusMultis.17
P1
P2
P3
P4
Barrier Implemented with Spin-Locks
n is a shared variable initialized to the number of
processors,count is a shared variable initialized to 0,
arrive and depart are shared spin-lock variables where
arrive is initially unlocked and depart is initially locked
procedure synch()
lock(arrive);
count := count + 1;
/* count the processors as
if count < n
/* they arrive at barrier
then unlock(arrive)
else unlock(depart);
lock(depart);
count := count - 1;
/* count the processors as
if count > 0
/* they leave barrier
then unlock(depart)
else unlock(arrive);
BusMultis.18
Spin-Locks on Bus Connected ccUMAs
With bus based cache coherency, spin-locks allow
processors to wait on a local copy of the lock in their
caches
Reduces bus traffic – once the processor with the lock releases
the lock (writes a 0) all other caches see that write and invalidate
their old copy of the lock variable. Unlocking restarts the race to
get the lock. The winner gets the bus and writes the lock back to
1. The other caches then invalidate their copy of the lock and on
the next lock read fetch the new lock value (1) from memory.
This scheme has problems scaling up to many
processors because of the communication traffic when
the lock is released and contested
BusMultis.19
Commercial Single Backplane Multiprocessors
Processor
# proc.
MHz
BW/
system
Compaq PL
Pentium Pro
4
200
540
IBM R40
PowerPC
8
112
1800
AlphaServer
Alpha 21164
12
440
2150
SGI Pow Chal MIPS R10000
36
195
1200
Sun 6000
30
167
2600
BusMultis.20
UltraSPARC
Summary
Key questions
Q1 - How do processors share data?
Q2 - How do processors coordinate their activity?
Q3 - How scalable is the architecture (what is the maximum
number of processors)?
Bus connected (shared address UMA’s(SMP’s)) multi’s
Cache coherency hardware to ensure data consistency
Synchronization primitives for synchronization
Scalability of bus connected UMAs limited (< ~ 36 processors)
because the three desirable bus characteristics
- high bandwidth
- low latency
- long length
are incompatible
Network connected NUMAs are more scalable
BusMultis.21