No Slide Title

Download Report

Transcript No Slide Title

Snooping Protocols (II)
Assessing Protocol Tradeoffs


Tradeoffs affected by performance and organization
characteristics
Part art and part science


Art: experience, intuition and aesthetics of designers
Science: Workload-driven evaluation for cost-performance

want a balanced system: no expensive resource heavily underutilized
2
Assessing Protocol Tradeoffs

Methodology:



Use simulator; choose parameters per earlier methodology (default
1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for
some)
Focus on frequencies, not end performance for now
Use idealized memory performance model to avoid changes of
reference interleaving across processors with machine parameters

Cheap simulation: no need to model contention
3


Parallel prgram
70
Address bus
60
Data bus
0
OS-Data/III
OS-Data/3St
OS-Data/3St-RdEx E
OS-Code/III
OS-Code/3St
OS-Code/3St-RdEx
20
Appl-Data/III
40
Appl-Data/3St
Appl-Data/3St-RdEx
60
Appl-Code/3St-RdEx
80
Appl-Code/3St
100
Traffic (MB/s)
120
Appl-Code/III
Raytrace/3St
Raytrace/3St-RdEx
Raytrace/III
Radix/3St-RdEx
Radix/III
Radix/3St
Radiosity/3St
Radiosity/3St-RdEx
Radiosity/III
x
0
Ocean/III
Ocean/3S
Ocean/3St-RdEx
LU/3St-RdEx
180
LU/III
LU/3St
Barnes/III
Barnes/3St
Barnes/3St-RdEx
Traffic (MB/s)
Impact of Protocol Optimizations
(Computing traffic from state transitions discussed in book)
Effect of E state, and of BusUpgr instead of BusRdX
200
Address bus
160
Data bus
140
80
50
40
30
20
10
MSI versus MESI doesn’t seem to matter for bw for these workloads
Upgrades instead of read-exclusive helps
Multiprog workload
4
Impact of Cache Block Size

Multiprocessors add new kind of miss to 3C

Coherence misses: true sharing and false sharing



Both miss rate and traffic matter
Reducing misses architecturally in invalidation protocol




latter due to granularity of coherence being larger than a word
Capacity: enlarge cache; increase block size (if spatial locality)
Conflict: increase associativity ( if full associativity ?)
Cold and Coherence: only block size
Increasing block size has advantages and disadvantages


Can reduce misses if spatial locality is good
Can hurt too




increase misses due to false sharing if spatial locality not good
increase misses due to conflicts in fixed-size cache
increase traffic due to fetching unnecessary data and due to false sharing
can increase miss penalty and perhaps hit cost
5
A Classification of Cache Misses
Conflict misses are considered
to be capacity misses
Miss classification
First r eference to
memory block by processor
Yes
Other
First access
systemwide
Reason for Replacement
elimination of
last copy
No
1. Cold
No
Written
before
2. Cold
Invalidation
Yes
No
No
Reason
for miss
Modifed
word(s) accessed
during lifetime
Old copy
Yes
with state = invalid
still there
3. False-sharing- Yes
cold
Modifed
Yes
Modifed
word(s) accessed
4. True-sharingNo word(s) accessedYes
during
lifetime
cold
No
during lifetime
5. False-sharinginval-cap
6. True-sharinginval-cap 7. Purefalse-sharing
Life time: the time interval
during which the block remains
valid in the cache

No
Has block
Yes
been modifed since
replacement
8. Puretrue-sharing
Modifed
No word(s) accessed Yes
during lifetime
No
Modifed
word(s) accessed Yes
during lifetime
10. True-sharing9. Purecapacity
capacity
11. False-sharing- 12.True-sharingcap-inval
cap-inval
Many mixed categories because a miss may have multiple causes
6
Impact of Block Size on Miss Rate
Results shown only for default problem size: varied behavior

Need to examine impact of problem size and p as well (see text)
0.6
12
Upgrade
Upgrade
False sharing
0.5
False sharing
10
True sharing
True sharing
Capacity
Capacity
Capacity includes conflict
Cold
Cold
Miss rate (%)
8
0.3
6
0.2
4
0.1
2
•
Raytrace/128
Raytrace/256
Raytrace/32
Raytrace/64
6
Radix/256
Raytrace/8
Raytrace/16
8
4
2
8
Radix/32
Radix/64
Radix/128
Radix/16
Radix/8
Ocean/256
Ocean/128
Ocean/64
Ocean/32
Ocean/8
Ocean/16
Radiosity/128
Radiosity/256
0
Radiosity/64
Radiosity/16
Radiosity/32
Radiosity/8
Lu/256
Lu/64
Lu/128
Lu/16
Lu/32
Lu/8
Barnes/256
Barnes/64
Barnes/128
Barnes/16
Barnes/32
Barnes/8
0
6
8
Miss rate (%)
0.4
Working set doesn’t fit: impact on capacity misses much more critical
7
Impact of Block Size on Traffic
Traffic affects performance indirectly through contention
10
1.8
Address bus
9



Ocean/256
Ocean/64
0
Ocean/128
0
Ocean/32
0.2
Ocean/8
1
Ocean/16
0.4
LU/256
2
LU/64
0.6
LU/128
3
0.8
LU/32
Raytrace/128
Raytrace/256
Raytrace/64
Raytrace/16
Raytrace/32
Raytrace/8
Radiosity/8
Radiosity/16
Radiosity/32
Barnes/256
Barnes/64
Barnes/128
Barnes/16
Barnes/32
0
Radiosity/64 4
Radiosity/128 28
Radiosity/256
2
0.02
4
1
LU/8
0.04
5
1.2
LU/16
0.06
Traffic (bytes/FLOP)
0.08
6
Radix/256
0.1
7
Radix/64
0.12
Data bus
1.4
Radix/8
Traffic (bytes/instruction)
8
Radix/128
0.14
Address bus
1.6
Data bus
Radix/32
Data bus
Radix/16
Address bus
0.16
Barnes/8
Traffic (bytes/instructions)
0.18
Results different than for miss rate: traffic almost always increases
When working sets fits, overall traffic still small, except for Radix
Fixed overhead is significant component

So total traffic often minimized at 16-32 byte block, not smaller
8
Making Large Blocks More Effective

Software



Improve spatial locality by better data structuring (more later)
Compiler techniques
Hardware

Retain granularity of transfer but reduce granularity of coherence





Reduce both granularities, but prefetch more blocks on a miss
Proposals for adjustable cache size
More subtle: delay propagation of invalidations and perform all at
once


use subblocks: same tag but different state bits
one subblock may be valid but another invalid or dirty
But can change consistency model: discuss later in course
Use update instead of invalidate protocols to reduce false sharing
effect
9
Update versus Invalidate


Much debate over the years: tradeoff depends on sharing
patterns
Intuition:

If those that are used continue to use, and writes between use are
few, update should do better


If those that are used unlikely to use again, or many writes between
reads, updates not good



useless updates where only last one will be used
Can construct scenarios where one or other is much better
Can combine them in hybrid schemes (see text)


e.g. producer-consumer pattern
E.g. competitive: observe patterns at runtime and change protocol
Let’s look at real workloads
10
Update vs Invalidate: Miss Rates
2.50
0.60
False sharing
True sharing
0.50
Capacity
2.00
Cold
Miss rate (%)
Miss rate (%)
0.40
0.30
1.50
1.00
0.20
0.50
0.10


Radix/upd
Radix/mix
Radix/inv
Raytrace/inv
Ocean/upd
Ocean/mix
Ocean/inv
Raytrace/upd

LU/upd
0.00
LU/inv
0.00
Lots of coherence misses: updates help
Lots of capacity misses: updates hurt (keep data in cache uselessly)
Updates seem to help, but this ignores upgrade and update traffic
11
Upgrade and Update Rates (Traffic)
Upgrade/update rate (%)
Overall, trend is away from
update based protocols as default

LU/inv
LU/upd
Ocean/inv
Ocean/mix
Ocean/upd
Raytrace/inv
Raytrace/upd
Upgrade/update rate (%)
8.00
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00

bandwidth, complexity, large
blocks trend
Will see later that updates have
greater problems for scalable
systems
2.50

2.00

many bus transactions versus
one in invalidation case
could delay updates or use
merging
1.50

1.00

Update traffic is substantial
Main cause is multiple writes by a
processor before a read by other
0.50
0.00

Radix/inv
Radix/mix
Radix/upd
12
Synchronization
Synchronization


“A parallel computer is a collection of processing elements
that cooperate and communicate to solve large problems
fast.”
Types of Synchronization


Mutual Exclusion
Event synchronization



point-to-point
group
global (barriers)
14
History and Perspectives


Much debate over hardware primitives over the years
Conclusions depend on technology and machine style


speed vs flexibility
Most modern methods use a form of atomic read-modifywrite



IBM 370: included atomic compare&swap for multiprogramming
x86: any instruction can be prefixed with a lock modifier
High-level language advocates want hardware locks/barriers



SPARC: atomic register-memory ops (swap, compare&swap)
MIPS, IBM Power: no atomic operations but pair of instructions



but it’s goes against the “RISC” flow,and has other problems
load-locked, store-conditional
later used by PowerPC and DEC Alpha too
Rich set of tradeoffs
15
Components of a Synchronization Event

Acquire method


Waiting algorithm


Wait for synch to become available when it isn’t
Release method


Acquire right to the synch (enter critical section, go past event)
Enable other processors to acquire right to the synch
Waiting algorithm is independent of type of
synchronization
16
Waiting Algorithms

Blocking




Busy-waiting





Waiting processes repeatedly test a location until it changes value
Releasing process sets the location
Lower overhead, but consumes processor resources
Can cause network traffic
Busy-waiting better when




Waiting processes are descheduled
High overhead
Allows processor to do other things
Scheduling overhead is larger than expected wait time
Processor resources are not needed for other tasks
Scheduler-based blocking is inappropriate (e.g. in OS kernel)
Hybrid methods: busy-wait a while, then block
17
Role of System and User



User wants to use high-level synchronization operations

Locks, barriers...

Doesn’t care about implementation
System designer: how much hardware support in
implementation?

Speed versus cost and flexibility

Waiting algorithm difficult in hardware, so provide support for
others
Popular trend:

System provides simple hardware primitives (atomic operations)

Software libraries implement lock, barrier algorithms using these

But some propose and implement full-hardware synchronization
18
Mutual Exclusion: Hardware Locks

Separate lock lines on the bus: holder of a lock asserts the
line


Priority mechanism for multiple requestors
Inflexible, so not popular for general purpose use


few locks can be in use at a time (one per lock line)
hardwired waiting algorithm
19
First Attempt at Simple Software Lock
lock:
ld
cmp
bnz
st
ret
unlock:st
ret

location, #0
/* write 0 to location */
/* return control to caller */
Problem: lock needs atomicity in its own implementation


register, location/* copy loc to register */
location, #0 /* compare with 0 */
lock
/* if not 0, try again */
location, #1 /* store 1 to mark it locked */
/* return control to caller */
Read (test) and write (set) of lock variable by a process not atomic
Solution: atomic read-modify-write or exchange
instructions

atomically test value of location and set it to another value, return
success or failure somehow
20
Atomic Exchange Instruction

Atomic operation:



Simple example: test&set





Value in location read into a register
Another value (function of value read or not) stored into location
Value in location read into a specified register
Constant 1 stored into location
Successful if value loaded into register is 0
Other constants could be used instead of 1 and 0
Can be used to build locks
21
Simple Test&Set Lock
lock:
unlock:

register, location
lock
location, #0
/* if not 0, try again */
/* return control to caller */
/* write 0 to location */
/* return control to caller */
Other read-modify-write primitives can be used too



Swap
Fetch&op
Compare&swap



t&s
bnz
ret
st
ret
Three operands: location, register to compare with, register to swap with
Not commonly supported by RISC instruction sets
Can be cacheable or uncacheable (we assume cacheable)
22
T&S Lock Microbenchmark Performance
On SGI Challenge. Code:
lock; delay(c); unlock;
Same total no. of lock calls as p increases; measure time per transfer
20
s
18
l

u
l
s
Test&set, c = 0
Test&set, exponential backoff c = 3.64
Test&set, exponential backoff c = 0
Ideal
s
s
16
s
s
14
s
Time (ms)
l
s
s
12
10
s
s
s
l
l
s
l

8
l
6
l

s
l
l







l

4
s
l
2
s
l
0u
l
l


l
s
uuuuuuuuuuuuuuu
l



3
5
7
9
11
13
15
Number of processors

Performance degrades because unsuccessful test&sets
generate traffic
23
Enhancements to Simple Lock Algorithm

Reduce frequency of issuing test&sets while waiting




Test&set lock with backoff
Don’t back off too much or will be backed off when lock becomes
free
Exponential backoff works quite well empirically: ith time = k*ci
Busy-wait with read operations rather than test&set


Test-and-test&set lock
Keep testing with ordinary load


cached lock variable will be invalidated when release occurs
When value changes (to 0), try to obtain lock with test&set

only one attempter will succeed; others will fail and start testing again
24
Performance Criteria (T&S Lock)
Uncontended Latency

Very low if repeatedly accessed by same processor; indept. of p
Traffic


Lots if many processors compete; poor scaling with p
Each t&s generates invalidations, and all rush out again to t&s
Scalability

not asymptotic, within the realistic range
Storage

Very small (single variable); independent of p
Fairness

Poor, can cause starvation
25
Performance Criteria (T&S Lock)



Test&set with backoff similar, but less traffic
Test-and-test&set: slightly higher latency, much less traffic
But still all rush out to read miss and test&set on release

Traffic for p processors to access once each: O(p2) bus traffic

One invalidation and p-1 subsequent read misses
26
Improved Hardware Primitives: LL-SC

Goals:






Test with reads
Failed read-modify-write attempts don’t generate invalidations
Load-Locked (or -linked), Store-Conditional
LL reads variable into register
Follow with arbitrary instructions to manipulate its value
SC tries to store back to location if and only if no one else
has written to the variable since this processor’s LL



If SC succeeds, means all three steps happened atomically
If fails, doesn’t write or generate invalidations (need to retry LL)
Success indicated by condition codes; implementation later
27
Simple Lock with LL-SC
lock:
unlock:

reg1, location
/* LL location to reg1 */
reg1, lock
location, reg2
/* SC reg2 into location*/
reg2, lock
/* if failed, start again */
location, #0
/* write 0 to location */
Can do more fancy atomic ops by changing what’s between
LL & SC



ll
bnz
sc
beqz
ret
st
ret
But keep it small so SC likely to succeed
Don’t include instructions that would need to be undone (e.g. stores)
SC can fail (without putting transaction on bus) if:


Detects intervening write even before trying to get bus
Tries to get bus but another processor’s SC gets bus first
28
More Efficient SW Locking Algorithms

Problem with Simple LL-SC lock




No invals on failure, but read misses by all waiters after both release
and successful SC by winner (O(p) bus transactions per lock acq.)
No test-and-test&set analog, but can use backoff to reduce burstiness
Doesn’t reduce traffic to minimum, and not a fair lock
Better SW algorithms for bus (for r-m-w instructions or LL-SC)

Only one process to try to get lock upon release


Only one process to have read miss upon release




valuable when using test&set instructions; LL-SC does it already;
valuable with LL-SC too
Ticket lock achieves first
Array-based queuing lock achieves both
Both are fair (FIFO) locks as well
29
Ticket Lock


Only one r-m-w (from only one processor) per acquire
Works like waiting line at deli or bank


Two counters per lock (next_ticket, now_serving)
Acquire: fetch&inc next_ticket; wait for now_serving to equal it




Release: increment now-serving
FIFO order, low latency for low-contention if fetch&inc cacheable
Still O(p) read misses at release, since all spin on same variable


like simple LL-SC lock, but no inval when SC succeeds, and fair
Can be difficult to find a good amount to delay on backoff




atomic op when arrive at lock, not when it’s free (so less contention)
to reduce the bursty read-miss traffic
exponential backoff not a good idea due to FIFO order
backoff proportional to now-serving - next-ticket may work well
Wouldn’t it be nice to poll different locations ...
30
Ticket Lock
next_ticket
n+6
Fetch & Inc
n
n+1
Now_serving
n
n+2
n+3
n+4
n+5
read
31
Array-based Queuing Locks

Waiting processes poll on different locations in an array of
size p

Acquire



Release





fetch&inc to obtain address on which to spin (next array element)
ensure that these addresses are in different cache lines or memories
set next location in array, thus waking up process spinning on it
O(1) traffic per acquire with coherent caches
FIFO ordering, as in ticket lock, but O(p) space per lock
Good performance for bus-based machines
Not so great for non-cache-coherent machines with distributed
memory

array location I spin on not necessarily in my local memory (solution
later)
32
Array-based Queuing Locks
acquire
n+6
Fetch & Inc
n
n+1
1
0
n+2
n+3
n+4
n+5
0 0 0
n n+1
33
Lock Performance on SGI Challenge
Loop: lock; delay(c); unlock; delay(d);
l
6

u
s
7
Array-based
LL-SC
LL-SC, exponential
Ticket
Ticket, proportional
7
u
u
6
u
u
u u
u
u
u
l l
l l u
l
l
u l sl l
s l
l s l l
s
s6 s
s
s
s s
6
s s
4
Time (ms)
u
3
6
6
6 6
2
su
1
6
6 6

6


6
 

5
7
9
(a) Null (c = 0, d = 0)
11
13
u
l l
4
u
u l
u u sl l sl s
u
su s
s
s
3
0
3
l
u u
15
5
1
l 6 l
l s6
s
s
s6
6
l
s
4
3

1
0
5
7
6
6
9
11
Number of processors
(b) Critical-section (c = 3.64 ms, d = 0)

2

6
6 6
 
6 6 6
6 6
     
 

 
3
6
6
u
u l
6
6 l
l
6 6 6 l
l
u l
l l
6
u l
l s
l
l s l
su
u 6 s
6
s
s
s s
s6
u s
s
s
s
s


6
   
   
l
l
s
sl
6
s
l

u
6
u
u
l
6
1
0
Number of processors
6
u u u
u
2
6
u


sl 6


6

 
sl

u
6 
1
u
u
5
Time (ms)
5
u
Time (ms)
6
u
u
7
u
13
15
s
l

6
u
1
3
5
7
9
11
13
15
Number of processors
(c) Delay (c = 3.64 ms, d = 1.29 ms)
34
Lock Performance on SGI Challenge

Simple LL-SC lock does best at small p due to unfairness




Not so with delay between unlock and next lock
Need to be careful with backoff
Ticket lock with proportional backoff scales well, as does array lock
Methodologically challenging, and need to look at real workloads
35
Point to Point Event Synchronization

Software methods:



Busy-waiting: use ordinary variables as flags
Blocking: use semaphores
Hardware support: full-empty bit with each word in memory



Set when word is “full” with newly produced data (i.e. when written)
Unset when word is “empty” due to being consumed (i.e. when read)
Natural for word-level producer-consumer synchronization



producer: write if empty, set to full; consumer: read if full; set to empty
Hardware preserves atomicity of bit manipulation with read or write
Problem: flexiblity



multiple consumers, or multiple writes before consumer reads?
needs language support to specify when to use
composite data structures?
36