01-DKatz and AAmbard..

Transcript 01-DKatz and AAmbard..

Caches and Memory
Hierarchies
Deborah Katz and Ameya Ambardekar
7/24/2016
1
Last Level Caches
• Key Properties:
o ”The capacity of the last-level cache primarily dictates the
amount of the working set that can be stored on-chip.”*
o The last level cache is the last line of defense before you
need to go to memory
• Going to memory is a major performance hit
• Want to optimize last level caches to minimize requests
to memory
o Communications among caches and memory consume
on-chip bandwidth, which is a limited resource
o Last level caches are often shared among multiple
processors, reducing per-processor cache size
• Increases the need to use last level caches efficiently
*FLEXclusion: Balancing Cache Capacity and On-Chip Bandwidth via Flexible Exclusion
Jaewoong Sim, Jaekyu Lee, Moinuddin K. Qureshi, Hyesoon Kim
7/24/2016
2
Some Issues in Last Level
Caches
• Space in the last level cache can be more
efficiently used to store more data and reduce
memory requests
• Cache replacement policies do not always choose
the optimal block to evict from the cache,
increasing cache misses
• Advantages and disadvantages of inclusive or
exclusive caches depend on program behavior,
which is not known at design time
• Last level cache misses can stall the pipeline longer
than necessary due to inefficient memory
bandwidth utilization
7/24/2016
3
Writeback Efficiency
• In a writeback cache, dirty blocks are written to
main memory when they are evicted
o Placed in a write buffer before written to memory
• Writebacks can delay memory reads
• Spatial locality in writes improves write performance
o Row buffer locality and bank-level parallelism
o A write buffer is small, so buffered writes are unlikely to have good locality
• Inefficiencies in writebacks consume unnecessary
memory bandwidth
o If there were more writes to choose from, they might be scheduled more
efficiently
o Larger writeback caches are impractical
Improving Writeback Efficiency with Decoupled Last-Write Prediction
Zhe Wang, Samira M. Khan, and Daniel A. Jimenez
7/24/2016
4
Proposed Solution:
Early Writeback
• If you can predict which blocks will not be written
again before they are evicted, you can write those
back at your leisure
• Keep a buffer for early writeback blocks
• The memory controller has more writes to choose
from when scheduling writes for good locality
• There is only a small penalty for incorrectly
predicting that a block will not be written back
again
o The early writeback scheduler minimizes the interference with program
progress
Improving Writeback Efficiency with Decoupled Last-Write Prediction
Zhe Wang, Samira M. Khan, and Daniel A. Jimenez
7/24/2016
5
Predicting “Last-Write”
Cache Blocks
• When a block is known to be a “last write” block, its
associated program counter value is stored
• Future blocks associated with those program
counter values are predicted to also be last writes
• A prediction table is maintained
o The prediction table is designed to minimize overhead
Improving Writeback Efficiency with Decoupled Last-Write Prediction
Zhe Wang, Samira M. Khan, and Daniel A. Jimenez
7/24/2016
6
Advantages /
Disadvantages
• Advantages:
o Reduction of interference with read requests
• Disadvantages:
o Additional overhead in keeping data
o Additional space and power consumption for early writeback buffer
Improving Writeback Efficiency with Decoupled Last-Write Prediction
Zhe Wang, Samira M. Khan, and Daniel A. Jimenez
7/24/2016
7
Last Level Cache
Organization
• Inclusive caches
o Avoid cache coherence snooping requests
o Reduce write back requests
o However, inclusive caches waste space by storing a copy of data already
present on chip
• Exclusive caches
o Store more data than inclusive caches
o However, increase write back requests
• Non-inclusive caches may contain upper level
cache blocks
o But they are not required to
FLEXclusion: Balancing Cache Capacity and On-Chip Bandwidth via Flexible Exclusion
Jaewoong Sim, Jaekyu Lee, Moinuddin K. Qureshi, and Hyesoon Kim
7/24/2016
8
Need for Dynamic
Organization
• Exclusive caches perform better than non-inclusive
caches since they do not duplicate data
• However, LLC insertion traffic is much higher in an
exclusive cache than a non-inclusive
• Exclusive and non-inclusive caches have similar
numbers of cache misses when the working set fits
into the LLC
o However, the extra space from an exclusive cache helps when the
working set grows larger
o Insertion traffic remains larger for exclusive caches
• This suggests that the best cache configuration
would depend on workload & should be adaptive
FLEXclusion: Balancing Cache Capacity and On-Chip Bandwidth via Flexible Exclusion
Jaewoong Sim, Jaekyu Lee, Moinuddin K. Qureshi, and Hyesoon Kim
7/24/2016
9
Proposed Solution:
Flexible Cache Organization
• FLEXclusive cache operates in two modes –
exclusive mode and non-inclusive mode
• Mode determined at run-time using traffic
monitoring and decision mechanism
o Extra hardware needed for the decision logic
o Hardware also needed for correct operation of the cache in exclusive
and non-inclusive modes
FLEXclusion: Balancing Cache Capacity and On-Chip Bandwidth via Flexible Exclusion
Jaewoong Sim, Jaekyu Lee, Moinuddin K. Qureshi, and Hyesoon Kim
7/24/2016
10
Advantages/
Disadvantages
• Advantages
o Can provide the performance benefits of an exclusive cache or the
bandwidth savings of the non-inclusive cache at run-time
o Better performance for a wider range of programs
• Disadvantages
o Needs extra hardware
o Switching time effects?
FLEXclusion: Balancing Cache Capacity and On-Chip Bandwidth via Flexible Exclusion
Jaewoong Sim, Jaekyu Lee, Moinuddin K. Qureshi, and Hyesoon Kim
7/24/2016
11
Cache Replacement
Policies
• Attempt to choose a block for eviction that is
unlikely to be used in the near future
• The Least Recently Used policy evicts the block that
was read or written the least recently
• LRU does not perform well for some types of data
access patterns
• For example:
o A burst of references to unrelated data, followed by a return to the
previous working set
o When the application’s working set is larger than the cache
• Incorrect prediction increases costly memory
accesses
High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel Emer
7/24/2016
12
Proposed Solution:
Re-Reference Interval Prediction
• LRU predicts that when a block is used, it is likely to
be re-used again soon
o This translates to a “near-immediate re-reference interval”
• RRIP proposes making more sophisticated
predictions
o Use bits for each cache block to store re-reference interval prediction
o When blocks are initially inserted into the cache, predict a “long” rereference interval
• “Long” re-reference interval is the equivalent of being near the end of
an LRU queue
• Shorten the predicted re-reference interval at every hit
o Static and dynamic variations
High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel Emer
7/24/2016
13
RRIP Variations
• Static version: Blocks inserted into the cache always
have a long re-reference prediction
o Shorten the predicted re-reference interval at every hit
• Dynamic version: Alters the re-reference prediction
policy based on cache misses observed at runtime
o Insert with a “long” predicted interval versus “distant” interval
• A “distant” re-reference interval is the equivalent of being at the end
of an LRU queue, about to be evicted
o Behaves better when working set is larger than the cache
• Evicting some data almost immediately allows a portion of the
working set to stay in the cache
• Avoids thrashing
High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel Emer
7/24/2016
14
Advantages/
Disadvantages
• Advantages:
o Reduces memory accesses, which reduces overhead
• Disadvantages:
o Requires keeping a 2-bit register per cache block
o Requires additional hardware to implement the replacement policy
• Authors claim this does not increase cache access time
o Only performs well if the last level cache is large enough
High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel Emer
7/24/2016
15
Data Compression &
Caches
• For reducing off-chip memory accesses, need to
store more data on chip
• LLC size limited by chip area, and per-core cache
area is decreasing
• Use data compression for “increasing” cache size
o Ideal compression technique needs to be fast, simple & effective
o Requires low decompression latency
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches
Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry
7/24/2016
16
Proposed Solution:
BΔI Compression
• Key observation: for many cache lines the value
stored in a cache line lies within a small range
• Represent a substantial part of the cache line by
storing a base value of data and the various offsets
o Data value = base value + offset
• Many cache lines store small immediate values
o Use zero as a second base
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches
Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry
7/24/2016
17
BΔI Operation
• Check if cache line can be compressed
• Requires increased tag store size to hold extra tag
bits
• Might need to evict more than one compressed
cache line if fetched cache line cannot be
compressed
• For decompression, decide to add the base or zero
to the stored delta using information that has been
stored with the compressed data
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches
Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry
7/24/2016
18
Results
• Advantages
o Doubles effective cache size for single processor systems
• Cache size increases for multi-cores too
o Decreases on-chip traffic
o Low decompression latency
o Simple implementation
• Disadvantages
o Eviction can result in loss of multiple cache lines
o Might not work for all data sets.
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches
Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry
7/24/2016
19
Questions?
7/24/2016
20

01-DKatz and AAmbard..

Transcript 01-DKatz and AAmbard..

Directory