Nektarios Paisios. An Overview of the Techniques of Space and Energy Reduction

Transcript Nektarios Paisios. An Overview of the Techniques of Space and Energy Reduction

Nektarios Paisios.

An Overview of the Techniques of Space and Energy Reduction using data compression.

Introduction:

• Data Compression: a technique of data reduction.

• Space is costly.

• Study: "The cost is not for the processor but for the memory" • In the past: memory provided enough space for then current application footprint, • but disk space too small to hold data.

• Compression: An old method of saving disk space.

• 1994 advertisement: Up to 50%-100% more free disk space.

Data compression: why?

• Explosive growth of disk space & drop in prices.

• But still: network links slow & processors will soon reach their chip limit according to Moor's Law (2010) • eg. 2 GHZ 4 years ago, 3 GHZ 2 years ago, what now?

• New methods of speeding up need to be invented.

• By: bringing data closer to processor & by providing data faster & by making predictions more accurate.

• Compression can: • Store more data in caches = closer to processor.

• Store more data in predictors = more accurate predictions.

• But faster?

Data compression: why?

• Clusters of computers built out of commodity equipment can make wanders.

• Less cost due to commoditisation, but: • More energy needed & more energy means more cooling.

• Compression can: • reduce data structures = less energy requirements.

• But maintain equal performance.

Data Compression: what?

• Two forms: Lossless and lossy.

• Pictures music = lossy - other files = lossless.

• Both useful in processors.

• Data caches: lossless because of program accuracy and integrity.

• Predictors: lossy up to an acceptable point. Why?

• Lossy = faster.

• Prediction needs to be faster than actual program execution

Data compression: how?

• Commonest method: • Finds common patterns.

• Isolates them.

• Replaces them with a pointer.

• Example: The fat cat sat like that.

• at + the space: a common pattern.

• Three techniques proposed in processors: • Pattern matching, • pattern differentiation, • common repeating bit elimination.

Three techniques.

• 1. Pattern matching,: • Produces a dictionary of common items.

• But: • How to make the dictionary (what to choose)?

• When to update it?

• How big is the dictionary (speed)?

• 2. Pattern differentiation: • Finds common changes: increments - decrements.

• Used when we have series of data with an expected dispersion.

• eg. value predictors.

• Can it be used in other cases?

• 3. common repeating bit elimination: • Large memory blocks are all zeros.

• A series of 0s or a series of 1s can be replaced with a code.

Example1: Compression in caches and memory.

• From: Technical Report 1500, Computer Sciences Dept., UW Madison, April 2004 • Aims: • Increase effective memory size, • reduce memory address and data bandwidth, • increase effective cache size.

• Three aproaches: Dictionary, differencial, significance.

• Dictionary: Common patterns are stored in a separate table and a pointer to them is place in the compressed data.

• Differential: The common patterns are stored with the compressed data together with a list of differences amongst the various data parts.

• Significance: Not all bits are required and the upper once are usually zero.

Dictionary-based compression in main memory.

• From: Technical Report 1500, Computer Sciences Dept., UW Madison, April 2004 • IBM’s Memory Compression.

• IBM’s MXT technology [26 employs real-time main-memory content compression.

• Effectively double memory.

• Implemented in the Pinnacle chip single-chip memory controller.

• Franaszek, et al. = CRAM. (MXT) • Kjelso, et al. = X-Match hardware compression. (4-byte entries) • Lempel-Ziv (LZ77) sequential algorithm: • Block-Referential Compression with Directory Sharing, • divides the input data block (1 KB in MXT) into sub-blocks • Four 256-byte sub-blocks, cooperatively constructs dictionaries while compressing all sub-blocks in parallel.

Dictionary-based compression in caches.

• Lee, et al. = selectively compress L2 cache & memory blocks if can be reduced to half their original size.

• (SCMS) use of the X-RL compression algorithm similar to X-Match.

• Speed considerations?

• Parallel decompression • Selective compression: not everything is compress if not worth it.

• Chen, et al. = divide cache into different section of compressibility.

• Use of LZ algorithm.

Dictionary-based compression in caches.

• Frequent-Value-Based Compression.

• Yang and Gupta = analysed the SPECint95 benchmarks.

• Discovered that a small number of distinct values occupy a large fraction of memory access values.

• This value locality enabled the design energy-efficient caches &data compressed caches.

• How? Each line in the L1 cache can be either one uncompressed line or two lines compressed to at least half based on frequent values.

• Zhang, et al. = value-centric data cache design called the frequent value cache (FVC).

• Added a small direct-mapped cache with values frequently found in the benchmarks.

• greatly reduce the cache miss rate.

• Is this a right aproach?

Differential-based compression in caches.

• Benini, et al. = uncompressed caches but compressed memory.

• Assumption: it is likely for data words in same cache line to have some bits in common.

• Zhang and Gupta = added 6 new data compression instructions to MIPS.

• New instructions: • Compress 32-bit data and addresses into 15 bits.

• By common prefixes and narrow data trasformations

Significance-Based Compression.

• Most significant bits are shared amongst data and instruction & data addresses.

• Addresses: Why transfer long addresses with repeating patterns?

• Farrens and Park: "many address references transferred between processor and memory have redundant information in their high order (most significant) portions".

• Solution: cache these high order bits in a group of dynamically allocated base registers, • only transferr small register indexes rather than the high-order address bits between the processor and memory.

• Also: Citron and Rudolph store common high-order bits in address and data words in a table, • transfer only an index plus the low order bits between the processor and memory.

Significance-Based Compression.

• Canal , et al. = compress addresses & instructions.

• Keep only the significant bytes.

• Maintain a two - three extension bits to maintain significant byte positions.

• Results: Reduces power consumption in the pipeline.

• Kant and Iyer = most significant bits of address can be predicted with high accuracy whilst data with lower accuracy.

• Simple solution: • Compress individual cache lines on a word-by-word basis by storing common word patterns in a compressed format.

• Store each word with an appropriate prefix.

Significance-Based Compression.

• Significant bits of processor structure entries are the same or are to be found in a small data set: • BTB 256 entry table can store 99% of higher bits.

• Data bits: Why have multible instances of them in every BTB, cache, etc, entry?

• Solution: Use multible tables with different sizes, • use pointers amongst the different table levels.

Frequent value caches: how do they work?

• • • • • • • • • They work as follows: The cache is divided into two arrays.

One let's say 5 lower bits and the other 27 upper bits.

If the lower 5 bits let's say belong to a value which is frequent, the remaining 27 bits are not read and they are read instead from a smaller high speed register file containing 2 power of 5 places.

Otherwise, If the let's say lower 5 bits do not belong to a frequent value then the rest of the 27 bits are read from the second cache array.

Thus, the actual value sharing is not done between the two cache tables but between 3 tables: the two cache tables and the smaller fast register file.

Also, an extra flag bit is used to indicate wether a value is frequent and so, although there is always an indirection, (either between the two cache tables, or between the first cache table and the special register file), and thus a delay, there is no extra pointer and so the first of the two delays could have been in theory avoided.

Why don't they do it simpler?

cache compression schemes: a summary.

• Cache compression schemes: • 1. Indirect tags: "The IIC does not associate a tag with a specific data block; instead, each tag contains a pointer into a data array which contains the blocks." • 2. FVC "The Frequent Value Cache (FVC) replaces the top N frequently used 32bit values with log (N) bits. When built as a separate structure the FVC can increase cache size if an entire cache block is made up of frequent values." • Probability decreases though with larger caches, since larger cache = more uniqueness in the data.

• So suitable for small structures: the paper mentions only l1 cache.

• 3. Dynamic Zero Compression (DZC): If a byte is all zero then only one bit is used to signify this saving the other 7 bits.

cache compression schemes: a summary.

• • • • • • • • • Cache compression schemes: 4. Separate banks: Kim et al. utilize the knowledge that most of the bits of values stored in a L1 data cache are merely sign bits.

Their scheme compresses the upper portion of a word to a single bit if it is all 1s or all 0s.

These high order bits can be stored in a separate cache bank and accessed only if needed, or, tags can be further modified indicating whether an access to the second cache bank is necessary.

5. Alameldeen and Wood = algorithm called frequent pattern compression (FPC).

What? Adaptive scheme of compression sometimes compresses sometimes not based on whether the penalty of uncompression is more or less than the potential penalties incurred by cache misses. Very elegant!

6. "general compression algorithm. Cache lines are compressed in pairs (where the line address is the same except for the low-order bit). If both lines compress by 50% or more, they are stored in a single cache line, freeing a cache line in an adjacent set.

Paper doesn't specify compression algorithm though. Also, does not specify how these lines are tagged differently.

cache compression schemes: a summary.

• • • • • • • • • • • • G. Hallnor and S. K. Reinhardt, "A Compressed Memory Hierarchy using an Indirect Index Cache".

Compression through an indirect table of tags.

Cache fully associative and lines are referenced through a pointer stored alongside the tag.

More than one pointers slots are present to allow compression.

Algorithm used = LZSS.

Compression carried out only if line can be compressed to fit into the size of the sector architecturally specified, otherwise no compression.

Attains greater than 50% of the performance gain of doubling the cache size, with about one tenth the area overhead.

Disadvantages: The speed of LZSS is dependent on the number of simultaneous compressions.

6 bytes per tag extra for the pointers.

Pointers may be unused if not compression is possible for that line.

Resulting in: 134 kb for an 1 mb cache are for the indirection table (tags, pointers, etc). Bad!

cache compression schemes: a summary.

• N. Kim, T. Austin, T. Mudge, “Low-Energy Data Cache using Sign Compression and Cache Line Bisection” • How does the sign compression work?

• "each word fetched during a cache line miss is not changed, • But the upper half-words are replaced by a zero or one when the upper half-words are all zeros or all ones respectively.

• Uses some sign compression bits instead.

• However: • Allows uncompressed words in the line too.

• Extra bits to indicate: uncompressed / compressed / sign bits.

• Innovation: • Two tags per cache line instead of one.

cache compression schemes: a summary.

• N. Kim, T. Austin, T. Mudge, “Low-Energy Data Cache using Sign Compression and Cache Line Bisection” • It allows energy savings as only half the line is accessed based on where the block in question is and given that sign compression is carried out.

• Energy: precharge using a MRU mechanism.

• Uses empty spaces in a block to store new blocks fetched having the same index.

• Reduces misses.

cache compression schemes: a summary.

• • • • • • • • • • Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A. Wood Computer Sciences Department, University of Wisconsin-Madison {alaa, david}@cs.wisc.edu

Adaptive simply means that sometimes you compress sometimes not based on two factors: 1. Decompression latency and 2. Avoided cache misses.

If the cost of decompressing is more than the time that would be saved by avoiding potential misses if compression was used, then compression is not performed, otherwise compression is carried out.

How? A single global saturating counter predicts whether the L2 cache should store a line in compressed or uncompressed form.

Counter updated by the L2 controller.

Based on whether "compression could (or did) eliminate a (potential) miss or incurs an unnecessary decompression overhead." Not a new idea though: virtual memory.

cache compression schemes: a summary.

• L. Villa, M. Zhang, K. Asanovic, “Dynamic Zero Compression for Cache Energy Reduction” • "Dynamic Zero Compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte." • Invisible to software.

• Basically what it does is for each byte if it is all zeros it uses only one bit to store it.

• Disadvantages: • Compression scheme for every byte of the cache line, • Increases the complexity of the cache architecture.

• Lose the opportunity to compress all ones and only deal with zeros.

cache compression schemes: a summary.

• Chuanjun Zhang*, Jun Yang and Frank Vahid, Low Static-Power Frequent-Value Data Caches • "Recently, a frequent value low power data cache design was proposed based on the observation that a major portion of data cache accesses involves frequent values that can be" separated and stored only once." • Basically it means that if a cache line value is "frequent" then you store it only once and you keep a pointer to it.

• Same idea.

• But: Proposes a method to shut off the unused bits to conserve energy in the case that a pointer is used.

• They are also proposing to reduce the latency of reading both a frequent value table and the ordinary cache.

Compression in caches: conclusion.

• Cache designers might consider using cache compression to increase cache capacity and reduce off-chip bandwidth. • "A key challenge in the design of a compressed data store is the management of variable-sized data blocks." • Generally, in the studies carried out, a lot of work has been done.

• Compression has been examined from a thousand angles.

Compression in caches: conclusion.

• Compression has been examined from a thousand angles: • Most are using the idea that 0s and 1s come together in great numbers.

• Some deal with common "frequent" bit patterns.

• However, found none that shows a mechanism of finding those "frequent" values.

• They rely on prophiling or on hard-coding the values from what I understand.

• Marios paper?

Example 2: compression in predictors.

• Prediction important for high parallelism.

• Branches 15% of program.

• Pentium: 4k BTB.

• Do branch targets exhibit the same pattern behaviour as cache lines?

• Surely targets might not be as compressible as cache lines by the removal of leading zero bits but there might be pattern repetition in them.

Compression in predictors.

• Ideal: Dynamic allocation of target space according to the needs of each instruction.

• Rehashable BTB: • Recognises polymorphic branches and store them in a common BTB space.

• Value predictors: • Loh H. Gabriel = stores values in separate tables based on length.

• Energy saving upto 25% & space upto 75%.

• However: Cannot be used with the BTB.

What did we do with the BTB?

• Mission: Minimize the waste of space in the BTB.

• Data compression to avoid duplicate entries, meaning bit sharing.

• How?

• Simple: Two-table structure.

Methodology:

• Aim: Find all Entries/branches that have the same or partially the same target.

• We used: No replacements BTB & BTB with multible tables.

Results.

• Questions: • 1. What width will each table have?

• 2. How many entries?

• 3. How to join them up?

Q1: Bit Ranges.

• GCC95 results • BTB performance: • BTB type: Num-of-branches that are correct hits performance • Normal BTB: 19164012 87.8982% • BTB with no replacements: 19892653 percentage of 91.2402% • Bits 1-16: 21802503 99.9999% • Bits 5-20: 21791021 99.9473% • Bits 9-24: 21706107 99.5578% • Bits 13-28: 2130411697.714% • Bits 17-32: 20013651 91.7951% • Bits 25-32: 21801499 99.9953% • Bits 1-24: 21706107 99.5578% • 1-24 bits & 25-32 best performance than bits 1-16 & 17-32.

Q2: How much space for each table.

• • • • • • • • • • • • • BTB type percentage 1-32bit hits percentage 25-32bit hits 4k normal percentage 1-24bit hits 19164012 87.8982% 19164385 87.8999% 19847639 91.0337% 4k no replacement 19892653 91.2402% 21801499 99.9953% 21706107 99.5578% 2k normal 18571627 85.1811% 18571985 85.1827% 19248618 88.2862% 2k no replacement 19567018 89.7466% 21801090 99.9934% 21701108 99.5349% 1k normal 17522962 80.3713% 17523248 80.3726% 18187048 83.4172% 1k no replacement 18885228 86.6195% 21799372 99.9856% 21688593 99.4775% 512 normal 16149816 74.0732% 16150001 74.074% 16786062 76.9914% 512 no replacement 99.4008% 17720797 81.2787% 21795542 99.968% 21671880 256 normal 14327910 65.7168% 14328057 65.7174% 14924424 68.4528% 256 no replacement 98.9917% 15905413 72.9522% 21770601 99.8536% 21582680 For BTB without replacements critical point at 256 places for lower 8 bits = 8-bits are after all.

Upper 24 bits very common!

Results.

• • • • • • • • • • • • Benchmark BTB size Num of correct hits % Num of correct hits % Name in normal BTB in improved BTB GCC95 GCC95 8k places 1952817089.5684% 4k places 1916401287.8982% 1938401488.9072% 1903477987.3054% GCC95 MCF2000 99.2259% 2k places 1857162785.1811%1845696984.6552% 8k places 149389263 99.2259% 149389263 MCF2000 99.225% 4k places 149387899 99.225% 149387899 MCF2000 98.125% Vortex2000 Vortex2000 Vortex2000 2k places 147731903 8k places 8937143786.9625% 4k places 8840544486.0226% 2k places 8518538982.8893% Up to 80% of original size!

98.125% 147731903 8895392386.5563% 8799559885.6238% 8480523282.5194%

Costs.

• Num of 1st table improvedreduction% size of normal size of • Entries BTB BTB • 8k entries: 376832 bits 299520 bits 20.516% • 4k entries: 192512 bits 156160 bits 18.882% • 2k entries: 98304 bits 82432 bits 16.145% • Generally reduces size requirements by 20%.

Don't use the page number, but a pointer to it.

• Andr6 Seznec = Brilliant proposal.

• Caches: Relative size of addresses (tags) is huge especially in small blocks.

• Predictors: Accuracy affected due to large addresses (targets & tags).

• Curious finding: Addresses represented 3+ times, in cache tags, in instructions, in BTB, in TLB.

• Removed by: • 1. Store page number-s only once • 2. Do not use the page number, but a pointer to it

Don't use the page number, but a pointer to it.

• Andr6 Seznec = Brilliant proposal.

• How?

• Page number stored in a page number cache.

• Can be the TLB when vertual addresses are used or another buffer if physical.

• Store 5-bit pointers in place of addresses.

• Reduce cache, reduced predictor tags.

• Cache: If a page pointer is invalidated (page miss) all entries are invalidated.

• But: Why not invalidating the BTB entries as well as the tlb ones?

Don't use the page number, but a pointer to it.

• How does it compare?

• Andr6 Seznec comparison with other schemes: • Isolated compression scheme independent of address width • 8-bit+ pointers 6-bit pointers according to paper not much Seznec scheme • Touch only the targets touch both tags and targets • Predictor size Dependent on address width Predictor size • Second-level table accessed every time table with page pointers accessed only when getting outside processor, ie. to ram. • A specific predictor only solution A BTB, cache and tlb solution • Not affected by page misses Affected by misses though • Only specific predictor changed cache, BTB even program counter has to be modified to the new scheme to be effective

Conclusion.

• Compression: a huge field and we have touched the surface.

• The key to a successful algorithm: • 1. Speed, speed, speed!

• 2. Simple to implement in hardware, • 3. Balances space & energy savings with overhead.

• Based on the above are: • Decision trees, • Classification algorithms, • etc, • worth it?

References:

• • • • • • • • • • • • Technical Report 1500, Computer Sciences Dept., UW-Madison, April 2004 Target Prediction for Indirect Jumps Po-Yung Chang Eric Hao Yale N. Patt Don't use the page number, but a pointer to it Andr6 Seznec A. R. Alameldeen and D. Wood, "Adaptive Cache Compression for High-Performance Processors", Proc. of the 31st International Symposium on Computer Architecture, June 2004, pg. 212-223.

G. Hallnor and S. K. Reinhardt, "A Compressed Memory Hierarchy using an Indirect Index Cache", Technical Report CSE-TR-488-04, 2004.

L. Villa, M. Zhang, K. Asanovic, “Dynamic Zero Compression for Cache Energy Reduction”, In the proceedings of the 33 rd International Symposium on Microarchitecture, Dec2000.

P. R. Wilson, S. F. Kaplan, Y. Smaragdakis, “The Case for Compressed Caching in Virtual Memory Systems”, In the proceedings of USENIX 1999.

J. Yang, R. Gupta, “Energy Efficient Frequent Value Data Cache Design”, In the proceedings of the 35 th Annual International Symposium on Microarchitecture, 2002, (MICRO Y. Zhang, J. Yang, R. Gupta, “Frequent Value Locality and Value-Centric Data Cache Design”, In the proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, Nov. 2000 N. Kim, T. Austin, T. Mudge, “Low-Energy Data Cache using Sign Compression and Cache Line Bisection”, 2 nd Annual Workshop on Memory Performance Issues, May 2002 P. R. Wilson, S. F. Kaplan, Y. Smaragdakis, “The Case for Compressed Caching in Virtual Memory Systems”, In the proceedings of USENIX 1999.

Chuanjun Zhang*, Jun Yang and Frank Vahid, Low Static-Power Frequent-Value Data Caches

References:

• • • • • • • • • • • • • Li, T. & Joxn, L., K. (2001). Rehashable BTB: An Adaptive Branch Target Buffer to Improve the Target Predictability of Java Code. The University of Texas at Austin.

Sazeides Y. & Smith J. E. (1998). Implementations of the Context-Based Value Predictors. University of Wisconsin-Madison.

Loh. G. H. (2003). Width-Partitioned Load Value Predictors. Journal of Instruction-Level Parallelism. College of Computing Georgia Institude of Technology Atlanta.

Gifford S. & Huang C.-W. & Yang Z. & Yu C. (2003). A Comprehensive Front-end Architecture for the VeriSimple Alpha Pipeline. University of Michigan.

Yung R. (1996). Design of the UltraSPARC Instruction Fetch Unit. Sn Microsystems.

Chang P.-Y. & Hao E. & Patt Y. N. (1997). Target Prediction for Indirect Jumps. Department of Electrical Engineering and Computer Science the University of Michigan.

Calder B. & Grunwald D. (1995). Next Cache Line and Set Prediction. Department of Computer Science University of Colorado.

McFarling S. (1993). Combining Branch Predictors. Western Research Laboratory California.

Hinton G. & Sager D. & Upton M. & Boggs D. & Carmean D. & Kyker A. & Roussel P. (2001). The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal Q1.

Lohy G. H. & Henrizy D. S. & Krishnamurthyy A. (2003). Exploiting Bias in the Hysteresis Bit of a Two-bit Saturating Counters in Branch Predictows. Journal of Instruction Level Parallelism.

Kalla R. & Sinharoy B. & Tendler J. M. (2004). IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE Computer Society.

Arora K., Sharangpani H. (2000). Itanium Processor Microarchitecture. IEEE Computer Society.

Perleberg C. H. & Smith A. J. (1993). Branch Target Buffer Design and Optimizationn. IEEE Transactions on Computers.

Other interesting references:

• • • • • • • • • • • • Gabriel H. Loh Simulation Differences Between Academia and Industry: A Branch Prediction Case Study To appear in the International Symposium on Performance Analysis of Software and Systems (ISPASS), March , 2005, Austin, TX, USA. Gabriel H. Loh The Frankenpredictor: Stitiching Together Nasty Bits of Other Predictors In the 1st Championship Branch Prediction Contest (CBP1), pp. 1-4, Dec 6, 2004, Portland, OR, USA. (Held in conjunction with MICRO-37.) Gabriel H. Loh The Frankenpredictor: Satisfying Multiple Objectives in a Balanced Branch Predictor Design Invited to appear in the Journal of Instruction Level Parallelism (JILP).

Gabriel H. Loh, Dana S. Henry Predicting Conditional Branches With Fusion-Based Hybrid Predictors In the 11th Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 165-176, September 22-25, 2002, Charlottesville, VA, USA.