PPT slides - Electrical and Computer Engineering

Download Report

Transcript PPT slides - Electrical and Computer Engineering

Based on the paper by :Jie Tang, Shaoshan Liu,Zhimin Gu,Chen Liu and Jean-Luc Gaoudiot,Fellow, IEEE Computer Architecture Letters Volume 10 Issue 1

         Introduction Motivation and Background Previous Work Methodology Prefetcher Performance Energy Efficiency Energy Consumption Analysis Energy Efficiency Model Conclusion

 Data prefetching, is the process of fetching data that is needed in the program in advance, before the instruction that requires it is executed.

 It removes apparent memory latency.

 Data prefetching has been a successful technique in modern high-performance computing platforms.

 It was found however, that prefetching significantly increases power consumption .

 Embedded Mobile Systems typically have constraints for space , cost and power.

 This means that they can’t afford power consuming processes.

 Hence, prefetching was considered unsuitable for Embedded Systems.

 Embedded mobile systems have now come to be powered by powerful processors such as dual core processors like the Tegra2 by Nvidia  Smart phone applications include web browsing, multimedia, gaming, Webtop control all of which require a very high performance from the computing system.

 To meet the this requirement, methods such as prefetching which were earlier shunned, can now be used.

 With better, more power efficient technology, the energy consumption behavior may have also changed.

 Due to this reason, we have decided to study and model the energy efficiency of different types of prefetchers.

 Over the years, the main bottleneck that was preventing the speeding up of systems, has been the slowness of memory and not the processor speed.  Prefetching date can be implemented in hardware by observing fetching patterns, such as prefetching the most recently used data first.  Sequential prefetching takes advantage of spatial locality in the memory.  Tagged prefetching associates a tag bit with every memory block and prefetches based on that value

Stride-based prefetching detects the stride pattern in the address stream like fetches from different iterations from the same loop.

Stream prefetchers try to capture sequential nearby misses and prefetch an entire block at a time.

Correlated prefetchers issue prefetches based on the previously recorded correlations between addresses of cache misses.

 There had been some studies focusing on improving energy efficiency in hardware prefetching: PARE is one of these techniques, which constructs a power- aware hardware prefetching engine.  By categorizing memory accesses into different groups  It uses a table with indexed hardware history which is continuously updated and different memory fetches are categorized, and the prefetching decisions are based on the information in this table.

 Modern embedded mobile systems execute a wide variety of workloads Table 1 Benchmark Set

Xerces-C++

SAX DOM

 The first set includes two XML data processing benchmarks taken from Xerces-C++

Media Bench II

JPEG2000 Encode JPEG2000 Decode H.264 Encode

 They are implementing event-based parsing which is data centric (SAX) which is and tree-based parsing model document centric (DOM).

PARSEC

H.264 Decode Fluidanimate Freqmine

 The second set is taken from MediaBench II which provides application level benchmarks, representing multimedia and entertainment workloads , based on the ISO JPEG-2000 and ISO JPEG-2000 standard.

 It also has the H.264 Video Compression standards.  The third set is taken from the PARSEC(Princeton Application Repository for Shared-Memory Computers) benchmark for multithreaded processors which is used in many gaming applications.

Cache hierarchy indicates the level of cache that the prefetcher covers.

 

Prefetching degree

shows whether the prefetching degree of the prefetcher is static or dynamically adjusted.

Trigger L1 and Trigger L2

respectively show what triggers the prefetch.

P 5 P 6 P 1 P 2 P 3 P 4

Table 2 Summary of Prefetchers cache hierarchy prefetching degree trigger L1 L1 & L2 L1 L1 & L2 L2 L1 & L2 L2 Dynamic Static Dynamic Static Dynamic Static miss miss miss N/A miss N/A trigger L2 access N/A miss miss miss access

   To study the performance of the selected prefetchers, we use CMP$IM,a cache simulator, to model high performance embedded systems. It is a Pin based multi-core cache simulator Simulation parameters are shown in Table 3, which resembles modern Smartphone and e-book systems  Table 3 Simulation Parameters

Frequency Issue Width Instruction Window L1 Data Cache L1 Inst. Cache L2 Uniform Cache Memory Latency 1 GHz 4 128 entries 32KB, 8-way, 1cycle 32 KB, 8-way, 1cycle 512 KB, 16-way, 20 cycles 256 MB, 200 cycles

 To study the impact of prefetching on energy consumption of memory subsystem, we use CACTI to model energy parameter of different technology implementations.  In a simulator, a hardware prefetcher can be defined by a set of hardware tables, its output is in the form of tables of data, hence it’s energy consumption can be modeled.

 Prefetching techniques are effective on improving performance by more than 5% on average. In detail, the effectiveness of prefetchers depends on both prefetching technique itself and natures of applications.

 P3 results in the best average performance because it’s the most aggressive prefetcher.

 JPEG2000 decoding and encoding programs can receive up to 22% of performance improvement due to its streaming feature.

 Fig1 Performance Improvement

 We study the energy efficiency of both 90 nm and 32 nm technologies. The results are summarized in Figures 2 and 3 respectively.

 The baseline for comparison is energy consumption without any prefetcher, thus a positive number shows that with the prefetcher the system dissipates more energy.

 For instance, 0.1 means that with the prefetcher, the system dissipates 10% more energy compared to baseline.

 In 90nm technology, most prefetchers significantly increase overall energy consumption, which confirms the findings of previous studies.  Thus, in 90 nm technology, only very conservative prefetchers can be energy efficient.

 Fig 2 90nm

 Fig 3 32nm

 In 32 nm technology, P4 is still the most energy efficient prefetcher, reducing overall energy by almost 4% on average; when running JPEG 2000 Decode, it achieves close to 10% energy saving.

 P2 and P3 are still the most energy-inefficient prefetchers due to their aggressiveness. However, in the worst case they only consume 25% extra energy, a four-fold reduction compared to the 90 nm implementations.

 Thus most prefetchers are able to provide performance gain with less than 5% energy overheads; and P1 and P4 even result in 2% to 5% energy reductions.

 In equation 1, the total energy consumption consists of two contributors: static energy (E

static

) and dynamic energy (E

dynamic )

N m

is the number of read/write memory

accesses

E dynamic

= number of read/write accesses with the energy dissipated on the bus & memory subsystem of each access (E’

m ).

 (E

static

) is production of overall execution time (t) and the system static power consumption (P

static ).

 When prefetchers accelerate the process, the reduced execution time reduces the static energy consumption.

 However, prefetchers generate significant amount of extra memory subsystem accesses leading to pure dynamic overheads.

Equation 1: E = E static +E dynamic = (P static x t)+(N m x E’ m )

Table 4 Energy Category

Dynamic memory dynamic activities of the memory subsystem

Static memory

memory subsystem static power consumption Dynamic prefetch dynamic activities of the prefetcher

Static prefetch

prefetcher hardware static power consumption

Fig 4

 In 90 nm technology, dynamic energy contributes to up to 66% of the total energy consumption: 14% from the pre-fetcher and 52% from the memory subsystem. Static energy only accounts for 34% of the total energy consumption.

 Hence, although the prefetchers are able to reduce execution time, there leaves little room for total energy saving, leading to energy inefficiency for most pre-fetchers in 90 nm implementations.

 In 32 nm technology, static energy contributes over 66% of the total energy consumption: 65% from the memory subsystem, and 1% from the prefetcher hardware.  Dynamic energy is far less compared to static.  32 nm technology, prefetchers become energy- efficient in many different cases.

 We propose an analytical model to evaluate efficiency. Equation 2: E no-pref > E pref (?)   To simplify the model, we assume there is only one level in the memory subsystem. Compared to E

no-pref

, E

pref

has two more contributors: static energy and dynamic energy consumption coming from prefetcher hardware.

Equation 3: P m-static *t 1 +N m1 xE’ m > P m-static xt2+N m2 xE’ m +P p-static xt 2 +N p xE’ p  Equation 4: (t 1 -t 2 )/t 1 > [(N m2 -N m1 )*E ’m +N p *E’ p +P p-static *t 2 ]/P m-static *t 1

 The left-hand side shows the performance gain as a result of prefetching.

 The dividend of right-hand side contains three terms: energy overhead incurred by the extra memory accesses ;dynamic energy; and static energy consumption.

 The divisor of the right-hand side represents the static energy of the original design without prefetching

 As summarized in Equation 5, if a prefetcher needs to be energy efficient, the performance gain (G) it brings must be greater than the ratio of the energy overhead (E

overhead

) it incurs over the original static energy (E

no-pref-static

).

Equation 5:

G> E overhead /E no-perf-static

Equation 6 :

EEI=G - E overhead /E no-perf-static

 We define a metric Energy Efficiency Indicator (EEI) in Equation 6. A positive EEI indicates the prefetcher is energy-efficient and vice versa.

P1 90 nm -0.1

 We have validated the analytical results with the empirical results shown in table, thus indicating the simplicity and effectiveness of our analytical models.

P2 P3 P4 P5 P6 -0.5

-0.69 0.03 -0.27 -0.31

32 nm 0.03 -0.05 -0.07 0.05

0.00 -0.14

 With a new trend in highly capable embedded mobile applications, it seems conducive to implement high-performance techniques-> PREFETCHING  They do not seem to put a burden on energy consumption and should thus be implement

 A simple analytical model has been demonstrated to estimate the effects of prefetching and to effectively calculate it.

 System designers can estimate the energy efficiency of their hardware prefetcher designs and make changes accordingly.