張文賢Home Work Report Title: ??? (Course: SoC Hw/Sw Codesign)

Download Report

Transcript 張文賢Home Work Report Title: ??? (Course: SoC Hw/Sw Codesign)

A Framework for Coarse-Grain
Optimizations in the
On-Chip Memory Hierarchy
張文賢
Dept. of Electrical Engineering
National Cheng Kung University
Tainan, Taiwan, R.O.C
Email: [email protected]
0
Introduction (1/2)
Current on-chip block-centric memory hierarchies exploit access patterns
at the fine-grain scale of small blocks.
 Further useful patterns emerge with a macroscopic, coarse-grain view.



Several recently proposed techniques for coherence traffic reduction and
prefetching.
To exploit coars-egrain behavior

we must extend conventional caches with additional coarse-grain tracking and
management structures


They increase overall cost and complexity
Coarse-grain tracking and management no longer needs to be an
afterthought.

Since multi-megabyte caches have become commonplace.
1
Introduction (2/2)

Coarse-grain tracking:
Tracking information about multiple blocks belonging to coarser memory regions.
 Region: a continuous portion of memory.


Coarse-grain management:


Managing the corresponding blocks of the region.
Region information and management have been shown to facilitate:
(1) Performance, bandwidth and power improvements for snoop-coherent shared
memory multi-processors.
 (2) Prefetching for applications with demanding memory footprints.


These techniques rely on two types of information:
(1) Whether any block in a region is cached.
 (2) Which specific blocks of a region are cached.


They also require support for selectively fetching or invalidating the
blocks of a region.
2
RegionTracker

RegionTracker (RT): A dual-grain cache design

Coarse-grain functionality:


Communication still uses fine-grain blocks.


Region-level management actions
 such as region invalidation, lookup, migration, and replacement
Avoiding bandwidth explosion.
RT also has other features:
Reducing the area overhead.
 Less impact of performance.
 A single lookup to determine which, if any, blocks of a region are cached and
where.
 High associativity is not necessary.


RT requires less area to achieve a nearly identical miss rate (within 1%).

Compared to a block-centric conventional cache of the same data capacity.
3
RegionTracker Requirements (1/2)

The starting point for RT cache design is a conventional cache.
The performance and area have been tuned appropriately.
 Replacing just the tag array with a structure



To inspect and manipulate regions.
RT cache will have the same requirements:
1. Communication still uses fine-grain blocks.
 2. The miss rate does not increase.
 3. The cache latency does not increase.
 4. The cache area does not increase.
 5. Lookups do not require higher associativity.
 6. There is no need for additional cache accesses as a result of regular cache
operation (e.g., for replacements).
 7. Banking and interleaving are possible.

4
RegionTracker Requirements (2/2)

In addition, the new design will provide the following region-level
functionality:

1. A single lookup can determine:
Whether a region is cached.
 For snoop reduction.
 Which blocks of a region are cached and where.
 For prefetching.


2. Supporting region invalidation, migration and replacement
For on-chip streaming and coherence optimizations.
 For optimizing data placement.
 Moving regions close to where they are accessed.

5
RegionTracker Structure

In this paper, the configuration of L2 cache is:
Size: 8MB
 16-way set associative
 Block size: 64 bytes
 Physical address: 50 bits
 Region size: 1KB

The tag array of the conventional L2 cache is replaced with RT.
Conventional L2
L1
L1
L1
Tag Array
Data Array
L1
L1
RegionTracker

RVA
L1
L1
ERB
L1
BST
Data Array
6
Conventional Cache Design

The higher associativity is necessary when encountering lager cache
size.

In order to alleviate the impact of miss-rate.
7
Region Vector Array

Region Vector Array (RVA):
Comprising region vectors.
 Its organization is independent of the data array.


Each RVA entry contains:
A region tag and LRU information
 Several BLOFs:


One per block in the region, that identifys in which data way the corresponding block is
cached, if any.
Physical address:
Region Tag
49
Region Vector Entry:
RVA index
21 20
block
10 9
Region Tag
- - - - - -
6
…
RVA
…
…
…
match
BLOF: V
way
8
Evicted Region Buffer

Evicted Region Buffer (ERB):
Into which Evicted RVA entries are copied.
 Eliminating the need for multiple simultaneous block evictions.
 ERB is not a victim buffer since it does not contain any data blocks.
 A small ERB is sufficient to avoid performance losses.



The ERB evicts blocks eagerly from the oldest one third of its entries.


e.g., 12 entries
Eager evictions ensure that an empty ERB entry is available anytime a region is
evicted from the RVA.
Using standard back-pressure mechanism when an empty entry is not
available

Stalling the cache until one of the entries becomes available.
9
Block Status Table

Block Status Table (BST):
Only necessary for accesses requiring additional state information (e.g., dirty, or
exclusive).
 Storing block status information



The BST uses the same indexing scheme of data array shown in
following figure :


But it is an un-tagged structure.
Addresses that map to a specific BST set can map to multiple different
RVA sets.

To avoid searching multiple RVA sets when performing block replacements,
optional BST backpointers contain the RVA index bits that are not contained in the
BST index.


Only for blocks that are resident in the cache.
Bits 20 and 21 in the example.
The BST relies on the result of an RVA or ERB lookup to perform the
final selection of a single way from the BST.
10
RegionTracker Lookup Procedure

When servicing a cache request, there are multiple different scenarios,
as described in Figure 2.
Figure 2. RegionTracker lookup procedure
11
RegionTracker Lookup Procedure

1. A hit for the region and the block.
First, the access proceeds in parallel to the RVA, ERB and BST.
 In the event of a block hit, the result of the RVA or ERB lookup determines which
way in the BST is selected in the final stage of the BST access.


2. On an RVA hit and block miss
We need to replace a block from the same data set.
 This process determines the address of the victim block so it can be evicted
 The two RVA entries (for the requested and victim blocks) are updated to reflect
their new states.


3. When a region miss occurs (no RVA entry or ERB entry is found)
A victim RVA entry is selected and copied to the ERB.
 Once the RVA entry has been copied to the ERB, a new entry can be allocated in
its place or the requested region.


At this point the access proceeds the same as if it had originally found the newly
initialized RVA entry.
12
Optimization Framework: Snoop Elimination

Avoiding broadcast:
Remote nodes perform an RVA lookup to determine whether they have any blocks
in the region cached.
 The originating node marks the region as non-shared using the extra RVA bit.
 Subsequent requests for other blocks in the region from the same node need not
use broadcast.
 If another node requests a block in the region, it will use broadcast invalidating the
non-shared status of the region.


BlockScout, a new optimization

A single sharing bit is added to each BLOF.
To indicate whether a specific block is shared or not.
 Requiring communicating sharing vectors in snoop replies.

In this case, the originating nodes collect a sharing vector on the first block access
in the region.
 Avoiding broadcasts for accesses to non-shared blocks in the region.

13
Optimization Framework: Prefetching

Stealth Prefetching Technique:


Tracking which blocks in a given region have been previously fetched and uses this
information to later prefetch these blocks.
Extending RT BLOFs with the additional bit-vector to support.
Implemented by using invalid state of each BLOF.
 These bit-vectors track which blocks have been cached and which are present in
the cache.

14
Previous Coarse-Grain Cache Design

The blocks belonging to a region will all map to a data set region (DSR)


Sectored Cache:


DSR is a continuous portion of data array sets.
It allocates a single RVA entry for each data way in each DSR.
Sector Pool Cache:

Adding flexibility by having more RVA entries per DSR.
15
Decoupled-Sectored Cache

Decoupled-Sectored Cache (DSC): a relatively previous work

Combining a region tag array (RTA) with a status table (BST).



Extending each BST entry with 8-bit backpointer to identify a region.
Each BST entry contains a pointer which uniquely identifies an RTA tag.
DSC lookups proceed into two phases:
The first accesses the RTA, comparing region tags against the address, and, in
parallel, it accesses the BST.
 The second phase compares the BST pointers against the way of the matching
region tag.

DSC overcomes the problems of poor miss-rates and high associativity
suffered by SC and SPC respectively.
 The optimized DSC (oDSC) smooths out region evictions via the use of a
modified ERB.
 The RegionTracker-DSC, or RT-DSC extends oDSC by adding single-bit
BLOFs to each region tag.

16
Simulation Environment
This paper simulated a four-core CMP with a shared L2 cache based on
the Piranha cache design [1] using the Flexus simulator [7].
 Table 1 details the processor cores.

17
Simulated Workloads

Table 2 describes the simulated workloads:
(1) The TPC-C v.3.0 online transaction processing workload running on both IBM
DB2 v8 ESE and Oracle 10g EnterpriseDatabase Server.
 (2) Five queries from the TPC-H DSS workload running on IBM DB2 v8 ESE.
 (3) The SPECweb99 benchmark running over Apache HTTP Server v2.0 and Zeus
Web Server v4.3

18
RegionTracker vs. Sector Cache

Figure 5 shows the relative miss rates of SC (left) and RT (right).

The SC design with 512-byte sectors


The RT design chosen has 2K to 64K, 12-way set-associative RVA sets.


Because SC performs much worse with 1KB sectors (more than 50% miss-rate in some benchmark).
Depending on the cache size.
The y-axis shows miss rate relative to a conventional cache with equal capacity.
Figure 5. Relative Miss Rate for sector cache with 512-byte sectors (left) and a
RegionTracker cache with 1kB regions (right), with capacities of 8MB to 256MB.
19
Area Estimation

This paper estimated the relative size from full-custom implementations
of the cache structures on a 130nm commercial technology.

The 3rd and 4th columns respectively show:
the relative size compared to the conventional tag array
 the relative bit-density compared to the conventional tag array

SC: 512-byte sectors
 SPC: a 52-way set-associative tag array with 1kB sectors
 The last three designs all have 2K, 15- way associative sets with 1KB regions.

20
Relative Size and Miss Rate

Figure 6 reports the relative miss rate for SC, SPC, oDSC, RT and RT-DSC designs as
a function of the relative area of each design.
The x-axis shows the relative area of the structure compared to the area of a conventional
tag array.
 The y-axis shows the miss rate relative to the conventional cache.
 Points beneath the Sqrt(2) Rule perform better than a conventional cache with the same total
area.

21
Conclusion

Compared to previous dual-grain tracking cache designs RegionTracker
offers the following advantages:
(1) It avoids the significant miss rate increase suffered by sectored caches.
 (2) It avoids the need for a highly associative RVA lookups that are required by the
sector pool design.


For an 8MB cache RT:
Reducing area by 3-9%
 Using lower associativity for lookups



while increasing miss-rate by less than 1%
As caches grow larger, coarse-grain management and optimization
techniques will become increasingly important.
22