american.cs.ucdavis.edu

Download Report

Transcript american.cs.ucdavis.edu

Glenn Reinman, Brad Calder,
Department of Computer Science and Engineering,
University of California San Diego
and Todd Austin
Department of Electrical Engineering and Computer Science,
University of Michigan
Introduction
• Instruction supply critical to processor performance
Instruction
Fetch
...
Execution
Core
Issue
Buffer
– Complicated by instruction cache misses
– Instruction cache miss solutions:
• Increasing size or associativity of instruction cache
• Instruction cache prefetching
– Which cache blocks to prefetch?
– Timeliness of prefetch
– Interference with demand misses
Fetch Directed Instruction Prefetching
2
Prior Instruction Prefetching Work
• Next line prefetching (NLP) (Smith)
– Each cache block is tagged with an NLP bit
– When block is accessed during a fetch
• NLP bit determines whether next sequential block is prefetched
– Prefetch into fully associative buffer
• Streaming buffers (Jouppi)
– On cache miss, sequential cache blocks, starting with block that
missed, are prefetched into a buffer
• Buffer can use fully associative lookup
• Uniqueness filter can avoid redundant prefetches
• Multiple streaming buffers can be used together
Fetch Directed Instruction Prefetching
3
Our Prefetching Approach
• Desirable characteristics
– Accuracy of prefetch
• Useful prefetches
– Timeliness of prefetch
• Maximize prefetch gain
• Fetch Directed Prefetching
– Branch predictor runs ahead of instruction cache
– Instruction cache prefetch guided by instruction stream
Fetch Directed Instruction Prefetching
4
Talk Overview
• Fetch Target Queue (FTQ)
• Fetch Directed Prefetching (FDP)
• Filtering Techniques
• Enhancements to Streaming Buffers
• Bandwidth Considerations
• Conclusions
Fetch Directed Instruction Prefetching
5
Fetch Target Queue
Branch
Predictor
...
Instruction
Fetch
...
Execution
Core
Issue
Buffer
FTQ
• Queue of instruction fetch addresses
• Latency tolerance
– Branch predictor can continue in face of icache miss
– Instruction Fetch can continue in face of branch predictor miss
• When combined with high bandwidth branch predictor
– Provides stream of instr addresses far in advance of current PC
Fetch Directed Instruction Prefetching
6
Fetch Directed Prefetching
PIQ
Prefetch
Prefetch Enqueue
Fully
associative (32 entry)
buffer
(filtration mechanisms)
current FTQ
prefetch candidate
Branch Predictor
Instruction Fetch
FTQ
(32 entry)
•Stream of PCs contained in FTQ guides prefetch
–FTQ is searched in-order for entries to prefetch
–Prefetched cache blocks stored in fully associative queue
–Fully associative queue and instruction cache probed in parallel
Fetch Directed Instruction Prefetching
7
Methodology
• SimpleScalar Alpha 3.0 tool set (Burger, Austin)
– SPEC95 C Benchmarks
• Fast forwarded past initialization portion of benchmarks
–
–
–
–
Can issue 8 instructions per cycle
128 entry reorder buffer
32 entry load/store buffer
Variety of instruction cache sizes
• 16K 2-way and 4-way associative
• 32K 2-way associative
• Tried both single and dual ported configurations
– Instruction cache size for this talk is 16K 2-way
– 32K 4-way associative data cache
– Unified 1MB 4-way associative second level cache
Fetch Directed Instruction Prefetching
8
Bandwidth Concerns
• Prefetching can disrupt demand fetching
– Need to model bus utilization
• Modified SimpleScalar’s memory hierarchy
– Accurate modeling of bus usage
– Two configurations of L2 cache bus to main memory
• 32 bytes/cycle
• 8 bytes/cycle
– Single port on L2 cache
• Shared by both data and instruction caches
Fetch Directed Instruction Prefetching
9
Performance of Fetch Directed Prefetch
89.9 %
70
32 bytes/cycle
66% bus utilization
60
8 bytes/cycle
41% bus utilization
% IPC Speedup
50
40
30
20
10
0
groff
gcc
go
m88ksim
perl
Fetch Directed Instruction Prefetching
vortex
average
10
Reducing Wasted Prefetches
• Reduce bus utilization while retaining speedup
– How to identify useless or redundant prefetches?
• Variety of filtration techniques
– FTQ Position Filtering
– Cache Probe Filtering
• Use idle instruction cache ports to validate prefetches
– Remove CPF
– Enqueue CPF
– Evict Filtering
Fetch Directed Instruction Prefetching
11
Cache Probe Filtering
• Use instruction cache to validate FTQ entries for prefetch
– FTQ entries are initially unmarked
– If cache block is in i-cache, invalidate FTQ entry
– If cache block is not in i-cache, validate FTQ entry
• Validation can occur whenever a cache port is idle
– When the instruction window is full
– Instruction cache miss
• Lockup-free instruction cache
Fetch Directed Instruction Prefetching
12
Cache Probe Filtering Techniques
• Enqueue CPF
– Only enqueue Valid prefetches
– Conservative, low bandwidth approach
• Remove CPF
– By default, prefetch all FTQ entries.
– If idle cache ports are available for validation
• Do not prefetch entries which are found Invalid
Fetch Directed Instruction Prefetching
13
Performance of Filtering Techniques
8 bytes/cycle
70
No Filt
% IPC Speedup
60
Rem CPF
Enq CPF
50
30% bus utilization
40
55% bus utilization
30
20
10
0
groff
gcc
go
m88ksim
perl
Fetch Directed Instruction Prefetching
vortex
average
14
Eviction Prefetching Example
• If branch predictor holds more state than instruction cache
– Mark evicted cache blocks in branch predictor
– Prefetch those blocks when predicted
•
•
•
•
•
•
FTB
Fetch Directed Instruction Prefetching
Evict bit2
Instruction
Cache
•
•
•
Evict bit1
•
•
•
FTB
Index
3
1
0
27
2
15
Evict bit0
Evict
index
Cache
miss
Cache
block
evicted
0 0 0
1 0 0
0 1 0
• • •
• • •
• • •
Bit set for
next
prediction
15
Performance of Filtering Techniques
8 bytes/cycle
70
% IPC Speedup
60
No Filt
Rem CPF
Enq CPF
Evicted
Enq CPF + Evict
50
31% bus utilization
40
20% bus utilization
30
20
10
0
groff
gcc
go
m88ksim
perl
Fetch Directed Instruction Prefetching
vortex
average
16
Enqueue CPF and Eviction Prefetching
• Effective combination of two low bandwidth approaches
• Both attempt to prefetch entries not in instruction cache
• Enqueue CPF needs to wait on idle cache port to prefetch
• Eviction Prefetching can prefetch when prediction is made
• Combined
– Eviction Prefetching gives basic coverage
– Enqueue CPF finds additional prefetches that Evict misses
Fetch Directed Instruction Prefetching
17
Streaming Buffer Enhancements
• All configurations used uniqueness filters and fully
associative lookup
• Base configurations
– Single streaming buffer (SB1)
– Dual streaming buffers (SB2)
– Eight streaming buffers (SB8)
• Cache Probe Filtering (CPF) enhancements
– Filter out streaming buffer prefetches already in icache
– Stop filtering
Fetch Directed Instruction Prefetching
18
Streaming Buffer Results
8 bytes/cycle
70
SB1
SB2
SB8
SB1-Stop
SB2-Stop
SB8-Stop
60
36% bus utilization
% IPC Speedup
50
40
58% bus utilization
30
20
10
0
groff
gcc
go
m88ksim
perl
Fetch Directed Instruction Prefetching
vortex
average
19
Selected Low Bandwidth Results
8 bytes/cycle
45
70
40
60
35
30
25
40
20
30
15
20
10
10
5
0
0
NLP
SB1 Filt
Stop
SB8 Filt
Stop
No Filt
Rem CPF Enq CPF
Fetch Directed Instruction Prefetching
Evicted
Enq CPF
+ Evict
20
% Bus Utilization
% IPC Speedup
50
Selected High Bandwidth Results
32 bytes/cycle
45
70
40
60
35
30
25
40
20
30
15
20
10
10
5
0
0
NLP
SB1 Filt
Stop
SB8 Filt
Stop
No Filt
Rem CPF Enq CPF
Fetch Directed Instruction Prefetching
Evicted
Enq CPF
+ Evict
21
% Bus Utilization
% IPC Speedup
50
Conclusion
• Fetch Directed Prefetching
– Accurate, just in time prefetching
• Cache Probe Filtering
– Reduces bus bandwidth of fetch directed prefetching
– Also useful for Streaming Buffers
• Evict Filter
– Provides accurate prefetching by identifying evicted
cache blocks
• Fully associative versus inorder prefetch buffer
– Available in upcoming tech report by end of year
Fetch Directed Instruction Prefetching
22
Prefetching Tradeoffs
• NLP
– Simple, low bandwidth approach
– No notion of prefetch usefulness
– Limited timeliness
• Streaming Buffers
– Takes advantage of latency of a cache miss
– Can use low to moderate bandwidth with filtering
– No notion of prefetch usefulness
• Fetch Directed Prefetching
– Prefetch based on prediction stream
– Can use low to moderate bandwidth with filtering
– Most useful with accurate branch prediction
Fetch Directed Instruction Prefetching
23