Effective Non-Blocking Cache Architecture for High

Transcript Effective Non-Blocking Cache Architecture for High

Dukki Hong1
Sang-Oak Woo3
Youngduke Seo1 Youngsik Kim2
Kwon-Taek Kwon3
Seok-Yoon Jung3 Kyoungwoo Lee4 Woo-Chan Park1
1Media
Processor Lab., Sejong University
2Korea Polytechnic University
3SAIT of Samsung Electronics Co., Ltd.
4Yonsei University
[email protected]
http://rayman.sejong.ac.kr
October 3, 2013


Introduction
Related Work
◦ Texture mapping
◦ Non-blocking Scheme

Proposed Non-Blocking Texture Cache
◦ The Proposed Architecture
◦ Buffers for Non-blocking scheme
◦ Execution Flow of The NBTC


Experimental Results
Conclusion
October 3, 2013
2

Texture mapping
◦ Core technique for 3D graphics
◦ Maps texture images to the surface

Problem: a huge amount of memory access is required
◦ Major bottleneck in graphics pipelines
◦ Modern GPUs generally use texture caches to solve this problem

Improving texture cache performance
◦ Improving cache hit rates
◦ Reducing miss penalty
◦ Reducing cache access time
October 3, 2013
3

The visual quality of mobile 3D games have evolved enough
to compare with PC games.
◦ Detailed texture images
 ex) Infinity blade : 2048 [GDC 2011]
◦ Demand high texture mapping throughput
<Epic Games: Infinity Blade Series>
<Gameloft: Asphalt Series>
October 3, 2013
4

Improving texture cache performance
◦ Improving cache hit rates
◦ Reducing miss penalty
◦ Reducing cache access time

“Our approach”
In this presentation, we introduce a non-blocking texture
cache (NBTC) architecture
◦ Out-of-order (OOO) execution
◦ Conditional in-order (IO) completion
 the same screen coordinate to support the standard API effectively
October 3, 2013
5
Texture
mapping

Texture mapping is that
glue n-D images onto
geometrical objects
◦ To increase realism
<Texture>
Texture
<Object>
<Texture Mapped Object>
filtering

Texture filtering is a operation for
reducing artifacts of texture aliasing
caused by the texture mapping
Bi-linear filtering : four samples per texture access
Tri-linear filtering : eight samples per texture access
<Results of the texture filtering>
October 3, 2013
6

Cache performance study
◦ In [Hakura and Gupta 1997], the performance of a texture cache was
measured with regard to various benchmarks
◦ In [Igehy et al. 1999], the performance of a texture cache was studied with
regard to multiple pixel pipelines

Pre-fetching scheme
◦ In [Igehy et al. 1998], the latency generated during texture cache misses can be
hidden by applying an explicit pre-fetching scheme

Survey of texture cache
◦ The introduction of a texture cache and the integration of texture cache
architectures into modern GPUs were studied in [Doggett 2012]
October 3, 2013
7

Non-blocking cache (NBC)
◦ allows the following cache request while a cache miss is handled
 Reducing the miss-induced processor stalls
◦ Kroft firstly published a NBC using missing information/status holding
registers (MSHR) that keep track of multiple miss information [Kroft 1981]
<Blocking Cache>
<Non-blocking Cache with MSHR>
Hit
CPU
Miss Penalty
Miss
stall only when
result needed
CPU
CPU
Miss Penalty
Miss Penalty
Miss
Miss Penalty
Miss
Block
valid
bit
Block
request
address
Comparator
<Kroft’s MSHR>
Word 0
valid
bit
Word 1
valid
bit
Word 0
destination
Word 1
destination
Word 0
format
⁞
⁞
⁞
Word 1
format
Word n Word n Word n
valid desti- format
bit
nation
8

Performance study with regard to non-blocking cache
◦ Comparison with four different MSHRs [Farkas and Jouppi 1994].





Implicitly addressed MSHR : Kroft’s MSHR
Explicitly addressed MSHR : complement version of implicitly MSHR
In-cache MSHR : each cache line as MSHR
The first three MSHRs : only one entry per miss block address
Inverted MSHR: single entry per possible destination
 The number of entries = usable registers in a processor (possible destination)
Reg #1 Reg #1
valid request
bit
address
Comparator
Reg #2 Reg #2
valid request
bit
address
Comparator
⁞
⁞
Reg #1 Reg #1
format address
in block
Reg #2 Reg #2
format address
in block
⁞
⁞
<Inverted MSHR
organization>
Match
encoder
Matching
Register
number
PC
PC
PC
PC
◦ Recent high-performance
out-of-order
(OOO) processor using the latest SPEC
valid request format address
benchmark [Libitet al. address
2011]
in block
Comparator non-blocking cache improved the OOO processor’s performance
 A hit under two-misses
17.76% more than the one using a blocking data cache
9
Proposed Non-Blocking Texture Cache
October 3, 2013
10

Fragment Information
Retry Buffer
Lookup
Lookup Retry Buffer
Shading
Unit
Texture Request
Texture Address Generation
L1 Cache
Miss
Update
Hit/Miss Router
Waiting List Buffer
Block
Address
Buffer
Update
Hit
Texture
Request
Ready
Texture
Request
Waiting
List
Buffer
Update
MUX
Retry
Buffer
Update
Texture Mapping
Pipeline
<Proposed NBTC architecture>
Missed
Texel
Request
Block Address Buffer
||||
Missed
Texture
Request
This architecture includes a typical
blocking texture cache (BTC) of a level
1 (L1) cache as well as three kinds of
buffers for non-blocking scheme:
◦ Retry buffer
 Guarantee IO completion
◦ Waiting list buffer
 Keep track of miss information
◦ Block address buffer
 Remove duplicate block address
Triangle
Request
Address
Queue
......
Fragment
(Retry Buffer)
DRAM
or
L2 Cache
Texture or
Request or
Texel
Request
......
texaddr
......
(Waiting List Buffer)
(Block Address Buffer)
October 3, 2013
11
Retry Buffer
Texture request,
Ready Filtered
Texture
Valid Screen (Filtering information, Texture address) Bit
Data
Bit Coordinate
:
:
:
:
:
:
:
:
:

Feature
◦ The most important property of the retry buffer (RB) is its support of IO
completion
 The RB stores fragment information by input order
 The RB is designed as FIFO

Data Format of each RB entry
◦
◦
◦
◦
◦
Valid bit : 0 = empty, 1 = occupied
Screen coordinate : screen coordinate for output display unit (x, y)
Texture request
Ready bit : 0 = invalid filtered texture data, 1 = valid filtered texture data
Filtered texture data : texture data for accomplished texture mapping
October 3, 2013
12
Waiting List Buffer
Texel Addr0 … 7
Valid Texture Filtering
Texel Data0 … 7
Bit
ID
information
Ready Bit0…7

Features
◦ The waiting list buffer (WLB) is similar to the inverted MSHR proposed in
[Farkas and Jouppi 1994]
 The WLB stores information of both missed and hit addresses
 The texture address of the WLB plays a similar role as a register in the
inverted MSHR

Data format of each WLB entry
◦
◦
◦
◦
◦
◦
Valid bit : 0 = empty, 1 = occupied
Texture ID : ID number of a texture request
Filtering information : the information to accomplish the texture mapping
Texel addr N : the texture address of necessary texture data
Texel data N : the texel data of Texel Addr N
Ready bit N : 0 = invalid texe data N, 1 = valid texel data N October 3, 2013
13
Miss Address

Block
Address
…
Block Request Address
Queue
Address
Feature
◦ The block address buffer operates the DRAM access sequentially with regard
to the texel request that caused a cache miss
 The block address buffer removes duplicate DRAM requests
 When data are loaded, all the removed DRAM requests are found
 The block address buffer is designed as FIFO
October 3, 2013
14
Fragment Information
Retry Buffer
Lookup
Lookup Retry Buffer
Shading
Unit
Start
Texture Request
Execute
lookup RB
Texture Address Generation
L1 Cache
Generate
texture addresses
Miss
Update
Hit/Miss Router
Waiting List Buffer
Block
Address
Buffer
Update
Hit
Texture
Request
Ready
Texture
Request
MUX
Retry
Buffer
Update
Waiting
List
Buffer
Update
Missed
Texel
Request
All hits
Block Address Buffer
||||
Missed
Texture
Request
Request
Address
Queue
Execute
tag compare with
texel requests
Hit handling case
Occurred
miss
Miss handling case
DRAM
or
L2 Cache
Texture Mapping
Pipeline
October 3, 2013
15
Fragment Information
Retry Buffer
Lookup
Lookup Retry Buffer
Shading
Unit
Hit handling case
Texture Request
Read texel data
from L1 cache
Texture Address Generation
L1 Cache
Miss
Update
Hit/Miss Router
Waiting List Buffer
Block
Address
Buffer
Update
Hit
Texture
Request
Ready
Texture
Request
MUX
Retry
Buffer
Update
Waiting
List
Buffer
Update
Missed
Texel
Request
Block Address Buffer
||||
Missed
Texture
Request
Request
Address
Queue
Input texel data
to texture mapping
unit via MUX
Execute
texture mapping
Update RB
DRAM
or
L2 Cache
Texture Mapping
Pipeline
October 3, 2013
16
Miss handling case
Fragment Information
Retry Buffer
Lookup
Lookup Retry Buffer
Shading
Unit
Texture Request
“Concurrent
execution”
Read hit texel data
from L1 cache
Input missed
texture requests
to WLB
Texture Address Generation
Input missed texel
requests to BAB
L1 Cache
Miss
Update
Remove duplicate
texel requests
Hit/Miss Router
Waiting List Buffer
Block
Address
Buffer
Update
Hit
Texture
Request
Ready
Texture
Request
MUX
Retry
Buffer
Update
Waiting
List
Buffer
Update
Missed
Texel
Request
Process the next
texture request
Block Address Buffer
||||
Missed
Texture
Request
Request
Address
Queue
DRAM
or
L2 Cache
Texture Mapping
Pipeline
October 3, 2013
17
Miss handling case
Fragment Information
Retry Buffer
Lookup
Lookup Retry Buffer
Shading
Unit
Texture Request
“Concurrent
execution”
Read hit texel data
from L1 cache
Input missed
texture requests
to WLB
Texture Address Generation
Input missed texel
requests to BAB
L1 Cache
Miss
Update
Remove duplicate
texel requests
Hit/Miss Router
Waiting List Buffer
Block
Address
Buffer
Update
Hit
Texture
Request
Ready
Texture
Request
MUX
Retry
Buffer
Update
Texture Mapping
Pipeline
Waiting
List
Buffer
Update
Missed
Texel
Request
Process the next
texture request
Block Address Buffer
||||
Missed
Texture
Request
Request
Address
Queue
DRAM
or
L2 Cache
Complete memory
request
Forward the loaded
data to WLB and
cache
Input texel data
to texture mapping
unit via MUX
Determine
the ready entry
in WLB
Invalidate the entry
Execute
texture mapping
Update RB
October 3, 2013
18
Fragment Information
Retry Buffer
Lookup
Lookup Retry Buffer
Shading
Unit
Update RB
Texture Request
Determine the
ready entry in RB
Texture Address Generation
L1 Cache
Miss
Update
Hit/Miss Router
Waiting List Buffer
Block
Address
Buffer
Update
Hit
Texture
Request
Ready
Texture
Request
MUX
Retry
Buffer
Update
Waiting
List
Buffer
Update
Missed
Texel
Request
Block Address Buffer
||||
Missed
Texture
Request
Request
Address
Queue
Determine whether
IO completion
Forward the ready
entry to the
shading unit
Process the next
fragment
infromation
DRAM
or
L2 Cache
Texture Mapping
Pipeline
October 3, 2013
19
Experimental Results
October 3, 2013
20

Simulator configuration
◦ mRPsim : announced by SAIT [Yoo et al. 2010]
 Execution driven cycle-accurate simulator for SRP-based GPU
 Modification of the texture mapping unit
 Eight pixel processors
 DRAM access latency cycles : 50, 100, 200, and 300 cycles
◦ Benchmark
 Taiji which has nearest, bi-linear, and tri-linear filtering modes

Cache configuration
◦ Four-way set associative, eight-word block size and 32KByte cache size
◦ The number of each buffer entries : 32
October 3, 2013
21
Total PS Cycles (M cycles)
15
12
NBTC stall cycles
PS run cycles
PS stall cycles
9
6
3
0
DRAM Access Latency (Cycles)

Pixel shader cycle/frame
◦
◦
◦
◦
PS run cycle : running cycles
PS stall cycle : stall cycle
NBTC stall cycle : stall cycles due to the WLB full
The pixel shader’s execution cycle decreased from 12.47% (latency 50) to
41.64% (latency 300)
22
October 3, 2013
Miss Rate (%)
8
BTC
NBTC
6
4
2
0
50
100
200
300
DRAM Access Latency (Cycles)

Cache miss rates
◦ The NBTC’s cache miss rate increased slightly more than the BTC’s
cache miss rate
 The NBTC can handle the following cache accesses in cases where a cache
update is not completed
October 3, 2013
23
Memory Bandwidth
(MBytes)
7
BTC
NBTC
6
5
4
50
100
200
300
DRAM Access Latency (Cycles)

Memory bandwidth requirement
◦ The memory bandwidth requirement of the NBTC increased up to
11% more than that of the BTC
 Since the block address buffer removes duplicate DRAM requests, the
increasing memory bandwidth requirement was relatively lower
24

A non-blocking texture cache to improve the performance of
texture caches
◦ basic OOO executions maintaining IO completion for texture requests with
the same screen coordinate
◦ Three buffers to support the non-blocking scheme:
 The retry buffer : IO completion
 The waiting list buffer : tracking the miss information
 The block address buffer : deleting the duplicate block address

We plan to also implement hardware for the proposed NBTC
architecture and then will measure both the power consumption and
the hardware area of the proposed NBTC architecture
October 3, 2013
25
Thank you
for your attention
http://rayman.sejong.ac.kr
October 3, 2013
26
Backup Slides
October 3, 2013
27
October 3, 2013
28