Caching Strategies for Textures

Download Report

Transcript Caching Strategies for Textures

Caching Strategies for Textures
Paul Arthur Navratil
Overview
• Conceptual summary
• Design and Analysis of a Cache Architecture for
Texture Mapping (Hakura and Gupta 1997)
• Prefetching in a Texture Cache Architecture
(Igehy, Eldrige, and Proudfoot 1998)
• Discussion!
Mip mapping
• Achieves acceptable performance texture mapping
• Interpolation between fixed levels of detail is a
constant computation cost per fragment
• Reduces aliasing [Williams p.4]
• Efficient memory use
• Memory access pattern is well understood
Hakura and Gupta: Problem
• Motivation: need high bandwidth, low latency
memory access for texture mapping
• Previous work uses brute-force
– Dedicated DRAM for each fragment generator [Akeley p.3]
– SGI RealityEngine can have 320MB texture memory,
but only 16MB of unique texture memory!
Hakura and Gupta: Idea
• Observation:
If textures exhibit spatial and temporal localities, design a
system to exploit them
• Use SRAM cache for each fragment generator
Have a single, shared DRAM texture memory
• Advantages
– Unique texture memory is larger
– Uses cheaper chip (SRAM over DRAM)
– SRAM gives higher bandwidth and lower latency
Hakura and Gupta: Locality
• Mip mapping has inherent spatial locality
– Four contiguous texels on each of two levels for
trilinear interpolation, with texel area close to pixel area
• Texture mapping has two temporal localities
– Overlapping texel usage along contiguous fragment
generation
– Repeating texture across image [color images.ps]
Hakura and Gupta: Caching
• Observation: Increase in DRAM density has decreased
DRAM bandwidth!
• Cache decreases bandwidth requirement by decreasing
accesses to texture memory
• Block transfers from memory to cache maximize DRAM
bandwidth utilization
• Texture memory can be shared (not dedicated)
• No cache coherence issues
• Cache characterized by:
– Cache size
– Cache line size
– Associativity
• Which combination is best?
Hakura and Gupta: Texture
Representation in Memory
• Base case: Linear (Non-Blocked)
– Williams original representation misses spatial locality
– Use contiguous RGBA values per texel [Hakura p.5]
• Observations:
– Gradual level-of-detail change uses more of a fetched cache line
– Higher line size drops cold miss rate
– Principle of Texture Thrift: amount of texture info required to
render is proportional to the resolution of the image, and is
independent of the number of surfaces and the size of the texture
[Peachey 90]
– In examples, workset limited to one texture
Worst case bound by either texture size or screen size
– This representation is sensitive to the texture orientation on screen.
Hakura and Gupta: Texture
Representation in Memory
• Blocked case: convert 2-D arrays into 4-D arrays.
– Address calculation is a two-step process
– Block size remains constant across mipmap levels
• Observations:
– Reduces dependency on texture orientation, and utilizes spatial
locality
– Lowest miss rates occur when block size matches cache line size
[Hakura p.7]
– Increasing line size alone creates worse miss rates
– Can use 2-way associative cache to avoid conflict with blocks of
different mipmap levels (see Igehy)
Hakura and Gupta: Rasterization
• Rasterization order affects texture access pattern,
and thus cache behavior also
• Use tiling (chunking) to utilize spatial locality
– If tiles are too large, the working set will be larger than
the cache size, and capacity misses will result [Hakura p.9]
– Smaller triangles in image reduce this effect
Hakura and Gupta: Performance
• Rendering performance and memory bandwidth are good
measures of a texture mapping system
• Fragment generator observations
– Machine must access more than one texel per cycle
– Must hide memory latency to achieve maximum throughput
(address precomputation)
• SRAM cache observations
– Multiple banks with interleaced lines for multiple texel access
– Interleave texels within each block
– Without multi-texel access, trilinear interpolation can compare
texels only once every two cycles!
Hakura and Gupta: Conclusions
• Caching yields a three-fold to fifteen-fold
reduction in memory bandwidth requirements
• Cache should be at least 16 KB and 2-way
associative
• Long cache lines better utilize bandwidth
(with a slight increase in bandwidth requirements)
• Block size should match cache line size
• Rasterization pattern should be tiled
Igehy et al: Problem
• Motivation: Memory bandwidth and latency are
(becoming) bottleneck for texture systems
• Previous work shows caching benefits [Hakura97; Cox98],
but fails to hide memory latency
• Little literature on prefetching texels:
– used in some systems, but the algorithms are not
described (proprietary) e.g. [Torborg and Kajiya, 1996]
Igehy et al: Idea
• Combine prefetching and caching in an
architecture with a clear description
• Advantages:
– Simple
– Robust to variations in bandwidth requirements and
latencies
– Achieves within 3% of performance of a zero-latency
system
Igehy et al: Traditional
Prefetching (no cache)
• When a fragment is ready for texturing, queue it and
request the texels
• Fragment stays in queue for time equal to memory latency
• If the queue is sized correctly, latency will be masked
• Problems:
– If covering large request rate and latency, early prefetch can cause
cache miss
– Tags must be checked at double-rate to maximize throughput
(prefetch check and read check)
– Prefetch buffer size must increase as request rate and latency
increase
Igehy et al: Texture Prefetching
• Differences from traditional prefetch:
– Tag checks occur once per texel, before cache access
– Add reorder buffer to handle early return of texel data
– New cache blocks only put in cache when associated fragment
reaches head of the queue
• Cache organization:
– Four banks each, with adjacent levels of mipmap in alternating
banks
– Data interleaved so the four accesses for bilinear interpolation can
occur in parallel
– Can process 8 requests in parallel, which is enough for trilinear
interpolation
Igehy et al: Texture Properties
• Texture caching effectiveness is scene dependent
• Observation: unique-texel-to-fragment ratio is lower
bound on number of texels that must be fetched per frame
(unless utilizing inter-frame locality)
• Want a low unique-texel-to-fragment ratio!
• Ratio affected by:
– Magnification (lowers ratio)
– Repetition (lowers ratio if cache holds entire texture)
– Minification (ratio depends on texel-area-to-pixel-area ratio)
Igehy et al: Memory
Organization
• Use 6-D texture representation in Hakura [Igehy p.5]
• Rasterize in tiled pattern (not scan-line)
• Cache associativity does not appreciably affect
miss rate
– Design minimizes conflict misses
• General formula for determining associativity:
– m independent n-way associative caches can handle a
rate of m bilinear accesses (four texels) per cycle to
m*n textures (or texture levels in mipmap)
Igehy et al: Bandwidth
• Average texel requests per frame are not enough to
determine actual requirements
– High-request bursts occur [Igehy p.6]
e.g. color map vs. light map
• When system misses ideal (zero-latency)
performance, bandwidth is to blame [Igehy p.8]
– e.g. AGP vs. NUMA
Igehy et al: Conclusions
• System that approximates zero-latency is possible
– Achieved 97% utilization of available resources
• Fragment queue should slightly exceed latency of
memory system to account for miss bursts
• Reserve reorder-buffer slot when memory request
is made to avoid deadlock
Discussion!
•