presentation - University of British Columbia
Download
Report
Transcript presentation - University of British Columbia
TPUTCACHE: HIGH-FREQUENCY,
MULTI-WAY CACHE FOR HIGHTHROUGHPUT
FPGA APPLICATIONS
Aaron Severance
University of British Columbia
Advised by Guy Lemieux
1
Our Problem
We use overlays for data processing
Partially/fully fixed processing elements
Virtual CGRAs, soft vector processors
Memory:
Large register files/scratchpad in overlay
Low latency, local data
Trivial (large DMA): burst to/from DDR
Non-trivial?
2
Scatter/Gather
Data dependent store/load
vscatter adr_ptr, idx_vect, data_vect
for i in 1..N
adr_ptr[idx_vect[i]] <= data_vect[i]
Random narrow (32-bit) accesses
Waste bandwidth on DDR interfaces
3
If Data Fits on the FPGA…
BRAMs with interconnect network
General network…
Memory mapped BRAM
Not customized per application
Shared: all masters <-> all slaves
Double-pump (2x clk) if possible
Banking/LVT/etc. for further ports
4
Example BRAM system
5
But if data doesn’t fit…
(oversimplified)
6
So Let’s Use a Cache
But a throughput focused cache
Low latency data held in local memories
Amortize latency over multiple accesses
Focus on bandwidth
7
Replace on-chip memory or
augment memory controller?
Data fits on-chip
Want BRAM like speed, bandwidth
Low overhead compared to shared BRAM
Data doesn’t fit on-chip
Use ‘leftover’ BRAMs for performance
8
TputCache Design Goals
Fmax near BRAM Fmax
Fully pipelined
Support multiple outstanding misses
Write coalescing
Associativity
9
TputCache Architecture
Replay based architecture
Reinsert misses back into pipeline
Separate line fill/evict logic in background
Token FIFO for completing requests in order
No MSHRs for tracking misses
Fewer muxes (only single replay request mux)
6 stage pipeline -> 6 outstanding misses
Good performance with high hit rate
Common case fast
10
TputCache Architecture
11
Cache Hit
12
Cache Miss
13
Evict/Fill Logic
14
Area & Fmax Results
•Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV
•423MHz compared to 490MHz BRAM fmax on Stratix IV
•Minor degredation with increasing size, associativity
•13% to 35% extra BRAM usage for tags, queues
15
Benchmark Setup
TputCache
128kB, 4-way, 32-byte lines
MXP soft vector processor
16 lanes, 128kB scratchpad memory
Scatter/Gather memory unit
Indexed loads/stores per lane
Doublepumping port adapters
TputCache runs at 2x frequency of MXP
16
MXP Soft Vector Processor
DMA and Vector Work Queues, Instruction Decode & Control
Custom
Instructions
Address Generation
Nios II/f
I$
D$
M
M
DMA &
Custom
FIltering
BB
5 1
CC
6 2
AA
7 3
ALU0
Bank 0
S
BB
6 2
CC
7 3
M
Custom
AA
4 0
ALU1
Bank 1
BB
7 3
Avalon
Fabric
CC
4 0
Vector
AA
5 1
ALU2
Bank 2
S
BB
4 0
DDR
Control
Align 3
DstC
CC
5 1
Instructions
AA
6 2
Bank 3
Vector Scratchpad
Accum
ALU3
Align 1
SrcA
Align 2
SrcB
Custom Vector Instructions
Gather Data
Throughput
M
Cache
S
S/G
Scatter/Gather Addresses
M
Control
Scatter Data
17
Histogram
•Instantiate a number of Virtual Processors (VPs) mapped across lanes
•Each VP histograms part of the image
•Final pass to sum VP partial histograms
18
Hough Transform
•Convert an image to 2D Hough Space (angle, radius)
•Each vector element calculates the radius for a given angle
•Adds pixel value to counter
19
Motion Compensation
•Load block from reference image, interpolate
•Offset by small amount from location in current image
20
Future Work
More ports needed for scalability
Write cache
Share evict/fill BRAM port with 2nd request
Banking (sharing same evict/fill logic)
Multiported BRAM designs
Allocate on write currently
Track dirty state of bytes in BRAMs 9th bit
Non-blocking behavior
Multiple token FIFOs (one per requestor)?
21
FAQ
Coherency
Envisioned as only/LLC
Future work
Replay loops/problems
Random replacement + associativity
Power expected to be not great…
22
Conclusions
TputCache: alternative to shared BRAM
Low overhead (13%-35% extra BRAM)
Nearly as high fmax (253MHz vs 270MHz)
More flexible than shared BRAM
Performance degrades gradually
Cache behavior instead of manual filling
23
Questions?
Thank you
24