Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer Allen Michalski CSE Department – Reconfigurable.
Download
Report
Transcript Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer Allen Michalski CSE Department – Reconfigurable.
Microprocessors with FPGAs: Implementation and
Workload Partitioning of the DARPA HPCS Integer
Sort Benchmark within the SRC-6e Reconfigurable
Computer
Allen Michalski
CSE Department – Reconfigurable Computing Lab
University of South Carolina
Outline
Reconfigurable Computing – Introduction
SRC-6e architecture, programming model
Sorting Algorithms
Design guidelines
Testing Procedures, Results
Conclusions, Future Work
Lessons learned
Michalski
Page 2
MAPLD 2005/253
What is a Reconfigurable Computer?
Combination of:
Microprocessor workstation for frontend processing
FPGA backend for specialized coprocessing
Typical PC bus for communications
Michalski
Page 3
MAPLD 2005/253
What is a Reconfigurable Computer?
PC Characteristics
High clock speed
Superscalar, pipelined
Out of order issue
Speculative execution
High-Level Language
FPGA Characteristics
Low clock speed
Large number of configurable
elements
• LUTs, Block RAMs, CPAs
• Multipliers
HDL Language
Michalski
Page 4
MAPLD 2005/253
What is the SRC-6e?
SRC = Seymour R. Cray
RC with high-throughput memory interface
1,415 MB/s for SNAP writes, 1,280 MB/s for SNAP reads
PCI-X (1.0) = 1.064 GB/s
Michalski
Page 5
MAPLD 2005/253
SRC-6e Development
Programming does not require knowledge of HW design
C code can compile to hardware
Michalski
Page 6
MAPLD 2005/253
SRC Design Objectives
FPGA Considerations
Superscalar design
• Parallel, pipelined execution
SRC Considerations
High overall data throughput
• Streaming versus non-streaming data transfer?
Reduction of FPGA data processing stalls due to data
dependencies, data read/write delays
• FPGA Block RAM versus SRC OnBoard Memory?
Evaluate software/hardware partitioning
Algorithm partitioning
Data size partitioning
Michalski
Page 7
MAPLD 2005/253
Sorting Algorithms
Traditional Algorithms
Comparison Sorts: Θ(n lg n) best case
•
•
•
•
Insertion sort
Merge sort
Heapsort
Quicksort
Counting Sorts
• Radix sort: Θ(d(n+k))
HPCS FORTRAN code baseline
Radix sort in combination with heapsort
This research focuses on 128-bit operands
• SRC simplified data transfer, management
Michalski
Page 8
MAPLD 2005/253
Sorting – SRC FPGA Implementation
Memory Constraints
SRC onboard memory
• 6 banks x 4 MB
• Pipelined read or write access
• 5 clock latency
FPGA BRAM memory
• 144 blocks, 18 Kbit each
• 1 clock read and write latency
Initial Choices
Parallel Insertion Sort (BubbleSort)
• Produces sorted blocks
• Use of onboard memory pipelined processing
– Minimize data access stalls
Parallel Heapsort
• Random access merge of sorted lists
• Use of BRAM for low latency access
– Good for random data access
Michalski
Page 9
MAPLD 2005/253
Parallel Insertion Sort (BubbleSort)
Systolic array of cells
Pipelined SRC processing from OnBoard Memory
Keeps highest value, passes other values
Latency 2x number of cells
Michalski
Page 10
MAPLD 2005/253
Parallel Insertion Sort (BubbleSort)
Systolic array of cells
Results passed out in
reverse order of
comparison
• N = # comparator cells
Sorts a list completely in
Θ(L2)
Limit sort size to some
number a < L (list size)
• Create multiple sorted lists
• Each list sorted in Θ(a)
Michalski
Page 11
MAPLD 2005/253
Parallel Insertion Sort (BubbleSort)
#include <libmap.h>
void parsort_test(int arraysize, int sortsize, int transfer,
uint64_t datahigh_in[], uint64_t datalow_in[], uint64_t datahigh_out[], uint64_t datalow_out[],
int64_t *start_transferin, int64_t *start_loop, int64_t *start_transferout, int64_t *end_transfer, int mapno) {
OBM_BANK_A (a, uint64_t, MAX_OBM_SIZE)
OBM_BANK_B (b, uint64_t, MAX_OBM_SIZE)
OBM_BANK_C (c, uint64_t, MAX_OBM_SIZE)
OBM_BANK_D (d, uint64_t, MAX_OBM_SIZE)
DMA_CPU(CM2OBM, a, MAP_OBM_stripe(1, "A"), datahigh_in, 1, arraysize*8, 0);
wait_DMA(0);
….
while (arrayindex < arraysize) {
endarrayindex = arrayindex + sortsize - 1;
if (endarrayindex > arraysize - 1)
endarrayindex = arraysize - 1;
while (arrayindex < endarrayindex) {
for (i=arrayindex; i<=endarrayindex; i++) {
data_high_in = a[i]; data_low_in = b[i];
parsort(i==endarrayindex, data_high_in, data_low_in, &data_high_out, &data_low_out);
c[i] = data_high_out; d[i] = data_low_out;
Michalski
Page 12
MAPLD 2005/253
Parallel Heapsort
Tree structure of cells
Asynchronous operation
• Acknowledged data transfer
Merges sorted lists in Θ(n lg n)
Designed for Independent BRAM block accesses
Michalski
Page 13
MAPLD 2005/253
Parallel Heapsort
BRAM Limitations
144 Block RAMs @ 512 32 bit values = not a whole
lot of 128-bit values
OnBoard Memory
SRC constraint – Up to 64 reads and 8 writes in one
MAP C file
Cascading clock delays as number of reads increase
Explore the use of MUXd access: search and update
only 6 of 48 leaf nodes at a time in round-robin
fashion
Michalski
Page 14
MAPLD 2005/253
FPGA Initial Results
Baseline: One V26000
PAR options: -ol high –t 1
Bubblesort Results – 100 Cells
29,354 Slices (86%)
37,131 LUTs (54%)
13.608 ns = 73 MHz (verified operational at 100MHz)
Heapsort Results – 95 Cells (48 Leafs)
21,011 Slices (62%)
24,467 LUTs (36%)
11.770 ns = 85 MHz (verified operational at 100MHz)
Michalski
Page 15
MAPLD 2005/253
Testing Procedures
All tests utilize one chip for baseline results
Evaluate fastest software radix of operation
Hardware/Software Partitioning
Five cases - Case 5 utilizes FPGA reconfiguration
Data size partitioning – 100, 500, 1000, 5000, 10000
10 runs for each
test case/data partitioning
combination
List size 500000 values
Michalski
Page 16
MAPLD 2005/253
Results
Time (sec.)
Software Datasize Partitioning - Radixsort vs. Radixsort + Heapsort
80
70
60
50
40
30
20
10
0
HeapSort
RadixSort
4
8
16
Radixsort
4
8
16
Radix + Heap
(Listsize=100)
4
8
16
Radix + Heap
(Listsize=500)
4
8
16
4
8
16
4
8
16
Radix + Heap
Radix + Heap
Radix + Heap
(Listsize=1000) (Listsize=5000) (Listsize=10000)
TestCase/Radix
Fastest Software Operations (Baseline)
Comparison of Radixsort and Heapsort Combinations
• Radix 4, 8 and 16 evaluated
Minimum Time: Radix-8 Radixsort + Heapsort (Size = 5000 or
10000)
Radix-16 has too many buckets for sort size partitions evaluated
Heapsort comparisons faster than radixsort index updates
Michalski
Page 17
MAPLD 2005/253
Results
Fastest SW-only
Time = 3.41 sec.
Fastest time
including HW =
3.89 sec.
Bubblesort
(HW), Heapsort
(SW)
Partition Listsize
of 1000
SRC Software/Hardware Executions (500K Data)
35
Time (sec.)
30
Heapsort (HW)
25
20
Heapsort Config (HW)
Heapsort (SW)
15
Bubblesort (HW)
10
Bubblesort Config (HW)
Radixsort (SW)
5
0
S H S H S H S H S H S H S H S H S H S H
- - - - - - - - - - - - - - - - - - - S S H H S S H H S S H H S S H H S S H H
100
500
1000
5000
10000
Data Partition/Test Case
Heapsort times…
Dominated by data access
Significantly slower than software
Michalski
Page 18
MAPLD 2005/253
Results – Bubblesort vs. Radixsort
Some cases
where HW faster
than SW
6
5
4
HW - Data Transfer Out
HW - Data Processing
3
HW - Data Transfer In
SW - Only
2
1
100
500
1000
5000
Bubblesort
(HW)
Radixsort
(SW)
Bubblesort
(HW)
Radixsort
(SW)
Bubblesort
(HW)
Radixsort
(SW)
Bubblesort
(HW)
Radixsort
(SW)
Bubblesort
(HW)
0
Radixsort
(SW)
List sizes < 5000
SRC data
pipelined access
Fastest SW case
was for list size =
10000
Radixsort (SW) vs. Bubblesort (HW)
Time (sec.)
10000
Data Size/Test Case
MAP data transfer time less significant than data
processing time
For size = 1000:
Input (11.3%), Analyze (76.9%), Output (11.5%)
Michalski
Page 19
MAPLD 2005/253
Results - Limitations
Heapsort is limited by overhead of input
servicing
Random accesses of OBM not ideal
Overhead of loop search, sequentially dependent
processing
Bubblesort limited by number of cells
Can increase by approximately 13 cells
Two-chip streaming
Reconfiguration time assumed to be one-time
setup factor
Reconfiguration case exception – Solve by having a
core per V26000
Michalski
Page 20
MAPLD 2005/253
Conclusions
Pipelined, systolic designs are needed to
overcome speed advantage of microprocessor
Bubblesort works well on small data sets
Heapsort’s random data access cannot exploit SRC
benefits
SRC high-throughput data transfer and highlevel data abstraction provides good framework
to implement systolic designs
Michalski
Page 21
MAPLD 2005/253
Future Work
Heapsort’s random data access cannot exploit
SRC benefits
Look for possible speedups using BRAM?
Unroll leaf memory access
Exploit SRC “periodic macro” paradigm
Currently evaluating radix sort in hardware
This works better than bubblesort for larger sort sizes
Compare MAP-C to VHDL when baseline VHDL
is faster than SW
Michalski
Page 22
MAPLD 2005/253