Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer Allen Michalski CSE Department – Reconfigurable.

Transcript Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer Allen Michalski CSE Department – Reconfigurable.

Microprocessors with FPGAs: Implementation and
Workload Partitioning of the DARPA HPCS Integer
Sort Benchmark within the SRC-6e Reconfigurable
Computer
Allen Michalski
CSE Department – Reconfigurable Computing Lab
University of South Carolina
Outline

Reconfigurable Computing – Introduction
 SRC-6e architecture, programming model

Sorting Algorithms
 Design guidelines
Testing Procedures, Results
 Conclusions, Future Work

 Lessons learned
Michalski
Page 2
MAPLD 2005/253
What is a Reconfigurable Computer?

Combination of:
 Microprocessor workstation for frontend processing
 FPGA backend for specialized coprocessing
 Typical PC bus for communications
Michalski
Page 3
MAPLD 2005/253
What is a Reconfigurable Computer?

PC Characteristics






High clock speed
Superscalar, pipelined
Out of order issue
Speculative execution
High-Level Language
FPGA Characteristics
 Low clock speed
 Large number of configurable
elements
• LUTs, Block RAMs, CPAs
• Multipliers
 HDL Language
Michalski
Page 4
MAPLD 2005/253
What is the SRC-6e?


SRC = Seymour R. Cray
RC with high-throughput memory interface
 1,415 MB/s for SNAP writes, 1,280 MB/s for SNAP reads
 PCI-X (1.0) = 1.064 GB/s
Michalski
Page 5
MAPLD 2005/253
SRC-6e Development

Programming does not require knowledge of HW design
 C code can compile to hardware
Michalski
Page 6
MAPLD 2005/253
SRC Design Objectives

FPGA Considerations
 Superscalar design
• Parallel, pipelined execution

SRC Considerations
 High overall data throughput
• Streaming versus non-streaming data transfer?
 Reduction of FPGA data processing stalls due to data
dependencies, data read/write delays
• FPGA Block RAM versus SRC OnBoard Memory?

Evaluate software/hardware partitioning
 Algorithm partitioning
 Data size partitioning
Michalski
Page 7
MAPLD 2005/253
Sorting Algorithms

Traditional Algorithms
 Comparison Sorts: Θ(n lg n) best case
•
•
•
•
Insertion sort
Merge sort
Heapsort
Quicksort
 Counting Sorts
• Radix sort: Θ(d(n+k))

HPCS FORTRAN code baseline
 Radix sort in combination with heapsort
 This research focuses on 128-bit operands
• SRC simplified data transfer, management
Michalski
Page 8
MAPLD 2005/253
Sorting – SRC FPGA Implementation

Memory Constraints
 SRC onboard memory
• 6 banks x 4 MB
• Pipelined read or write access
• 5 clock latency
 FPGA BRAM memory
• 144 blocks, 18 Kbit each
• 1 clock read and write latency

Initial Choices
 Parallel Insertion Sort (BubbleSort)
• Produces sorted blocks
• Use of onboard memory pipelined processing
– Minimize data access stalls
 Parallel Heapsort
• Random access merge of sorted lists
• Use of BRAM for low latency access
– Good for random data access
Michalski
Page 9
MAPLD 2005/253
Parallel Insertion Sort (BubbleSort)

Systolic array of cells
 Pipelined SRC processing from OnBoard Memory
 Keeps highest value, passes other values
 Latency 2x number of cells
Michalski
Page 10
MAPLD 2005/253
Parallel Insertion Sort (BubbleSort)

Systolic array of cells
 Results passed out in
reverse order of
comparison
• N = # comparator cells
 Sorts a list completely in
Θ(L2)
 Limit sort size to some
number a < L (list size)
• Create multiple sorted lists
• Each list sorted in Θ(a)
Michalski
Page 11
MAPLD 2005/253
Parallel Insertion Sort (BubbleSort)
#include <libmap.h>
void parsort_test(int arraysize, int sortsize, int transfer,
uint64_t datahigh_in[], uint64_t datalow_in[], uint64_t datahigh_out[], uint64_t datalow_out[],
int64_t *start_transferin, int64_t *start_loop, int64_t *start_transferout, int64_t *end_transfer, int mapno) {
OBM_BANK_A (a, uint64_t, MAX_OBM_SIZE)
OBM_BANK_B (b, uint64_t, MAX_OBM_SIZE)
OBM_BANK_C (c, uint64_t, MAX_OBM_SIZE)
OBM_BANK_D (d, uint64_t, MAX_OBM_SIZE)
DMA_CPU(CM2OBM, a, MAP_OBM_stripe(1, "A"), datahigh_in, 1, arraysize*8, 0);
wait_DMA(0);
….
while (arrayindex < arraysize) {
endarrayindex = arrayindex + sortsize - 1;
if (endarrayindex > arraysize - 1)
endarrayindex = arraysize - 1;
while (arrayindex < endarrayindex) {
for (i=arrayindex; i<=endarrayindex; i++) {
data_high_in = a[i]; data_low_in = b[i];
parsort(i==endarrayindex, data_high_in, data_low_in, &data_high_out, &data_low_out);
c[i] = data_high_out; d[i] = data_low_out;
Michalski
Page 12
MAPLD 2005/253
Parallel Heapsort

Tree structure of cells
 Asynchronous operation
• Acknowledged data transfer
 Merges sorted lists in Θ(n lg n)
 Designed for Independent BRAM block accesses
Michalski
Page 13
MAPLD 2005/253
Parallel Heapsort

BRAM Limitations
 144 Block RAMs @ 512 32 bit values = not a whole
lot of 128-bit values

OnBoard Memory
 SRC constraint – Up to 64 reads and 8 writes in one
MAP C file
 Cascading clock delays as number of reads increase
 Explore the use of MUXd access: search and update
only 6 of 48 leaf nodes at a time in round-robin
fashion
Michalski
Page 14
MAPLD 2005/253
FPGA Initial Results

Baseline: One V26000
 PAR options: -ol high –t 1

Bubblesort Results – 100 Cells
 29,354 Slices (86%)
 37,131 LUTs (54%)
 13.608 ns = 73 MHz (verified operational at 100MHz)

Heapsort Results – 95 Cells (48 Leafs)
 21,011 Slices (62%)
 24,467 LUTs (36%)
 11.770 ns = 85 MHz (verified operational at 100MHz)
Michalski
Page 15
MAPLD 2005/253
Testing Procedures
All tests utilize one chip for baseline results
 Evaluate fastest software radix of operation
 Hardware/Software Partitioning

 Five cases - Case 5 utilizes FPGA reconfiguration
 Data size partitioning – 100, 500, 1000, 5000, 10000
 10 runs for each
test case/data partitioning
combination
 List size 500000 values
Michalski
Page 16
MAPLD 2005/253
Results
Time (sec.)
Software Datasize Partitioning - Radixsort vs. Radixsort + Heapsort
80
70
60
50
40
30
20
10
0
HeapSort
RadixSort
4
8
16
Radixsort
4
8
16
Radix + Heap
(Listsize=100)
4
8
16
Radix + Heap
(Listsize=500)
4
8
16
4
8
16
4
8
16
Radix + Heap
Radix + Heap
Radix + Heap
(Listsize=1000) (Listsize=5000) (Listsize=10000)
TestCase/Radix

Fastest Software Operations (Baseline)
 Comparison of Radixsort and Heapsort Combinations
• Radix 4, 8 and 16 evaluated

Minimum Time: Radix-8 Radixsort + Heapsort (Size = 5000 or
10000)
 Radix-16 has too many buckets for sort size partitions evaluated
 Heapsort comparisons faster than radixsort index updates
Michalski
Page 17
MAPLD 2005/253
Results
Fastest SW-only
Time = 3.41 sec.
 Fastest time
including HW =
3.89 sec.

 Bubblesort
(HW), Heapsort
(SW)
 Partition Listsize
of 1000

SRC Software/Hardware Executions (500K Data)
35
Time (sec.)
30
Heapsort (HW)
25
20
Heapsort Config (HW)
Heapsort (SW)
15
Bubblesort (HW)
10
Bubblesort Config (HW)
Radixsort (SW)
5
0
S H S H S H S H S H S H S H S H S H S H
- - - - - - - - - - - - - - - - - - - S S H H S S H H S S H H S S H H S S H H
100
500
1000
5000
10000
Data Partition/Test Case
Heapsort times…
 Dominated by data access
 Significantly slower than software
Michalski
Page 18
MAPLD 2005/253
Results – Bubblesort vs. Radixsort
Some cases
where HW faster
than SW

6
5
4
HW - Data Transfer Out
HW - Data Processing
3
HW - Data Transfer In
SW - Only
2
1
100
500
1000
5000
Bubblesort
(HW)
Radixsort
(SW)
Bubblesort
(HW)
Radixsort
(SW)
Bubblesort
(HW)
Radixsort
(SW)
Bubblesort
(HW)
Radixsort
(SW)
Bubblesort
(HW)
0
Radixsort
(SW)
 List sizes < 5000
 SRC data
pipelined access
 Fastest SW case
was for list size =
10000
Radixsort (SW) vs. Bubblesort (HW)
Time (sec.)

10000
Data Size/Test Case
MAP data transfer time less significant than data
processing time
 For size = 1000:
Input (11.3%), Analyze (76.9%), Output (11.5%)
Michalski
Page 19
MAPLD 2005/253
Results - Limitations

Heapsort is limited by overhead of input
servicing
 Random accesses of OBM not ideal
 Overhead of loop search, sequentially dependent
processing

Bubblesort limited by number of cells
 Can increase by approximately 13 cells
 Two-chip streaming

Reconfiguration time assumed to be one-time
setup factor
 Reconfiguration case exception – Solve by having a
core per V26000
Michalski
Page 20
MAPLD 2005/253
Conclusions

Pipelined, systolic designs are needed to
overcome speed advantage of microprocessor
 Bubblesort works well on small data sets
 Heapsort’s random data access cannot exploit SRC
benefits

SRC high-throughput data transfer and highlevel data abstraction provides good framework
to implement systolic designs
Michalski
Page 21
MAPLD 2005/253
Future Work

Heapsort’s random data access cannot exploit
SRC benefits
 Look for possible speedups using BRAM?
 Unroll leaf memory access
 Exploit SRC “periodic macro” paradigm

Currently evaluating radix sort in hardware
 This works better than bubblesort for larger sort sizes

Compare MAP-C to VHDL when baseline VHDL
is faster than SW
Michalski
Page 22
MAPLD 2005/253

Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer Allen Michalski CSE Department – Reconfigurable.

Transcript Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer Allen Michalski CSE Department – Reconfigurable.

Directory