Transcript Slide 1

Data Parallel FPGA
Workloads: Software
Versus Hardware
Peter Yiannacouras
J. Gregory Steffan
Jonathan Rose
FPL 2009
FPGA Systems and Soft Processors
Weeks
Software
+
Compiler
Digital System
Soft
Processor
computation
Months
HDL
+
CAD
Custom
HW
Used in 25% of designs
[source: Altera, 2009]
 Easier
? Configurable
COMPETE
 Faster
 Smaller
 Less Power
Simplify FPGA design: Customize soft processor architecture
Target: Data level parallelism → vector processors
2
Vector Processing Primer
// C code
for(i=0;i<16; i++)
c[i]=a[i]+b[i]
// Vectorized code
set
vl,16
vload vr0,a
vload vr1,b
vadd vr2,vr0,vr1
vstore vr2,c
Each vector instruction
holds many units of
independent operations
vadd
vr2[15]=vr0[15]+vr1[15]
vr2[14]=vr0[14]+vr1[14]
vr2[13]=vr0[13]+vr1[13]
vr2[12]=vr0[12]+vr1[12]
vr2[11]=vr0[11]+vr1[11]
vr2[10]=vr0[10]+vr1[10]
vr2[9]= vr0[9]+vr1[9]
vr2[8]= vr0[8]+vr1[8]
vr2[7]= vr0[7]+vr1[7]
vr2[6]= vr0[6]+vr1[6]
vr2[5]= vr0[5]+vr1[5]
vr2[4]= vr0[4]+vr1[4]
vr2[3]= vr0[3]+vr1[3]
vr2[2]= vr0[2]+vr1[2]
vr2[1]= vr0[1]+vr1[1]
vr2[0]= vr0[0]+vr1[0]
1 Vector Lane
3
Vector Processing Primer
// C code
for(i=0;i<16; i++)
c[i]=a[i]+b[i]
vadd
16 Vector Lanes
vr2[15]=vr0[15]+vr1[15]
vr2[14]=vr0[14]+vr1[14]
vr2[13]=vr0[13]+vr1[13]
16x speedup
vr2[12]=vr0[12]+vr1[12]
vr2[11]=vr0[11]+vr1[11]
// Vectorized code
Previous
Work (on Soft Vector Processors):
vr2[10]=vr0[10]+vr1[10]
set
vl,16
vr2[9]= vr0[9]+vr1[9]
1. Scalability
vload vr0,a
vr2[8]= vr0[8]+vr1[8]
vload vr1,b
2. Flexibility
vr2[7]= vr0[7]+vr1[7]
vadd vr2,vr0,vr1
vr2[6]= vr0[6]+vr1[6]
3. Portability
vstore vr2,c
vr2[5]= vr0[5]+vr1[5]
vr2[4]= vr0[4]+vr1[4]
Each vector instruction
vr2[3]= vr0[3]+vr1[3]
holds many units of
vr2[2]= vr0[2]+vr1[2]
independent operations
vr2[1]= vr0[1]+vr1[1]
vr2[0]= vr0[0]+vr1[0]
4
Soft Vector Processors vs HW
Weeks
Software
+
Compiler
+
Vectorizer
Scalable
Fine-tunable
Customizable
 Easier
Soft Vector
Processor
Custom
HW
Months
HDL
+
CAD
Lane
Lane
Lane
Lane
Lane
Lane
Lane
Lane
1 2 3 4 5 6 7 8 …16
Vector Lanes
How much?
 Faster
 Smaller
 Less Power
What is the soft vector processor vs FPGA custom HW gap?
(also vs scalar soft processor)
5
Measuring the Gap
EEMBC
Benchmarks
Soft
Vector
Processor
Scalar
Soft
Processor
Evaluation
Speed
Area
HW
Circuits
Evaluation
Compare
Speed
Area
Conclusions
Evaluation
Compare
Speed
Area
6
VESPA Architecture Design
(Vector Extended Soft Processor Architecture)
Icache
Scalar
Pipeline
3-stage
Vector
Control
Pipeline
3-stage
Vector
Pipeline
6-stage
Decode
RF
VC
RF
Decode
Legend
Pipe stage
Logic
Storage
Dcache
Logic
VS
RF
Decode
M
U
X
A
L
U
VC
WB
Shared
Dcache
VS
WB
Replicate
Hazard
check
WB
VR
VR
RF
RF
Supports integer
and fixed-point
operations [VIRAM]
Lane 1
ALU,Mem
Lane 2 Unit
ALU, Mem, Mul
VR
VR
WB
WB
32-bit Lanes
7
VESPA Parameters
Compute
Architecture
Instruction
Set
Architecture
Memory
Hierarchy
Description
Symbol
Values
Number of Lanes
L
1,2,4,8, …
Memory Crossbar Lanes
M
1,2, …, L
Multiplier Lanes
X
1,2, …, L
Maximum Vector Length
MVL
2,4,8, …
Width of Lanes (in bits)
W
1-32
Instruction Enable (each)
-
on/off
Data Cache Capacity
DD
any
Data Cache Line Size
DW
any
Data Prefetch Size
DPK
< DD
Vector Data Prefetch Size
DPV
< DD/MVL
8
VESPA Evaluation Infrastructure
SOFTWARE
EEMBC C
Benchmarks
GCC
ld
Vectorized
assembly
subroutines
HARDWARE
Verilog
scalar
μP
ELF
Binary
+
VC
RF
vpu
VC
WB
Logic
VS
RF
VS
WB
Mem
Unit
GNU as
Decode
Replicate
Hazard
check
VR
VR
RF
RF
VR
VR
WB
WB
SatuSaturate
rate
A
L A
U L
U
M
U M
X U
X
xx &&satur.
satur.
Rshift
Rshift
TM4
Instruction
Set
Simulation
verification
Altera
Quartus II
v 8.1
RTL
Simulation
verification
cycles
area,
clock frequency
Realistic
and
detailed
evaluation
9
Measuring the Gap
EEMBC
Benchmarks
Soft
Vector
Processor
Scalar
Soft
Processor
Evaluation
Speed
Area
HW
Circuits
Evaluation
Compare
Speed
Area
Conclusions
Evaluation
Compare
Speed
Area
10
Designing HW Circuits
(with simplifying assumptions)
HW
Memory
Request
Idealized
Control
DDR
Core
Datapath
Altera
Quartus II
v 8.1
area,
clock frequency
cycle count (modelled)
 Assume fed at full DDR bandwidth
 Calculate execution time from data size
Optimistic HW implementations vs real processors
11
Benchmarks Converted to HW
Stratix III 3S200C2
Benchmark
EEMBC autcor
conven
rgbcmyk
rgbyiq
ip_checksum
VIRAM imgblend
ALMs DSPs M9Ks
592
46
527
706
158
302
32
0
0
108
0
32
1
0
0
0
0
0
Clock
(MHz)
Cycles
323
476
447
274
457
443
1057
226
237784
144741
2567
14414
HW Clock: 275-475 MHz
VESPA Clock: 120-140 MHz
HW advantage: 3x faster clock frequency
12
Performance/Area Space (vs HW)
vs HW
Speed Advantage
HWSlowdown
Scalar – 432x slower, 7x larger
HW (1,1)
optimistic
HWArea
Areavs
Advantage
HW
fastest VESPA
17x slower, 64x larger
Soft vector processors can significantly close performance gap
13
Area-Delay Product

Commonly used to measure efficiency in silicon



Considers both performance and area
Inverse of performance-per-area
Calculated using:
(Area) × (Wall Clock Execution Time)
14
Area-Delay
HWHW
Area-Delay
vs HW
Area-Delay
Advantage
Advantage
Area-Delay Space (vs HW)
3500
2900x
3000
Scalar
2500
1 Lane
2000
2 Lanes
1500
4 Lanes
1000
8 Lanes
900x
500
16 Lanes
0
0
20
40
60
80
HW Area
Area Advantage
Advantage
HW
VESPA up to 3 times better silicon usage than Scalar
15
Reducing the Performance Gap

Previously: VESPA was 50x slower than HW

Reducing loop overhead


VESPA: Decoupled pipelines (+7% speed)
Improving data delivery


VESPA: Parameterized cache (2x speed, 2x area)
VESPA: Data Prefetching (+42% speed)
These enhancements were key parts of reducing gap,
combined 3x performance improvement
16
Wider Cache Line Size
vld.w
VESPA
16 lanes
Scalar
Vector
Coproc
(load 16 sequential 32-bit words)
Lane
Lane
Lane
Lane Lane
Lane
Lane
Lane Lane
Lane
Lane
Lane Lane
Lane
Lane
Lane
0004
0008
0 0 012
4 41516
Vector Memory Crossbar
Dcache
4KB,
16B line
…
17
Wider Cache Line Size
vld.w
VESPA
16 lanes
Scalar
Vector
Coproc
(load 16 sequential 32-bit words)
Lane
Lane
Lane
Lane Lane
Lane
Lane
Lane Lane
Lane
Lane
Lane Lane
Lane
Lane
Lane
0004
0008
0 0 012
4 41516
Vector Memory Crossbar
Dcache
16KB,
64B line
2x speed, 2x area
(reduced cache
accesses +
some prefetching)
4x
…
4x
18
Hardware Prefetching Example
No Prefetching
Prefetching 3 blocks
vld.w
vld.w
MISS
MISS
Dcache
MISS
HIT
Dcache
…
…
10 cycle
penalty
DDR
10 cycle
penalty
DDR
42% speed improvement from reduced miss cycles
19
Reducing the Area Gap
(by Customizing the Instruction Set)

FPGAs can be reconfigured between applications

Observations: Not all applications
1.
2.

Operate on 32-bit data types
Use the entire vector instruction set
Eliminate unused hardware
20
VESPA Parameters
Description
Symbol
Values
Number of Lanes
L
1,2,4,8, …
Maximum Vector Length
MVL
2,4,8, …
Width of Lanes (in bits)
W
1-32
Memory Crossbar Lanes
M
1,2, …, L
Multiplier Lanes
X
1,2, …, L
Instruction Enable (each)
-
on/off
Data Cache Capacity
DD
any
Data Cache Line Size
DW
any
Data Prefetch Size
DPK
< DD
Vector Data Prefetch Size
DPV
< DD/MVL
Reduce
width
Subset
instruction
set
21
Advantage
Speed
HWSlowdown
HW Speedup
vs HW
Customized VESPA vs HW
200
Full
Subsetted
150
Subsetted+Width Reduced
100
50
0
0
10
20
30
40
50
45%
60
70
HWHW
Area
Advantage
Area
Area
vsAdvantage
HW
Up to 45% area saved with width reduction & subsetting
22
Summary

VESPA more competitive with HW design

Fastest VESPA only 17x slower than HW


Attacking loop overhead and data delivery was key



Scalar soft processor was 432x slower than HW
Decoupled pipelines, cache tuning, data prefetching
Further enhancements can reduce the gap more
VESPA improves efficiency of silicon usage

900x worse area-delay than HW


Scalar soft processor 2900x worse area-delay than HW
Subsetting/width reduction can further reduce to 561x
Enable software implementation for non-critical
data-parallel computation
23
Thank You!

Stay tuned for public release:
1.
2.
GNU assembler ported for VIRAM (integer only)
VESPA hardware design (DE3 ready)
24
Breaking Down Performance

Components of performance
Loop:
Loop:
Loop:
<work>
<work>
<work>
goto Loop
goto Loop
goto Loop
Iteration-level parallelism
…
b)
Cycles per iteration × Clock period
c)
a)
Measure the HW advantage in each of these components
25
Breakdown of Performance Loss
(16 lane VESPA vs HW)
Clock
Frequency
Iteration
Level
Parallelism
Cycles
Per
Iteration
autcor
2.6x
1x
9.1x
conven
3.9x
1x
6.1x
rgbcmyk
3.7x
0.375x
13.8x
rgbyiq
2.2x
0.375x
19.0x
ip_checksum
3.7x
0.5x
4.8x
imgblend
3.6x
1x
4.4x
GEOMEAN
3.2x
0.64x
8.2x
Benchmark
Total
17x
Largest factor
Was previously worse, recently improved
26
1-Lane VESPA vs Scalar
1.
2.
3.
4.
Efficient pipeline execution
Large vector register file for storage
Amortization of loop control instructions.
More powerful ISA (VIRAM vs MIPS):
1.
2.
3.
5.
6.
Support for fixed-point operations
Predication
Built-in min/max/absolute instructions
Execution in both scalar and vector co-processor
Manual vectorization in assembly versus scalar GCC
27
Measuring the Gap
Scalar: MIPS soft processor
C
(complete & real)
EEMBC C
Benchmarks
COMPARE
VESPA: VIRAM soft vector processor
(complete & real) assembly
COMPARE
HW: Custom circuit for each benchmark
(simplified & idealized)
Verilog
28
Reporting Comparison Results
1. Scalar (C)
vs HW (Verilog)
2. VESPA (Vector assembly) vs HW (Verilog)
3. HW (Verilog)

Performance (wall clock time)
Execution Time of Processor
HW Speed Advantage =
Execution Time of Hardware

Area (actual silicon area)
HW Area Advantage =
Area of Processor
Area of Hardware
29
`
Cache Design Space –
Performance (Wall Clock Time)
2.00
Speedup Vs 4KB,16B
1.93
1.77
1.75
1.68
1.50
128B 122MHz
1.55
1.50
1.37
64B
123MHz
32B 126MHz
16B 129MHz
1.25
1.13
1.00
4KB
8KB
16KB
32KB
64KB
Best cache design almost doubles performance of original VESPA
Cache line more important than cache depth (lots of streaming)
More pipelining/retiming could reduce clock frequency penalty
30
Vector Length Prefetching Performance
Peak 29%
21% 2.5
2.2x
`
conven
2
fbital
1.5
viterb
rgbcmyk
1
rgbyiq
ip_checksum
Amount of Prefetching
32*VL
8*VL
4*VL
2*VL
16*VL
no cache
pollution
1*VL
0.5
None
Speedup
Not receptive
autcor
imgblend
filt3x3
GMEAN
1*VL prefetching provides good speedup without tuning, 8*VL best
31
Fraction of Total Cycles
`
Overall Memory System Performance
16 lanes
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Memory Unit Stall Cycles
67%
Miss Cycles
48%
31%
4%
16-byte line
64-byte line
(4KB)
(16KB)
64-byte line +
prefetch 15
Wider line + prefetching reduces memory unit stall cycles significantly
Wider line + prefetching eliminates all but 4% of miss cycles
32