Transcript Slide 1
Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009 FPGA Systems and Soft Processors Weeks Software + Compiler Digital System Soft Processor computation Months HDL + CAD Custom HW Used in 25% of designs [source: Altera, 2009] Easier ? Configurable COMPETE Faster Smaller Less Power Simplify FPGA design: Customize soft processor architecture Target: Data level parallelism → vector processors 2 Vector Processing Primer // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c Each vector instruction holds many units of independent operations vadd vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] vr2[12]=vr0[12]+vr1[12] vr2[11]=vr0[11]+vr1[11] vr2[10]=vr0[10]+vr1[10] vr2[9]= vr0[9]+vr1[9] vr2[8]= vr0[8]+vr1[8] vr2[7]= vr0[7]+vr1[7] vr2[6]= vr0[6]+vr1[6] vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] vr2[3]= vr0[3]+vr1[3] vr2[2]= vr0[2]+vr1[2] vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0] 1 Vector Lane 3 Vector Processing Primer // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] vadd 16 Vector Lanes vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] 16x speedup vr2[12]=vr0[12]+vr1[12] vr2[11]=vr0[11]+vr1[11] // Vectorized code Previous Work (on Soft Vector Processors): vr2[10]=vr0[10]+vr1[10] set vl,16 vr2[9]= vr0[9]+vr1[9] 1. Scalability vload vr0,a vr2[8]= vr0[8]+vr1[8] vload vr1,b 2. Flexibility vr2[7]= vr0[7]+vr1[7] vadd vr2,vr0,vr1 vr2[6]= vr0[6]+vr1[6] 3. Portability vstore vr2,c vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] Each vector instruction vr2[3]= vr0[3]+vr1[3] holds many units of vr2[2]= vr0[2]+vr1[2] independent operations vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0] 4 Soft Vector Processors vs HW Weeks Software + Compiler + Vectorizer Scalable Fine-tunable Customizable Easier Soft Vector Processor Custom HW Months HDL + CAD Lane Lane Lane Lane Lane Lane Lane Lane 1 2 3 4 5 6 7 8 …16 Vector Lanes How much? Faster Smaller Less Power What is the soft vector processor vs FPGA custom HW gap? (also vs scalar soft processor) 5 Measuring the Gap EEMBC Benchmarks Soft Vector Processor Scalar Soft Processor Evaluation Speed Area HW Circuits Evaluation Compare Speed Area Conclusions Evaluation Compare Speed Area 6 VESPA Architecture Design (Vector Extended Soft Processor Architecture) Icache Scalar Pipeline 3-stage Vector Control Pipeline 3-stage Vector Pipeline 6-stage Decode RF VC RF Decode Legend Pipe stage Logic Storage Dcache Logic VS RF Decode M U X A L U VC WB Shared Dcache VS WB Replicate Hazard check WB VR VR RF RF Supports integer and fixed-point operations [VIRAM] Lane 1 ALU,Mem Lane 2 Unit ALU, Mem, Mul VR VR WB WB 32-bit Lanes 7 VESPA Parameters Compute Architecture Instruction Set Architecture Memory Hierarchy Description Symbol Values Number of Lanes L 1,2,4,8, … Memory Crossbar Lanes M 1,2, …, L Multiplier Lanes X 1,2, …, L Maximum Vector Length MVL 2,4,8, … Width of Lanes (in bits) W 1-32 Instruction Enable (each) - on/off Data Cache Capacity DD any Data Cache Line Size DW any Data Prefetch Size DPK < DD Vector Data Prefetch Size DPV < DD/MVL 8 VESPA Evaluation Infrastructure SOFTWARE EEMBC C Benchmarks GCC ld Vectorized assembly subroutines HARDWARE Verilog scalar μP ELF Binary + VC RF vpu VC WB Logic VS RF VS WB Mem Unit GNU as Decode Replicate Hazard check VR VR RF RF VR VR WB WB SatuSaturate rate A L A U L U M U M X U X xx &&satur. satur. Rshift Rshift TM4 Instruction Set Simulation verification Altera Quartus II v 8.1 RTL Simulation verification cycles area, clock frequency Realistic and detailed evaluation 9 Measuring the Gap EEMBC Benchmarks Soft Vector Processor Scalar Soft Processor Evaluation Speed Area HW Circuits Evaluation Compare Speed Area Conclusions Evaluation Compare Speed Area 10 Designing HW Circuits (with simplifying assumptions) HW Memory Request Idealized Control DDR Core Datapath Altera Quartus II v 8.1 area, clock frequency cycle count (modelled) Assume fed at full DDR bandwidth Calculate execution time from data size Optimistic HW implementations vs real processors 11 Benchmarks Converted to HW Stratix III 3S200C2 Benchmark EEMBC autcor conven rgbcmyk rgbyiq ip_checksum VIRAM imgblend ALMs DSPs M9Ks 592 46 527 706 158 302 32 0 0 108 0 32 1 0 0 0 0 0 Clock (MHz) Cycles 323 476 447 274 457 443 1057 226 237784 144741 2567 14414 HW Clock: 275-475 MHz VESPA Clock: 120-140 MHz HW advantage: 3x faster clock frequency 12 Performance/Area Space (vs HW) vs HW Speed Advantage HWSlowdown Scalar – 432x slower, 7x larger HW (1,1) optimistic HWArea Areavs Advantage HW fastest VESPA 17x slower, 64x larger Soft vector processors can significantly close performance gap 13 Area-Delay Product Commonly used to measure efficiency in silicon Considers both performance and area Inverse of performance-per-area Calculated using: (Area) × (Wall Clock Execution Time) 14 Area-Delay HWHW Area-Delay vs HW Area-Delay Advantage Advantage Area-Delay Space (vs HW) 3500 2900x 3000 Scalar 2500 1 Lane 2000 2 Lanes 1500 4 Lanes 1000 8 Lanes 900x 500 16 Lanes 0 0 20 40 60 80 HW Area Area Advantage Advantage HW VESPA up to 3 times better silicon usage than Scalar 15 Reducing the Performance Gap Previously: VESPA was 50x slower than HW Reducing loop overhead VESPA: Decoupled pipelines (+7% speed) Improving data delivery VESPA: Parameterized cache (2x speed, 2x area) VESPA: Data Prefetching (+42% speed) These enhancements were key parts of reducing gap, combined 3x performance improvement 16 Wider Cache Line Size vld.w VESPA 16 lanes Scalar Vector Coproc (load 16 sequential 32-bit words) Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane 0004 0008 0 0 012 4 41516 Vector Memory Crossbar Dcache 4KB, 16B line … 17 Wider Cache Line Size vld.w VESPA 16 lanes Scalar Vector Coproc (load 16 sequential 32-bit words) Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane Lane 0004 0008 0 0 012 4 41516 Vector Memory Crossbar Dcache 16KB, 64B line 2x speed, 2x area (reduced cache accesses + some prefetching) 4x … 4x 18 Hardware Prefetching Example No Prefetching Prefetching 3 blocks vld.w vld.w MISS MISS Dcache MISS HIT Dcache … … 10 cycle penalty DDR 10 cycle penalty DDR 42% speed improvement from reduced miss cycles 19 Reducing the Area Gap (by Customizing the Instruction Set) FPGAs can be reconfigured between applications Observations: Not all applications 1. 2. Operate on 32-bit data types Use the entire vector instruction set Eliminate unused hardware 20 VESPA Parameters Description Symbol Values Number of Lanes L 1,2,4,8, … Maximum Vector Length MVL 2,4,8, … Width of Lanes (in bits) W 1-32 Memory Crossbar Lanes M 1,2, …, L Multiplier Lanes X 1,2, …, L Instruction Enable (each) - on/off Data Cache Capacity DD any Data Cache Line Size DW any Data Prefetch Size DPK < DD Vector Data Prefetch Size DPV < DD/MVL Reduce width Subset instruction set 21 Advantage Speed HWSlowdown HW Speedup vs HW Customized VESPA vs HW 200 Full Subsetted 150 Subsetted+Width Reduced 100 50 0 0 10 20 30 40 50 45% 60 70 HWHW Area Advantage Area Area vsAdvantage HW Up to 45% area saved with width reduction & subsetting 22 Summary VESPA more competitive with HW design Fastest VESPA only 17x slower than HW Attacking loop overhead and data delivery was key Scalar soft processor was 432x slower than HW Decoupled pipelines, cache tuning, data prefetching Further enhancements can reduce the gap more VESPA improves efficiency of silicon usage 900x worse area-delay than HW Scalar soft processor 2900x worse area-delay than HW Subsetting/width reduction can further reduce to 561x Enable software implementation for non-critical data-parallel computation 23 Thank You! Stay tuned for public release: 1. 2. GNU assembler ported for VIRAM (integer only) VESPA hardware design (DE3 ready) 24 Breaking Down Performance Components of performance Loop: Loop: Loop: <work> <work> <work> goto Loop goto Loop goto Loop Iteration-level parallelism … b) Cycles per iteration × Clock period c) a) Measure the HW advantage in each of these components 25 Breakdown of Performance Loss (16 lane VESPA vs HW) Clock Frequency Iteration Level Parallelism Cycles Per Iteration autcor 2.6x 1x 9.1x conven 3.9x 1x 6.1x rgbcmyk 3.7x 0.375x 13.8x rgbyiq 2.2x 0.375x 19.0x ip_checksum 3.7x 0.5x 4.8x imgblend 3.6x 1x 4.4x GEOMEAN 3.2x 0.64x 8.2x Benchmark Total 17x Largest factor Was previously worse, recently improved 26 1-Lane VESPA vs Scalar 1. 2. 3. 4. Efficient pipeline execution Large vector register file for storage Amortization of loop control instructions. More powerful ISA (VIRAM vs MIPS): 1. 2. 3. 5. 6. Support for fixed-point operations Predication Built-in min/max/absolute instructions Execution in both scalar and vector co-processor Manual vectorization in assembly versus scalar GCC 27 Measuring the Gap Scalar: MIPS soft processor C (complete & real) EEMBC C Benchmarks COMPARE VESPA: VIRAM soft vector processor (complete & real) assembly COMPARE HW: Custom circuit for each benchmark (simplified & idealized) Verilog 28 Reporting Comparison Results 1. Scalar (C) vs HW (Verilog) 2. VESPA (Vector assembly) vs HW (Verilog) 3. HW (Verilog) Performance (wall clock time) Execution Time of Processor HW Speed Advantage = Execution Time of Hardware Area (actual silicon area) HW Area Advantage = Area of Processor Area of Hardware 29 ` Cache Design Space – Performance (Wall Clock Time) 2.00 Speedup Vs 4KB,16B 1.93 1.77 1.75 1.68 1.50 128B 122MHz 1.55 1.50 1.37 64B 123MHz 32B 126MHz 16B 129MHz 1.25 1.13 1.00 4KB 8KB 16KB 32KB 64KB Best cache design almost doubles performance of original VESPA Cache line more important than cache depth (lots of streaming) More pipelining/retiming could reduce clock frequency penalty 30 Vector Length Prefetching Performance Peak 29% 21% 2.5 2.2x ` conven 2 fbital 1.5 viterb rgbcmyk 1 rgbyiq ip_checksum Amount of Prefetching 32*VL 8*VL 4*VL 2*VL 16*VL no cache pollution 1*VL 0.5 None Speedup Not receptive autcor imgblend filt3x3 GMEAN 1*VL prefetching provides good speedup without tuning, 8*VL best 31 Fraction of Total Cycles ` Overall Memory System Performance 16 lanes 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Memory Unit Stall Cycles 67% Miss Cycles 48% 31% 4% 16-byte line 64-byte line (4KB) (16KB) 64-byte line + prefetch 15 Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but 4% of miss cycles 32