F5 7-16-07 - Greg Stitt, University of Florida

Download Report

Transcript F5 7-16-07 - Greg Stitt, University of Florida

RC Device Characterizations & Tradeoff Analysis

Jason Williams

August 30, 2007

Introduction

 Reconfigurable Computing (RC) is an emerging field that utilizes devices with a programmable fabric allowing the hardware to be configured and adapted to solve changing problems  RC systems have typically been built using Field Programmable Gate Arrays (FPGAs) but there are other architectures that could implement RC systems such as Field Programmable Object Arrays (FPOAs) and Field Programmable Compute Arrays (FPCA, e.g. MONARCH) 2

Subject & Purpose

 

Subject

 To survey the landscape of various RC devices   Characterize these devices using various metrics (performance, price, power) Create a comparison framework using the characterizations

Purpose

 Will give the end user a quantitative framework to aid in the selection of an appropriate RC device to meet their application needs  Lays groundwork for understanding performance impacts of architectural components 3

Problem Definition

Problems

 RC devices can be vastly different from one another  Various architectural differences and very few standard/common parameters  Memory Example: Xilinx BRAM vs. Altera M RAM/M4K/M512 vs. FPOA RF/IRAM vs. CPU cache   RC devices differ from traditional microprocessors     Typically slower clock rates Potential for massive parallelism Different power consumption trends Different on-die memory configurations All of these differences make direct device comparisons difficult 4

Problem Background

Users have a variety of requirements/concerns – What key parameters do we need to compare?

 Computational performance (integer/fixed point, floating point, fine grained/bit level)     On-chip memory performance (latency, bandwidth) Off-chip communications and I/O Power consumption Price 5

Scope Statement

Devices to be included in study

      Xilinx Virtex 4 LX200, LX100, SX55 Altera Stratix II S180 Freescale PowerPC MPC7447 + AltiVec MathStar Arrix FPOA (1 GHz) Raytheon Monarch PCA Sony/Toshiba/IBM Cell 6

Methods

Literature review

 Apply and extend characterizations and metrics to devices under study 

Datasheet analysis

Experiments using vendor development tools/simulation environments

 Example: Utilization and timing analysis results from post place and route for common ALU/FP structures 

Combine characterization study results into a QFD style matrix

7

FPGA Theoretical Floating Point Performance

Methodology

 Adapted from Jeff Mason’s (Xilinx) presentation at RSSI ’07 “FPGA HPC – The road beyond processors” with input from Dave Strenski (Cray). Similar methodology also reported in

An overview of FPGAs and FPGA programming; Initial experiences at Daresbury

, Richard Wain, Ian Bush, Martyn Guest, Miles Deegan, Igor Kozin and Christine Kitchen. November 2006. Distributed Computing Group at Daresbury Laboratory.

 Using datasheet information, Altera and Xilinx Floating Point cores, ISE and Quartus, estimate FP add and FP multiply performance.

8

FPGA Floating Point Performance

Xilinx Example

  Data from Virtex 4 Family Overview (DS112) and Coregen Floating Point Operator v3.0 (DS335) Assumptions:      15% slice overhead (routing, I/O, etc.) Use DSP resources first, then logic only implementation to fill device.

Use lower of the two clock speeds for all calculations (DSP vs. Logic only).

Assume 2 storage elements (BRAM) per operation (operands, overwrite with result). Limit the number of operations if there is not enough BRAM to support.

Use speed optimized, highest effort for Synthesis, Map, PAR.

9

FPGA Floating Point Performance

Xilinx Example Continued (LX200 –10)

 Double Precision Floating Point Multiply Per Instance Max Frequency (MHz) DSPs Used LUTs Used FF Used DSP Implementation 303 16 550 774 Logic Only Implementation 185 0 2311 2457 Device Maximum (less 15% LUT for overhead) 500 96 178176 (151449) 178176 (151449)       96 / 16 = 6 DSP Multipliers 151449 – (774 * 6) = 146805 remaining LUT for Logic Multipliers 146805 / 2457 = ~59 Logic Only Multipliers 65 total multipliers in 1 context @ 185 MHz =

~12 Gflop/s

Limit total number of multipliers to 85 due to BRAM limitation =

~11.1 Gflop/s

LX100 has 336 18Kb dual port BRAM. For 64-bit (DP), ((336 * 2) / 4) / 2 = 85 function units 10

Theoretical Floating Point Performance

Methodology

  FPOA floating point performance is reported as 0. This device could have a floating point core designed for it, but its architecture (16 bit ALUs) would not implement FP efficiently.

PowerPC, AltiVec, MONARCH, and Cell floating point performance numbers are available/derivable from their respective datasheets 11

Floating Point Performance Results

Floating Point Performance (BRAM Limitation)

200 180 160 140 120 100 80 60 40 20 0 X ilin x Vi rte x 4 LX X 20 ilin 0 x Vi rte x 4 LX 10 0 X ilin x Vi rte x 4 S A X 55 lte ra S tra tix II S 18 0 Fr ee sc al e Po w er PC Fr M ee sc P C al 74 e 47 Po w er M PC at + hS A ta lti Ve r A rr c ix FP O A (1 R G ay H th z) eo n M on ar ch S P on C y/ A To sh ib a/ IB M C el l DP Multiply DP Add SP Multiply SP Add 12

Floating Point Performance Results

Floating Point Performance (No BRAM Limitation)

200 180 160 140 120 100 80 60 40 20 0 X ilin x Vi rte x 4 LX X 20 ilin 0 x Vi rte x 4 LX 10 0 X ilin x Vi rte x 4 S A X 55 lte ra S tra tix II S 18 0 Fr ee sc al e Po w er PC Fr M ee sc P C al 74 e 47 Po w er M PC at + hS A ta lti Ve r A rr c ix FP O A (1 R G ay H th z) eo n M on ar ch S P on C y/ A To sh ib a/ IB M C el l DP Multiply DP Add SP Multiply SP Add 13

Floating Point Performance Results

Device

Xilinx Virtex 4 LX200 Xilinx Virtex 4 LX100 Xilinx Virtex 4 SX55 Altera Stratix II S180 Freescale PowerPC MPC7447 Freescale PowerPC + AltiVec MathStar Arrix FPOA (1 GHz) Raytheon Monarch PCA Sony/Toshiba/IBM Cell

DP Multiply DP Add

12.025

7.03

24.14

17.04

SP Multiply SP Add

46.032

32.88

61.824

44.16

7.03

8.14

1 1 0 0 20 11.016

17.304

1 1 0 0 20 38.36

71.68

1 5 0 64 200 33.998

48.334

1 5 0 64 200 Theoretical Floating Point Performance (GFlops, BRAM Limitation)

Device

Xilinx Virtex 4 LX200 Xilinx Virtex 4 LX100 Xilinx Virtex 4 SX55 Altera Stratix II S180 Freescale PowerPC MPC7447 Freescale PowerPC + AltiVec MathStar Arrix FPOA (1 GHz) Raytheon Monarch PCA Sony/Toshiba/IBM Cell

DP Multiply DP Add

12.025

34.08

7.03

18.744

SP Multiply SP Add

63.568

95.68

36.716

53.36

7.03

8.14

1 11.016

17.304

1 38.36

71.68

1 33.998

48.334

1 1 0 0 20 1 0 0 20 5 0 64 200 5 0 64 200 Theoretical Floating Point Performance (GFlops, No BRAM Limitation) 14

Floating Point Conclusions

     For FPGAs, floating point performance dependent on FP core implementation. This impacts resource utilization and maximum achievable frequency. For Xilinx devices, available on-chip memory also greatly impacts performance if we assume there has to be enough on-chip memory to buffer operands and results. Stratix II S180 has more on chip RAM (1.5x V4LX200) and a more flexible memory hierarchy (a larger number of smaller blocks to support more individual registers, higher device memory bandwidth) and does not have this issue.

Xilinx adder cores can use on-chip DSP resources, Altera adder cores do not.

MONARCH only supports single precision floating point.

Cell is the clear leader in theoretical floating point performance (using all processing elements).

15

Theoretical Integer Performance

   Utilize same basic methodology as Floating Point Performance Comparison       15% slice overhead (routing, I/O, etc.).

Use DSP resources first, then logic only implementation to fill device.

Use lower of the two clock speeds for all calculations (DSP vs. Logic only).

Use vendor software (Quartus, ISE) to find resource utilization for 1 functional unit. Calculate the number of parallel functional units that fit in 1 context using datasheet values.

Assume 2 storage elements (BRAM) per functional unit (operands, overwrite with result). Limit the number of parallel functional units if there is not enough BRAM to support 2 storage elements per functional unit.

Use speed optimized, highest effort for Synthesis, Map, PAR.

Use standard integer widths (32 bit and 16 bit).

Analyze Addition and Multiplication operations separately.

16

Theoretical Integer Performance

Methodology

  FPOA 32 bit integer performance is reported as 0. This device could have a 32 bit ALU core designed for it, but it is natively a 16 bit device.

PowerPC, AltiVec, MONARCH, and Cell integer performance numbers are available/derivable from their respective datasheets 17

Integer Performance Results

Integer Performance (BRAM Limitation)

400 350 300 250 200 150 100 50 0 X ilin x Vi rte x 4 LX X 20 ilin 0 x Vi rte x 4 LX 10 0 X ilin x Vi rte x 4 S A X lte 55 ra S tra tix II S 18 0 Fr ee sc al e Po w er PC Fr ee M sc P al C 74 e 47 Po w er M PC at + hS A ta lti Ve r A rr c ix FP O A (1 R G ay H th z) eo n M on ar ch S P on C y/ A To sh ib a/ IB M C el l 18 32 bit Multiply 32 bit Add 16 bit Multiply 16 bit Add

Integer Performance Results

Integer Performance (No BRAM Limitation)

2500 2000 1500 1000 500 0 X ilin x Vi rte x 4 LX X 20 ilin 0 x Vi rte x 4 LX 10 0 X ilin x Vi rte x 4 S A X lte 55 ra S tra tix II S 18 0 Fr ee sc al e Po w er PC Fr ee M sc P al C 74 e 47 Po w er M PC at + hS A ta lti Ve r A rr c ix FP O A (1 R G ay H th z) eo n M on ar ch S P on C y/ A To sh ib a/ IB M C el l 19 32 bit Multiply 32 bit Add 16 bit Multiply 16 bit Add

Integer Performance Results

Device

Xilinx Virtex 4 LX200 Xilinx Virtex 4 LX100 Xilinx Virtex 4 SX55 Altera Stratix II S180 Freescale PowerPC MPC7447 Freescale PowerPC + AltiVec MathStar Arrix FPOA (1 GHz) Raytheon Monarch PCA Sony/Toshiba/IBM Cell

32 bit Multiply 32 bit Add 16 bit Multiply 16 bit Add

37.848

979.736

198.144

2243.04

23.406

38.346

549.608

371.624

122.464

201.928

1238.88

733.92

74.5

3 7 0 64 125 17.304

3 7 0 64 125 257.07

3 11 384 64 250 48.334

3 11 384 64 250 Theoretical Integer Performance (GOPs, BRAM Limitation)

Device

Xilinx Virtex 4 LX200 Xilinx Virtex 4 LX100 Xilinx Virtex 4 SX55 Altera Stratix II S180 Freescale PowerPC MPC7447 Freescale PowerPC + AltiVec MathStar Arrix FPOA (1 GHz) Raytheon Monarch PCA Sony/Toshiba/IBM Cell

32 bit Multiply

37.848

23.406

38.346

74.5

3 7 0 64 125

32 bit Add 16 bit Multiply 16 bit Add

69.216

49.44

65.92

115.584

82.56

110.08

161.28

115.2

153.6

17.304

3 7 0 64 125 257.07

3 11 384 64 250 48.334

3 11 384 64 250 Theoretical Integer Performance (GOPs, No BRAM Limitation) 20

Integer Performance Conclusions

      In some cases, BRAM limitation is again an important performance limiter for Xilinx devices. Stratix II S180 has more on chip RAM (1.5x V4LX200) and a more flexible memory hierarchy (a larger number of smaller blocks to support more individual registers, higher device memory bandwidth) and does not .

Quartus II 6.0 typically reports higher maximum achievable frequency for post place and route timing analysis versus ISE 9.2.

  Used speed grade –10 for Virtex 4 devices.

Used speed grade –3 for Stratix II device.

 32 bit multiply example: Quartus reports 500 MHz for both DSP and Logic Only implementations, ISE reports 421 MHz for DSP, 249 MHz for Logic Only.

Xilinx adder cores can use on-chip DSP resources, which could improve add performance if there was enough memory support. Altera adder cores do not support DSP utilization and therefore suffer a performance hit compared to Xilinx devices.

Without the BRAM limitation, Xilinx devices show the highest performance for Integer Add operations.

With the BRAM limitation, the FPOA has the highest 16 bit integer performance.

Cell has the highest 32 bit integer performance (using all processing elements).

21

Bit-level Computational Performance

Methodology

 Based off of Dehon’s Computational Density calculations  Computational Density    ALU bit operations /cycle   frequency Die area  2    Normalizes performance by die (or package) area and minimum feature size/process technology Bit operations for FPGAs are number of 4 input LUTs Bit operations for GPP and other “hybrid” devices based on number of cores, number of issued instructions, and width of ALU/Functional Units 22

Bit-level Computation Performance

Bit Level Computational Density 160 140 120 100 80 60 40 20 0 

As expected, fine-grained FPGAs dominate performance in this metric

23

External Memory Bandwidth Methodology

  Methodology varies by platform due to available information and architecture differences.

In all cases, choose maximum throughput available based on vendor IP for memory controllers.

  Saturated Case uses maximum amount of I/O for external memory interface, Balanced Case assumes a balance of I/O and memory interface.

Altera Stratix II  Influenced by speed grade, number of I/O    Used new high performance ALTMEMPHY core (vs. legacy memory interface core) Support for 333 MHz DDR2 RAM Number of controllers limited by the number of on-chip delay-locked loops (2) 24

External Memory Bandwidth Methodology

    Xilinx Virtex 4   Influenced by speed grade, number of I/O Memory Interface Generator v1.73 (Coregen) forces use of slower “Direct Clocking” to support multiple banks vs. SERDES strobe implementation, for -10 speed grade maximum frequency is 220 – 240 MHz (depending on bus width) Mathstar FPOA  Datasheet information for total external memory interface bandwidth (RLDRAM II) Cell  External Memory Bandwidth (Rambus XDRAM) reported in presentation “Introduction to the Cell Processor” from Dr. Michael Perrone (IBM) MONARCH  External Memory Bandwidth (DDR2) reported in presentation “World’s First Polymorphic Computer – MONARCH” from K. Prager, L. Lewis, M. Vahey, G. Groves (Raytheon) 25

External Memory Bandwidth Results

External Mem ory Bandw idth

25 20

GB/s

15 10 5 0 Stratix II S180 Virtex 4 LX200 Virtex 4 LX100 Virtex 4 SX55 Cell FPOA MONARCH Saturated Balanced 26

External Memory Bandwidth Conclusions

       External Memory Bandwidth important to prevent data bottleneck into the device.

For FPGAs, the type and speed of external memory supported depends on the family and speed grade of the device.

In this study, non-FPGA devices have separate I/O and memory controllers/interfaces, so there is not a distinction between saturated and balanced.

Stratix II S180 and Virtex 4 SX55 configurations support 2 simultaneous controllers, Virtex 4 LX100 and LX200 support 3 simultaneous controllers which is shown in the performance difference for the saturated case.

Although Stratix II controller supports faster DDR2 RAM (333 MHz vs. 220 MHz in this configuration), Virtex 4 SX55 has higher bandwidth due to support for a wider bus.

Xilinx claims higher bandwidth on website, assumes wider bus than existing memories.

For the balanced case, Cell is the performance leader, primarily due to specialized RAM format (XDRAM).

27

I/O Bandwidth Methodology

      Methodology varies by platform due to available information and architecture differences.

In all cases, choose maximum throughput available protocol/signaling level.

Saturated Case uses maximum amount of I/O for I/O interface, Balanced Case assumes a balance of I/O and 1 memory interface.

Altera Stratix II  Datasheet information for concurrent receive pairs and transmit pairs @ 1.040 Gb/s per pair.

Xilinx Virtex 4  Datasheet information for concurrent receive pairs and transmit pairs @ 1 Gb/s per pair.

Mathstar FPOA  Datasheet information for concurrent total transmit and receive bandwidth.

28

I/O Bandwidth Methodology

Cell

 I/O Bandwidth reported in presentation “Introduction to the Cell Processor” from Dr. Michael Perrone (IBM) 

MONARCH

 I/O Bandwidth reported in presentation “World’s First Polymorphic Computer – MONARCH” from K. Prager, L. Lewis, M. Vahey, G. Groves (Raytheon) 29

I/O Bandwidth Results

I/O Bandwidth

80 70 60 50

GB/s

40 30 20 10 0 Altera Stratix II S180 Xilinx Virtex 4 SX55 Xilinx Virtex 4 LX100 Xilinx Virtex 4 LX200 Cell Mathstar FPOA MONARCH Saturated Balanced 30

I/O Bandwidth Conclusions

      I/O Bandwidth is important to prevent I/O and data bottleneck.

In this study, non-FPGA devices have separate I/O and memory controllers/interfaces, so there is not a distinction between saturated and balanced.

All devices except for FPOA have at least 40 GB/s throughput.

FPGAs are shown in both fully utilized and balanced cases.

Stratix II uses separate I/O for single ended memory interface and differential pairs so there is no distinction between saturated and balanced cases.

Cell has the highest I/O performance for both cases. 31

Internal Device Memory Bandwidth

Methodology

   FPGAs   Xilinx – all BRAMs are the same, calculation = number of BRAMS * port width * number of ports * memory access frequency Altera – 3 levels of internal memory hierarchy, calculation similar to above for all levels of hierarchy FPOA – similar to above with 2 levels of memory hierarchy (Register File and Internal RAM) GPP – bus width * frequency * ports 32

Internal Memory Bandwidth

Internal Memory Bandwidth

3000 2500 2000 1500 1000 500 0 

Large amount of parallel accesses give FPGAs the advantage in this metric

33

  

Device Characterization Matrix

Goal: enable comparison of different devices on key parameters     Tie all device characterizations into unifying framework User weights allow adjustment to specific application needs Scores quickly show comparison results based on input weights Approach:   Scale each characterization study from 1 to 10 Generate weighted average score for each device taking into account user weights Justification  Significant architectural differences have historically made these devices difficult to compare Single-Precision Floating-Point scaling example  Use min and max values to scale from 1 to 10

Device

Altera Stratix II S180 Xilinx Virtex 4 LX200 Xilinx Virtex 4 LX100 Xilinx Virtex 4 SX55 PowerPC PowerPC + AltiVec Cell Monarch

SP FP Multiply Throughput (GFlops)

71.68

46.03

32.88

38.36

1 5 200 64

Scaled SP FP Multiply Performance

1 1 10 4 4 3 2 3 min max 1 200

Characteri zation i

scale

min  (

x i

x

min )

scale

max * ( 

x

max 

scale x

min min )

where

{ {

scale

min

x

min 

N

min

j

 1 (

x j

),

x

max  1 ,

scale

max  10 } 

N

max

j

 1 (

x j

)}

and

1 10

Device Characterization Matrix

User Weight PPC AltiVec Xilinx Virtex-4 LX100 Xilinx Virtex-4 SX55 Xilinx Virtex-4 LX200 Altera Stratix-II S180 Cell Mathstar FPOA MONARCH 10

1 1 3 2 5 3 10 0 4

10

1 1 3 3 4 4 10 0 4

10

1 1 6 4 10 6 6 0 1

Score

 

i N

 1

w i

*

characteri zation i

i N

 1

w i where

w is a non negative weight

10

1 1 4 4 6 5 10 0 1

10

1 1 6 4 10 8 2 0 2 

10

1 2 3 4 4 6 10 0 6

10

1 1 6 4 10 6 2 3 1

10

1 1 4 6 6 7 7 10 2

10

1 1 6 4 8 10 1 2 1

10

1 1 4 5 6 10 3 3 1

10

1 1 3 3 3 3 10 2 4

10

1 1 6 4 6 6 10 2 6 Examples with other weights: A.

B.

C.

 Power & cost (10), internal & external memory BW (5), 16-bit integer performance (7): FPOA & V4SX55 lead  DP FP performance (5), power (10) Stratix-II S180 and V4LX200 lead External & I/O BW (10), power (10), cost (10)  MONARCH and Cell lead

10

10 10 8 8 8 8 1 7 6

References

                   DeHon, A. The Density Advantage of Configurable Computing.

Computer

, vol.33, no.4, pp.41-49, Apr 2000.

DeHon, A. Reconfigurable Architectures for General-Purpose Computing. A.I. Technical Report No. 1586, Massachusetts Institute of Technology, 1996.

Compton, K. and Hauck, S. Reconfigurable computing: a survey of systems and software. Memory Bandwidth, http://en.wikipedia.org/wiki/Memory_bandwidth .

Mason, J. FPGA HPC – The road beyond processors, Xilinx Corporation. RSSI 2007.

ACM Comput. Surv.

34, 2 (Jun. 2002), 171-210. Wain, R., Bush, I., Guest, M., Deegan, M., Kozin, I. and Kitchen, C.

An overview of FPGAs and FPGA programming; Initial experiences at Daresbury

,. November 2006. Distributed Computing Group at Daresbury Laboratory.

Bolsens, I. Programming Modern FPGAs. Xilinx Corporation. MPSOC August, 2006.

Underwood, K. 2004. FPGAs vs. CPUs: trends in peak floating-point performance. In

Proceedings of the 2004 ACM/SIGDA 12th international Symposium on Field Programmable Gate Arrays

(Monterey, California, USA, February 22 - 24, 2004). FPGA '04. ACM Press, New York, NY, 171-180.

HPEC Challenge Benchmarks. http://www.ll.mit.edu/HPECchallenge .

Xilinx Corporation. 2100 Logic Drive, San Jose, CA 95124-3400.

Virtex-4 Family Overview

(DS112), January 23, 2007.

Xilinx Corporation. 2100 Logic Drive, San Jose, CA 95124-3400.

Floating-Point Operator v3.0

(DS335). September 28, 2006.

“Introduction to the Cell Processor” from Dr. Michael Perrone (IBM) “World’s First Polymorphic Computer – MONARCH” from K. Prager, L. Lewis, M. Vahey, G. Groves (Raytheon) Strenski, Dave. “FPGA Floating Point Performance – a pencil and paper evaluation”. http://www.hpcwire.com/hpc/1195762.html

.

Strenski, Dave. 2006. Computational Bottlenecks and Hardware Decisions for FPGAs. FPGA and Structured ASIC Journal.

Altera Corporation. 101 Innovation Drive, San Jose, CA 95134. Stratix II Device Handbook v 4.3, May 2007.

Freescale Semiconductor Inc. 6501 William Cannon Drive West, Austin, TX 78735. MPC7450 RISC Microprocessor Family Reference Manual, Rev. 5. January 2005.

Freescale Semiconductor Inc. 6501 William Cannon Drive West, Austin, TX 78735. AltiVec Technology Programming Environments Manual, Rev. 3. April 2006.

MathStar Corporation. 19075 NW Tanasbourne Dr. Suite 200, Hillsboro, OR 97124. Arrix Family Product Brief, August 2006.

36