Transcript F5 7-16-07 - Greg Stitt, University of Florida
RC Device Characterizations & Tradeoff Analysis
Jason Williams
August 30, 2007
Introduction
Reconfigurable Computing (RC) is an emerging field that utilizes devices with a programmable fabric allowing the hardware to be configured and adapted to solve changing problems RC systems have typically been built using Field Programmable Gate Arrays (FPGAs) but there are other architectures that could implement RC systems such as Field Programmable Object Arrays (FPOAs) and Field Programmable Compute Arrays (FPCA, e.g. MONARCH) 2
Subject & Purpose
Subject
To survey the landscape of various RC devices Characterize these devices using various metrics (performance, price, power) Create a comparison framework using the characterizations
Purpose
Will give the end user a quantitative framework to aid in the selection of an appropriate RC device to meet their application needs Lays groundwork for understanding performance impacts of architectural components 3
Problem Definition
Problems
RC devices can be vastly different from one another Various architectural differences and very few standard/common parameters Memory Example: Xilinx BRAM vs. Altera M RAM/M4K/M512 vs. FPOA RF/IRAM vs. CPU cache RC devices differ from traditional microprocessors Typically slower clock rates Potential for massive parallelism Different power consumption trends Different on-die memory configurations All of these differences make direct device comparisons difficult 4
Problem Background
Users have a variety of requirements/concerns – What key parameters do we need to compare?
Computational performance (integer/fixed point, floating point, fine grained/bit level) On-chip memory performance (latency, bandwidth) Off-chip communications and I/O Power consumption Price 5
Scope Statement
Devices to be included in study
Xilinx Virtex 4 LX200, LX100, SX55 Altera Stratix II S180 Freescale PowerPC MPC7447 + AltiVec MathStar Arrix FPOA (1 GHz) Raytheon Monarch PCA Sony/Toshiba/IBM Cell 6
Methods
Literature review
Apply and extend characterizations and metrics to devices under study
Datasheet analysis
Experiments using vendor development tools/simulation environments
Example: Utilization and timing analysis results from post place and route for common ALU/FP structures
Combine characterization study results into a QFD style matrix
7
FPGA Theoretical Floating Point Performance
Methodology
Adapted from Jeff Mason’s (Xilinx) presentation at RSSI ’07 “FPGA HPC – The road beyond processors” with input from Dave Strenski (Cray). Similar methodology also reported in
An overview of FPGAs and FPGA programming; Initial experiences at Daresbury
, Richard Wain, Ian Bush, Martyn Guest, Miles Deegan, Igor Kozin and Christine Kitchen. November 2006. Distributed Computing Group at Daresbury Laboratory.
Using datasheet information, Altera and Xilinx Floating Point cores, ISE and Quartus, estimate FP add and FP multiply performance.
8
FPGA Floating Point Performance
Xilinx Example
Data from Virtex 4 Family Overview (DS112) and Coregen Floating Point Operator v3.0 (DS335) Assumptions: 15% slice overhead (routing, I/O, etc.) Use DSP resources first, then logic only implementation to fill device.
Use lower of the two clock speeds for all calculations (DSP vs. Logic only).
Assume 2 storage elements (BRAM) per operation (operands, overwrite with result). Limit the number of operations if there is not enough BRAM to support.
Use speed optimized, highest effort for Synthesis, Map, PAR.
9
FPGA Floating Point Performance
Xilinx Example Continued (LX200 –10)
Double Precision Floating Point Multiply Per Instance Max Frequency (MHz) DSPs Used LUTs Used FF Used DSP Implementation 303 16 550 774 Logic Only Implementation 185 0 2311 2457 Device Maximum (less 15% LUT for overhead) 500 96 178176 (151449) 178176 (151449) 96 / 16 = 6 DSP Multipliers 151449 – (774 * 6) = 146805 remaining LUT for Logic Multipliers 146805 / 2457 = ~59 Logic Only Multipliers 65 total multipliers in 1 context @ 185 MHz =
~12 Gflop/s
Limit total number of multipliers to 85 due to BRAM limitation =
~11.1 Gflop/s
LX100 has 336 18Kb dual port BRAM. For 64-bit (DP), ((336 * 2) / 4) / 2 = 85 function units 10
Theoretical Floating Point Performance
Methodology
FPOA floating point performance is reported as 0. This device could have a floating point core designed for it, but its architecture (16 bit ALUs) would not implement FP efficiently.
PowerPC, AltiVec, MONARCH, and Cell floating point performance numbers are available/derivable from their respective datasheets 11
Floating Point Performance Results
Floating Point Performance (BRAM Limitation)
200 180 160 140 120 100 80 60 40 20 0 X ilin x Vi rte x 4 LX X 20 ilin 0 x Vi rte x 4 LX 10 0 X ilin x Vi rte x 4 S A X 55 lte ra S tra tix II S 18 0 Fr ee sc al e Po w er PC Fr M ee sc P C al 74 e 47 Po w er M PC at + hS A ta lti Ve r A rr c ix FP O A (1 R G ay H th z) eo n M on ar ch S P on C y/ A To sh ib a/ IB M C el l DP Multiply DP Add SP Multiply SP Add 12
Floating Point Performance Results
Floating Point Performance (No BRAM Limitation)
200 180 160 140 120 100 80 60 40 20 0 X ilin x Vi rte x 4 LX X 20 ilin 0 x Vi rte x 4 LX 10 0 X ilin x Vi rte x 4 S A X 55 lte ra S tra tix II S 18 0 Fr ee sc al e Po w er PC Fr M ee sc P C al 74 e 47 Po w er M PC at + hS A ta lti Ve r A rr c ix FP O A (1 R G ay H th z) eo n M on ar ch S P on C y/ A To sh ib a/ IB M C el l DP Multiply DP Add SP Multiply SP Add 13
Floating Point Performance Results
Device
Xilinx Virtex 4 LX200 Xilinx Virtex 4 LX100 Xilinx Virtex 4 SX55 Altera Stratix II S180 Freescale PowerPC MPC7447 Freescale PowerPC + AltiVec MathStar Arrix FPOA (1 GHz) Raytheon Monarch PCA Sony/Toshiba/IBM Cell
DP Multiply DP Add
12.025
7.03
24.14
17.04
SP Multiply SP Add
46.032
32.88
61.824
44.16
7.03
8.14
1 1 0 0 20 11.016
17.304
1 1 0 0 20 38.36
71.68
1 5 0 64 200 33.998
48.334
1 5 0 64 200 Theoretical Floating Point Performance (GFlops, BRAM Limitation)
Device
Xilinx Virtex 4 LX200 Xilinx Virtex 4 LX100 Xilinx Virtex 4 SX55 Altera Stratix II S180 Freescale PowerPC MPC7447 Freescale PowerPC + AltiVec MathStar Arrix FPOA (1 GHz) Raytheon Monarch PCA Sony/Toshiba/IBM Cell
DP Multiply DP Add
12.025
34.08
7.03
18.744
SP Multiply SP Add
63.568
95.68
36.716
53.36
7.03
8.14
1 11.016
17.304
1 38.36
71.68
1 33.998
48.334
1 1 0 0 20 1 0 0 20 5 0 64 200 5 0 64 200 Theoretical Floating Point Performance (GFlops, No BRAM Limitation) 14
Floating Point Conclusions
For FPGAs, floating point performance dependent on FP core implementation. This impacts resource utilization and maximum achievable frequency. For Xilinx devices, available on-chip memory also greatly impacts performance if we assume there has to be enough on-chip memory to buffer operands and results. Stratix II S180 has more on chip RAM (1.5x V4LX200) and a more flexible memory hierarchy (a larger number of smaller blocks to support more individual registers, higher device memory bandwidth) and does not have this issue.
Xilinx adder cores can use on-chip DSP resources, Altera adder cores do not.
MONARCH only supports single precision floating point.
Cell is the clear leader in theoretical floating point performance (using all processing elements).
15
Theoretical Integer Performance
Utilize same basic methodology as Floating Point Performance Comparison 15% slice overhead (routing, I/O, etc.).
Use DSP resources first, then logic only implementation to fill device.
Use lower of the two clock speeds for all calculations (DSP vs. Logic only).
Use vendor software (Quartus, ISE) to find resource utilization for 1 functional unit. Calculate the number of parallel functional units that fit in 1 context using datasheet values.
Assume 2 storage elements (BRAM) per functional unit (operands, overwrite with result). Limit the number of parallel functional units if there is not enough BRAM to support 2 storage elements per functional unit.
Use speed optimized, highest effort for Synthesis, Map, PAR.
Use standard integer widths (32 bit and 16 bit).
Analyze Addition and Multiplication operations separately.
16
Theoretical Integer Performance
Methodology
FPOA 32 bit integer performance is reported as 0. This device could have a 32 bit ALU core designed for it, but it is natively a 16 bit device.
PowerPC, AltiVec, MONARCH, and Cell integer performance numbers are available/derivable from their respective datasheets 17
Integer Performance Results
Integer Performance (BRAM Limitation)
400 350 300 250 200 150 100 50 0 X ilin x Vi rte x 4 LX X 20 ilin 0 x Vi rte x 4 LX 10 0 X ilin x Vi rte x 4 S A X lte 55 ra S tra tix II S 18 0 Fr ee sc al e Po w er PC Fr ee M sc P al C 74 e 47 Po w er M PC at + hS A ta lti Ve r A rr c ix FP O A (1 R G ay H th z) eo n M on ar ch S P on C y/ A To sh ib a/ IB M C el l 18 32 bit Multiply 32 bit Add 16 bit Multiply 16 bit Add
Integer Performance Results
Integer Performance (No BRAM Limitation)
2500 2000 1500 1000 500 0 X ilin x Vi rte x 4 LX X 20 ilin 0 x Vi rte x 4 LX 10 0 X ilin x Vi rte x 4 S A X lte 55 ra S tra tix II S 18 0 Fr ee sc al e Po w er PC Fr ee M sc P al C 74 e 47 Po w er M PC at + hS A ta lti Ve r A rr c ix FP O A (1 R G ay H th z) eo n M on ar ch S P on C y/ A To sh ib a/ IB M C el l 19 32 bit Multiply 32 bit Add 16 bit Multiply 16 bit Add
Integer Performance Results
Device
Xilinx Virtex 4 LX200 Xilinx Virtex 4 LX100 Xilinx Virtex 4 SX55 Altera Stratix II S180 Freescale PowerPC MPC7447 Freescale PowerPC + AltiVec MathStar Arrix FPOA (1 GHz) Raytheon Monarch PCA Sony/Toshiba/IBM Cell
32 bit Multiply 32 bit Add 16 bit Multiply 16 bit Add
37.848
979.736
198.144
2243.04
23.406
38.346
549.608
371.624
122.464
201.928
1238.88
733.92
74.5
3 7 0 64 125 17.304
3 7 0 64 125 257.07
3 11 384 64 250 48.334
3 11 384 64 250 Theoretical Integer Performance (GOPs, BRAM Limitation)
Device
Xilinx Virtex 4 LX200 Xilinx Virtex 4 LX100 Xilinx Virtex 4 SX55 Altera Stratix II S180 Freescale PowerPC MPC7447 Freescale PowerPC + AltiVec MathStar Arrix FPOA (1 GHz) Raytheon Monarch PCA Sony/Toshiba/IBM Cell
32 bit Multiply
37.848
23.406
38.346
74.5
3 7 0 64 125
32 bit Add 16 bit Multiply 16 bit Add
69.216
49.44
65.92
115.584
82.56
110.08
161.28
115.2
153.6
17.304
3 7 0 64 125 257.07
3 11 384 64 250 48.334
3 11 384 64 250 Theoretical Integer Performance (GOPs, No BRAM Limitation) 20
Integer Performance Conclusions
In some cases, BRAM limitation is again an important performance limiter for Xilinx devices. Stratix II S180 has more on chip RAM (1.5x V4LX200) and a more flexible memory hierarchy (a larger number of smaller blocks to support more individual registers, higher device memory bandwidth) and does not .
Quartus II 6.0 typically reports higher maximum achievable frequency for post place and route timing analysis versus ISE 9.2.
Used speed grade –10 for Virtex 4 devices.
Used speed grade –3 for Stratix II device.
32 bit multiply example: Quartus reports 500 MHz for both DSP and Logic Only implementations, ISE reports 421 MHz for DSP, 249 MHz for Logic Only.
Xilinx adder cores can use on-chip DSP resources, which could improve add performance if there was enough memory support. Altera adder cores do not support DSP utilization and therefore suffer a performance hit compared to Xilinx devices.
Without the BRAM limitation, Xilinx devices show the highest performance for Integer Add operations.
With the BRAM limitation, the FPOA has the highest 16 bit integer performance.
Cell has the highest 32 bit integer performance (using all processing elements).
21
Bit-level Computational Performance
Methodology
Based off of Dehon’s Computational Density calculations Computational Density ALU bit operations /cycle frequency Die area 2 Normalizes performance by die (or package) area and minimum feature size/process technology Bit operations for FPGAs are number of 4 input LUTs Bit operations for GPP and other “hybrid” devices based on number of cores, number of issued instructions, and width of ALU/Functional Units 22
Bit-level Computation Performance
Bit Level Computational Density 160 140 120 100 80 60 40 20 0
As expected, fine-grained FPGAs dominate performance in this metric
23
External Memory Bandwidth Methodology
Methodology varies by platform due to available information and architecture differences.
In all cases, choose maximum throughput available based on vendor IP for memory controllers.
Saturated Case uses maximum amount of I/O for external memory interface, Balanced Case assumes a balance of I/O and memory interface.
Altera Stratix II Influenced by speed grade, number of I/O Used new high performance ALTMEMPHY core (vs. legacy memory interface core) Support for 333 MHz DDR2 RAM Number of controllers limited by the number of on-chip delay-locked loops (2) 24
External Memory Bandwidth Methodology
Xilinx Virtex 4 Influenced by speed grade, number of I/O Memory Interface Generator v1.73 (Coregen) forces use of slower “Direct Clocking” to support multiple banks vs. SERDES strobe implementation, for -10 speed grade maximum frequency is 220 – 240 MHz (depending on bus width) Mathstar FPOA Datasheet information for total external memory interface bandwidth (RLDRAM II) Cell External Memory Bandwidth (Rambus XDRAM) reported in presentation “Introduction to the Cell Processor” from Dr. Michael Perrone (IBM) MONARCH External Memory Bandwidth (DDR2) reported in presentation “World’s First Polymorphic Computer – MONARCH” from K. Prager, L. Lewis, M. Vahey, G. Groves (Raytheon) 25
External Memory Bandwidth Results
External Mem ory Bandw idth
25 20
GB/s
15 10 5 0 Stratix II S180 Virtex 4 LX200 Virtex 4 LX100 Virtex 4 SX55 Cell FPOA MONARCH Saturated Balanced 26
External Memory Bandwidth Conclusions
External Memory Bandwidth important to prevent data bottleneck into the device.
For FPGAs, the type and speed of external memory supported depends on the family and speed grade of the device.
In this study, non-FPGA devices have separate I/O and memory controllers/interfaces, so there is not a distinction between saturated and balanced.
Stratix II S180 and Virtex 4 SX55 configurations support 2 simultaneous controllers, Virtex 4 LX100 and LX200 support 3 simultaneous controllers which is shown in the performance difference for the saturated case.
Although Stratix II controller supports faster DDR2 RAM (333 MHz vs. 220 MHz in this configuration), Virtex 4 SX55 has higher bandwidth due to support for a wider bus.
Xilinx claims higher bandwidth on website, assumes wider bus than existing memories.
For the balanced case, Cell is the performance leader, primarily due to specialized RAM format (XDRAM).
27
I/O Bandwidth Methodology
Methodology varies by platform due to available information and architecture differences.
In all cases, choose maximum throughput available protocol/signaling level.
Saturated Case uses maximum amount of I/O for I/O interface, Balanced Case assumes a balance of I/O and 1 memory interface.
Altera Stratix II Datasheet information for concurrent receive pairs and transmit pairs @ 1.040 Gb/s per pair.
Xilinx Virtex 4 Datasheet information for concurrent receive pairs and transmit pairs @ 1 Gb/s per pair.
Mathstar FPOA Datasheet information for concurrent total transmit and receive bandwidth.
28
I/O Bandwidth Methodology
Cell
I/O Bandwidth reported in presentation “Introduction to the Cell Processor” from Dr. Michael Perrone (IBM)
MONARCH
I/O Bandwidth reported in presentation “World’s First Polymorphic Computer – MONARCH” from K. Prager, L. Lewis, M. Vahey, G. Groves (Raytheon) 29
I/O Bandwidth Results
I/O Bandwidth
80 70 60 50
GB/s
40 30 20 10 0 Altera Stratix II S180 Xilinx Virtex 4 SX55 Xilinx Virtex 4 LX100 Xilinx Virtex 4 LX200 Cell Mathstar FPOA MONARCH Saturated Balanced 30
I/O Bandwidth Conclusions
I/O Bandwidth is important to prevent I/O and data bottleneck.
In this study, non-FPGA devices have separate I/O and memory controllers/interfaces, so there is not a distinction between saturated and balanced.
All devices except for FPOA have at least 40 GB/s throughput.
FPGAs are shown in both fully utilized and balanced cases.
Stratix II uses separate I/O for single ended memory interface and differential pairs so there is no distinction between saturated and balanced cases.
Cell has the highest I/O performance for both cases. 31
Internal Device Memory Bandwidth
Methodology
FPGAs Xilinx – all BRAMs are the same, calculation = number of BRAMS * port width * number of ports * memory access frequency Altera – 3 levels of internal memory hierarchy, calculation similar to above for all levels of hierarchy FPOA – similar to above with 2 levels of memory hierarchy (Register File and Internal RAM) GPP – bus width * frequency * ports 32
Internal Memory Bandwidth
Internal Memory Bandwidth
3000 2500 2000 1500 1000 500 0
Large amount of parallel accesses give FPGAs the advantage in this metric
33
Device Characterization Matrix
Goal: enable comparison of different devices on key parameters Tie all device characterizations into unifying framework User weights allow adjustment to specific application needs Scores quickly show comparison results based on input weights Approach: Scale each characterization study from 1 to 10 Generate weighted average score for each device taking into account user weights Justification Significant architectural differences have historically made these devices difficult to compare Single-Precision Floating-Point scaling example Use min and max values to scale from 1 to 10
Device
Altera Stratix II S180 Xilinx Virtex 4 LX200 Xilinx Virtex 4 LX100 Xilinx Virtex 4 SX55 PowerPC PowerPC + AltiVec Cell Monarch
SP FP Multiply Throughput (GFlops)
71.68
46.03
32.88
38.36
1 5 200 64
Scaled SP FP Multiply Performance
1 1 10 4 4 3 2 3 min max 1 200
Characteri zation i
scale
min (
x i
x
min )
scale
max * (
x
max
scale x
min min )
where
{ {
scale
min
x
min
N
min
j
1 (
x j
),
x
max 1 ,
scale
max 10 }
N
max
j
1 (
x j
)}
and
1 10
Device Characterization Matrix
User Weight PPC AltiVec Xilinx Virtex-4 LX100 Xilinx Virtex-4 SX55 Xilinx Virtex-4 LX200 Altera Stratix-II S180 Cell Mathstar FPOA MONARCH 10
1 1 3 2 5 3 10 0 4
10
1 1 3 3 4 4 10 0 4
10
1 1 6 4 10 6 6 0 1
Score
i N
1
w i
*
characteri zation i
i N
1
w i where
w is a non negative weight
10
1 1 4 4 6 5 10 0 1
10
1 1 6 4 10 8 2 0 2
10
1 2 3 4 4 6 10 0 6
10
1 1 6 4 10 6 2 3 1
10
1 1 4 6 6 7 7 10 2
10
1 1 6 4 8 10 1 2 1
10
1 1 4 5 6 10 3 3 1
10
1 1 3 3 3 3 10 2 4
10
1 1 6 4 6 6 10 2 6 Examples with other weights: A.
B.
C.
Power & cost (10), internal & external memory BW (5), 16-bit integer performance (7): FPOA & V4SX55 lead DP FP performance (5), power (10) Stratix-II S180 and V4LX200 lead External & I/O BW (10), power (10), cost (10) MONARCH and Cell lead
10
10 10 8 8 8 8 1 7 6
References
DeHon, A. The Density Advantage of Configurable Computing.
Computer
, vol.33, no.4, pp.41-49, Apr 2000.
DeHon, A. Reconfigurable Architectures for General-Purpose Computing. A.I. Technical Report No. 1586, Massachusetts Institute of Technology, 1996.
Compton, K. and Hauck, S. Reconfigurable computing: a survey of systems and software. Memory Bandwidth, http://en.wikipedia.org/wiki/Memory_bandwidth .
Mason, J. FPGA HPC – The road beyond processors, Xilinx Corporation. RSSI 2007.
ACM Comput. Surv.
34, 2 (Jun. 2002), 171-210. Wain, R., Bush, I., Guest, M., Deegan, M., Kozin, I. and Kitchen, C.
An overview of FPGAs and FPGA programming; Initial experiences at Daresbury
,. November 2006. Distributed Computing Group at Daresbury Laboratory.
Bolsens, I. Programming Modern FPGAs. Xilinx Corporation. MPSOC August, 2006.
Underwood, K. 2004. FPGAs vs. CPUs: trends in peak floating-point performance. In
Proceedings of the 2004 ACM/SIGDA 12th international Symposium on Field Programmable Gate Arrays
(Monterey, California, USA, February 22 - 24, 2004). FPGA '04. ACM Press, New York, NY, 171-180.
HPEC Challenge Benchmarks. http://www.ll.mit.edu/HPECchallenge .
Xilinx Corporation. 2100 Logic Drive, San Jose, CA 95124-3400.
Virtex-4 Family Overview
(DS112), January 23, 2007.
Xilinx Corporation. 2100 Logic Drive, San Jose, CA 95124-3400.
Floating-Point Operator v3.0
(DS335). September 28, 2006.
“Introduction to the Cell Processor” from Dr. Michael Perrone (IBM) “World’s First Polymorphic Computer – MONARCH” from K. Prager, L. Lewis, M. Vahey, G. Groves (Raytheon) Strenski, Dave. “FPGA Floating Point Performance – a pencil and paper evaluation”. http://www.hpcwire.com/hpc/1195762.html
.
Strenski, Dave. 2006. Computational Bottlenecks and Hardware Decisions for FPGAs. FPGA and Structured ASIC Journal.
Altera Corporation. 101 Innovation Drive, San Jose, CA 95134. Stratix II Device Handbook v 4.3, May 2007.
Freescale Semiconductor Inc. 6501 William Cannon Drive West, Austin, TX 78735. MPC7450 RISC Microprocessor Family Reference Manual, Rev. 5. January 2005.
Freescale Semiconductor Inc. 6501 William Cannon Drive West, Austin, TX 78735. AltiVec Technology Programming Environments Manual, Rev. 3. April 2006.
MathStar Corporation. 19075 NW Tanasbourne Dr. Suite 200, Hillsboro, OR 97124. Arrix Family Product Brief, August 2006.
36