Comprehensive environment for benchmarking using FPGAs: ATHENa - Automated Tool for Hardware EvaluatioN.

Download Report

Transcript Comprehensive environment for benchmarking using FPGAs: ATHENa - Automated Tool for Hardware EvaluatioN.

Comprehensive environment for
benchmarking using FPGAs:
ATHENa - Automated Tool for
Hardware EvaluatioN
1
Modern Benchmarking:
Natural Progression of Tools
Software
FPGAs
ASICs
?
?
eBACS
D. Bernstein,
T. Lange
2
ATHENa – Automated Tool for Hardware EvaluatioN
http://cryptography.gmu.edu/athena
Set of scripts written in Perl aimed at an
AUTOMATED generation of
OPTIMIZED results for
MULTIPLE hardware platforms
Currently under development at
George Mason University.
Version 0.3.1
3
Why Athena?
"The Greek goddess Athena was frequently
called upon to settle disputes between
the gods or various mortals. Athena Goddess o
known for her superb logic and intellect.
Her decisions were usually well-considered,
highly ethical, and seldom motivated
by self-interest.”
from "Athena, Greek Goddess
of Wisdom and Craftsmanship"
4
Designers of ATHENa
Venkata
“Vinny”
MS CpE
student
Ekawat
“Ice”
MS CpE
student
Marcin
Xin
PhD ECE
student
PhD ECE
student
Michal
PhD exchange
PhD ECE student from
Slovakia
student
Rajesh
Basic Dataflow of ATHENa
User
FPGA Synthesis and
Implementation
6
5
Database
query
ATHENa
Server
2
Ranking
of designs
HDL + scripts +
configuration files
3
Result Summary
+ Database Entries
1
HDL + FPGA Tools
Download scripts and
configuration files8
4
Designer
Database Entries
0
Interfaces
+ Testbenches
6
configuration
files
synthesizable
source files
result summary
(user-friendly)
testbench
constraint
files
database
entries
(machinefriendly)
7
configuration
files
synthesizable
source files
result summary
(user-friendly)
ATHENa Major Features (1)
•
synthesis, implementation, and timing analysis in the batch mode
•
support for devices and tools of multiple FPGA vendors:
•
generation of results for multiple families of FPGAs of a given vendor
•
automated choice of a best-matching device within a given family
9
ATHENa Major Features (2)
•
automated verification of the design through simulation in the batch
mode
OR
•
exhaustive search for optimum options of tools
•
heuristic adaptive optimization strategies aimed at maximizing
selected performance measures (e.g., speed, area, speed/area ratio,
power, cost, etc.)
10
ATHENa Major Features (2)
•
automated verification of the design through simulation in the batch
mode
OR
•
exhaustive search for optimum options of tools
•
heuristic adaptive optimization strategies aimed at maximizing
selected performance measures (e.g., speed, area, speed/area ratio,
power, cost, etc.)
11
Multi-Pass Place-and-Route Analysis
GMU SHA-512, Xilinx Virtex 5
100 runs for different placement starting points
~ 20%
The smaller the better
best
Minimum clock
worst
12
12
Dependence of Results on Requested Clock Frequency
13
ATHENa Applications
• single_run:
- one set of options
• placement_search
- one set of options
- multiple starting points for placement
• exhaustive_search
- multiple sets of options
- multiple starting points for placement
- multiple requested clock frequencies
SHA-1 Results
Throughput [Mbit/s]
Virtex 5
Virtex 4
Spartan 3
Architectures
15
ATHENA Results for SHA-1, SHA-256 & SHA-512
2000
1800
1600
1400
Mb/s
1200
1000
sha1
sha256
sha512
800
600
400
200
0
spartan3
virtex4
virtex5
cyclone2
cyclone3
FPGA family
stratix2
stratix3
16
Ideas (1)
• Select several representative FPGA platforms
with significantly different properties
e.g., different vendor – Xilinx vs. Altera
process - 90 nm vs. 65 nm
LUT size - 4-input vs. 6-input
optimization - low-cost vs. high-performance
• Use ATHENa to characterize all SHA-3 candidates
and SHA-2 using these platforms in terms
of the target performance metrics
(e.g. throughput/area ratio)
17
Ideas (2)
• Calculate ratio
SHA-3 candidate performance vs.
SHA-2 performance (for the same security level)
• Calculate geometrical average over multiple
platforms
18
Xilinx FPGA Devices
Technology
Low-cost
Highperformance
Virtex 2, 2 Pro
Spartan 3
Virtex 4
120/150 nm
90 nm
65 nm
45 nm
40 nm
Virtex 5
Spartan 6
Virtex 6
Xilinx FPGA Device Support by Tools
Version
Low-cost
High-performance
Xilinx ISE 10.1
All up to Virtex 5
All up to Virtex 5
Xilinx WebPACK 11.1
Smallest up to Virtex 5
Smallest up to Virtex 5
Xilinx WebPACK 11.3
Smallest up to Virtex 5
Smallest Spartan 6,
Virtex 6
Smallest up to Virtex 5
Smallest Spartan 6,
Virtex 6
Altera FPGA Devices
Technology
Low-cost
Mid-range
130 nm
Cyclone
Highperformance
Stratix
90 nm
Cyclone II
Stratix II
65 nm
Cyclone III
Arria I
Stratix III
40 nm
Cyclone IV
Arria II
Stratix IV
Altera FPGA Device Support by Tools
Version
Low-cost
Mid-range
Highperformance
Quartus 7.1
Cyclone IV none,
Cyclone III all
Arria GX all
Arria II GX none
Stratix II smallest,
Stratix III none
Quartus 8.1
Cyclone IV none,
Cyclone III all
Arria GX all
Arria II GX none
Stratix I, II, III
smallest
Quartus 9.0 sp2,
Sep. 09
Cyclone IV none,
Cyclone III all
Arria GX all
Arria II GX none
Stratix I, II, III
smallest
Quartus 9.1
Nov. 09
Cyclone IV
smallest,
Cyclone III all
Arria GX all
Arria II GX smallest
Stratix I, II, III
all
Stratix IV none
FPGA and ASIC Performance Measures
23
The common ground is vague
• Hardware Performance: cycles per block, cycles per
byte, Latency (cycles), Latency (ns), Throughput for long
messages, Throughput for short messages, Throughput
at 100 KHz, Clock Frequency, Clock Period, Critical
Path Delay, Modexp/s, PointMul/s
• Hardware Cost: Slices, Slices Occupied, LUTs, 4-input
LUTs, 6-input LUTs, FFs, Gate Equivalent GE, Size on
ASIC, DSP Blocks, BRAMS, Number of Cores, CLB,
MUL, XOR, NOT, AND
• Hardware efficiency:
Hardware performance/Hardware cost
24
Our Favorite Hardware Performance Metrics:
Mbit/s
for Throughput
ns
for
Latency
Allows for easy cross-comparison among implementations
in software (microprocessors), FPGAs (various vendors),
ASICs (various libraries)
25
But how to define and measure
throughput and latency for hash functions?
Time to hash N blocks of message = Htime(N, TCLK) =
Initialization Time(TCLK)
+ N * Block Processing Time(TCLK)
+
Finalization Time(TCLK)
Latency = Time to hash ONE block of message = Htime(1, TCLK) =
= Initialization Time + Block Processing Time + Finalization Time
Block size
Throughput (for long messages) =
Htime(N+1, TCLK) - Htime(N, TCLK)
=
Block size
Block Processing Time (TCLK)
26
But how to define and measure
throughput and latency for hash functions?
Initialization Time(TCLK)
= cyclesI ⋅ TCLK
Block Processing Time(TCLK) = cyclesP ⋅ TCLK
Finalization Time(TCLK)
= cyclesF ⋅ TCLK
Block size
from
specification
from
place & route report
(or experiment)
from
analysis of block diagram
and/or functional simulation
27
How to compare
hardware speed vs. software speed?
EBASH reports (http://bench.cr.yp.to/results-hash.html)
In graphs
Time(n) = Time in clock cycles vs. message size in bytes for
n-byte messages, with n=0,1, 2, 3, … 2048, 4096
In tables
Performance in cycles/byte for n=8, 64, 576, 1536, 4096, long msg
Time(4096) – Time(2048)
Performance for long message =
2048
28
How to compare
hardware speed vs. software speed?
8 bits/byte ⋅ clock frequency [GHz]
Throughput [Gbit/s] =
Performance for long message [cycles/byte]
29
How to measure hardware cost in FPGAs?
1. Stand-alone cryptographic core on FPGA
Cost of a smallest FPGA that can fit the core.
Unit: USD [FPGA vendors would need to publish MSRP
(manufacturer’s suggested retail price) of their chips] – not very likely
or size of the chip in mm2 - easy to obtain
2. Part of an FPGA System On-Chip
Vector: (CLB slices, BRAMs, MULs, DSP units)
(LEs, memory bits, PLLs, MULs, DSP units)
for Xilinx
for Altera
3. FPGA prototype of an ASIC implementation
Force the implementation using only reconfigurable logic
(no DSPs or multipliers, distributed memory vs. BRAM):
Use CLB slices as a metric.
[LEs for Altera]
30
How to measure hardware cost in ASICs?
1. Stand-alone cryptographic core
Cost = f(die area, pin count)
Tables/formulas available from semiconductor foundries
2. Part of an ASIC System On-Chip
Cost ~ circuit area
Units:
μm2
or
GE (gate equivalent) = size of a NAND2 cell
31
Deliverables (1)
1. Detailed block diagram of the Datapath
with names of all signals matching VHDL code
[electronic version a bonus]
2. Interface with the division into the Datapath
and the Controller [electronic version]
3. ASM charts of the Controller, and a block diagram of
connections among FSMs (if more than one used)
[electronic version a bonus]
4. RTL VHDL code of the Datapath, the Controller, and
the Top-Level Circuit
5. Updated timing and area analysis
formulas for timing confirmed through simulation
32
Deliverables (2)
6. Report on verification
− highest level entity verified for functional correctness
•
•
•
Functional simulation
Post-synthesis simulation
Timing simulation [bonus]
− verification of lower-level entities
-
Name of entity
Testbench used for verification
Result of verification, incorrect behavior, possible source of
error
33
Deliverables (3)
7. Results of benchmarking using ATHENa
–
Entire core or the highest level entity verified for correct
functionality
Xilinx Spartan 3, Virtex 4, Virtex 5
Three methods of testing
–
–
•
•
•
–
–
–
Single_run
Placement_search [cost table = 1, 11, 21]
Exhaustive_search
[cost_table = 31, 41, 51; speed or area; two sets of requested frequencies]
Results generated by ATHENa
Your own graphs and charts
Observations and conclusions
34
Bonus Deliverables (4)
8. Pseudo-code [but not a C code]
9. Bugs and suspicious behavior of ATHENa
10. Additional results of benchmarking using ATHENa
–
–
Altera Cyclone II, Stratix II, Cyclone III, Arria I, Stratix III
Three methods of testing
•
•
•
–
–
–
Single_run
Placement_search [seed = 1, 1000, 2000]
Exhaustive_search
[seed = 3000, 4000, 5000; speed or area; two sets of requested frequencies]
Results generated by ATHENa
Your own graphs and charts
Observations and conclusions
35
Bonus Deliverables (5)
11. Report from the meeting with students working
on the same SHA core
–
–
Summary of major differences
Advantages and disadvantages of your design
12. Bugs found in the
–
–
–
–
–
–
–
Padding script
Testbench
Class examples
Slides
Documentation
SHA-3 Packages
Etc.
36
Bonus Deliverables (6)
13. Extending the design to cover all hash function variants
–
–
–
Hash value sizes: 512 [highest priority], 384, 224
Other variant/parameter support specific to a given hash function
Support through generics or constants
14. Padding in hardware
Assuming that message size before padding is already a multiple of the
– word size
– byte size
– a single bit
37
Composition of Students
4 GWU
PhD candidates
14 local
students
(with 3 former
BSCpE graduates)
14 international
students
38
After Grading
1. Summary of results published on the course web page
2. Selected students invited to develop articles/reports
to be posted on the
- ATHENa web page
- SHA-3 Zoo Web Page
3. Unification, generalization and optimization of codes
by Ice, myself, and other students
4. Presentation to NIST, conference submissions,
presentation at the Second SHA-3 Conference in
Santa Barbara in August 2010.
39