SHA-3 Candidate Evaluation FPGA Benchmarking - Phase 1 • 14 Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design methodology (the.

Download Report

Transcript SHA-3 Candidate Evaluation FPGA Benchmarking - Phase 1 • 14 Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design methodology (the.

SHA-3 Candidate
Evaluation
1
FPGA Benchmarking - Phase 1
• 14 Round-2 SHA-3 Candidates implemented by
33 graduate students following the same design
methodology (the same function implemented
independently by 2-3 students)
• Uniform Input/Output Interface
• Uniform Generic Testbench
• Optimization for maximum throughput to cost ratio
• Benchmarking on multiple FPGA platforms from
Xilinx and Altera using ATHENa
• Comparing vs. optimized implementations of
SHA-1 & SHA-2
• Compressing all results into one single ranking
2
Division into Datapath and Controller
Data Inputs
Control & Status Inputs
Control
Signals
Datapath
(Execution
Unit)
Controller
(Control
Unit)
Status
Signals
Data Outputs
Control & Status Outputs
3
Design Methodology
Specification
Execution Unit
Block
diagram
VHDL code
Interface
Control Unit
Algorithmic
State Machine
VHDL code
4
Steps of the Design Process (1)
Given
1. Specification
2. Interface
Completed
3. Pseudocode
4. Detailed block diagram of the Datapath
5. Interface with the division into the Datapath
and the Controller
6. Timing and area analysis, architectural-level optimizations
7. RTL VHDL code of the Datapath, and
corresponding Testbenches
5
Steps of the Design Process (2)
Remained to be done
8. ASM chart of the Controller
9. RTL VHDL Code the Controller
and the corresponding testbench
10. Integration of the Datapath and the Controller
11. Testing using uniform generic testbench
(developed by Ice)
12. Source Code Optimizations
13. Performance characterization using ATHENa
14. Documentation and final report
6
FPGA Benchmarking - Phase 2
• extending source codes to cover all hash function
variants
• padding in hardware
• applying additional architectural optimizations
• extended benchmarking (Actel FPGAs, multiple tools,
adaptive optimization strategies, etc.)
• reconciling differences with other available rankings
• preparing the codes for ASIC evaluation
7
How to compress all results into a single
ranking?
8
Single Ranking (1)
• Select several representative FPGA platforms
with significantly different properties
e.g., different vendor – Xilinx vs. Altera
process - 90 nm vs. 65 nm
LUT size - 4-input vs. 6-input
optimization - low-cost vs. high-performance
• Use ATHENa to characterize all SHA-3 candidates
and SHA-2 using these platforms in terms
of the target performance metrics
(e.g. throughput/area ratio)
9
Single Ranking (2)
• Calculate ratio
SHA-3 candidate performance vs.
SHA-2 performance (for the same security level)
• Calculate geometrical average over multiple
platforms
10
FPGA and ASIC Performance Measures
11
The common ground is vague
• Hardware Performance: cycles per block, cycles per
byte, Latency (cycles), Latency (ns), Throughput for long
messages, Throughput for short messages, Throughput
at 100 KHz, Clock Frequency, Clock Period, Critical
Path Delay, Modexp/s, PointMul/s
• Hardware Cost: Slices, Slices Occupied, LUTs, 4-input
LUTs, 6-input LUTs, FFs, Gate Equivalent GE, Size on
ASIC, DSP Blocks, BRAMS, Number of Cores, CLB,
MUL, XOR, NOT, AND
• Hardware efficiency:
Hardware performance/Hardware cost
12
Our Favorite Hardware Performance Metrics:
Mbit/s
for Throughput
ns
for
Latency
Allows for easy cross-comparison among implementations
in software (microprocessors), FPGAs (various vendors),
ASICs (various libraries)
13
But how to define and measure
throughput and latency for hash functions?
Time to hash N blocks of message = Htime(N, TCLK) =
Initialization Time(TCLK)
+ N * Block Processing Time(TCLK)
+
Finalization Time(TCLK)
Latency = Time to hash ONE block of message = Htime(1, TCLK) =
= Initialization Time + Block Processing Time + Finalization Time
Block size
Throughput (for long messages) =
Htime(N+1, TCLK) - Htime(N, TCLK)
=
Block size
Block Processing Time (TCLK)
14
But how to define and measure
throughput and latency for hash functions?
Initialization Time(TCLK)
= cyclesI ⋅ TCLK
Block Processing Time(TCLK) = cyclesP ⋅ TCLK
Finalization Time(TCLK)
= cyclesF ⋅ TCLK
Block size
from
specification
from
place & route report
(or experiment)
from
analysis of block diagram
and/or functional simulation
15
How to compare
hardware speed vs. software speed?
EBASH reports (http://bench.cr.yp.to/results-hash.html)
In graphs
Time(n) = Time in clock cycles vs. message size in bytes for
n-byte messages, with n=0,1, 2, 3, … 2048, 4096
In tables
Performance in cycles/byte for n=8, 64, 576, 1536, 4096, long msg
Time(4096) – Time(2048)
Performance for long message =
2048
16
How to compare
hardware speed vs. software speed?
8 bits/byte ⋅ clock frequency [GHz]
Throughput [Gbit/s] =
Performance for long message [cycles/byte]
17
How to measure hardware cost in FPGAs?
1. Stand-alone cryptographic core on FPGA
Cost of a smallest FPGA that can fit the core.
Unit: USD [FPGA vendors would need to publish MSRP
(manufacturer’s suggested retail price) of their chips] – not very likely
or size of the chip in mm2 - easy to obtain
2. Part of an FPGA System On-Chip
Vector: (CLB slices, BRAMs, MULs, DSP units)
(LEs, memory bits, PLLs, MULs, DSP units)
for Xilinx
for Altera
3. FPGA prototype of an ASIC implementation
Force the implementation using only reconfigurable logic
(no DSPs or multipliers, distributed memory vs. BRAM):
Use CLB slices as a metric.
[LEs for Altera]
18
How to measure hardware cost in ASICs?
1. Stand-alone cryptographic core
Cost = f(die area, pin count)
Tables/formulas available from semiconductor foundries
2. Part of an ASIC System On-Chip
Cost ~ circuit area
Units:
μm2
or
GE (gate equivalent) = size of a NAND2 cell
19