ECE 545 Digital System Design with VHDL Course web page: ECE web page  Courses  Course web pages  ECE 545 http://ece.gmu.edu/coursewebpages/ECE/ECE545/F11/

Download Report

Transcript ECE 545 Digital System Design with VHDL Course web page: ECE web page  Courses  Course web pages  ECE 545 http://ece.gmu.edu/coursewebpages/ECE/ECE545/F11/

ECE 545
Digital System Design with VHDL
Course web page:
ECE web page  Courses  Course web pages
 ECE 545
http://ece.gmu.edu/coursewebpages/ECE/ECE545/F11/
Kris Gaj
Research and teaching interests:
• reconfigurable computing
• computer arithmetic
• cryptography
• network security
Contact:
The Engineering Building, room 3225
[email protected]
Office hours: Thursday, 7:30-8:30 PM,
Tuesday, 6:00-7:00 PM,
and by appointment
ECE 545
Part of:
MS in Computer Engineering
One of five core courses
(must be passed with B or better)
Strongly suggested for two specialization areas:
Digital Systems Design
Microprocessor and Embedded Systems
Elective course in the remaining specialization areas
MS in Electrical Engineering
Elective
ECE 545
Part of:
PhD in Electrical and Computer Engineering
Knowledge tested at the
Technical Qualifying Exam (TQE)
Topic 2: Digital Design and Computer Organization
ECE 545 Class of Fall 2011
MS SE
NDG
1
2
PhD ECE
MS CpE
1
6
MS EE
8
• 18 students total
• 7 admitted in Fall 2011
• 5 admitted in Spring 2011
I am interested
in…
I want to specialize
primarily in…
CAD tools & Design Automation
VLSI
Hardware Description Languages
Recommended
program &
specialization
MS CpE
Digital Systems Design
Digital Systems Design FPGAs & Reconfigurable computing
ASICs & FPGAs
Computer Arithmetic
VHDL/Verilog
Front-end ASIC Design
(algorithmic downto gate level)
CAD Tools
Reconfigurable
Computing
Back-end ASIC Design
(circuit and mask layout levels)
Analog & Digital Circuit Design
Microelectronics
VLSI Fabrication
VLSI Fabrication
Microelectronics
Nanoelectronics
Nanoelectronics
Semiconductor Devices
MS EE
Microelectronics/
Nanoelectronics
Courses
Design level
Digital System Computer
Design with VHDL Arithmetic
VLSI Design VLSI Test
for ASICs
Concepts
algorithmic
register-transfer
ECE
545
ECE
645
ECE
681
gate
ECE
586
transistor
layout
devices
ECE
680
ECE
682
Digital
Integrated
Circuits
Physical
VLSI Design
Semiconductor
ECE 584
ECE684
Device Fundamentals
MOS Device
Electronics
CpE
Digital Systems Design
CpE
Microprocessors and
Embedded Systems
PreApproved
Electives
ECE 545 Digital System Design
with VHDL
ECE 645 Computer Arithmetic
ECE 681 VLSI Design for ASICs
ECE 682 VLSI Test Concepts
ECE 586 Digital Integrated Circuits
ECE 511 Microprocessors
ECE 545 Digital System Design
with VHDL
ECE 611 Advanced Microprocessors
ECE 612 Real-Time Embedded
Systems
Suggested
Electives
CS 540, 583 (languages, algorithms)
CS 635
(parallel machines)
ECE 584, 684, … (technology)
ECE 511, 611, … (microprocessors) ECE 542, 642, 742 (networks)
ECE 645, 681 (digital design)
ECE 646, 746, … (applications)
ECE 548 (sequential mach. theory)
Professors
K. Gaj, J. Kaps,
T. Storey, T.K. Ramesh
J. Kaps, K. Gaj, D. Tabak,
C. Sabzevari
DIGITAL SYSTEMS DESIGN
Concentration advisors: Kris Gaj, Jens-Peter Kaps, Ken Hintz
1. ECE 545 Digital System Design with VHDL
– K. Gaj, project, FPGA design with VHDL,
Aldec/Mentor Graphics, Xilinx/Altera
2. ECE 645 Computer Arithmetic
– K. Gaj, project, FPGA design with VHDL
Aldec/Mentor Graphics, Xilinx/Altera
3. ECE 681 VLSI Design for ASICs
– T.K. Ramesh, project/lab, front-end and back-end ASIC design with
Synopsys tools
4. ECE 586 Digital Integrated Circuits
– D. Ioannou, R. Mulpuri,
5. ECE 682 VLSI Test Concepts
– T. Storey
Grading Scheme
• Homework
- 10%
• Project
- 40%
• Midterm Exam
- 20%
• Final Exam
- 30%
Midterm exam 1
 2 hours 30 minutes
 in class
 design-oriented
 open-books, open-notes
 practice exams available on the web
Tentative date:
Thursday, October 27th
Final exam
 2 hours 45 minutes
 in class
 design-oriented
 open-books, open-notes
 practice exams available on the web
Date:
Monday, December 15, 4:30-7:15pm
Textbooks
13
Required Textbook
Pong P. Chu, RTL Hardware Design Using VHDL,
Wiley-Interscience, 2006.
Supplementary Textbook – Basics Refresher
Stephen Brown and Zvonko Vranesic,
Fundamentals of Digital Logic with VHDL Design,
McGraw-Hill, 3rd or 2nd Edition
Supplementary Textbook – Advanced
Hubert Kaeslin, Digital Integrated Circuit Design:
From VLSI Architectures to CMOS Fabrication,
Cambridge University Press; 1st Edition, 2008.
Used in ECE 681
“VLSI Design for ASICs”
Technology
&
Tools
17
What is an FPGA?
Configurable
Logic
Blocks
Block RAMs
Block RAMs
I/O
Blocks
Block
RAMs
Two competing implementation
approaches
ASIC
Application Specific
Integrated Circuit
• designed all the way
from behavioral description
to physical layout
• designs must be sent
for expensive and time
consuming fabrication
in semiconductor foundry
FPGA
Field Programmable
Gate Array
• no physical layout design;
design ends with
a bitstream used
to configure a device
• bought off the shelf
and reconfigured by
designers themselves
FPGAs vs. ASICs
ASICs
FPGAs
Off-the-shelf
High performance
Low development costs
Low power
Short time to the market
Low cost (but only
in high volumes)
Reconfigurability
FPGA Design process (1)
Design and implement a simple unit permitting to
speed up encryption with RC5-similar cipher with
fixed key set on 8031 microcontroller. Unlike in
the experiment 5, this time your unit has to be able
to perform an encryption algorithm by itself,
executing 32 rounds…..
Specification / Pseudocode
On-paper hardware design
(Block diagram & ASM chart)
VHDL description (Your Source Files)
Library IEEE;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
Functional simulation
entity RC5_core is
port(
clock, reset, encr_decr: in std_logic;
data_input: in std_logic_vector(31 downto 0);
data_output: out std_logic_vector(31 downto 0);
out_full: in std_logic;
key_input: in std_logic_vector(31 downto 0);
key_read: out std_logic;
);
end AES_core;
Synthesis
Post-synthesis simulation
FPGA Design process (2)
Implementation
Timing simulation
Configuration
On chip testing
Simulation Tools
FPGA Synthesis Tools
Logic Synthesis
VHDL description
architecture MLU_DATAFLOW of MLU is
signal A1:STD_LOGIC;
signal B1:STD_LOGIC;
signal Y1:STD_LOGIC;
signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC;
begin
A1<=A when (NEG_A='0') else
not A;
B1<=B when (NEG_B='0') else
not B;
Y<=Y1 when (NEG_Y='0') else
not Y1;
MUX_0<=A1 and B1;
MUX_1<=A1 or B1;
MUX_2<=A1 xor B1;
MUX_3<=A1 xnor B1;
with (L1 & L0) select
Y1<=MUX_0 when "00",
MUX_1 when "01",
MUX_2 when "10",
MUX_3 when others;
end MLU_DATAFLOW;
Circuit netlist
FPGA Implementation
• After synthesis the entire implementation
process is performed by FPGA vendor tools
Design Process control from Active-HDL
Xilinx FPGA Tools
ECE Labs
Aldec Active-HDL
Design Flow
Xilinx ISE
Design Flow
Aldec Active-HDL (IDE)
Mentor Graphics ModelSim SE
Xilinx XST
&
Synopsys Synplify Premier
Xilinx XST
&
Synopsys Synplify Premier
Xilinx ISE Design Suite
Xilinx ISE Design Suite (IDE)
simulation
synthesis
implementation
Xilinx FPGA Tools
Home
Xilinx ISE
Design Flow
Aldec Active-HDL
Design Flow
Aldec Active-HDL
Student Edition (IDE)
Mentor Graphics ModelSim PE
Student Edition
Xilinx XST
(restricted)
Xilinx XST
(restricted)
Xilinx ISE WebPACK
(restricted)
Xilinx ISE WebPACK (IDE)
(restricted)
simulation
synthesis
implementation
Altera FPGA Tools
ECE Labs
Altera
Design Flow
Mentor Graphics ModelSim-Altera
Altera Quartus II Subscription Edition
simulation
synthesis & implementation
Altera FPGA Tools
Home
Altera
Design Flow
Mentor Graphics ModelSim-Altera Starter
(restricted)
Altera Quartus II Web Edition
(restricted)
simulation
synthesis & implementation
Project
35
Project
 semester-long
 related to the research project conducted by
Cryptographic Engineering Research Group (CERG)
at GMU
 supporting NIST (National Institute of Standards
and Technology) in the evaluation of candidates
for a new cryptographic standard
CERG @ GMU
http://cryptography.gmu.edu/
10 PhD students
8 MS students
co-advised by Kris Gaj & Jens-Peter Kaps
Collaborators
Joint 3-year project (2010-2012) on benchmarking
cryptographic algorithms in software and hardware
sponsored by
software
Daniel J. Bernstein,
University of
Illinois
at Chicago
FPGAs
Jens-Peter Kaps
George Mason
University
FPGAs/ASICs
ASICs
Patrick
Schaumont
Virginia Tech
Leyla
Nazhand-Ali
Virginia Tech
Background
39
Outline
• Crypto 101
• Cryptographic standard contests
• Progress in evaluation methods
 AES
 eSTREAM
 SHA-3
• Benchmarking tools for software and FPGAs
• Open problems
Crypto 101
Cryptography is Everywhere
Buying a book on-line
Teleconferencing
over Intranets
Withdrawing cash from ATM
Backing up files
on remote server
Cryptographic Transformations
Most Often Implemented in Practice
Secret-Key Ciphers
Block Ciphers
Hash Functions
Stream Ciphers
encryption
message & user
authentication
Public-Key Cryptosystems
digital signatures
key agreement
key exchange
Digital Signature
Signature
HANDWRITTEN
DIGITAL
A6E3891F2939E38C745B
25289896CA345BEF5349
245CBA653448E349EA47
Main Goals:
• unique identification
• proof of agreement to the contents
of the document
Handwritten and Digital Signatures
Common Features
Handwritten signature
Digital signature
1. Unique
2. Impossible to be forged
3. Impossible to be denied by the author
4. Easy to verify by an independent judge
5. Easy to generate
Handwritten and Digital Signatures
Differences
Handwritten signature
Digital signature
6. Associated physically 6. Can be stored and
with the document
transmitted
independently
of the document
7. Almost identical
7. Function of the
for all documents
document
8. Usually at the last
8. Covers the entire
page
document
Hash Functions in Digital Signature Schemes
Alice
Bob
Message
Message
Signature
Signature
Hash
function
Hash
function
Hash value 1
Hash value
yes
no
Hash value 2
Public key
cipher
Alice’s private key
Public key
cipher
Alice’s public key
Hash Function
arbitrary length
m
message
h
Collision Resistance:
It is computationally
infeasible to find such
m and m’ that
h(m)=h(m’)
h(m)
fixed length
hash
function
hash value
Cryptographic
Standard
Contests
Cryptographic Standards Before 1997
Secret-Key Block Ciphers
IBM
& NSA
DES – Data Encryption Standard
Triple DES
1993 1995
Hash Functions
2003
SHA-1–Secure Hash Algorithm
NSA
SHA-2
SHA
1970
2005
1999
1977
1980
1990
2000
2010
time
Why a Contest for
a Cryptographic Standard?
• Avoid back-door theories
• Speed-up the acceptance of the standard
• Stimulate non-classified research on methods of
designing a specific cryptographic transformation
• Focus the effort of a relatively small cryptographic
community
Cryptographic Standard Contests
IX.1997
X.2000
AES
15 block ciphers  1 winner
NESSIE
I.2000
XII.2002
CRYPTREC
V.2008
XI.2004
34 stream ciphers  4 HW winners
+ 4 SW winners
eSTREAM
XII.2012
X.2007
51 hash functions  1 winner
SHA-3
96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13
time
Cryptographic Contests - Evaluation Criteria
Security
Software Efficiency
μProcessors
Hardware Efficiency
μControllers
Flexibility
Simplicity
FPGAs
ASICs
Licensing
53
Specific Challenges of Evaluations
in Cryptographic Contests
• Very wide range of possible applications, and as a result
performance and cost targets
throughput:
single Mbits/s to hundreds Gbits/s
cost:
single cents to thousands of dollars
• Winner in use for the next 20-30 years, implemented using
technologies not in existence today
• Large number of candidates
• Limited time for evaluation
• Only one winner and the results are final
Mitigating Circumstances
• Security is a primary criterion
• Performance of competing algorithms tend to very significantly
(sometimes as much as 500 times)
• Only relatively large differences in performance matter
(typically at least 20%)
• Multiple groups independently implement the same algorithms
(catching mistakes, comparing best results, etc.)
• Second best may be good enough
AES
Contest
1997-2000
Rules of the Contest
Each team submits
Detailed
cipher
specification
Justification
of design
decisions
Source
code
in C
Source
code
in Java
Tentative
results
of cryptanalysis
Test
vectors
AES: Candidate Algorithms
2
8
Canada:
CAST-256
Deal
USA: Mars
RC6
Twofish
Safer+
HPC
Costa Rica:
Frog
4
Germany:
Magenta
Belgium:
Rijndael
France:
Korea:
Crypton
Japan:
E2
1
DFC
Israel, UK,
Norway:
Serpent
Australia:
LOKI97
AES Contest Timeline
June 1998
15 Candidates
CAST-256, Crypton, Deal, DFC, E2,
Frog, HPC, LOKI97, Magenta, Mars,
RC6, Rijndael, Safer+, Serpent, Twofish,
August 1999
Round 1
Security
Software efficiency
Round 2
5 final candidates
Mars, RC6, Twofish (USA)
Rijndael, Serpent (Europe)
October 2000
1 winner: Rijndael
Belgium
Security
Software efficiency
Hardware efficiency
NIST Report: Security & Simplicity
Security
High
MARS
Twofish
Serpent
Rijndael
Adequate
RC6
Complex
Simple
Simplicity
Efficiency in software: NIST-specified platform
200 MHz Pentium Pro, Borland C++
Throughput [Mbits/s]
128-bit key
192-bit key
30
256-bit key
25
20
15
10
5
0
Rijndael
RC6
Twofish
Mars
Serpent
NIST Report: Software Efficiency
Encryption and Decryption Speed
high
medium
low
32-bit
processors
64-bit
processors
DSPs
RC6
Rijndael
Twofish
Rijndael
Twofish
Rijndael
Mars
Twofish
Mars
RC6
Mars
RC6
Serpent
Serpent
Serpent
Efficiency in FPGAs: Speed
Xilinx Virtex XCV-1000
Throughput [Mbit/s]
500
450
400
350
431
444
George Mason University
414
University of Southern California
353
Worcester Polytechnic Institute
294
300
250
200
150
100
177
173
149
143
104
62
112
88
102
61
50
0
Serpent Rijndael
x8
Twofish Serpent RC6
x1
Mars
Efficiency in ASICs: Speed
Throughput [Mbit/s]
MOSIS 0.5μm, NSA Group
700
606
128-bit key scheduling
600
500
3-in-1 (128, 192, 256 bit) key scheduling
443
400
300
202 202
200
105 105
103 104
57 57
100
0
Rijndael Serpent
x1
Twofish
RC6
Mars
Lessons Learned
Results for ASICs matched very well results for FPGAs,
and were both very different than software
FPGA
ASIC
x8
x1
GMU+USC, Xilinx Virtex XCV-1000
x1
NSA Team, ASIC, 0.5μm MOSIS
Serpent fastest in hardware, slowest in software
Lessons Learned
Hardware results matter!
Final round of the AES Contest, 2000
Speed in FPGAs
GMU results
Votes at the AES 3 conference
Limitations of the AES Evaluation
•
Optimization for maximum throughput
•
Single high-speed architecture per candidate
•
No use of embedded resources of FPGAs
(Block RAMs, dedicated multipliers)
•
Single FPGA family from a single vendor:
Xilinx Virtex
eSTREAM
Contest
2004-2008
eSTREAM - Contest for a new
stream cipher standard
PROFILE 1 (SW)
• Stream cipher suitable for
software implementations optimized for high speed
• Key size - 128 bits
• Initialization vector – 64 bits or 128 bits
PROFILE 2 (HW)
• Stream cipher suitable for
hardware implementations with limited memory,
number of gates, or power supply
• Key size - 80 bits
• Initialization vector – 32 bits or 64 bits
eSTREAM Contest Timeline
April 2005
PROFILE 1 (SW)
23 Phase 1 Candidates
PROFILE 2 (HW)
25 Phase 1 Candidates
July 2006
13 Phase 2 Candidates
20 Phase 2 Candidates
April 2007
8 Phase 3 Candidates
May 2008
8 Phase 3 Candidates
4 winners:
4 winners:
HC-128, Rabbit,
Salsa20, SOSEMANUK
Grain v1, Mickey v2,
Trivium, F-FCSR-H v2
Lessons Learned
Very large differences among
8 leading candidates
~30 x in terms of area (Grain v1 vs. Edon80)
~500 x in terms of the throughput to area ratio
(Trivium (x64) vs. Pomaranch)
Hardware Efficiency in FPGAs
Xilinx Spartan 3, GMU SASC 2007
Throughput
[Mbit/s]
x64
12000
10000
Trivium
8000
x32
6000
4000
x16
x16
2000
Grain
x1
0
0
Mickey-128
200
400
AES
600
800
1000 1200 1400
Area
[CLB slices]
ASIC Evaluations
• Two major projects
 T. Good, M. Benaissa, University of Sheffield, UK
(Phases 1-3) – 0.13μm CMOS
 F.K. Gürkaynak, et al., ETH Zurich, Switzerland
(Phase 1) - 0.25μm CMOS
•
Two representative applications
 WLAN @ 10 Mbits/s
 RFID / WSN @ 100 kHz clock
eSTREAM ASIC Evaluations
New compared to AES:
•Post-layout results, followed by
•Actually fabricated ASIC chips
(0.18μm CMOS)
•More complex performance measures
 Power x Area x Time
•New types of analyses
 Power x Latency vs. Area
 Throughput/Area vs. Energy per bit
SHA-3
Contest
2007-2012
NIST SHA-3 Contest - Timeline
Round 1
51
candidates
Oct. 2008
Round 3
Round 2
14
5
July 2009
Dec. 2010
1
Mid 2012
SHA-3 Round 2
77
Features of the SHA-3 Round 2 Evaluation
•
Optimization for maximum throughput to area ratio
•
10 FPGA families from two major vendors :
Xilinx and Altera
But still…
•
Single high-speed architecture per candidate
•
No use of embedded resources of FPGAs (Block RAMs,
dedicated multipliers, DSP units)
Throughput vs. Area Normalized to Results for SHA-256
and Averaged over 11 FPGA Families – 256-bit variants
79
Throughput vs. Area Normalized to Results for SHA-512
and Averaged over 11 FPGA Families – 512-bit variants
80
Performance Metrics
Primary
Secondary
1. Throughput
(single message)
2. Area
3. Throughput / Area
3. Hash Time for
Short Messages
(up to 1000 bits)
81
Overall Normalized Throughput: 256-bit variants of algorithms
Normalized to SHA-256, Averaged over 10 FPGA families
8
7.47
7.21
7
6
5.40
5
4
3
3.83
3.46
2.98
2.21
2
1
1.82
1.74
1.70
1.69
1.66
1.51
0.98
0
82
256-bit variants
Thr/Area Thr
Area Short msg.
512-bit variants
Thr/Area Thr
Area Short msg.
BLAKE
BMW
CubeHash
ECHO
Fugue
Groestl
Hamsi
JH
Keccak
Luffa
Shabal
SHAvite-3
SIMD
Skein
83
SHA-3 Round 3
84
SHA-3 Contest Finalists
85
New in Round 3
• Multiple Hardware Architectures
• Effect of the Use of Embedded Resources
• Low-Area Implementations
SHA-3
Multiple
High-Speed
Architectures
87
Study of Multiple Architectures
• Analysis of multiple hardware architectures
per each finalist, based on the known design
techniques, such as
•
Folding
•
Unrolling
•
Pipelining
• Identifying the best architecture in terms of the
throughput to area ratio
• Analyzing the flexibility of all algorithms in
terms of the speed vs. area trade-offs
BLAKE-256 in Virtex 5
x1 – basic iterative architecture
/k(h) – horizontal folding by a factor of k
/k(v) – vertical folding by a factor of k
xk – unrolling by a factor of k
xk-PPLn – unrolling by a factor of k with n pipeline stages
89
256-bit variants in Virtex 5
90
512-bit variants in Virtex 5
91
256-bit variants in Stratix III
92
512-bit variants in Stratix III
93
SHA-3
Lightweight
Implementations
94
Study of Lightweight Implementations in
FPGAs
• Two major projects
 J.-P. Kaps, et al., George Mason University, USA
 F.-X. Standaert, UCL Crypto Group, Belgium
• Target:
 Low-cost FPGAs (Spartan 3, Spartan 6, etc.)
for stand-alone implementations
 High-performance FPGAs (e.g., Virtex 6)
for system-on-chip implementations
Typical Assumptions – GMU Group
Implementation Results

Xilinx Spartan 3, ISE 12.3, after P&R, Optimized using ATHENa
SHA-3
Implementations
Based on Embedded
Resources
98
Implementations Based on the Use of
Embedded Resources in FPGAs
RAM blocks
RAM
blocks
Multipliers
Multipliers/DSP
units
blocks
Logic
Logic
blocks
(#Logic blocks, #Multipliers/DSP units, #RAM_blocks)
Graphics based on The Design Warrior’s Guide to FPGAs
Devices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
Resource Utilization Vector
(#Logic blocks, #Multipliers/DSP units, #RAM blocks)
Xilinx
Spartan 3:
(#CLB_slices, #multipliers, #Block_RAMs)
Virtex 5:
(#CLB_slices, #DSP units, #Block_RAMs)
Altera
Cyclone III: (#LEs,
Stratix III:
#multipliers, #RAM_bits)
(#ALUTs, #DSP units, #RAM_bits)
Fitting a Single Core
in a Smaller FPGA Device
BLAKE in Altera Cyclone II
EP2C20
EP2C5
LOGIC MUL MEM
LOGIC
MUL
(6862, 0,
MEM
0)
LEs, MULs, bits
(3129, 0, 12k)
LEs, MULs, bits
Fitting a Larger Number of Identical Cores
in the same FPGA Device
BLAKE in Virtex 5
XC5VSX50
3 BLAKE cores
Cumulative
Throughput
6.8 Gbit/s
XC5VSX50
8 BLAKE cores
20.6 Gbit/s
Cumulative Throughput for the
Largest Device of a Given Family
Basic architectures
Best architectures
SHA-3
in ASICs
104
Virginia Tech ASIC
• IBM MOSIS 130nm process
• The first ASIC implementing
5 final SHA-3 candidates
• Taped-out in Feb. 2011,
successfully tested
this Summer
• Multiple chips made available
to other research labs
FPGA Evaluations - Summary
AES
eSTREAM
SHA-3
Multiple FPGA families
No
No
Yes
Multiple architectures
No
Yes
Yes
Use of embedded
resources
No
No
Yes
Primary optimization
target
Throughput
Throughput/
Area
Experimental results
No
Area
Throughput/Ar
ea
No
Availability of source
codes
No
No
Yes
Specialized tools
No
No
Yes
Yes
ASIC Evaluations - Summary
AES
eSTREAM
SHA-3
Multiple processes/
libraries
No
No
Yes
Multiple architectures
No
Yes
Yes
Primary optimization
target
Throughput
Power x Area Throughput
x Time
/Area
Post-layout results
No
Yes
Yes
Experimental results
No
Yes
Yes
Availability of source
codes
No
No
Yes
Specialized tools
No
No
No
Benchmarking
Tools
Tools for Benchmarking
Implementations of Cryptography
Software
FPGAs
eBACS
ATHENa
D. Bernstein (UIC)
T. Lange (TUE)
K. Gaj,
J. Kaps, et al.
(GMU)
2006-present
2009-present
ASICs
?
Benchmarking
in Software: eBACS
110
eBACS: ECRYPT Benchmarking of
Cryptographic Systems:
http://bench.cr.yp.to/
SUPERCOP - toolkit developed by D. Bernstein and T. Lange
for measuring performance of cryptographic software
•
measurements on multiple machines (currently over 90)
•
each implementation is recompiled multiple times
(currently over 1600 times) with various compiler options
•
time measured in clock cycles/byte for multiple
input/output sizes
•
median, lower quartile (25th percentile), and upper quartile
(75th percentile) reported
•
standardized function arguments (common API)
111
SUPERCOP Extension for Microcontrollers –
XBX: 2009-present
Allows on-board timing measurements
Supports at least the following
microcontrollers:
8-bit:
Atmel ATmega1284P (AVR)
Developers:
 Christian Wenzel-Benner,
ITK Engineering AG, Germany
 Jens Gräf, LiNetCo GmbH,
Heiger, Germany
32-bit:
TI AR7 (MIPS)
Atmel AT91RM9200 (ARM 920T)
Intel XScale IXP420 (ARM v5TE)
Cortex-M3 (ARM)
Benchmarking
in FPGAs: ATHENa
113
ATHENa – Automated Tool for Hardware
EvaluatioN
http://cryptography.gmu.edu/athena
Open-source benchmarking environment,
written in Perl, aimed at
AUTOMATED generation of
OPTIMIZED results for
MULTIPLE hardware platforms.
The most recent version
0.6.2 released in June 2011.
Full features in ATHENa 1.0
to be released in 2012.
114
Why Athena?
"The Greek goddess Athena was frequently
called upon to settle disputes between
the gods or various mortals. Athena Goddess
known for her superb logic and intellect.
Her decisions were usually well-considered,
highly ethical, and seldom motivated
by self-interest.”
from "Athena, Greek Goddess
of Wisdom and Craftsmanship"
115
Basic Dataflow of ATHENa
User
FPGA Synthesis and
Implementation
6
5
Database
query
ATHENa
Server
2
Ranking
of designs
HDL + scripts +
configuration files
3
Result Summary
+ Database
Entries
1
HDL + FPGA Tools
Download scripts
and
configuration files8
4
Designer
Database
Entries
0
Interfaces
+ Testbenches
116
Three Components of the ATHENa
Environment
• ATHENa Tool
• ATHENa Database of Results
• ATHENa Website
ATHENa - Tool
118
configuration
files
constraint
files
testbench
synthesizable
source files
result
summary
(user-friendly)
database
entries
(machinefriendly)
119
ATHENa Major Features (1)
•
synthesis, implementation, and timing analysis in
batch mode
•
support for devices and tools of multiple FPGA vendors:
•
generation of results for multiple families of FPGAs of a
given vendor
•
automated choice of a best-matching device within a
given family
120
ATHENa Major Features (2)
•
automated verification of designs through simulation in
batch mode
OR
•
support for multi-core processing
•
automated extraction and tabulation of results
•
several optimization strategies aimed at finding
–
optimum options of tools
–
best target clock frequency
–
best starting point of placement
121
Relative Improvement of Results from Using ATHENa
Virtex 5, 512-bit Variants of Hash Functions
3
2.5
2
Area
Area
Throughput
Thr
Throughput/Area
Thr/Area
1.5
1
0.5
0
Ratios of results obtained using ATHENa suggested options
vs. default options of FPGA tools
122
Other (Somewhat) Similar Tools
ExploreAhead (part of PlanAhead)
Design Space Explorer (DSE)
Boldport Flow
EDAx10 Cloud Platform
123
Distinguishing Features of ATHENa
• Support for multiple tools from multiple vendors
• Optimization strategies aimed at the best possible
performance rather than design closure
• Extraction and presentation of results
• Seamless integration with the ATHENa database of results
124
ATHENa – Database
of Results
125
ATHENa Database
http://cryptography.gmu.edu/athenadb
126
ATHENa Database – Result View
• Algorithm parameters
• Design parameters
 Optimization target
 Architecture type
 Datapath width
 I/O bus widths
 Availability of source code
 Platform
 Vendor, Family, Device
 Timing
 Maximum clock frequency
 Maximum throughput
 Resource utilization
 Logic blocks (Slices/LEs/ALUTs)
 Multipliers/DSP units
 Tools
 Names & versions
 Detailed options
 Credits
 Designers & contact information
127
ATHENa Database – Compare Feature
Matching fields in grey
Non-matching fields in red and blue
128
Currently in the Database
Hash Functions in FPGAs
GMU Results for
•
20 hash functions
( 14 Round 2 SHA-3 + 5 Round 3 SHA-3 + SHA-2 )
x 2 variants ( 256-bit output & 512-bit output )
x 11 FPGA families
= 440 combinations
(440-not_fitting) = 423 optimized results
129
ATHENa - Website
130
ATHENa Website
http://cryptography.gmu.edu/athena/
• Download of ATHENa Tool
• Links to related tools
SHA-3 Competition in FPGAs & ASICs
• Specifications of candidates
• Interface proposals
• RTL source codes
• Testbenches
• ATHENa database of results
• Related papers & presentations
131
GMU Source Codes
•
best non-pipelined high-speed architectures for
14 Round 2 SHA-3 candidates and SHA-2
•
best non-pipelined high-speed architectures for
5 Round 3 SHA-3 candidates
•
Each code supports two variants:
with 256-bit and 512-bit output
132
Primary Designers of GMU Codes
Ekawat Homsirikamol
a.k.a “Ice”
Marcin Rogawski
Developed optimized VHDL implementations of
5 Round 3 SHA-3 Candidates + 14 Round 2 SHA-3 candidates + SHA-2
in two variants each (256 & 512-bit output),
for some functions using several alternative architectures
ATHENa Result Replication Files
• Scripts and configuration files sufficient to easily
reproduce all results (without repeating optimizations)
• Automatically created by ATHENa for all
results generated using ATHENa
• Stored in the ATHENa Database
In the same spirit of Reproducible Research as:
• J. Claerbout (Stanford University)
“Electronic documents give reproducible research a new meaning,”
in Proc. 62nd Ann. Int. Meeting of the Soc. of Exploration Geophysics, 1992,
http://sepwww.stanford.edu/doku.php?id=sep:research:reproducible:seg92
.....
• Patrick Vandewalle1, Jelena Kovacevic2, and Martin Vetterli1 (1EPFL, 2CMU)
Reproducible research in signal processing - what, why, and how.
IEEE Signal Processing Magazine, May 2009. http://rr.epfl.ch/17/
134
Benchmarking Goals Facilitated by ATHENa
Comparing multiple:
1. cryptographic algorithms
2. hardware architectures or implementations
of the same cryptographic algorithm
3. hardware platforms from the point of view
of their suitability for the implementation of a given algorithm,
(e.g., choice of an FPGA device or FPGA board)
4. tools and languages in terms of quality
of results they generate (e.g. Verilog vs. VHDL,
Synplicity Synplify Premier vs. Xilinx XST,
ISE v. 13.1 vs. ISE v. 12.3)
135
Open
Problems
Objective Benchmarking Difficulties
•
lack of standard one-fits-all interfaces
•
stand-alone performance vs. performance as a part
of a bigger system
•
heuristic optimization strategies
•
time & effort spent on optimization
or
Why Interface Matters?
• Pin limit
Total number of i/o ports ≤ Total number of an FPGA i/o pins
• Support for the maximum throughput
Time to load the next message block ≤ Time to process previous block
138
Interface: Two possible solutions
msg_bitlen
message
end_of_msg
SHA core
zero_word
Length of the message
communicated at
the beginning
Dedicated end of message
port
+ easy to implement
passive source circuit
− more intelligent source
circuit required
− area overhead for the counter
of message bits
+ no need for internal
message bit counter
139
SHA Core: Interface & Typical Configuration
clk
rst
clk
rs
t
clk
rs
t
clk
rst
clk
rst
clk
rst
Input
FIFO
ext_idata
w
fifoin_full
fifoin_write
din
dout
full
empty
write
read
Output
FIFO
SHA core
idata
w
fifoin_empty
fifoin_read
din
dout
src_ready
dst_ready
src_read
dst_write
odata
w
fifoout_full
ext_odata
din
dout
full
empty
fifoout_write
write
read
w
fifoout_empty
fifoout_read
• SHA core is an active component; surrounding FIFOs are passive and
widely available
• Input interface is separate from an output interface
• Processing a current block, reading the next block, and storing
a result for the previous message can be all done in parallel
140
Objective Benchmarking Difficulties
•
lack of convenient cost metric in FPGAs
•
accuracy of power estimators in ASICs & FPGAs
•
verifiability of results
•
human factor (skills of designers, order of
implementations, etc.)
How to measure hardware cost in FPGAs?
1. Stand-alone cryptographic core on an FPGA
Cost of the smallest FPGA that can fit the core?
Unit: USD [FPGA vendors would need to publish MSRP
(manufacturer’s suggested retail price) of their chips]
– not very likely, very volatile metric
or size of the chip in mm2 - easy to obtain
2. Part of an FPGA System On-Chip
Resource utilization described by a vector:
(#CLB slices, #MULs/DSP units, #BRAMs)
(#LEs/ALUTs, #MULs/DSP units, #membits)
for Xilinx
for Altera
Difficulty of turning vector into a single number
representing cost
142
Potential Problems with
Publishing Source Codes
• Export control regulations for cryptography
Check: Bert-Jaap Koops, Crypto Law Survey
http://rechten.uvt.nl/koops/cryptolaw/
• Commercial interests
• Competition with other groups for
grants and publications in the most renowned journals
and conference proceedings
Selected SHA-3 Source Codes Available
in Public Domain
• AIST-RCIS: http://www.rcis.aist.go.jp/special/SASEBO/SHA3-en.html
• University College Cork, Queens University Belfast, RMIT University,
Melbourne, Australia:
http://www.ucc.ie/en/crypto/SHA-3Hardware
• Virginia Tech: http://rijndael.ece.vt.edu/sha3/soucecodes.html
• ETH Zurich:
http://www.iis.ee.ethz.ch/~sha3/
• George Mason University: http:/cryptography.gmu.edu/athena
• BLAKE Team: http://www.131002.net/blake/
• Keccak Team: http://keccak.noekeon.org/
How to assure verifiability of results?
Level of openness
Source
files
Testimonies
Netlists
for selected FPGAs
Current situation:
Options of tools
Constraint files
conference/journal
papers
Interfaces
Testbenches
Results
FPGA family/device
Tool names+versions
ATHENa space
145
Initial Evaluation by High-Level
Synthesis Tools?
Initial number
of candidates
AES
15
• All hardware implementations
so far developed using RTL HDL
•
Growing number of candidates
in subsequent contests
• Each submission includes
reference implementation in C
eSTREAM
34
SHA-3
51
Next Contest
???
• Results from High-Level
Synthesis could have a large
impact in early stages of the
competitions
• Results and RTL codes from
previous contests form
interesting benchmarks for
High-Level synthesis tools
Turning Thousands of Results
into a Single Fair Ranking
• Choosing which FPGA families / ASIC libraries should
be included in the comparison
 wide range?
 only most recent?
 vendors with the largest market share?
 wide spectrum of vendors?
• Methods for combining multiple results into single
ranking
Thousands of results
on tens of platforms
1.
2.
3.
4.
5.
Turning Thousands of Results
into Fair Ranking
• Deciding on most important application scenarios
 Throughput – Cost – Power range
from RFIDs to High-speed security gateways

Assigning weights to different scenarios
148
Conclusions
– Contests for cryptographic standards are important
•
•
•
•
Stimulate progress in design and analysis of
cryptographic algorithms
Determine future of cryptography for the next decades
Promote cryptology: Are easy to understand by general
audience
Provide immediate recognition and visibility worldwide.
– Digital System Designers can play an important role
in these contests
•
•
•
•
Co-designers of new cryptographic algorithms
Evaluators
Tool developers
Early adopters of new standards
149
More About GMU Designs & Tools
•
Cryptology e-Print Archive - 2010/445 (100+ pages)
• Detailed hierarchical block diagrams
•
•
FPL 2010 paper
• ATHENa features
•
•
Corresponding formulas for execution time and throughput
Case studies
CHES 2011 paper
•
•
Multiple hardware architectures
Comprehensive results
150
Your Project
151
Your Project
• 5 SHA-3 candidates left in the contest + SHA-2
• Given:
 specification of the function
 reference implementation in C
 interface
 testbench and test vectors
 GMU implementation of the basic version including
 block diagrams
 ASM charts
 short description
 formulas for execution time & throughput
 source codes
 results for Xilinx and Altera FPGAs
Your Project
Develop:
 Block diagram
 ASM chart
 Formulas for execution time & throughput
 Synthesizable code in VHDL
 Results for multiple families of FPGAs from Xilinx and
Altera
for selected architectures assigned to you individually
by the instructor
Special Focus on
• New High-Speed Hardware Architectures
based on
•
•
Pipelining
Unrolling
•
Use of Embedded Resources of FPGAs
• New Medium-Speed Hardware Architectures
based on
•
•
Folding
Distributed Memory
• Lightweight implementations
Starting Point: Basic Iterative Architecture
•
•
datapath width = state size
one clock cycle per one round/step
Block processing time = #R ⋅ T
#R = number of rounds/steps
T = clock period
Currently, most common architecture used to implement SHA-1, SHA-2,
and many other hash functions.
155
Unrolling - x2
•
•
datapath width = state size
one clock cycle per two rounds
Block processing time = (#R/2) * T’
T < T’ < 2⋅T
typically T’ ≈ 2⋅T
Area/2 < Area' < 2⋅Area
Typically Area’ ≈ 2⋅Area
Typically Throughput/Area ratio decreases
156
Pipelining - x2-PPL2, x1-PPL2
157
Horizontal Folding - /2(h)
•
•
datapath width = state size
two clock cycles per one round/step
Block processing time = (2⋅#R) * T’
T/2 < T’ < T
typically T’ ≈ T/2
Area/2 < Area' < Area
Typically Throughput/Area ratio increases
158
Distributed Memory vs. Embedded Memory
Distributed Memory
Block RAMs
Block RAMs
(inside of
Configurable
Logic
Blocks)
Embedded Memory
(Block RAMs)
All Projects - Organization
• Projects divided into phases
• Deliverables for each phase submitted through
Blackboard at selected checkpoints and evaluated
by the instructor and/or TA
• Feedback provided to students on a best effort basis
• Final report and codes submitted using Blackboard
at the end of the semester
• 6 informal groups with 3 students in each
Honor Code Rules
• All students are expected to write and debug
their codes individually
• Students are encouraged to help and support each
other in all problems related to the
- operation of the CAD tools
- understanding of an investigated algorithm and
existing implementations
- understanding of the project tasks
Course Objectives
• At the end of this course you should be able to:
• Decompose a digital system into a controller (FSM) and datapath,
and code accordingly
• Code in VHDL for synthesis
• Write VHDL testbenches
• Synthesize and implement digital systems on FPGAs
• This knowledge will come about through homework, exams,
and an extensive project
• The project in particular will help you know VHDL and the FPGA
design flow from beginning to end
162
Additional Skills Learned in the Project
• Reading & understanding specification of a complex
algorithm
• Design of new hardware architectures based on
existing architectures (datapath & controller)
• Reading, understanding, and modifying existing
VHDL code
• Using embedded resources of modern FPGAs
• Characterizing performance of your codes
for multiple FPGA families
163