ECE 545 Digital System Design with VHDL Course web page: ECE web page Courses Course web pages ECE 545 http://ece.gmu.edu/coursewebpages/ECE/ECE545/F11/
Download ReportTranscript ECE 545 Digital System Design with VHDL Course web page: ECE web page Courses Course web pages ECE 545 http://ece.gmu.edu/coursewebpages/ECE/ECE545/F11/
ECE 545 Digital System Design with VHDL Course web page: ECE web page Courses Course web pages ECE 545 http://ece.gmu.edu/coursewebpages/ECE/ECE545/F11/ Kris Gaj Research and teaching interests: • reconfigurable computing • computer arithmetic • cryptography • network security Contact: The Engineering Building, room 3225 [email protected] Office hours: Thursday, 7:30-8:30 PM, Tuesday, 6:00-7:00 PM, and by appointment ECE 545 Part of: MS in Computer Engineering One of five core courses (must be passed with B or better) Strongly suggested for two specialization areas: Digital Systems Design Microprocessor and Embedded Systems Elective course in the remaining specialization areas MS in Electrical Engineering Elective ECE 545 Part of: PhD in Electrical and Computer Engineering Knowledge tested at the Technical Qualifying Exam (TQE) Topic 2: Digital Design and Computer Organization ECE 545 Class of Fall 2011 MS SE NDG 1 2 PhD ECE MS CpE 1 6 MS EE 8 • 18 students total • 7 admitted in Fall 2011 • 5 admitted in Spring 2011 I am interested in… I want to specialize primarily in… CAD tools & Design Automation VLSI Hardware Description Languages Recommended program & specialization MS CpE Digital Systems Design Digital Systems Design FPGAs & Reconfigurable computing ASICs & FPGAs Computer Arithmetic VHDL/Verilog Front-end ASIC Design (algorithmic downto gate level) CAD Tools Reconfigurable Computing Back-end ASIC Design (circuit and mask layout levels) Analog & Digital Circuit Design Microelectronics VLSI Fabrication VLSI Fabrication Microelectronics Nanoelectronics Nanoelectronics Semiconductor Devices MS EE Microelectronics/ Nanoelectronics Courses Design level Digital System Computer Design with VHDL Arithmetic VLSI Design VLSI Test for ASICs Concepts algorithmic register-transfer ECE 545 ECE 645 ECE 681 gate ECE 586 transistor layout devices ECE 680 ECE 682 Digital Integrated Circuits Physical VLSI Design Semiconductor ECE 584 ECE684 Device Fundamentals MOS Device Electronics CpE Digital Systems Design CpE Microprocessors and Embedded Systems PreApproved Electives ECE 545 Digital System Design with VHDL ECE 645 Computer Arithmetic ECE 681 VLSI Design for ASICs ECE 682 VLSI Test Concepts ECE 586 Digital Integrated Circuits ECE 511 Microprocessors ECE 545 Digital System Design with VHDL ECE 611 Advanced Microprocessors ECE 612 Real-Time Embedded Systems Suggested Electives CS 540, 583 (languages, algorithms) CS 635 (parallel machines) ECE 584, 684, … (technology) ECE 511, 611, … (microprocessors) ECE 542, 642, 742 (networks) ECE 645, 681 (digital design) ECE 646, 746, … (applications) ECE 548 (sequential mach. theory) Professors K. Gaj, J. Kaps, T. Storey, T.K. Ramesh J. Kaps, K. Gaj, D. Tabak, C. Sabzevari DIGITAL SYSTEMS DESIGN Concentration advisors: Kris Gaj, Jens-Peter Kaps, Ken Hintz 1. ECE 545 Digital System Design with VHDL – K. Gaj, project, FPGA design with VHDL, Aldec/Mentor Graphics, Xilinx/Altera 2. ECE 645 Computer Arithmetic – K. Gaj, project, FPGA design with VHDL Aldec/Mentor Graphics, Xilinx/Altera 3. ECE 681 VLSI Design for ASICs – T.K. Ramesh, project/lab, front-end and back-end ASIC design with Synopsys tools 4. ECE 586 Digital Integrated Circuits – D. Ioannou, R. Mulpuri, 5. ECE 682 VLSI Test Concepts – T. Storey Grading Scheme • Homework - 10% • Project - 40% • Midterm Exam - 20% • Final Exam - 30% Midterm exam 1 2 hours 30 minutes in class design-oriented open-books, open-notes practice exams available on the web Tentative date: Thursday, October 27th Final exam 2 hours 45 minutes in class design-oriented open-books, open-notes practice exams available on the web Date: Monday, December 15, 4:30-7:15pm Textbooks 13 Required Textbook Pong P. Chu, RTL Hardware Design Using VHDL, Wiley-Interscience, 2006. Supplementary Textbook – Basics Refresher Stephen Brown and Zvonko Vranesic, Fundamentals of Digital Logic with VHDL Design, McGraw-Hill, 3rd or 2nd Edition Supplementary Textbook – Advanced Hubert Kaeslin, Digital Integrated Circuit Design: From VLSI Architectures to CMOS Fabrication, Cambridge University Press; 1st Edition, 2008. Used in ECE 681 “VLSI Design for ASICs” Technology & Tools 17 What is an FPGA? Configurable Logic Blocks Block RAMs Block RAMs I/O Blocks Block RAMs Two competing implementation approaches ASIC Application Specific Integrated Circuit • designed all the way from behavioral description to physical layout • designs must be sent for expensive and time consuming fabrication in semiconductor foundry FPGA Field Programmable Gate Array • no physical layout design; design ends with a bitstream used to configure a device • bought off the shelf and reconfigured by designers themselves FPGAs vs. ASICs ASICs FPGAs Off-the-shelf High performance Low development costs Low power Short time to the market Low cost (but only in high volumes) Reconfigurability FPGA Design process (1) Design and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 8031 microcontroller. Unlike in the experiment 5, this time your unit has to be able to perform an encryption algorithm by itself, executing 32 rounds….. Specification / Pseudocode On-paper hardware design (Block diagram & ASM chart) VHDL description (Your Source Files) Library IEEE; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; Functional simulation entity RC5_core is port( clock, reset, encr_decr: in std_logic; data_input: in std_logic_vector(31 downto 0); data_output: out std_logic_vector(31 downto 0); out_full: in std_logic; key_input: in std_logic_vector(31 downto 0); key_read: out std_logic; ); end AES_core; Synthesis Post-synthesis simulation FPGA Design process (2) Implementation Timing simulation Configuration On chip testing Simulation Tools FPGA Synthesis Tools Logic Synthesis VHDL description architecture MLU_DATAFLOW of MLU is signal A1:STD_LOGIC; signal B1:STD_LOGIC; signal Y1:STD_LOGIC; signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC; begin A1<=A when (NEG_A='0') else not A; B1<=B when (NEG_B='0') else not B; Y<=Y1 when (NEG_Y='0') else not Y1; MUX_0<=A1 and B1; MUX_1<=A1 or B1; MUX_2<=A1 xor B1; MUX_3<=A1 xnor B1; with (L1 & L0) select Y1<=MUX_0 when "00", MUX_1 when "01", MUX_2 when "10", MUX_3 when others; end MLU_DATAFLOW; Circuit netlist FPGA Implementation • After synthesis the entire implementation process is performed by FPGA vendor tools Design Process control from Active-HDL Xilinx FPGA Tools ECE Labs Aldec Active-HDL Design Flow Xilinx ISE Design Flow Aldec Active-HDL (IDE) Mentor Graphics ModelSim SE Xilinx XST & Synopsys Synplify Premier Xilinx XST & Synopsys Synplify Premier Xilinx ISE Design Suite Xilinx ISE Design Suite (IDE) simulation synthesis implementation Xilinx FPGA Tools Home Xilinx ISE Design Flow Aldec Active-HDL Design Flow Aldec Active-HDL Student Edition (IDE) Mentor Graphics ModelSim PE Student Edition Xilinx XST (restricted) Xilinx XST (restricted) Xilinx ISE WebPACK (restricted) Xilinx ISE WebPACK (IDE) (restricted) simulation synthesis implementation Altera FPGA Tools ECE Labs Altera Design Flow Mentor Graphics ModelSim-Altera Altera Quartus II Subscription Edition simulation synthesis & implementation Altera FPGA Tools Home Altera Design Flow Mentor Graphics ModelSim-Altera Starter (restricted) Altera Quartus II Web Edition (restricted) simulation synthesis & implementation Project 35 Project semester-long related to the research project conducted by Cryptographic Engineering Research Group (CERG) at GMU supporting NIST (National Institute of Standards and Technology) in the evaluation of candidates for a new cryptographic standard CERG @ GMU http://cryptography.gmu.edu/ 10 PhD students 8 MS students co-advised by Kris Gaj & Jens-Peter Kaps Collaborators Joint 3-year project (2010-2012) on benchmarking cryptographic algorithms in software and hardware sponsored by software Daniel J. Bernstein, University of Illinois at Chicago FPGAs Jens-Peter Kaps George Mason University FPGAs/ASICs ASICs Patrick Schaumont Virginia Tech Leyla Nazhand-Ali Virginia Tech Background 39 Outline • Crypto 101 • Cryptographic standard contests • Progress in evaluation methods AES eSTREAM SHA-3 • Benchmarking tools for software and FPGAs • Open problems Crypto 101 Cryptography is Everywhere Buying a book on-line Teleconferencing over Intranets Withdrawing cash from ATM Backing up files on remote server Cryptographic Transformations Most Often Implemented in Practice Secret-Key Ciphers Block Ciphers Hash Functions Stream Ciphers encryption message & user authentication Public-Key Cryptosystems digital signatures key agreement key exchange Digital Signature Signature HANDWRITTEN DIGITAL A6E3891F2939E38C745B 25289896CA345BEF5349 245CBA653448E349EA47 Main Goals: • unique identification • proof of agreement to the contents of the document Handwritten and Digital Signatures Common Features Handwritten signature Digital signature 1. Unique 2. Impossible to be forged 3. Impossible to be denied by the author 4. Easy to verify by an independent judge 5. Easy to generate Handwritten and Digital Signatures Differences Handwritten signature Digital signature 6. Associated physically 6. Can be stored and with the document transmitted independently of the document 7. Almost identical 7. Function of the for all documents document 8. Usually at the last 8. Covers the entire page document Hash Functions in Digital Signature Schemes Alice Bob Message Message Signature Signature Hash function Hash function Hash value 1 Hash value yes no Hash value 2 Public key cipher Alice’s private key Public key cipher Alice’s public key Hash Function arbitrary length m message h Collision Resistance: It is computationally infeasible to find such m and m’ that h(m)=h(m’) h(m) fixed length hash function hash value Cryptographic Standard Contests Cryptographic Standards Before 1997 Secret-Key Block Ciphers IBM & NSA DES – Data Encryption Standard Triple DES 1993 1995 Hash Functions 2003 SHA-1–Secure Hash Algorithm NSA SHA-2 SHA 1970 2005 1999 1977 1980 1990 2000 2010 time Why a Contest for a Cryptographic Standard? • Avoid back-door theories • Speed-up the acceptance of the standard • Stimulate non-classified research on methods of designing a specific cryptographic transformation • Focus the effort of a relatively small cryptographic community Cryptographic Standard Contests IX.1997 X.2000 AES 15 block ciphers 1 winner NESSIE I.2000 XII.2002 CRYPTREC V.2008 XI.2004 34 stream ciphers 4 HW winners + 4 SW winners eSTREAM XII.2012 X.2007 51 hash functions 1 winner SHA-3 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 time Cryptographic Contests - Evaluation Criteria Security Software Efficiency μProcessors Hardware Efficiency μControllers Flexibility Simplicity FPGAs ASICs Licensing 53 Specific Challenges of Evaluations in Cryptographic Contests • Very wide range of possible applications, and as a result performance and cost targets throughput: single Mbits/s to hundreds Gbits/s cost: single cents to thousands of dollars • Winner in use for the next 20-30 years, implemented using technologies not in existence today • Large number of candidates • Limited time for evaluation • Only one winner and the results are final Mitigating Circumstances • Security is a primary criterion • Performance of competing algorithms tend to very significantly (sometimes as much as 500 times) • Only relatively large differences in performance matter (typically at least 20%) • Multiple groups independently implement the same algorithms (catching mistakes, comparing best results, etc.) • Second best may be good enough AES Contest 1997-2000 Rules of the Contest Each team submits Detailed cipher specification Justification of design decisions Source code in C Source code in Java Tentative results of cryptanalysis Test vectors AES: Candidate Algorithms 2 8 Canada: CAST-256 Deal USA: Mars RC6 Twofish Safer+ HPC Costa Rica: Frog 4 Germany: Magenta Belgium: Rijndael France: Korea: Crypton Japan: E2 1 DFC Israel, UK, Norway: Serpent Australia: LOKI97 AES Contest Timeline June 1998 15 Candidates CAST-256, Crypton, Deal, DFC, E2, Frog, HPC, LOKI97, Magenta, Mars, RC6, Rijndael, Safer+, Serpent, Twofish, August 1999 Round 1 Security Software efficiency Round 2 5 final candidates Mars, RC6, Twofish (USA) Rijndael, Serpent (Europe) October 2000 1 winner: Rijndael Belgium Security Software efficiency Hardware efficiency NIST Report: Security & Simplicity Security High MARS Twofish Serpent Rijndael Adequate RC6 Complex Simple Simplicity Efficiency in software: NIST-specified platform 200 MHz Pentium Pro, Borland C++ Throughput [Mbits/s] 128-bit key 192-bit key 30 256-bit key 25 20 15 10 5 0 Rijndael RC6 Twofish Mars Serpent NIST Report: Software Efficiency Encryption and Decryption Speed high medium low 32-bit processors 64-bit processors DSPs RC6 Rijndael Twofish Rijndael Twofish Rijndael Mars Twofish Mars RC6 Mars RC6 Serpent Serpent Serpent Efficiency in FPGAs: Speed Xilinx Virtex XCV-1000 Throughput [Mbit/s] 500 450 400 350 431 444 George Mason University 414 University of Southern California 353 Worcester Polytechnic Institute 294 300 250 200 150 100 177 173 149 143 104 62 112 88 102 61 50 0 Serpent Rijndael x8 Twofish Serpent RC6 x1 Mars Efficiency in ASICs: Speed Throughput [Mbit/s] MOSIS 0.5μm, NSA Group 700 606 128-bit key scheduling 600 500 3-in-1 (128, 192, 256 bit) key scheduling 443 400 300 202 202 200 105 105 103 104 57 57 100 0 Rijndael Serpent x1 Twofish RC6 Mars Lessons Learned Results for ASICs matched very well results for FPGAs, and were both very different than software FPGA ASIC x8 x1 GMU+USC, Xilinx Virtex XCV-1000 x1 NSA Team, ASIC, 0.5μm MOSIS Serpent fastest in hardware, slowest in software Lessons Learned Hardware results matter! Final round of the AES Contest, 2000 Speed in FPGAs GMU results Votes at the AES 3 conference Limitations of the AES Evaluation • Optimization for maximum throughput • Single high-speed architecture per candidate • No use of embedded resources of FPGAs (Block RAMs, dedicated multipliers) • Single FPGA family from a single vendor: Xilinx Virtex eSTREAM Contest 2004-2008 eSTREAM - Contest for a new stream cipher standard PROFILE 1 (SW) • Stream cipher suitable for software implementations optimized for high speed • Key size - 128 bits • Initialization vector – 64 bits or 128 bits PROFILE 2 (HW) • Stream cipher suitable for hardware implementations with limited memory, number of gates, or power supply • Key size - 80 bits • Initialization vector – 32 bits or 64 bits eSTREAM Contest Timeline April 2005 PROFILE 1 (SW) 23 Phase 1 Candidates PROFILE 2 (HW) 25 Phase 1 Candidates July 2006 13 Phase 2 Candidates 20 Phase 2 Candidates April 2007 8 Phase 3 Candidates May 2008 8 Phase 3 Candidates 4 winners: 4 winners: HC-128, Rabbit, Salsa20, SOSEMANUK Grain v1, Mickey v2, Trivium, F-FCSR-H v2 Lessons Learned Very large differences among 8 leading candidates ~30 x in terms of area (Grain v1 vs. Edon80) ~500 x in terms of the throughput to area ratio (Trivium (x64) vs. Pomaranch) Hardware Efficiency in FPGAs Xilinx Spartan 3, GMU SASC 2007 Throughput [Mbit/s] x64 12000 10000 Trivium 8000 x32 6000 4000 x16 x16 2000 Grain x1 0 0 Mickey-128 200 400 AES 600 800 1000 1200 1400 Area [CLB slices] ASIC Evaluations • Two major projects T. Good, M. Benaissa, University of Sheffield, UK (Phases 1-3) – 0.13μm CMOS F.K. Gürkaynak, et al., ETH Zurich, Switzerland (Phase 1) - 0.25μm CMOS • Two representative applications WLAN @ 10 Mbits/s RFID / WSN @ 100 kHz clock eSTREAM ASIC Evaluations New compared to AES: •Post-layout results, followed by •Actually fabricated ASIC chips (0.18μm CMOS) •More complex performance measures Power x Area x Time •New types of analyses Power x Latency vs. Area Throughput/Area vs. Energy per bit SHA-3 Contest 2007-2012 NIST SHA-3 Contest - Timeline Round 1 51 candidates Oct. 2008 Round 3 Round 2 14 5 July 2009 Dec. 2010 1 Mid 2012 SHA-3 Round 2 77 Features of the SHA-3 Round 2 Evaluation • Optimization for maximum throughput to area ratio • 10 FPGA families from two major vendors : Xilinx and Altera But still… • Single high-speed architecture per candidate • No use of embedded resources of FPGAs (Block RAMs, dedicated multipliers, DSP units) Throughput vs. Area Normalized to Results for SHA-256 and Averaged over 11 FPGA Families – 256-bit variants 79 Throughput vs. Area Normalized to Results for SHA-512 and Averaged over 11 FPGA Families – 512-bit variants 80 Performance Metrics Primary Secondary 1. Throughput (single message) 2. Area 3. Throughput / Area 3. Hash Time for Short Messages (up to 1000 bits) 81 Overall Normalized Throughput: 256-bit variants of algorithms Normalized to SHA-256, Averaged over 10 FPGA families 8 7.47 7.21 7 6 5.40 5 4 3 3.83 3.46 2.98 2.21 2 1 1.82 1.74 1.70 1.69 1.66 1.51 0.98 0 82 256-bit variants Thr/Area Thr Area Short msg. 512-bit variants Thr/Area Thr Area Short msg. BLAKE BMW CubeHash ECHO Fugue Groestl Hamsi JH Keccak Luffa Shabal SHAvite-3 SIMD Skein 83 SHA-3 Round 3 84 SHA-3 Contest Finalists 85 New in Round 3 • Multiple Hardware Architectures • Effect of the Use of Embedded Resources • Low-Area Implementations SHA-3 Multiple High-Speed Architectures 87 Study of Multiple Architectures • Analysis of multiple hardware architectures per each finalist, based on the known design techniques, such as • Folding • Unrolling • Pipelining • Identifying the best architecture in terms of the throughput to area ratio • Analyzing the flexibility of all algorithms in terms of the speed vs. area trade-offs BLAKE-256 in Virtex 5 x1 – basic iterative architecture /k(h) – horizontal folding by a factor of k /k(v) – vertical folding by a factor of k xk – unrolling by a factor of k xk-PPLn – unrolling by a factor of k with n pipeline stages 89 256-bit variants in Virtex 5 90 512-bit variants in Virtex 5 91 256-bit variants in Stratix III 92 512-bit variants in Stratix III 93 SHA-3 Lightweight Implementations 94 Study of Lightweight Implementations in FPGAs • Two major projects J.-P. Kaps, et al., George Mason University, USA F.-X. Standaert, UCL Crypto Group, Belgium • Target: Low-cost FPGAs (Spartan 3, Spartan 6, etc.) for stand-alone implementations High-performance FPGAs (e.g., Virtex 6) for system-on-chip implementations Typical Assumptions – GMU Group Implementation Results Xilinx Spartan 3, ISE 12.3, after P&R, Optimized using ATHENa SHA-3 Implementations Based on Embedded Resources 98 Implementations Based on the Use of Embedded Resources in FPGAs RAM blocks RAM blocks Multipliers Multipliers/DSP units blocks Logic Logic blocks (#Logic blocks, #Multipliers/DSP units, #RAM_blocks) Graphics based on The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com) Resource Utilization Vector (#Logic blocks, #Multipliers/DSP units, #RAM blocks) Xilinx Spartan 3: (#CLB_slices, #multipliers, #Block_RAMs) Virtex 5: (#CLB_slices, #DSP units, #Block_RAMs) Altera Cyclone III: (#LEs, Stratix III: #multipliers, #RAM_bits) (#ALUTs, #DSP units, #RAM_bits) Fitting a Single Core in a Smaller FPGA Device BLAKE in Altera Cyclone II EP2C20 EP2C5 LOGIC MUL MEM LOGIC MUL (6862, 0, MEM 0) LEs, MULs, bits (3129, 0, 12k) LEs, MULs, bits Fitting a Larger Number of Identical Cores in the same FPGA Device BLAKE in Virtex 5 XC5VSX50 3 BLAKE cores Cumulative Throughput 6.8 Gbit/s XC5VSX50 8 BLAKE cores 20.6 Gbit/s Cumulative Throughput for the Largest Device of a Given Family Basic architectures Best architectures SHA-3 in ASICs 104 Virginia Tech ASIC • IBM MOSIS 130nm process • The first ASIC implementing 5 final SHA-3 candidates • Taped-out in Feb. 2011, successfully tested this Summer • Multiple chips made available to other research labs FPGA Evaluations - Summary AES eSTREAM SHA-3 Multiple FPGA families No No Yes Multiple architectures No Yes Yes Use of embedded resources No No Yes Primary optimization target Throughput Throughput/ Area Experimental results No Area Throughput/Ar ea No Availability of source codes No No Yes Specialized tools No No Yes Yes ASIC Evaluations - Summary AES eSTREAM SHA-3 Multiple processes/ libraries No No Yes Multiple architectures No Yes Yes Primary optimization target Throughput Power x Area Throughput x Time /Area Post-layout results No Yes Yes Experimental results No Yes Yes Availability of source codes No No Yes Specialized tools No No No Benchmarking Tools Tools for Benchmarking Implementations of Cryptography Software FPGAs eBACS ATHENa D. Bernstein (UIC) T. Lange (TUE) K. Gaj, J. Kaps, et al. (GMU) 2006-present 2009-present ASICs ? Benchmarking in Software: eBACS 110 eBACS: ECRYPT Benchmarking of Cryptographic Systems: http://bench.cr.yp.to/ SUPERCOP - toolkit developed by D. Bernstein and T. Lange for measuring performance of cryptographic software • measurements on multiple machines (currently over 90) • each implementation is recompiled multiple times (currently over 1600 times) with various compiler options • time measured in clock cycles/byte for multiple input/output sizes • median, lower quartile (25th percentile), and upper quartile (75th percentile) reported • standardized function arguments (common API) 111 SUPERCOP Extension for Microcontrollers – XBX: 2009-present Allows on-board timing measurements Supports at least the following microcontrollers: 8-bit: Atmel ATmega1284P (AVR) Developers: Christian Wenzel-Benner, ITK Engineering AG, Germany Jens Gräf, LiNetCo GmbH, Heiger, Germany 32-bit: TI AR7 (MIPS) Atmel AT91RM9200 (ARM 920T) Intel XScale IXP420 (ARM v5TE) Cortex-M3 (ARM) Benchmarking in FPGAs: ATHENa 113 ATHENa – Automated Tool for Hardware EvaluatioN http://cryptography.gmu.edu/athena Open-source benchmarking environment, written in Perl, aimed at AUTOMATED generation of OPTIMIZED results for MULTIPLE hardware platforms. The most recent version 0.6.2 released in June 2011. Full features in ATHENa 1.0 to be released in 2012. 114 Why Athena? "The Greek goddess Athena was frequently called upon to settle disputes between the gods or various mortals. Athena Goddess known for her superb logic and intellect. Her decisions were usually well-considered, highly ethical, and seldom motivated by self-interest.” from "Athena, Greek Goddess of Wisdom and Craftsmanship" 115 Basic Dataflow of ATHENa User FPGA Synthesis and Implementation 6 5 Database query ATHENa Server 2 Ranking of designs HDL + scripts + configuration files 3 Result Summary + Database Entries 1 HDL + FPGA Tools Download scripts and configuration files8 4 Designer Database Entries 0 Interfaces + Testbenches 116 Three Components of the ATHENa Environment • ATHENa Tool • ATHENa Database of Results • ATHENa Website ATHENa - Tool 118 configuration files constraint files testbench synthesizable source files result summary (user-friendly) database entries (machinefriendly) 119 ATHENa Major Features (1) • synthesis, implementation, and timing analysis in batch mode • support for devices and tools of multiple FPGA vendors: • generation of results for multiple families of FPGAs of a given vendor • automated choice of a best-matching device within a given family 120 ATHENa Major Features (2) • automated verification of designs through simulation in batch mode OR • support for multi-core processing • automated extraction and tabulation of results • several optimization strategies aimed at finding – optimum options of tools – best target clock frequency – best starting point of placement 121 Relative Improvement of Results from Using ATHENa Virtex 5, 512-bit Variants of Hash Functions 3 2.5 2 Area Area Throughput Thr Throughput/Area Thr/Area 1.5 1 0.5 0 Ratios of results obtained using ATHENa suggested options vs. default options of FPGA tools 122 Other (Somewhat) Similar Tools ExploreAhead (part of PlanAhead) Design Space Explorer (DSE) Boldport Flow EDAx10 Cloud Platform 123 Distinguishing Features of ATHENa • Support for multiple tools from multiple vendors • Optimization strategies aimed at the best possible performance rather than design closure • Extraction and presentation of results • Seamless integration with the ATHENa database of results 124 ATHENa – Database of Results 125 ATHENa Database http://cryptography.gmu.edu/athenadb 126 ATHENa Database – Result View • Algorithm parameters • Design parameters Optimization target Architecture type Datapath width I/O bus widths Availability of source code Platform Vendor, Family, Device Timing Maximum clock frequency Maximum throughput Resource utilization Logic blocks (Slices/LEs/ALUTs) Multipliers/DSP units Tools Names & versions Detailed options Credits Designers & contact information 127 ATHENa Database – Compare Feature Matching fields in grey Non-matching fields in red and blue 128 Currently in the Database Hash Functions in FPGAs GMU Results for • 20 hash functions ( 14 Round 2 SHA-3 + 5 Round 3 SHA-3 + SHA-2 ) x 2 variants ( 256-bit output & 512-bit output ) x 11 FPGA families = 440 combinations (440-not_fitting) = 423 optimized results 129 ATHENa - Website 130 ATHENa Website http://cryptography.gmu.edu/athena/ • Download of ATHENa Tool • Links to related tools SHA-3 Competition in FPGAs & ASICs • Specifications of candidates • Interface proposals • RTL source codes • Testbenches • ATHENa database of results • Related papers & presentations 131 GMU Source Codes • best non-pipelined high-speed architectures for 14 Round 2 SHA-3 candidates and SHA-2 • best non-pipelined high-speed architectures for 5 Round 3 SHA-3 candidates • Each code supports two variants: with 256-bit and 512-bit output 132 Primary Designers of GMU Codes Ekawat Homsirikamol a.k.a “Ice” Marcin Rogawski Developed optimized VHDL implementations of 5 Round 3 SHA-3 Candidates + 14 Round 2 SHA-3 candidates + SHA-2 in two variants each (256 & 512-bit output), for some functions using several alternative architectures ATHENa Result Replication Files • Scripts and configuration files sufficient to easily reproduce all results (without repeating optimizations) • Automatically created by ATHENa for all results generated using ATHENa • Stored in the ATHENa Database In the same spirit of Reproducible Research as: • J. Claerbout (Stanford University) “Electronic documents give reproducible research a new meaning,” in Proc. 62nd Ann. Int. Meeting of the Soc. of Exploration Geophysics, 1992, http://sepwww.stanford.edu/doku.php?id=sep:research:reproducible:seg92 ..... • Patrick Vandewalle1, Jelena Kovacevic2, and Martin Vetterli1 (1EPFL, 2CMU) Reproducible research in signal processing - what, why, and how. IEEE Signal Processing Magazine, May 2009. http://rr.epfl.ch/17/ 134 Benchmarking Goals Facilitated by ATHENa Comparing multiple: 1. cryptographic algorithms 2. hardware architectures or implementations of the same cryptographic algorithm 3. hardware platforms from the point of view of their suitability for the implementation of a given algorithm, (e.g., choice of an FPGA device or FPGA board) 4. tools and languages in terms of quality of results they generate (e.g. Verilog vs. VHDL, Synplicity Synplify Premier vs. Xilinx XST, ISE v. 13.1 vs. ISE v. 12.3) 135 Open Problems Objective Benchmarking Difficulties • lack of standard one-fits-all interfaces • stand-alone performance vs. performance as a part of a bigger system • heuristic optimization strategies • time & effort spent on optimization or Why Interface Matters? • Pin limit Total number of i/o ports ≤ Total number of an FPGA i/o pins • Support for the maximum throughput Time to load the next message block ≤ Time to process previous block 138 Interface: Two possible solutions msg_bitlen message end_of_msg SHA core zero_word Length of the message communicated at the beginning Dedicated end of message port + easy to implement passive source circuit − more intelligent source circuit required − area overhead for the counter of message bits + no need for internal message bit counter 139 SHA Core: Interface & Typical Configuration clk rst clk rs t clk rs t clk rst clk rst clk rst Input FIFO ext_idata w fifoin_full fifoin_write din dout full empty write read Output FIFO SHA core idata w fifoin_empty fifoin_read din dout src_ready dst_ready src_read dst_write odata w fifoout_full ext_odata din dout full empty fifoout_write write read w fifoout_empty fifoout_read • SHA core is an active component; surrounding FIFOs are passive and widely available • Input interface is separate from an output interface • Processing a current block, reading the next block, and storing a result for the previous message can be all done in parallel 140 Objective Benchmarking Difficulties • lack of convenient cost metric in FPGAs • accuracy of power estimators in ASICs & FPGAs • verifiability of results • human factor (skills of designers, order of implementations, etc.) How to measure hardware cost in FPGAs? 1. Stand-alone cryptographic core on an FPGA Cost of the smallest FPGA that can fit the core? Unit: USD [FPGA vendors would need to publish MSRP (manufacturer’s suggested retail price) of their chips] – not very likely, very volatile metric or size of the chip in mm2 - easy to obtain 2. Part of an FPGA System On-Chip Resource utilization described by a vector: (#CLB slices, #MULs/DSP units, #BRAMs) (#LEs/ALUTs, #MULs/DSP units, #membits) for Xilinx for Altera Difficulty of turning vector into a single number representing cost 142 Potential Problems with Publishing Source Codes • Export control regulations for cryptography Check: Bert-Jaap Koops, Crypto Law Survey http://rechten.uvt.nl/koops/cryptolaw/ • Commercial interests • Competition with other groups for grants and publications in the most renowned journals and conference proceedings Selected SHA-3 Source Codes Available in Public Domain • AIST-RCIS: http://www.rcis.aist.go.jp/special/SASEBO/SHA3-en.html • University College Cork, Queens University Belfast, RMIT University, Melbourne, Australia: http://www.ucc.ie/en/crypto/SHA-3Hardware • Virginia Tech: http://rijndael.ece.vt.edu/sha3/soucecodes.html • ETH Zurich: http://www.iis.ee.ethz.ch/~sha3/ • George Mason University: http:/cryptography.gmu.edu/athena • BLAKE Team: http://www.131002.net/blake/ • Keccak Team: http://keccak.noekeon.org/ How to assure verifiability of results? Level of openness Source files Testimonies Netlists for selected FPGAs Current situation: Options of tools Constraint files conference/journal papers Interfaces Testbenches Results FPGA family/device Tool names+versions ATHENa space 145 Initial Evaluation by High-Level Synthesis Tools? Initial number of candidates AES 15 • All hardware implementations so far developed using RTL HDL • Growing number of candidates in subsequent contests • Each submission includes reference implementation in C eSTREAM 34 SHA-3 51 Next Contest ??? • Results from High-Level Synthesis could have a large impact in early stages of the competitions • Results and RTL codes from previous contests form interesting benchmarks for High-Level synthesis tools Turning Thousands of Results into a Single Fair Ranking • Choosing which FPGA families / ASIC libraries should be included in the comparison wide range? only most recent? vendors with the largest market share? wide spectrum of vendors? • Methods for combining multiple results into single ranking Thousands of results on tens of platforms 1. 2. 3. 4. 5. Turning Thousands of Results into Fair Ranking • Deciding on most important application scenarios Throughput – Cost – Power range from RFIDs to High-speed security gateways Assigning weights to different scenarios 148 Conclusions – Contests for cryptographic standards are important • • • • Stimulate progress in design and analysis of cryptographic algorithms Determine future of cryptography for the next decades Promote cryptology: Are easy to understand by general audience Provide immediate recognition and visibility worldwide. – Digital System Designers can play an important role in these contests • • • • Co-designers of new cryptographic algorithms Evaluators Tool developers Early adopters of new standards 149 More About GMU Designs & Tools • Cryptology e-Print Archive - 2010/445 (100+ pages) • Detailed hierarchical block diagrams • • FPL 2010 paper • ATHENa features • • Corresponding formulas for execution time and throughput Case studies CHES 2011 paper • • Multiple hardware architectures Comprehensive results 150 Your Project 151 Your Project • 5 SHA-3 candidates left in the contest + SHA-2 • Given: specification of the function reference implementation in C interface testbench and test vectors GMU implementation of the basic version including block diagrams ASM charts short description formulas for execution time & throughput source codes results for Xilinx and Altera FPGAs Your Project Develop: Block diagram ASM chart Formulas for execution time & throughput Synthesizable code in VHDL Results for multiple families of FPGAs from Xilinx and Altera for selected architectures assigned to you individually by the instructor Special Focus on • New High-Speed Hardware Architectures based on • • Pipelining Unrolling • Use of Embedded Resources of FPGAs • New Medium-Speed Hardware Architectures based on • • Folding Distributed Memory • Lightweight implementations Starting Point: Basic Iterative Architecture • • datapath width = state size one clock cycle per one round/step Block processing time = #R ⋅ T #R = number of rounds/steps T = clock period Currently, most common architecture used to implement SHA-1, SHA-2, and many other hash functions. 155 Unrolling - x2 • • datapath width = state size one clock cycle per two rounds Block processing time = (#R/2) * T’ T < T’ < 2⋅T typically T’ ≈ 2⋅T Area/2 < Area' < 2⋅Area Typically Area’ ≈ 2⋅Area Typically Throughput/Area ratio decreases 156 Pipelining - x2-PPL2, x1-PPL2 157 Horizontal Folding - /2(h) • • datapath width = state size two clock cycles per one round/step Block processing time = (2⋅#R) * T’ T/2 < T’ < T typically T’ ≈ T/2 Area/2 < Area' < Area Typically Throughput/Area ratio increases 158 Distributed Memory vs. Embedded Memory Distributed Memory Block RAMs Block RAMs (inside of Configurable Logic Blocks) Embedded Memory (Block RAMs) All Projects - Organization • Projects divided into phases • Deliverables for each phase submitted through Blackboard at selected checkpoints and evaluated by the instructor and/or TA • Feedback provided to students on a best effort basis • Final report and codes submitted using Blackboard at the end of the semester • 6 informal groups with 3 students in each Honor Code Rules • All students are expected to write and debug their codes individually • Students are encouraged to help and support each other in all problems related to the - operation of the CAD tools - understanding of an investigated algorithm and existing implementations - understanding of the project tasks Course Objectives • At the end of this course you should be able to: • Decompose a digital system into a controller (FSM) and datapath, and code accordingly • Code in VHDL for synthesis • Write VHDL testbenches • Synthesize and implement digital systems on FPGAs • This knowledge will come about through homework, exams, and an extensive project • The project in particular will help you know VHDL and the FPGA design flow from beginning to end 162 Additional Skills Learned in the Project • Reading & understanding specification of a complex algorithm • Design of new hardware architectures based on existing architectures (datapath & controller) • Reading, understanding, and modifying existing VHDL code • Using embedded resources of modern FPGAs • Characterizing performance of your codes for multiple FPGA families 163