AES Microcode And Reconfigurable Crypto Unit

Download Report

Transcript AES Microcode And Reconfigurable Crypto Unit

AES Microcode Implementation In
IXP2400 And
A study of
Reconfigurable Crypto Unit
Piyush Ranjan Satapathy
CS203B Class Project
Presentation
Road Map

AES Algorithm Overview
IXP2400 Platform: A Quick Look
Microcode: Overview
Implementation of AES
Experimental Results

Reconfigurable Crypto unit of Intel IXP2850




Algorithm Overview









Designed by Daemen and
Rijmen for the NIST
Originally called Rijndael
Symmetric key block
substitution cipher
Replacement for DES
Successful field testing since
inception
Three bit-modes
State defined as a 4x4 array
of 16 bytes
Key size is either 16,24, or
32 bytes
A byte is represented by
Galois polynomials
Bit
Mode
Key
Length
(Nk
words)
State
Size
(Nb
words)
Numbe
r
of
Rounds
(Nr)
128
4
4
10
192
6
4
12
256
8
4
14
Kn
Stages of AES Algorithm:
Result from
round n-1
Pass to
round n+1
ByteSub
Shift Row
MixColumn
AddRoundKey
Detailed view of round n
 Each round performs the following operations:
 Non-linear Layer: No linear relationship between the input and output of a
round
 Linear Mixing Layer: Guarantees high diffusion over multiple rounds
 Very small correlation between bytes of the round input and the bytes of
the output
 Key Addition Layer: Bytes of the input are simply EXOR’ed with the
expanded round key
1. SubBytes Function



Affine Transformation in GF (28)
Direct implementation is complex
Easily performed by a 16 x 16 LUT
ROM

Simple byte substitution

Combinational logic
Each byte at the input of a round undergoes a
non-linear byte substitution according to the
following transform
Substitution (“S”)-box
2. Shift Row



Shifting done only on the bottom
three rows of the State
Left rotate for encryption
Right rotate for decryption
Depending on the block length, each “row” of the
block is cyclically shifted according to the above table
3. MixColumns Function
•
•
•
Matrix multiplication in GF (28)
MixColumns functionality resides
primarily in the controller and
instruction memory
A series of conditional XOR and left
shift operations
Each column is multiplied by a fixed polynomial
C(x) = ’03’*X3 + ’01’*X2 + ’01’*X + ’02’
This corresponds to matrix multiplication b(x) = c(x) a(x):
4. Key Expansion and Addition





Performed before both the encrypt and decrypt process
Byte values from the Key are read and manipulated into the RoundKey
A series of SubBytes and XOR operations with RCON ROM values and the
Key
Performs XOR operation between the State and the Roundkey
This is the only function without an inverse
Each word is simply EXOR’ed with the expanded round key
IXP2400 Platform: A Quick Look
Name
SizeBytes
Transfer
Size(Bytes)
Reference
latency in
cycles
GPR/ME
256*4
4
1
TR/ME
512*4
4
1
NNR/ME
128*4
4
1
LM/ME
640*4
4
3
Scratch
16K
4
60
SRAM
64M
4
90
DRAM
1G
16
120
• achieve high processing performance
• programming flexibility
• Cheaper than ASIC
Microcode Overview

















alu [ dest1, a, +, b]
ALU addition of a and b and storing in dest1
alu [ dest2, dest1, -, c] ALU subtraction
Move(reg1, reg2)
 Moving from one reg1 to reg2 ; both are gprs.
Immed[reg, ox0020]
 Immediate value assignment to register
local_csr_wr[ACTIVE_LM_ADDR_0, 0x0]  Local memory indexing with index0
.begin … endm  Macro begin and end
.if … .endif  If loop
xbuf_alloc ($$state, 4, read)  buffer allocation in DRAM transfer register
.reg gen_regiater $sram_reg $$dram_reg  Register declaration
.sig sram_sig dram_sig  signal declaration
.while … .endw  While looping
#for round[1,2,3,4,5,6,7,8,9,10] … #endloop  For looping
alu_shf[index, --, B, s0, >>24]  Alu shift function of B
scratch[read, $T, index, 0, 1], ctx_swap[sram_sig]  scratch read instruction
ld_field_w_clr[t1, 1000, $T]  Performs a write to t1 register
dram[write, $$out[0], dst_addr, 0, 2], sig_done[dram_sig]  Dram write
ctx_arb[dram_sig], ctx_arb[kill]  signaling
Implementation Setup







Environmental Setup:
Intel IXP 4.1
600MHz ME configurations
200-MHz SRAMs
150-MHz RDRAMs
Executed in Multi threads
Executed in Different Micro Engines
Experimental Results(1)
SRAM Utilization
ME utilization %
Command Bus Arbiter Statistics
MicroEngine Utilisation Percentage
60%
Idle due to No request
40%
Used
20%
80%
Idle
60%
Stalled
40%
Aborted
Executing
None-SRAM
1T
hr
ea
d
8T
hr
ea
ds
20%
1T
hr
ea
d
Th
re
ad
s
0%
100%
Percentage
80%
idle due to memory
queue fullness
8
Percentage
100%
SRAM
0%
8 Threads
4Threads
2Threads
No of Threads in Execution
1Thread
Experimental Results(2)
Throughput Performance
Across Threads in 1 ME
Throughput Performance
Across Threads in 1 ME
AES Throughput Across MicroEngines
Throughput(MIPS)
500
400
300
Series1
200
100
0
8 Threads
4Threads
2Threads
No of threads
1Thread
Throughput(MIPS)
Throughput Improvement for 1 MicroEngine with
different threads
1800
1600
1400
1200
1000
800
600
400
200
0
1
2
4
No of MicroEngines
8
Crypto Unit of IXP2850
Intel IXP2850 Encryption Data Flow
Crypto Unit Overview
Simple Encrypt Example
Simple Encrypt and Hash Example
3DES Core
􀁹2 Cores per crypto unit

􀁹Takes 192-bit key

–(56-bit + 8-bit parity) x 3Keys

􀁹Operates on 8-byte blocks

􀁹Result is written to ME transfer registers or TBUF
element

􀁹Result can be passed to the SHA-1 unit for hashing
Security Processing, pipelining, and interleaving
using three wires and one core
Multiple keys and IVs
AES Core

􀁹All AES key sizes are
supported



–(128, 192, or 256)
Both Encryption and
Decryption supported
􀁹Operates on 16 byte blocks
AES Key Scheduler
SHA1 Core





2 SHA-1 cores per crypto
unit Operates on 64-byte
blocks
Data is loaded from Input
RAM or Crypto cores into
the SHA-1 buffer
Can perform on unmodified
packet data or on the
ciphered packet data
Operates on 512 bit block
size and has a data buffer to
accumulate the ciphered
data
This gives flexibility to run
SHA and AES, 3DES at
different rates.
SHA1 Critical Path Analysis
Some of The Crypto Commands




crypto_write_ram($$orig_plain_text[0],DATA_RAM_ADDR,8,ENCRYPT
_UNIT, ram_sig)  Perform and wait for the write
crypto_load_iv($$iv[0], 1,ENCRYPT_UNIT,CRYPTO_BANK,
ENCRYPT_STATE, iv_sig)  Loading IV Data
crypto_load_key($$key[0],3,ENCRYPT_UNIT,CRYPTO_BANK,ENCRY
PT_STATE,key_sig)  Loading Key
crypto_cipher($$encrypt_data[0],DATA_RAM_ADDR,8,CRYPTO_CIPHER
_ENCRYPT,CRYPTO_CIPHER_NO_CBC, CRYPTO_CIPHER_3DES,
ENCRYPT_UNIT,CRYPTO_BANK, ENCRYPT_STATE, cipher_sig)
Acknowledgement






Yan Luo
Chris Baron
http://cnscenter.future.co.kr/resource/rsccenter/presentation/intel/spring2003/S03US
CPTS92_OS.pdf ( For some slides)
Mel Tsai; UC Berkeley (For some slides)
Thomas Sodon et al, EE College of NewJersey
Zhangxi Tan et al, Tsinghua University
Q……………?