A Novel Memory Architecture for Elliptic Curve

Download Report

Transcript A Novel Memory Architecture for Elliptic Curve

A Novel Memory Architecture for Elliptic
Curve Cryptography with Parallel Modular
Multipliers
Ralf Laue, Sorin A. Huss
Integrated Circuits and Systems Lab, Computer Science Dept.
Technische Universität Darmstadt, Germany
{laue|huss}@iss.tu-darmstadt.de
December 14th, 2006
FPT 2006, Bangkok
FPT 2006 Bangkok
Introduction
• Speed-up of todays hardware stems increasingly
from parallelization.
• Cryptographical implementations should take advantage of this by using parallel algorithm versions.
• We begin with an survey about parallelization on different abstraction levels of public key cryptography.
• Then, we present a novel parallel memory
architecture for elliptic curve cryptography in GF(P).
– Allows the execution time to scale with the number of
parallel modular multipliers.
– Direct memory connection leads to low resource usage.
FPT 2006 Bangkok
Page 2
Overview
• Parallelization on Different Abstraction Levels
• Novel Memory Architecture
– Design Considerations
– Proposed Memory Architecture
• Experimental Results
– Number of Parallel Multipliers
– Prototype Implementation
– Application to Another EC Arithmetic Algorithm
FPT 2006 Bangkok
Page 3
Parallelization on Different Abstraction
Levels
RSA
ECC/HECC
System
Cryptographic Scheme
Discrete Logarithm/
Integer Factorization
Point Multiplication/Exponentiation
Elliptic Curve Group
Point Addition and Doubling
Finite Field
• In general, parallelization
yields greater benefit on lower
levels (as less control logic
needs to be duplicated)
• Parallelization on higher levels
allows further speed-up and
offers advantages not available
on lower levels.
• Parallelization methods on
different levels do not exclude
each other.
Modular Arithmetic
FPT 2006 Bangkok
Page 4
Parallelization on Finite Field Level
• Modular multi-word multiplication is the
most critical operation. Thus, parallelization on this level is a popular strategy.
• The approaches on this level do not
exclude each other.
RSA
ECC/HECC
System
Cryptographic Scheme
Discrete Logarithm/
Integer Factorization
Point Multiplication/Exponentiation
Elliptic Curve Group
Point Addition and Doubling
Data-paths of full bit-width:
Finite Field
Modular Arithmetic
– Allow for linear time complexity at cost
of proportional increase of resources (e.g. systolic array).
– Usual bit-widths: ECC: >100 bit, RSA: >1000 bit
– Problem: Design for maximum bit-width. For smaller word
counts resources stay unused, higher may be infeasible.
FPT 2006 Bangkok
Page 5
Parallelization on Finite Field Level (cont.)
• Pipelining
– Allows for linear time complexity, too.
– More flexible as buses of full bit-width, because
number of pipeline stages may be chosen freely.
– Problem: calculated bit-width always corresponds
to a multiple of the number of stages in words.
• Resources may still stay unused.
• ECC/RSA-combination allows only for pipeline lengths
designed for ECC, as those designed for RSA would waste
resources and execution time, if used with ECC.
FPT 2006 Bangkok
Page 6
Parallelization on Finite Field Level (cont.)
• Karatsuba multiplication:
– Multiplying two numbers with two words each can be done with
three word multiplications.
( x1 2b  x0 )  ( y1 2b  y0 )  ( x1  y1 )22b  [(x1  y0 )  ( x0  y1 )]2b  ( x0  y0 ) 
( x1  y1 )22b  [(x1  x0 )  ( y1  y0 )  ( x1  y1 )  ( x0  y0 )]2b  ( x0  y0 )
– Recursion leads to approx. O(n1,585).
– As recursion is difficult in hardware, this is usually used for
multiplications in full bit-width (requires less resources).
• Residue Number Systems:
– Long numbers are represented relative to a base consisting of
multiple smaller moduli, relatively prime to each other. The Chinese
Remainder Theorem ensures a unique mapping.
– Multiplication, addition and subtraction may be executed in parallel.
– Can be interpreted as special case of buses of full bit-width.
FPT 2006 Bangkok
Page 7
Parallelization on Elliptic Group Level
RSA
ECC/HECC
• EC doubling and addition may be sped up
by using multiple modular units in paralSystem
lel.
Cryptographic Scheme
• Literature suggests a maximum of two or
Discrete Logarithm/
three modular multipliers (data depenInteger Factorization
dencies limit further improvements).
Elliptic Curve Group
• One instance of the remaining modular
Point Addition and Doubling
arithmetic is sufficient, because it is very
Finite Field
fast in comparison.
• This abstraction level is well-suited for parallelization in SIMD
implementations.
Point Multiplication/Exponentiation
Modular Arithmetic
• Note that this level does not exist for RSA.
FPT 2006 Bangkok
Page 8
Parallelization on Discrete Logarithm/
Integer Factorization Level
• Both point multiplication and exponentiation allows parallel use of two
instances of group operations.
– E.g. with Montgomery Ladder (parallel point doubling/addition for ECC;
parallel square/multiply for RSA).
RSA
ECC/HECC
System
Cryptographic Scheme
Discrete Logarithm/
Integer Factorization
Point Multiplication/Exponentiation
Elliptic Curve Group
Point Addition and Doubling
Finite Field
Modular Arithmetic
• Parallelization on this abstraction
level is (in addition to further speed-ups) often used
as countermeassure against side channel
attacks.
FPT 2006 Bangkok
Page 9
Parallelization on Cryptographic Primitive/
System Level
• Cryptographic Schermes usually only use
one point multiplication/exponentiation.
– We know of no proposal for parallelization
on this level.
RSA
ECC/HECC
System
Cryptographic Scheme
Discrete Logarithm/
Integer Factorization
Point Multiplication/Exponentiation
• Possible scenario:
Flexible coprocessor for RSA/ECC
Elliptic Curve Group
Point Addition and Doubling
Finite Field
Modular Arithmetic
– Parallelization on lower abstraction levels
is only possible to a certain degree, as long as unused resources
should be avoided.
– Further parallelization may be done on the level of the
cryptographic primitive to increase throughput.
FPT 2006 Bangkok
Page 10
Overview
• Parallelization on Different Abstraction Levels
• Novel Memory Architecture
– Design Considerations
– Proposed Memory Architecture
• Experimental Results
– Number of Parallel Multipliers
– Prototype Implementation
– Application to Another EC Arithmetic Algorithm
FPT 2006 Bangkok
Page 11
Design Goals
• ECC implementation for GF(P) on FPGAs.
• Ability to support different key lengths.
• Resource requirements should be relatively low, thus
allowing integration of further functions on the FPGA.
– E.g. other cryptographic modules, something unrelated to
cryptography.
• Thus, minimum execution time was less important
than a high utilization of the allocated resources.
FPT 2006 Bangkok
Page 12
Design Decisions
• No parallelization on finite field level
– Would lead to unused resources, at least for some key
lengths.
• Instead, parallelization on elliptic group level
– Depends on data dependencies, independent from key
length.
• Modular multiplication is more complex and time
consuming than remaining modular operations.
– Chosen architecture consists of multiple modular multipliers
parallel to each other and the module for the remaining
modular arithmetic parallel to the multipliers.
FPT 2006 Bangkok
Page 13
Conventional Memory Architecure
• Memory architecture must allow all operations to be
continuously supplied with data.
• Conventional memory architecure consists of one
memory and modules with input and output registers.
• Registers take up FPGA resources, but contain only
redundant data copied from memory.
RAM
FPT 2006 Bangkok
Mult 1
...
Mult n
ALU
...
Square
Page 14
Novel Memory Architecture
• Each modular multiplier is assigned its own memory block via a
direct connection.
– Supports continuous data supply.
– Low general resource usage, slightly increased memory usage.
• Remaining modular arithmetic may access memory blocks via
the second port.
• Execution time scales with the number of modular multpliers.
• Modular arithmetic copies data between local memory blocks,
as multipliers only can access “their“ memory block.
– Does not hinder scalability, as remaining modular arithmetic can
access all memory blocks simultaneously in parallel.
FPT 2006 Bangkok
Page 15
Novel Memory Architecture (cont.)
Cryptographic Primitive
commands
Elliptic Curve Arithmetic
commands
status
data
busy
Modular Arithmetic
commands
MUX
data
FPT 2006 Bangkok
BRAM
BRAM
ModMult
ModMult
...
BRAM
ModMult
• Usual memory blocks lack
third port.
• Cryptographic primitive and
modular arithemtic share
second memory port.
– Access from cryptographic
primitive only while no
computation is executed.
– Else: access from the
modular arithmetic.
• Elliptic curve arithmetic does
not directly access the data,
but only indirectly via the
modular arithmetic.
Page 16
Overview
• Parallelization on Different Abstraction Levels
• Novel Memory Architecture
– Design Considerations
– Proposed Memory Architecture
• Experimental Results
– Number of Parallel Multipliers
– Prototype Implementation
– Application to Another EC Arithmetic Algorithm
FPT 2006 Bangkok
Page 17
Number of Parallel Multipliers
• Determine number of multipliers to be used (IEEE 1363):
– ECDbl can utilize only two parallel modular multipliers because of
data dependecies.
– Utilization of modular multipliers for ECAdd (16 multiplications).
#multipliers
multiplier
utilization
#consecutive
multiplications
2
approx. 98%
8
3
approx. 82%
6
4
approx. 74%
5
• Table highlights scalability.
– (#multipliers * #consecutive multiplications) is smallest multiple of
the number of multipliers larger or equal than overall number of
multiplications.
FPT 2006 Bangkok
Page 18
Data Flow Graph ECAdd, IEEE
• Consecutive multiplications are
always executed on same
multiplier.
– No copying between memory
blocks.
– Dark and light grey multiplications
are executed on different modular
multipliers.
• Longest path contains 5 modular
multiplications.
– No speed-up by using more than 4
multipliers possible.
FPT 2006 Bangkok
Page 19
Schedule ECAdd, IEEE
ModMultA ModMultB ModArith
• Schedule for two modular multipliers.
Sub1
Mult8_Add
Sub3
Mult7_Add
Sub2
Sub4
Sub5
Sub6
Sub7
Div1
Mult13_Add
Quad1
Mult1
Mult2
Mult3
Quad3
Mult12
Mult11
Mult15
Quad2
Mult5
Mult4
Mult6
Mult9
Quad4
Mult10
Mul14
• Mapping to multipliers as shown in data flow graph on last slide.
FPT 2006 Bangkok
Page 20
Prototype Implementation - Results
FlipFlops
LUTs
Slices
BRAMs
Cycle Period
Point Multiplication
this work
1128
3015
1806
3
9.898ns
12.716ms (160 Bit)
[16]
6959
11227
n/a
n/a
10.952ns
14.414ms (160 Bit)
[30]
5735
11416
n/a
35
25ns
estimated 3ms (192 Bit)
[5]
n/a
n/a
18314
24
100.1ns
114.71µs (191 Bit GF(2m))
•
•
•
Taking its smaller resource usage into account, the execution time of
our solution is comparable to previous work.
However, because of the high resource usage, none of the previous
designs fulfills the given requirements.
Reference [5] uses GF(2m) as finite field, thus execution time is not
comparable. But its memory architecture is similar, but not easily
applicable to GF(P) and it does not scale as well.
FPT 2006 Bangkok
Page 21
Application to Alternative EC Arithmetic
•
•
Application of our memory architecture to an algorithm for atomic point
doubling and addition.
Algorithms consists of more modular multiplications, thus, allowing the
better utilization for more modular multipliers.
[21]
this
work
•
•
#multipliers
multiplier
utilization
#consecutive
multiplications
#consecutive
additions
2
approx. 90%
10
8
2
approx. 94%
10
1
3
approx. 90%
7
1
4
approx. 89%
5
5
5
approx. 75%
5
1
Our architecture allows the parallel execution of modular additions.
With three multipliers atomic algorithm is faster as IEEE point addition
with only two parallel multipliers.
FPT 2006 Bangkok
Page 22
Schedule for Atomic ECAdd&Dbl
ModMultA ModMultB ModMultC ModArith
• Schedule for three modular multipliers.
Add19
FPT 2006 Bangkok
Add17
Add26
Sub4
Add11
Sub24
Add3
Add12
Add25
Mult13
Add16
Add15
Sub18
Add33
Sub32
Mult6
Mult7
Mult8
Mult10
Mult21
Mult9
Mult2
Mult23
Mult5
Mult14
Mult31
Mult1
Mult22
Mult27
Mult29
Mult20
Mult28
Mult30
Page 23
Conclusions
• Novel memory architecture for ECC
implementations over GF(P) on FPGAs
features the following advantages:
– Low register usage, because of direct memory
access.
– Execution time scales with the number of
modular multipliers, as long as data dependencies
allow this.
– Remaining modular arithmetic is executed in
parallel to all the modular multiplications.
FPT 2006 Bangkok
Page 24
• Thank you for the attention.
• Any questions?
FPT 2006 Bangkok
Page 25
References
[5] N. A. Saqib, F. Rodríguez-Henríquez, A. Díaz-Pérez, „A Parallel
Architecture for Computing Scalar Multiplication on Hessian
Elliptic Curves.“ in ITCC, vol. 2, 2004, pp.493-497.
[16] A. B. Örs, L. Batina, B. Preneel, J. Vandewalle, „Hardware
Implementation of an Elliptic Curve Processor over GF(p).“ in
ASAP. IEEE Computer Society, 2003, pp. 433-443.
[21] W. Fischer, C. Giraud, E. W. Knudsen, „Parallel scalar
multiplication on general elliptic curves over Fp hedged against
Non-Differential Side-Channel Attacks.“, Jan 2002.
[30] G. Orlando, C. Paar, „A Scalable GF(p) Ellitpic Curve
Processor Architecture for Programmable Hardware.“ in CHES,
ser. LNCS, vol 2162, 2001, pp. 348-363.
FPT 2006 Bangkok
Page 26