Long Modular Multiplication for Cryptographic Applications

Download Report

Transcript Long Modular Multiplication for Cryptographic Applications

Long Modular Multiplication for
Cryptographic Applications
Laszlo Hars
Seagate Research
Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004 Boston, MA
Full version of the paper is at: http://www.hars.us/Papers/ModMult.pdf
Outline
• Background (need, algorithms, complexity…)
• Target: occasional PK crypto (smartcard, OSD…)
• Optimizations
– Hardware architecture
• General purpose, support fast modular reduction
• Speed: Parallel operation: multiply || add / load…
• Memory: In-place update
– Algorithmic improvements
• Multiply with short Reciprocal (~trial division)
– Precision – scaling of reciprocals
– Drop insignificant terms
• Modulus scaling
Modular Multiplication
• a×b mod m = remainder of (a×b) ÷ m
• Used in RSA, ECC, ElGamal, DiffieHellman, Primality tests, BBS-PRNG…
• Assume a,b,m are n-digit numbers
▫ m normalized: ½ d n ≤ m < d n
▫ Digit size (machine word) = 16 bits (8…64)
▫ n = 64 for RSA-1024 (10…256)
• Squaring ~twice faster
• Conserve memory
 Divide after Multiply: double length product
Modular Multiplication
• Interleaved multiplication and division
• Barrett multiplication
– Multiply with reciprocal ([d 2n/m]: extra n digits)
• Quisquater's multiplication
– Scaling the modulus for many MS 1-bits
(S: extra n digits storage)
• Montgomery multiplication
– Number representation: a → a×d n mod m
– Right-to-left (simple) interleaved division
– Needs pre- and post processing
Sub-Quadratic time algorithms
• Fast multiplications
Complicated algorithms
▫ Pays for very long numbers
▫ Karatsuba: O(nlog2 3) – faster if n > 10…30
▫ Toom-Cook 3,4…way O(nα)
▫ 3FT (Finite Field Fourier Transform) O(n·logn·loglogn)
• Division = multiplication with reciprocal
• Long Reciprocal [d 2n/m]
– Newton iteration: 0.6…2 multiplication time
• Speed-ups for PKC
www.hars.us/Papers/Truncated Products.pdf
Quadratic time algorithms
• School multiplication: n2 digit products
• School division: k·n2 digit operations
– Quotient digits estimated with short divisions
• Digit-Multiplications || other operations
+ Simple structure
+ No extra storage when interleaved
– Slower
– Quotient digits with trial-and-error
• Goal: reduce # correction steps
Multiply-Accumulate
 DSP: multiplication parallel to
load / store / add / compare…
 Order of the digit-product calculation
▫ Row-order (use input digits sequentially)
fori = 0 … |a|-1 forj = 0 … |b|-1 …aibj…
▪
–More memory access
Column-order (output digits sequentially)
fork = 0 … |a|+|b|-2 fori,j: i+j = k …aibj…
–Longer accumulator (can be split)
HW Architecture
• General purpose µP with enhancements
– Circuit utilization: Multi-use
• DSP structure: multiplication || others
– Multiplier is large and slow
• Long accumulator
 Split adder / counter
• In-Accumulator instructions
• Quotient-digit correction circuit
• Updateable memory –circular offset write
HW Architecture
DeMUX
Memory Write Lines
Memory
Bank0
Memory
Bank1
Memory
Bank2
MUX0
MUX1
MUX2
BufferInverter
BufferInverter
BufferInverter
Memory Read Lines
• 16-bit digits
• || Shift-add
= 17.5-bit mult
• In-Accumulator
▫ Shift
▫ Add
A
31 ..
16 15 ...
B

0
Multiplier
Shifter
As
P

Accumulator
Register
Bank
C
Quotient Digits
• No need to store q
• q ← multiplication with short reciprocal µ
– µ is used many times
– µ ← Newton iteration, look-up table…
– All bits - 2 MS digits and 1 bit: error = 0 or 1 (-1)
– More than 1-digit reciprocal: quotient often OK
– Most economical: µ = [d n+2/ 2m] = {µ1,µ0}
scale: ÷2m, making µ exact 2-digit
• Special case m = ½ d n  µ := d 2 −1
– Usable: µ = [d n+1.5/ m], µ = [2d n+1/ m]…
The basic algorithm LRL4
Rn-1…n-3= ana-1 bnb-1 d + ana-1 bnb-2 + ana-2 bnb-1 // Col 1,2
for k = na+nb-4 … n-3
// Columns to left
Rn…n-4 += Σi+j=k aibj // Loop-1 to right
if (overflow) R -= m
1
2
3
4
2
q =(Rn-1µ1d + Rn-1µ0d + Rn-2µ1d + Rn-2µ0)/d3·2
R =(R–q·m)d
// Loop-2
for k = 0 … n-4
Rn…k += Σi+j=k aibj
while( Rn > 0 ) R -= m
// LS digits to left
// Loop-3 ~ 1
// fix overflow
Left-Right-Left (military step) algorithm
Inner Loops (multiply-add)
Q = 0
// 50-bit accumulator
for k = 0 … n-4
Q = MS(Q) + rk
for j = max(0,k+1-na)… min(k+1,nb)
Q += ak-jbj
rk = D0(Q)
for i = n-3 … n // storing digits
Q = MS(Q) + ri
ri = D0(d)
c = 0
Q = 0
for k =
Q =
c =
rk =
Σi+j=k aibj
// 1-digit temp store
// 33-bit accumulator
0 … n-1
MS(Q) + c – q·mk
rk
D0(Q)
(R–q·m)d
Improvements
• Probability of an overflow < n /d.
– When a, b and m uniform random (?)
Sequential
quotient correction
• DSP SW mod reduction time = 1.0001n2 + 4n
– multiply time = 10 additions: 1.000 01n2 +4n
• HW assisted time = n2 + 4n
• Variants (Accumulator = xn d 3 + xn−1d 2 + …)
– LRL4: q = [2(µ1xn d 2 + (µ1xn−1+µ0 xn) d + µ0εxn−1) / d 3]
– LRL3: q = [2(µ1xn d + (µ1xn−1+µ0 xn ) ) / d 2]
2
? LRL2: q = [(µ1xn d + µ0xn) / d 2+δ], many corrections
Shorter reciprocal
• 1 digit → error explosion
• 1 digit + 2 bits OK: µ1=1
µ = ½ [2d n+1 / m] = d + µ0 + δ, with δ = 0 or ½
• 50-bit Accumulator with carry c = 0 or 1
R = cd 3 + xn d 2 + xn−1 d + xn−2
• Estimated quotient-digit
q = [(R+Rδ/d ) /d 2 + µ0 c + µ0 xn /d ] ≈ µR /d
Quotient correction
• Mod reduction time
– SW: 1.25n2 +n (mult = 10 adds: 1.025n2 +n)
– HW: n2 + n
Modulus Scaling
• Special m: NO multiplication for quotient-digit
– Quotient digit: q = rn +1
– (0F) MS digit of m = d −1 = 11…12
– (10) MS 2 digits of m = {1,0}
• Transform m: 1-digit scaling factor S
– mS is n+1-digit
– Last reduction step is with m → n-digit result
Need to store m and mS
• Faster than Montgomery: n2 + const
 Montgomery with modulus scaling: n2 + const
– LS digit of m = d −1 = 11…12 (xF)
– Last reduction step is with m → n-digit result
Summary
Storage
#Digit#DigitPrePostAlgorithm
beyond processing processing products + Extra HW products with
operands
fixes in SW
extra HW
Barrett
2n
O(n2)
−
n2+5n
−
n2+5n
LRL4
−
−
−
1.0001n2+4n
Shifter
n2+4n
LRL3
−
−
−
1.0001n2+3n
Shifter
n2+3n
LRL2
−
−
−
2n2+2n
Shifter
n2+2n
Shifter
2
LRL1
−
−
−
1.25n2
n
+n
Accu-adder
Shifter,
2
S0F
n
n
−
1.25n2
n
Accu-adder
2
(1+ε)
n
Shifter
2
S10
n
n
−
n
(signed)
Accu-adder
2
2
(1+ε)
n
n
S10-2
n
n
−
Accu-adder + ε n2 adds
(signed)
Montgomery
−
O(n2)
O(n2)
n2+n
−
n2+n
Montgomery-T
n
O(n2)
O(n2)
n2
−
n2