Solving the Greatest Common Divisor Problem in Parallelx

Download Report

Transcript Solving the Greatest Common Divisor Problem in Parallelx

Solving the Greatest Common Divisor Problem in Parallel Derrick Coetzee University of California, Berkeley CS 273, Fall 2010, Prof. Satish Rao

The problem and serial solutions

• • • • Given two positive integers (represented in binary), find the large positive integer that divides both Let n will be the total number of bits in the inputs Solvable in O(n) divisions and O(n 2 ) bit operations with Euclidean algorithm: – while b ≠ 0: (a, b) ← (b, a mod b) Matrix formulation: • Schönhage uses this to solve in O(n log 2 n log log n)

Adding parallelism: Algorithm PM

• Brent & Kung (1982): reduce time of each step to O(1) (O(n) bit complexity) – α ← n; β ← n (where a,b ≤ 2 n and a odd) while b ≠ 0: while b ≡ 0 (mod 2) { b ← b/2; β if α ≥ β { swap(a,b); swap(α, β) } if a+b ≡ 0 (mod 4): b ← ⌊ (a+b)/2 ⌋ ← β − 1 } else: b ← ⌊ (a−b)/2 ⌋ – – Test “a+b ≡ 0 (mod 4)” uses only 2 lowest bits of sum, enables effective pipelining of the steps Number of steps is still linear, requires n processors

Algorithm PM: Example

• • • • • • • • GCD(105, 30) (a = 1101001 2 , b = 11110 2 ) α ← 7; β ← 7 b ← 1111 2 ; β swap (a ← a + b ≡ 11 2 ← 6 1111 2 , b ← + 01 2 ≡ 00 2 1101001 (mod 4) 2 , α ← 6, β ← 7) Compute trailing zeroes of new b and 2 more bits: – b ← ⌊ (a+b)/2 ⌋ = ( 01111 2 + …01001 2 ) /2 = … 1100 b ← …11 2 ; β swap (a ← ← …11 2 5 , b ← 1111 2 , α ← 5, β ← 6)

Algorithm PM: Example

• • • • • a = …11 2 , b a + b ≡ 11 2 = 1111 2 , α + 11 2 ≡ 10 2 = 5, β = (mod 4) 6 Compute LSB of new b while previous iteration computes next bit of a in parallel: – b' = ⌊ (a−b)/2 ⌋ = …0 – a = …111 2 With another bit of a, can compute next bit of b’ Final result: a= 1111 2 , b’ = 0

Sublinear time GCD

• Kannan et al (1987): Break the problem into n/log n steps, each of which takes log log n time and eliminates log n bits of each input.

– CRCW, O(n log log n/log n) time, O(n 2 log 2 n) processors – Idea: instead of a mod b, compute pa mod b for all 0≤p≤n • If gcd(a,b) = d and 0 ≤ p ≤ n, then gcd(pa,b)/d ≤ n • Pigeonhole principle gives first log n bits same for at least two p – gcd(a,b) = gcd(b,(p 1 a mod b)-(p 2 a mod b)) (ignoring factors ≤ n) – Only need to know first log n bits to identify p 1, p 2 – can be computed in log log n time

Sublinear time GCD: Example

• • GCD(143, 221), n = 8 – (0×143) mod 221 = 000… 2 – (1×143) mod 221 = 100… 2 – (2×143) mod 221 = 010… 2 – (3×143) mod 221 = 110… 2 – (4×143) mod 221 = 100… 2 – (5×143) mod 221 = 001… 2 – (6×143) mod 221 = 110… 2 – (7×143) mod 221 = 011… 2 – (8×143) mod 221 = 001… 2 (4×143) mod 221 − (1×143) mod 221 = 13

Sublinear time GCD: Using matrix form

• Problem: (p 1 a mod b) − (p 2 a mod b) takes log n time – Carry lookahead adder needs log n time to add n-bit numbers – Idea: Delay updates to (a, b) for log n steps, collecting updates in a 2 × 2 matrix T, then compute T(a,b) in log n time – Results in n/log 2 n phases each taking log n time – Each step performs constant number of log 2 n-bit operations • Ensures that each step still takes log log n time – Time dominated by cost of the O(n/log n) steps

Getting rid of the log log n factor

• Chor and Goldreich (1990): Parallelize Brent & Kung’s PM algorithm instead of Euclidean algorithm – α ← n; β ←

n

while b ≠ 0: while b = 0 (mod 2) { b ← b/2; β ← β − 1 } if α ≥ β { swap(a,b); swap(α, β) } if a+b ≡ 0 (mod 4): b ← ⌊ (a+b)/2 ⌋ else: b ← ⌊ (a−b)/2 ⌋ – Idea: pack log n iterations of PM algorithm into each phase • Each phase can be done in constant time – Requires O(n/log n) time on O(n 1+ε ) processors – Best-known algorithm

log n iterations in constant time

– α ← n; β ← n δ ← 0 while b ≠ 0: while b = 0 (mod 2) { b if a+b ← b/2; β ← β − 1 δ if α ≥ β δ ≥ 0 { swap(a,b); swap(α, β) δ ← ← δ + 1} −δ } ≡ 0 (mod 4): b ← ⌊ (a+b)/2 ⌋ else: b ← ⌊ (a−b)/2 ⌋ – – δ together with the last log n+1 bits of a and b determine the transformation matrix of the next log n iterations Precompute a table – lookup, do the matrix multiplication • Technicality: Need to treat | δ| ≤ log n and |δ| > log n differently

Multiplication in constant time

• How to multiply an n-bit number by a log n bit number in constant time?

– Represent both numbers in base 2 log n – Precompute a multiplication table for this base – Separate the larger number into a sum of two numbers, each having zero for every other digit, e.g. 1234 = 1030 + 204 – Multiply both by the smaller number; no carries are required • Example: 1234 × 7 = (1030+204)×7 = 7210+1428 = 8638 – Finally, Chandra et al (1983) shows we can add two n-bit numbers in O(1) time with O(n 2 ) processors • But this requires concurrent writes

Complexity perspective: an analogy

• • P = NP?

– Open problem, but most practical problems have either been shown to be in P or else NP-hard.

– Exceptions: integer factorization, graph isomorphism • If integer factorization is NP-complete, NP = coNP • If graph isomorphism is NP-complete, PH collapses to 2 nd level NC = P?

– Open problem, but most practical problems have either been shown to be in NC or else P-hard.

– Exceptions: GCD • What would GCD being P-complete imply?

Questions?