Transcript CS 584

CS 484
Sorting
One of the most common operations
 Definition:

– Arrange an unordered collection of
elements into a monotonically increasing or
decreasing order.

Two categories of sorting
– internal (fits in memory)
– external (uses auxiliary storage)
Sorting Algorithms

Comparison based
– compare-exchange
– O(n log n)

Noncomparison based
– Uses known properties of the elements
– O(n) - bucket sort etc.
Parallel Sorting Issues

Input and Output sequence storage
– Where?
– Local to one processor or distributed

Comparisons
– How compare elements on different nodes

# of elements per processor
– One (compare-exchange --> comm.)
– Multiple (compare-split --> comm.)
Parallel Sorting Algorithms
Merge Sort
 Quick Sort
 Bitonic Sort
 Others …

Merge Sort
Simplest parallel sorting algorithm?
 Steps

– Distribute the elements
– Everybody sort their own sequence
– Merge the lists

Problem
– How to merge the lists
Quicksort
Simple, low overhead
 O(n log n)
 Divide and conquer
 Divide recursively into smaller
subsequences.

Quicksort
n elements stored in A[1…n]
 Divide

– Divide a sequence into two parts
– A[q…r] becomes A[q…s] and A[s+1…r]
– make all elements of A[q…s] smaller than
or equal to all elements of A[s+1…r]

Conquer
– Recursively apply Quicksort
Quicksort
Partition the sequence A[q…r] by
picking a pivot.
 Performance is greatly affected by the
choice of the pivot.
 If we pick a bad pivot, we end up with a
O(n2) algorithm.

Parallelizing Quicksort

Task parallelism
– At each step of the algorithm 2 recursive
calls are made.
– Farm out one of the recursive calls to
another processor.

Problems
– The work of partitioning is done by one
processor.
Parallelizing Quicksort
Consider domain decomposition.
 Hypercube

– a d dimensional hypercube can be split into two
(d-1) dimensional hypercubes such that each
processor in one cube is connected to one in the
other cube.

If all processors know the pivot, neighbors
split their respective lists and all elements
larger than the pivot are distributed to one
subcube and smaller elements are distributed
to the other subcube
Parallelizing Quicksort

After we go through each dimension, if
n>p the numbers are not totally sorted.
– Why?
Each processor then sorts their own
sublist using a sequential quicksort.
 Pivot selection is particularly important

– Bad pivots eliminate some processors
Pivot Selection

Random selection
– During the ith split one of the processors in
each subcube picks a random element
from its list and broadcasts to others.

Problem
– What if a bad pivot is selected at first?
Pivot Selection

Median selection
– If the distribution is uniform then each
processor's list is a representative sample
thus the median is representative

Problem
– Is the distribution really uniform?
– Can we assume that a single processor's
list has the same distribution as the full list?
Procedure HypercubeQuickSort(B)
sort B using sequential quicksort
for I = 1 to d
Select pivot and broadcast or receive pivot
partition B into B1 and B2 such that B1<= pivot < B2
if ith bit of iproc is zero then
send B2 to neighbor along ith dimension
C = subsequence received along ith dimension
Merge B1 and C into B
else
send B1 to neighbor along
C = subsequence received along ith dimension
Merge B2 and C into B
endif
endfor
Analysis
Iterations = log2p
 Select a pivot = O(n)

– keep sublist sorted
Broadcast pivot = O(log2p)
 Split the sequence

– split own sequence = O(log n/p)
– exchange blocks with neighbor = O(n/p)
– merge blocks = O(n/p)
Hypercube Quicksort Model
1000.000000000
100.000000000
10.000000000
1000
10000
1.000000000
100000
0.100000000
1000000
10000000
0.010000000
100000000
1000000000
0.001000000
0.000100000
8192
4096
2048
512
1024
256
64
128
32
8
16
4
2
0.000010000
1
Execution Time =
MyPortionSortTime +
NumSteps *
(PivotSelection +
Exchange +
CompareData)
Execution Time

Processors
10000
1000
1000
10000
100
100000
1000000
10
10000000
100000000
1
1000000000
Linear Speedup
0.1
Processors
51
2
10
24
20
48
40
96
81
92
25
6
64
12
8
32
8
16
4
2
0.01
1
Execution Time =
n/p * log2(n/p) * CompareTime +
log2(p) *
((latency + 1/bandwidth) +
2*(latency + n/(p*bandwidth) +
(CompareTime * 2*n/p)
Speedup

Analysis
Quicksort appears very scalable
 Depends heavily on the pivot
 Easy to parallelize


Hypercube sorting algorithms depend
on the ability to map a hypercube onto
the node communication architecture.
Sorting Networks

Specialized hardware for sorting
– based on comparator
x
y
max{x,y}
min{x,y}
x
y
min{x,y}
max{x,y}
Compare-Exchange
Compare-Split
Sorting Network
Bitonic Sort

Key operation:
– rearrange a bitonic sequence to ordered

Bitonic Sequence
– sequence of elements <a0, a1, … , an-1>
There exists i such that <a0, … ,ai> is
monotonically increasing and <ai+1,… , an-1> is
monotonically decreasing or
 There exists a cyclic shift of indices such that
the above is satisfied.

Bitonic Sequences

<1, 2, 4, 7, 6, 0>
– First it increases then decreases
– i=3

<8, 9, 2, 1, 0, 4>
– Consider a cyclic shift
– i will equal 2 or 3
Rearranging a Bitonic Sequence






Let s = <a0, a1, … , an-1>
– an/2 is the beginning of the decreasing seq.
Let s1= <min{a0, an/2}, min{a1, an/2 +1}…min{an/2-1,an-1}>
Let s2=<max{a0, an/2}, max{a1,an/2+1}… max{an/2-1,an-1} >
In sequence s1 there is an element bi = min{ai, an/2+i}
– all elements before bi are from increasing
– all elements after bi are from decreasing
Sequence s2 has a similar point
Sequences s1 and s2 are bitonic
Rearranging a Bitonic Sequence
Every element of s1 is smaller than
every element of s2
 Thus, we have reduced the problem of
rearranging a bitonic sequence of size n
to rearranging two bitonic sequences of
size n/2 then concatenating the
sequences.

Rearranging a Bitonic Sequence
Bitonic Merging Network
What about unordered lists?




To use the bitonic merge for n items, we must
first have a bitonic sequence of n items.
Two elements form a bitonic sequence
Any unsorted sequence is a concatenation of
bitonic sequences of size 2
Merge those into larger bitonic sequences
until we end up with a bitonic sequence of
size n
Creating a Bitonic Sequence
Wires
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
10
10
5
3
20
20
9
5
5
9
10
8
9
5
20
9
3
3
14
10
8
8
12
12
12
14
8
14
14
12
3
20
90
0
0
95
0
90
40
90
60
60
60
60
40
40
90
40
23
23
95
35
35
35
35
23
95
95
23
18
18
18
18
0
Mapping onto a hypercube
One element per processor
 Start with the sorting network maps
 Each wire represents a processor
 Map processors to wires to minimize the
distance traveled during exchange

Bitonic Merge on Hypercube
Bitonic Sort
Procedure BitonicSort
for i = 0 to d -1
for j = i downto 0
if (i + 1)st bit of iproc <> jth bit of iproc
comp_exchange_max(j, item)
else
comp_exchange_min(j, item)
endif
endfor
endfor
comp_exchange_max and comp_exchange_min compare and
exchange the item with the neighbor on the jth dimension
Bitonic Sort Stages
Assignment
Pick 16 random integers
 Draw the Bitonic Sort network
 Step through the Bitonic sort network to
produce a sorted list of integers.
 Explain how the if statement in the
Bitonic sort algorithm works.
