Tonga Institute of Higher Education Design and Analysis of

Download Report

Transcript Tonga Institute of Higher Education Design and Analysis of

Tonga Institute of Higher Education

Design and Analysis of Algorithms

IT 254

Lecture 3:

Sorting

● ● ● ●

Sorting

Sorting, in computer science, is defined as an algorithm that takes a list of items and returns an ordered list of items based on some key We will only talk about lists of numbers in ascending order, but sorting can work on any type of data if you make the algorithm correctly Sorting has always been a very important idea in computer science and a lot of different algorithms have been made that can sort.

We will look at, and analyze, a few of the familiar sorts, including heap sort, Quicksort and linear time sorting.

Simple Sorting

• But before we start looking at complicated sorting algorithms that run quickly, it is helpful to understand how even simple sorting works.

• If we were going to sort a list of numbers in our head, how would we do it?

17 5 32 36 11 8 • One simple algorithm: 1. Start at beginning of list and go to end of list 2. Find smallest number and swap with front 3. Move one number over and repeat

1.

17 5

Simple Sorting

32 36 11 2.

5 Start Smallest Number 17 32 36 11 3.

5 Start Smallest Number 17 32 36 11 4.

5 11 32 36 17 Next Place Next Smallest Then repeat until the Place Pointer is pointing to the End of the array.

You'll have a sorted list Next Place Next Smallest

Simple Sorting Algorithm

● Let's look at some Pseudo-Code for simple sorting.

SimpleSort(Array A) for k = 0 to endof(A) int smallest = A[k], int smallIndex = k for j = k to endof(A) if A[j] < smallest THEN smallest = A[j], smallIndex = j end loop int temp = A[k] A[k] = A[smallIndex] A[smallIndex] = temp end loop

Simple Sorting Code

● Now let's look at the actual code (almost the same): for (int k = 0; k < n; k++) { int smallest = A[k]; int sIndex = k; for (int j = k; j < n; j++) { if (A[j] < smallest) { smallest = A[j]; sIndex = j; } } int temp = A[k]; A[k] = A[sIndex]; A[sIndex] = temp; }

Analyzing the Simple Sort

• Now, let us try to use our algorithm analysis skills to analyze the Simple Sort – There are two loops, one inside of the other.

– The first loop will run O(n) times. (This should be obvious) – But how many times does the second loop run?

– The second loop will start at j = k and go to n, so can we figure out how many times it runs?

– T(n) = (n+(n-1)+(n-2)+(n-3)+…+ 2 + 1) – This looks like a summation:

T

(

n

) 

k

 1

n

  0 (

n

k

)

Analyzing Simple Sort

● ● ● ● So can we use this summation?

– Yes, if we realize that it's really the arithmetic series. In our summation, we start at n and go to 1. The arithmetic series usually starts at 1 and goes to n, but it’s the same either way.

T(n) = (n + (n-1) + (n-2) … + 2 + 1) = n(n+1)/2 n(n+1)/2 = (n 2 + n)/2 = O(n 2 ) So the running time will be O(n 2 )

Looking at Simple Sort

● ● ●

Looking at Simple Sorting:

http://www.tihe.org/courses/it254/code/SimpleSortC.html

http://www.tihe.org/courses/it254/code/SimpleSortVB.html

● ● ● ●

Is that all there is to sorting?

Simple sorting uses a very easy algorithm to sort a list. Do we even need to know other algorithms?

Unfortunately, the simple sorting algorithm is very, very slow. It is O(n 2 ), which we do not like.

If your array size is 1 million, then 1 million * 1 million is a lot bigger The rest of the algorithms we will look at have much better running times, like O(n lg n) and even O(n)

Heaps

● ● Before we look at the Heapsort algorithm, it's a good idea to understand what a heap is.

A heap is a way to hold data. We can use different programming structures to hold different data in a variety of ways. So we might think a heap looks like the following.

5 3 This is also known a binary tree. It consists of “nodes” that are connected through links 1 4 6 8

Heapsort

A

heap

is like a binary tree:

16 14 8 7 9 2 4 10 3

Sometimes all the nodes are filled up and this is called a complete binary tree Sometimes there are missing leaves and this is called "nearly" complete

Heaps

● Even though we think about heaps as trees, we have to realize that most of the time they will live in arrays

16

16 14 10 8 7 9 3 2 4 1 =

8 14 7 10 9 3 2 4 1

Heaps

● To represent a complete binary tree as an array: – – – – – The root node is A[1] Node i is A[i] The parent of node divide) i is A[i/2] (note: integer The left child of node i The right child of node i is A[2i] is A[2i + 1]

16

16 14 10 8 7 9 3 2 4 1 =

8 2 4 14 1 7 9 10 3

Using heaps

● So if we are going to start using heaps in programming, we'll need functions to return the parent and the child nodes – So…

int Parent(i) { return i/2; } int Left(i) { return 2*i; } int Right(i) { return 2*i+1;}

This will get us the nodes in the array for the parent and the left and the right child nodes.

The heap property

● Heaps also need to satisfy the

heap property

: A[ Parent (i)]  A[i] for all nodes

i

> 1 – In other words, a parent is always bigger (or equal) than its child – – – This means the largest valued node is stored at the top (also called the root node) Every node below the root node has a smaller value By following different properties of the heaps, we can ensure that heaps stay the way we want them to

● ● ● Definitions: – –

Heap Height

The height of a node in the tree = the number of edges on the longest path to a leaf The height of a tree = the height of its root What is the height of an n-element heap? – If there are n-elements in the heap: – – On the 0 th level there is one element. The 1 st elements. The 2 nd level -> 4. The 3 rd level, two level -> 8. So this means that at each level 'k' there are 2 k elements.

– – To find how many levels (or the height), we can take the log of n, thus the height of an n-element heap = log n Example: If there are 15 elements, then there will be 1+2+4+8 on the levels. If we take log 2 15 < 4, thus we know the height is 3 This is good because many heap operations will only take O(log n)

Heap Operations

● ● A heap will have a few functions that it uses to make sure nodes stay in the correct order.

Heapify()

: A function that will make sure, after an add or delete, that the heap property is still true – – – If a node "x" in the heap with children left and right and two subtrees that start at left and right are assumed to be heaps Then: The problem is the subtree rooted at "x" may go against the heap property, because of a new insertion To Solve: let the value of the parent node “float down” so subtree at "x" satisfies the heap property

}

Heapify

// A is the heap (saved in array). "i" is root

void Heapify(A, i)

{

largest = i; l = Left(i); r = Right(i);

// l and r are indexes // if the "l" child exists and it is bigger than "i"

if (l < heap_size(A) && A[l] > A[i]) { largest = l; } else { largest = i; }

// if "r" child exists and is bigger than "l"

if (r < heap_size(A) && A[r] > A[largest]) { largest = r; }

// if largest is not the root node, then switch and // "float down" the tree

if (largest != i) { Swap(A, i, largest); Heapify(A, largest); }

2

2 14 8 4 1 7

Heapify Example

1

2 14 8 4 1 7 16 9 10 A = 16 4 1014 7 9 3 2 8 1 3 16 9 10 3

4

2 4 8 14 1 7 16 9 10 3

5

2 4 A = 161410 4 7 9 3 2 8 1 16 14 10 7 9 8 1 3 A = 16 4 1014 7 9 3 2 8 1

3

2 14 8 4 1 7 16 9 10 3 A = 161410 4 7 9 3 2 8 1 16 2

6

8 4 14 1 7 9 10 3 A = 16 4 10 14 7 9 3 2 8 1 A = 161410 8 7 9 3 2 4 1

● ● ●

Analyzing Heapsort

Fixing relationships between parent and child nodes takes  (1) time, so the question is how many times do we need to "float down" a tree?

If the heap at "x" has n elements, how many elements can the subtrees at left or right have? – Answer: 2 n /3 (worst case: bottom row 1/2 full) So time taken by

Heapify()

– is given by T(n) = T(2n/3) + O(1) // recursive ● ● By case 2 of the Master Theorem – – T(n) = aT(n/b) + n c T(n) = O(log n) Thus,

Heapify()

takes "logarithmic" time

● ● ● ●

Heapsort Functions

We can build a heap by running

Heapify()

on subarrays, one after each other.

This is because for an array of length range starting from n/2+1  n , all elements in n are heaps.

This is because after the n/2+1 position, all the nodes are children and will be changed during the heapify process regardless Thus: – Walk backwards through the array from n/2 to 1, calling

Heapify()

on each node.

– i Order of processing guarantees that the children of node are heaps when i is processed

BuildHeap()

Build heap will go through the entire array, making heaps out of everything below it, and when it finishes, you know you will have an entire working heap.

// given an unsorted array A, make A a heap BuildHeap(A) { heap_size(A) = length(A); for (i = length[A]/2; i > 1; i--) { Heapify(A, i); } }

BuildHeap Example

● Think about the following example and see if you can walk through a BuildHeap() operation ● A = {4, 1, 3, 2, 16, 9, 10, 14, 8, 7}

4 1 3 2 16 9 10 14 8 7

Analyzing BuildHeap()

● ● ● Each call to

Heapify()

takes O(lg

n

) time There are O(

n

) such calls (specifically, n/2) Remember, we don't worry about any constants, thus – O(n/2) -> O(n), – O(999n) -> O(n), – O(n 2 /1000) -> O(n 2 ) ● Thus, the running time is O(

n

– – lg

n

) Is this a correct asymptotic upper bound?

Is this an asymptotically tight bound?

Heapsort

● Given

BuildHeap()

, an in-place sorting algorithm is easily constructed: – – Maximum element is at A[1] Discard by swapping with element at A[n] ● Decrement heap_size[A] ● A[n] now contains correct value – Restore heap property at A[1] by calling

Heapify()

– Repeat, always swapping A[1] for A[heap_size(A)]

Heapsort

So our full Heapsort algorithm looks like:

// A is an array Heapsort(A) { BuildHeap(A); for (i = length(A); i > 1; i--) { Swap(A[1], A[i]); heap_size(A) -= 1; Heapify(A, 1); } }

Analyzing Heapsort

● ● ● The call to

BuildHeap()

time Each of the

n

O(lg

n

) time takes O(

n lg n

) - 1 calls to

Heapify()

takes Thus the total time taken by

HeapSort()

= O( = O(2 = O(

n lg n n n

lg

n

) lg

n

) ) + (

n

- 1) O(lg

n

)

Heapsort Questions

● ● What is the running time of Heapsort on an array called A of length n that is already sorted in increasing order? What about decreasing order?

Show the running time of Heapsort is also Ω(n lg n) ● ● ●

Looking at heapsort:

http://www.tihe.org/courses/it254/code/HeapSortC.html

http://www.tihe.org/courses/it254/code/HeapSortVB.html

Quicksort

● ● ● ● ● Sorts in place – means that if 2 numbers are the same, they will be in the same order after the sort Sorts O(n lg n) in the average case Sorts O(n 2 ) in the worst case Another divide-and-conquer algorithm – Divide: The array A[p..r] is partitioned subarrays A[p..q] and A[q+1..r] into two – Conquer: The two sub-arrays are sorted by recursive calls to quicksort – Fact: All elements in A[p..q] are less than all elements in A[q+1..r] Quicksort is one of the most widely used sorting algorithms because it is very fast on the average case

Quicksort Code

Quicksort(A, p, r) { if (p < r) { q = Partition(A, p, r); Quicksort(A, p, q); Quicksort(A, q+1, r); } }

Partition() function

● Clearly, all the action takes place in the

partition()

function – – – Rearranges the subarray in place End result: ● Two subarrays ● All values in first subarray  all values in second Returns the index of the “pivot” element separating the two subarrays

Partitition() Function

● Partition(A, p, r): – – – – – – – Select an element to act as the “pivot” Grow two regions, A[p..i] and A[j..r] ● ● All elements in A[p..i] <= pivot All elements in A[j..r] >= pivot Increment i until A[i] >= pivot Decrement j until A[j] <= pivot Swap A[i] and A[j] Repeat until i >= j Return j

} { Partition(A, p, r)

Partition Code

x = A[p]; // choose x as "pivot" i = p - 1; j = r + 1; while (TRUE) { while (A[i] <=x) { i++; } // move "i" forward while (A[j] >= x) { j--; } // move "j" backwards if (i <= j) // if i < j, swap and Swap(A, i, j); // do again else return j; // if i > j, return

Partition actually runs in O(n) time: Can we prove it?

Partition Example

Partition(A,1,5) 5 3 2 6 4 1 3 7 x = 3, i = 0 , j = 7 5 | 3 2 6 4 1 3 | 7 x =3, i=0 , j=6 (i j, so return j) So the partition we return is A[3], which is 1. Is this correct? Is everything on the left side < everything on the right side? Remember, the partition is not included in this calculation.

Analyzing Quicksort

● ● ● ●

What will be the worst case for the algorithm?

– Partition is always unbalanced

What will be the best case for the algorithm?

– Partition is perfectly balanced

Which is more likely?

– The latter, by far, except...

Will any input make it the worst-case?

– Yes: Already-sorted input

Analyzing Quicksort

● ● In the worst case: T(1) =  (1) T(n) = T(n - 1) +  (n) Works out to T(n) =  (n 2 ) (can we do the recurrence?) ● ● In the best case: T(n) = 2T(n/2) +  (n) Which comes out to: T(n) =  (n lg n) (can we do the recurrence?)

Making Quicksort even better

● ● ● The real problem of quicksort is that it runs in O(n 2 ) on already-sorted input Two possible solutions to this: – – Randomize the input array, OR Pick a random pivot element

How will these solve the problem?

– This makes sure that the input will not be in sorted order and that the algorithm will not run in O(n 2 ) time

Quicksort: The average case

● ● Assuming random input, average-case running time is much closer to O(n lg n) than O(n 2 ) Explanation – – Suppose that partition() always produces a 9-to 1 split. This is pretty unbalanced, but we can do the recurrence and check to see how it turns out The recurrence is thus: T(n) = T(9n/10) + T(n/10) + n What do we get as the answer?

● ●

Average Quicksort

In real-life though, quicksort will produce a mix of “bad” and “good” splits – If we choose splits at random… – Let us pretend that they alternate between best-case split (n/2 and n/2) and worst-case (n-1 and 1) What happens if we take a bad split with the first node, then take a good split with the rest of the array size (n-1)?

– We end up with three subarrays, – – ● size 1, (n-1)/2, and (n-1)/2 Combined cost of splits = T(1) + 2T((n-1)/2)) + n = 2T(n/2) + n + 1  still the same recurrence No worse than if we had take a good split of the first node!

Average Quicksort

● ● ● The O(n) cost of a bad split (or 2 or 3 bad splits) can be "absorbed" into the O(n) cost of each good split Thus, the running time of alternating bad and good splits is still O(n lg n), but with slightly higher constants (which we don't care about) Now, can we prove it?

● ● ●

Average case Quicksort

To make things easy, assume: – – All inputs are different (no repeats) Slightly different

partition()

procedure ● ● partition around a random element, which is not included in subarrays all splits (0:n-1, 1:n-2, 2:n-3, … , n-1:0) equally likely If the pivot point is random, then any particular split has a chance of (1/n) of happening.

Then we can use a recurrence for the expected running time

Average case Quicksort

● ● ● ● Each split has a (1/n) possibility. The running time of each split (from 1 and n-1 to n-1 and 1) will have running times of (T(k) + T(n-1-k)) The O(n) at the end comes from the running time of Partition Observer that for k = 1,2,3…, n-1, each T(k) of the sum occurs once as T(k) and once as T(n-k). We can add these up to get.

Solving the average case

● We can solve this recurrence using the dreaded substitution method – Guess the answer ● T( n ) = O( n lg n ) – – Assume that the hypothesis holds ● T( n )  cn lg n +b for some constant c and b ● Substitute in for some value < n Prove that it follows for n

Quicksort: Average case

The recurrence we need to solve Plug in inductive hypothesis Expand the k = 0 case 2b/n is just O(n), so put it into O(n)

Quicksort Average Case

Recurrence we are trying to solve Distribute summation Evaluate summation: b + b + b + ... = b(n Since n-1 < n, then 2b(n-1)/n < 2b

Quicksort Average Case

Recurrence we are trying to solve Distribute 2a/n

● ● ● ● ●

Quicksort Average case

So we have actually shown that – T(n) < a n lg n + b So the induction hypothesis holds Thus we know Quicksort, in the average case, runs in O(n lg n) time This is the kind of work that computer scientists who study algorithms do. You can also see why math and computer science need each other.

Sorting so far...

We've looked at a few types of sorting so far: – – – Simple Sort: ● ● O(n 2 ) worst case O(n 2 ) average (equally-likely inputs) case Heap sort: uses heap data structure ● O(n lg n) worst case ● Sorts in place Quick sort: divide and conquer ● ● ● ● O(n lg n) average case Fast in practice O(n 2 ) worst case Worst case is when we use sorted input ● Use randomized quicksort instead

Faster sorting than O(n lg n)

● ● All the sorting algorithms so far are O(n lg n) or slower. We can actually make faster sorting algorithms Firstly, we need to realize that all of the sorting algorithms so far are comparison sorts – – The only way to find out the sorting order information about a list of numbers is by comparing two elements Theorem: all comparison sorts are  (n lg n)

Sorting In Linear Time

● ● So how can we go faster than O(n lg n)?

Counting sort – – No comparisons between elements!

Instead, it assumes some things about the numbers being sorted, like: ● We assume numbers are in the range 1.. k – The algorithm: ● Input: A[1..n], where A[j]  ● ● {1, 2, 3, …, k} Output: B[1..n], sorted (notice: sorting in place) Also might need to use an array C[1..k] for auxiliary storage

Counting Sort

6: 7: 8: 9: } CountingSort(A, B, k) { 1: for i=1 to k 2: 5: C[i]= 0; 3: for j=1 to n 4: C[A[j]] += 1; for i=2 to k C[i] = C[i] + C[i-1]; for j=n down to 1 B[C[A[j]]] = A[j]; C[A[j]] -= 1;

Takes O(k) time Takes O(n) time Try to walk through example where A = {4 1 3 4 3}, k = 4

Counting Sort Explanation

● ● ● ● After initialization in lines 1-2, each number is looked at in lines 3-4. If the value of a number is we increment C[i]. So after lines 3-4, C[i] holds the total amount of numbers equal to i i , for each integer from i = 1..k

In lines 6-7, we find how many numbers are less than or equal to i . So we keep a running sum in C.

In lines 9-11, we place each number A[j] in its correct sorted position in the output array B. If all elements are unique, then when we go to line 9, for each A[j], the value of C[A[j]] is the correct final position of A[j] in the output array, since there are C[A[j]] numbers less than or equal to A[j].

● ● ●

Counting Sort

Total time: O(n + k) – Usually, k = O(n) – – Thus counting sort runs in O(n) time It is also a “stable sort” But isn't sorting  (n lg n)?

– No contradiction--this is not a comparison sort (in fact, there are no comparisons at all!) So why isn't counting sort always used?

– Because it depends on range k of elements ● ● Could we use counting sort to sort 32 bit integers? Answer: The k is too large (2 32 = 4,294,967,296).

Stable Sorts

● ● ● Stability: stable sorts keep the relative order of elements that have an equal key. That means a sort algorithm is stable if whenever there are two records R and S with the same key and with R appearing before S in the original list, R will appear before S in the sorted list. Unstable sort algorithms will ignore the order of elements that have the same value for their key.

Radix Sort

There are even better linear time sorts than counting sort ● ●

How did IBM become such a huge company?

Answer: Made counting machines that took census data and sped up the process of counting everyone – In particular, they used a card sorter sort cards into different bins ● that could Each column can be punched in 12 places ● Decimal digits use 10 places – Problem: only one column can be sorted on at a time

Radix Sort

● ● ● Intuitively, you might sort on the most significant digit, then the second msd, etc.

Problem: There would be lots of intermediate arrays to sort through (use a lot of memory) Key idea: sort the

least

significant digit first by using a stable sort (like counting sort)

RadixSort(A, d) for i=1 to d CountingSort(A) on digit i

● ● ● ● ● ●

Radix Sorting

Again, the idea is to sort a d digit number d times on each digit, using any stable sort starting with the ones digit.

A sample program will run like this:

Convert numbers into base 2. The first pass copies each even number into one array and each odd number into another. Then the two arrays are copied (even numbers first) back to the original. The next pass works on bit 1, then bit 2, and so on. The program detects when no more bits need to be sorted on and stops.

Radix Sort Example

Start  Sort digit 1  Sort digit 2  Sort digit 3 329 457 657 839 436 720 355 720 355 436 457 657 329 839 720 329 436 839 355 457 657 329 355 436 457 657 720 839 The key is that Radix is a stable sort. Numbers will stay in relative order

Radix Sorting

● ● In general, radix sort (based on counting sort) is – – – – Fast Asymptotically fast (i.e., O( n )) Simple to code A good choice So why not always use radix sort? – Does it work on floating point numbers?

Summary

● ● ● This chapter on sorting comprises a lot information, but only touches upon some of the most popular sorts Visit the online resources for more information about sorting Our goal is to be able to look at a sorting algorithm and analyze it's running time (either worst case or average case)