Lecture 33

Lecture 33

Agenda

Searching
Sorting

Searching

We all know what searching is - looking for something. In a computer program, the search could be:

Looking in a collection of values for some specific value (where is the 17 in this array of int?).
Looking for a value with a specific property (which object on the canvas contains the location where I clicked the mouse?).
Looking for a record in a database (what is the tax history for the last four years for the taxpayer with SSN 101-11-1009?).
Searching for text in some document or collection of documents (what web pages contain the text "searching and sorting algorithms?").

We have done some searching this semester. Remember your contains method for a Scribble?

  public boolean contains(Location point) {

    if (first.contains(point))
      return true;

    if (!rest.isEmpty())
      return rest.contains(point);

    return false;
}

We have to search through our collection of Lines that we call a Scribble to see which one, if any, contains the point.

How do we know that we're done searching? Well, at any time, we have access to the Line known as first. This Line might contain the point, and we'd know that this Scribble contains the point. There is no need to continue our search. If this Line does not contain the point, it might be the case that one of the other Lines in rest does. So if there are more lines, we see if any of them contain the point with a recursive call. Or perhaps, we have gotten to the end of the list and have checked every Line and none contained the point. In that case, we also know we're done and return false.

Let's try to get some idea of how much "work" it takes for us to get an answer. As a rough estimate of work, we will count how many times we call the contains method of a Line.

If our Scribble consists of n Lines, how many calls to the Line contains method will we have to make before we know the answer? It depends.. If the Scribble does not contain the point at all, we need to check all n Lines before we know the answer. If the Scribble does contain the point, we can stop as soon as we find the Line that contains the point. It might be the first, it might be the last - we just don't know. Assuming that there's an equal probability that the Line that contains the point is at any of the n positions, we have to examine, on average, (n)/(2) Lines.

In this case, we can't do any better. Perhaps if we were not restricted by the fact that the list of Lines forces us to examine the first, then the second, and so on. We can't jump right to the last Line, since our recursive structure does not provide access to that without first going through the whole list.

So let's think about searching in an array, where we have the option to look at any element directly. We will consider an array of int, though most of what we discuss applies to a wider range of "searchable" items.

Class demo: search for a number in an unordered array of numbers.

A method to do this:

    /**  
     * POST -- Returns index of findElt iff findElt is in elts
     *         otherwise returns -1.
     */
    public int search (int[] elts, int findElt) {
        int index = 0; 
        while (index < elts.length) {
            if (findElt != elts[index])
                index++;
            else
                return index;
        }
        return -1;                      // Didn't find elt.
    }

The procedure here is a lot like the search for a Line in a Scribble. We have no way of knowing that we're done until we either find the number we're looking for, or until we get to the end of the array. So again, if the array contains n numbers, we have to examine all n in an unsuccessful search, and, on average, (n)/(2) for a successful search.

Alternately, we could use recursion instead of a while loop for the search:

    /**  
     * POST -- Returns index of findElt iff findElt is in elts
     *         otherwise returns -1.
     */
    public int search (int[] elts, int findElt) {
        return recSearch(elts, findElt, 0);
    }
    
    /**  
     * POST -- Returns index of findElt iff findElt is in 
     *              elts[start..elts.length-1]
     *         otherwise returns -1.
     */
    public int recSearch (int[] elts, int findElt, int start) {
        if (findElt = elts[start])    // found it!
            return index;
        else if (start == elts.length-1) // no more elts to search
            return -1;
        else                       // search rest of list
            return recSearch(elts,findElt,start+1);
    }

Now, suppose the array has been sorted in ascending order.

Class demo: search for a number in an ordered array of numbers.

    /**  
     * POST -- Returns the index of findElt if it is in elts, 
     *         otherwise it returns -1
     */
    public int search (int[] elts, int findElt) {
        int index = 0; 
        while (index < elts.length)
            if (findElt > elts[index])
                index++;
            else if (findElt == elts[index])
                return index;
            else 
                return -1;
        return -1;
    }

Well, we can do the same type of search - start at the beginning and keep looking for the number. In the case of a successful search, we still stop when we find it. But now, we can also determine that a search is unsuccessful as soon as we encouter any number larger than our search number. Assuming that our search number is, on average, is going to be found near the median value of the array, our unsuccessful search is now going to require that we examine, on average, (n)/(2) items. This sounds great, but in fact is not a really significant gain, as we will see. These are all examples of a linear search - we examine items one at a time in some linear order until we find the search item or until we can determine that we will not find it.

Class demo: guessing game.

Class demo: dictionary lookup.

Class demo: binary search.

If you know that there is an order - where do you start your search? In the middle, since then even if you don't find it, you can look at the value you found and see if the search item is smaller or larger. From that, you can decide to look only in the bottom half of the array or in the top half of the array. You could then do a linear search on the appropriate half - or better yet - repeat the procedure and cut the half in half, and so on. This is a binary search. It is an example of a divide an conquer algorithm, because at each step, it divides the problem in half.

A Java method to do this:

    /**  
     * PRE -- The elements are in non-decreasing (essentially increasing) order
     * POST -- Returns the index of findElt if it is in elts, 
     *         otherwise it returns -1
     */
    public int search (int[] elts, int findElt) {
        return binsearch(elts, 0, elts.length -1, findElt);
    }
    
    /**  
     * PRE -- The elements are in non-decreasing (essentially increasing) order
     * POST -- Returns the index of findElt if it is in elts[low..high], 
     *         otherwise it returns -1
     **/
    protected int binsearch(int [] elts, int low, int high, int findElt) {
        if (low <= high) {
            int mid = (low + high) / 2;
            if (findElt < elts[mid])       // findElt can only be in 1st half
                return binsearch(elts, low, mid - 1, findElt);
            else if (elts[mid] < findElt)  // findElt can only be in 2nd half
                return binsearch(elts, mid + 1, high, findElt); 
            else                           // found findElt!
                return mid;
        }
        else
            return -1;                  // Didn't find findElt.
    }

How many steps are needed for this?

Each time, we cut the part of the array we still need to search in half.
How many times can divide number in half before you get to 1?
If you start with n, you divide to get (n)/(2) then (n)/(4), (n)/(8), ... and eventually get 1.
Let's suppose that n=2^k, then divide to 2^k-1, 2^k-2, 2^k-3, ..., 2⁰ = 1; divide k times by 2.
In general can divide n by 2 at most log₂ n times to get down to 1.

So how much better is this, really? In the case of a small array, the difference is not really significant. But as the size grows...

Search/#Elts 10 100 1000 1,000,000

linear 10 100 1000 1,000,000

binary 8 14 20 40

That's pretty huge. Even if you think about the search really needing on average (n)/(2) steps, for the 1000-element case, the binary search is still winning 500 to 20. The logarithmic factor is really important.

We can see this better by looking at graphs of n vs. logn and n. The difference is large, and gets larger and larger as n gets larger. Even if we multiply by constant factors in an attempt to make the logn graph as large as the n graph, there will always be a value of n large enough that the scaled function for n will be larger than the scaled function for logn. More on this later.

Sorting

We've looked at searching, but what about sorting?

Class demo: sorting a deck of cards.

Suppose our goal is to take a shuffled deck of cards and to sort it in ascending order. We'll ignore suits, so there is a four-way tie at each rank.

There are a lot of ways you might approach this. Here's one possibility:

Search for the largest card, and move it to the end of the deck.
Search for the next largest card, and move it to the second to last position in the deck.
...

What I have described is a form of a selection sort - at each step, we select the item that goes into the next position of the array, and put it there. This gets us one step closer to a solution.

This (hopefully) sounds like a job for a recursive algorithm. If we want to describe it that way, we'll need to make sure we have a base case (stopping condition) and a way to reduce the problem to a smaller version of the problem that can be solved by a recursive call.

Let's put playing cards aside and return to an array of int. Our base case is an array of either 0 or 1 elements. In either case, we know that the job is done, since any array of 0 or 1 elements is already sorted. The recursive step involves first finding the largest element and putting it at the end of the array (swap it with whatever happened to be in the last position), then doing a selection sort on the array except for the last element, which we know already contains its final value.

Here is a method that does this:

    /** 
     * POST -- ``element_array'' is sorted into non-decreasing order  
     **/
    public void sort(int[] elementArray) {
        recSelSort(elementArray.length -1,elementArray);
    }
    
    /**
        pre: lastIndex is a legal index of elementArray
        post:  elementArray[0..lastIndex] is sorted in non-decreasing
        order
    **/
    private void recSelSort(int lastIndex, int[] elementArray) {
        if (lastIndex > 0) {
            // get index of element with largest value 
            int extreme = findIndexOfLargest(elementArray,0,lastIndex);

            // swap largest elt left (``extreme'' elt) with one at lastIndex
            swap(elementArray, extreme, lastIndex);
            
            // elementArray[lastIndex] is now largest element in
            // elementArray[0..lastIndex]
            recSelSort(lastIndex-1,elementArray);
            // elementArray[0..lastIndex] are sorted.
        } 
    }

We will continue with some analysis of this algorithm.

Complexity of Searching

We saw that at each step of a binary search, we can divide the search in half. How many times can divide number in half before you get to 1?
If you start with n, you divide to get (n)/(2) then (n)/(4), (n)/(8), ... and eventually get 1.
Let's suppose that n=2^k, then divide to 2^k-1, 2^k-2, 2^k-3, ..., 2⁰ = 1; divide k times by 2.
In general can divide n by 2 at most log₂ n times to get down to 1.

So how much better is this, really? In the case of a small array, the difference is not really significant. But as the size grows...

Search/#Elts 10 100 1000 1,000,000

linear 10 100 1000 1,000,000

binary 8 14 20 40

Complexity of Sorting

We've looked at searching, but what about sorting?

Class demo: sorting a deck of cards.

Suppose our goal is to take a shuffled deck of cards and to sort it in ascending order. We'll ignore suits, so there is a four-way tie at each rank.

Describing a sorting algorithm precisely can be difficult. Let's consider a few.

Selection Sort

First, we will look at this procedure:

Search for the largest card, and move it to the end of the deck.
Search for the next largest card, and move it to the second to last position in the deck.
...

What I have described is a form of a selection sort - at each step, we select the item that goes into the next position of the array, and put it there. This gets us one step closer to a solution.

Here is a method that does this:

    /** 
     * POST -- ``element_array'' is sorted into non-decreasing order  
     **/
    public void sort(int[] elementArray) {
        recSelSort(elementArray.length -1,elementArray);
    }
    
    /**
        pre: lastIndex is a legal index of elementArray
        post:  elementArray[0..lastIndex] is sorted in non-decreasing
        order
    **/
    private void recSelSort(int lastIndex, int[] elementArray) {
        if (lastIndex > 0) {
            // get index of element with largest value 
            int extreme = findIndexOfLargest(elementArray,0,lastIndex);
            // swap largest elt left (``extreme'' elt) with one at lastIndex
            swap(elementArray, extreme, lastIndex);
            
            // elementArray[lastIndex] is now largest element in
            // elementArray[0..lastIndex]
            recSelSort(lastIndex-1,elementArray);
            // elementArray[0..lastIndex] are sorted.
        } 
    }

Note that if lastIndex > 0 then the array has at least two elements. If it has 0 or 1, then recSelSort does nothing. The method findIndexOfLargest(elementArray,0,lastIndex) finds the index of the largest element of elementArray[0..lastIndex], while swap(elementArray, extreme, lastIndex) swaps the largest element of the array with the last element. As a result the largest element in elementArray[0..lastIndex] will end up in elementArray[lastIndex].

It is also possible to write this iteratively: [Iterative Selection Sort], but the recursive version is more easily understood.

How long does this algorithm take? As we did with searching, we won't try to calculate an exact time, but we will estimate the cost by computing the number of comparisons done in sorting an array. We could alternately choose the to count the total number of "visits" to an array element, but the "shape" of the answer will be the same no matter which of these we compute.

Suppose the original array has n elements, where n > 1. Then it takes n-1 comparisons to find the largest element of the array (compare the first with the second, the largest of those with the third, etc.). In general, the number of comparisons needed to find the largest element is one less than the number of elements to be sorted. Once this element has been put into the last slot of the array, we need to sort the remaining n-1 elements of the array. By the argument above, it takes n-2 comparisons to find the largest of these. We continue with successive stages taking n-3, n-4, all the way down to the last pass through when there are only two elements and it takes only 1 comparison. (Once we get down to 1 element there is nothing to be done.)

Thus it takes S = (n-1) + (n-2) + (n-3) + ... + 3 + 2 + 1 comparisons to sort a list of n elements. We can compute this sum by writing the list forwards and backwards, and then adding the columns:

   S = (n-1) + (n-2) + (n-3) + ... +   3   +   2   +   1
   S =   1   +   2   +   3   + ... + (n-3) + (n-2) + (n-1)
   -------------------------------------------------------
  2S =   n   +   n   +   n   + ... +   n   +   n   +   n   = (n-1)*n

Therefore S = (n² - n)/(2). The graph of this as n increases looks like n² - a parabola.

Class demo: selection sort of big numbers on the board.

Merge Sort

Our second sorting algorithm proceeds as follows:

First, our base case: If the array contains 0 or 1 elements, there is nothing to do. It is already sorted.
If the array has two or more elements in it, we will break it in half, sort the two halves, and then go through and merge the elements.

Class demo: merge sort of numbers on the board.

The Java method to do it:

    /** 
        POST -- "elementArray" is sorted into non-decreasing order  
    **/
    public void sort(int[] elementArray) {
        // create tempArray for use in merging
        tempArray = new int[elementArray.length];
        
        recMergeSort(0, elementArray.length -1,elementArray,tempArray);
    }
    /**
        pre: first, lastIndex are legal indices of elementArray
        post:  elementArray[firstIndex..lastIndex] is sorted in non-decreasing order
    **/
    protected void recMergeSort(int firstIndex, int lastIndex, 
                               int[] elementArray, int[] tempArray) {
        int middleIndex = (firstIndex+lastIndex)/2; // middle element of array
        
        if (lastIndex - firstIndex > 0)      { // More than 1 elt of array
            // Sort both halves of list
            recMergeSort(firstIndex,middleIndex,elementArray,tempArray);
            recMergeSort(middleIndex+1,lastIndex,elementArray,tempArray);
            
            // Merge two halves
            mergeRuns(firstIndex,middleIndex,lastIndex,elementArray,tempArray);
        }
    }

The method mergeRuns takes the sorted elements in elementArray[firstIndex..middleIndex] and elementArray[middleIndex..lastIndex] and merges then together using the array tempArray, and then copies them back into elementArray.

Again we'd like to count the number of comparisons necessary in order to sort an array of n elements. Unfortunately, the code shown above doesn't include any comparisons - all of the comparisons are in the mergeRuns method.

Even without looking at the code in mergeRuns we can estimate the number of comparisons made. If we are trying to merge two sorted lists, every time we compare two elements at the ends of the lists we will put one in its correct position. When we run out of the elements in one of the lists, we put the remaining elements into the last slots of the sorted list. As a result, merging two lists which have a total of n elements requires at most n-1 comparisons.

Suppose we start with a list of n elements. Let T(n) be a function telling us the number of comparisons necessary to mergesort an array with n elements. As we noted above, we break the list in half, mergesort each half, and then merge the two pieces. Thus the total amount of comparisons needed are the number of comparisons to mergesort each half plus the number of comparisons necessary to merge the two halves. By the remarks above, the number of comparisons to do the final merge is no more than n-1. Thus T(n) <= T(n/2) + T(n/2) + n-1. For simplicity we'll replace the n-1 comparisons for the merging by the even larger n in order to make it easier to see how to approximate this result. We have T(n) = 2 ·T(n/2) + n and if we find a function that satisfies that equation, then we have an upper bound on the number of comparisons made during a mergesort.

Looking at our algorithm, no comparisons are necessary when the size of the array is 0 or 1. Thus T(0) = T(1) = 0. Let us see if we can solve this for small values of n. Because we are constantly dividing the number of elements in half it will be most convenient to start with values of n which are a power of two. Here we list a table of values:

n T(n)

1 = 2⁰ 0

2 = 2¹ 2*T(1)+2 = 2 = 2*1

4 = 2² 2*T(2)+4 = 8 = 4*2

8 = 2³ 2*T(4)+8 = 24 = 8*3

16 = 2⁴ 2*T(8)+16 = 64 = 16*4

32 = 2⁵ 2*T(16)+32 = 160 = 32*5

... ...

n = 2^k 2*T(n/2)+n = n*k

n	T(n)
1 = 2⁰	0
2 = 2¹	2T(1)+2 = 2 = 21
4 = 2²	2T(2)+4 = 8 = 42
8 = 2³	2T(4)+8 = 24 = 83
16 = 2⁴	2T(8)+16 = 64 = 164
32 = 2⁵	2T(16)+32 = 160 = 325
...	...
n = 2^k	2T(n/2)+n = nk

Notice that if n = 2^k then k = log₂ n. Thus T(n) = n ·log₂ n. In fact this works as an upper bound for the number of comparisons for mergesort even if n is not even. If we graph this we see that it grows much, much slower than the graph for a quadratic (for example, the one corresponding to the number of comparison for selection sort).

This explains why, when we run the algorithms, the time for mergesort is almost insignificant compared to that for selection sort.

Search/#Elts	10	100	1000	1,000,000
linear	10	100	1000	1,000,000
binary	8	14	20	40