Tuesday, March 10, 2009

Big O Notation 2: Sorting

As a Card-Carrying Nerd (you have to mail away exactly 314 cereal box tops for your card) I find entertainment in some very unusual activities.  For example, I like to execute algorithms by hand. When I find a pack of cards lying around, I often find myself shuffling them, dealing out about 10-20, and operating a QuickSort on them. Sometimes when I have a stack of old bills and I want to eliminate duplicates while retaining the latest copy for my records, I'll even lay them out in a grid on the floor and perform a bucket sort.

Sorting algorithms are material from Intro to Computer Science 1, but I'm going to give a very condensed overview of the topic as background material to what I'm going to go over in a future post. If you're already rock solid on sorting algorithms, I suggest you skip or skim this one, and I promise I'll get to more practical discussions later.

You'll remember that in the last post on Big O, I explained that a "dumb dictionary search" (just viewing each word one at a time) requires O(n) time, while a more efficient search (which is also known as a "binary search") requires only O(log n) time, which is a lot shorter. But the binary search only works if the words are in some kind order already. If they're all scrambled up -- if "xylophone" comes before "cereal" comes before "pheasant" -- then you have no choice but to execute a dumb search. There's no other way to find your word besides looking at every entry to make sure you didn't miss any.

Therefore, it's often helpful to sort a list of things (such as dictionary words). Of course, sorting a dictionary might take longer than actually searching it. But if you have to search the dictionary multiple times, it's worth it. You eat the cost of sorting only once, but you get to do binary searches as many times as you want after that.

Suppose we take a pack of cards and deal out "n" cards at random. Let's make this quick and pick 5. I'll imagine they came out like this:

  • 5, A, 7, K, 3

How long will it take to put these cards in the correct order? Well, naturally it can't take less than O(n) time. It's easy to see why. Suppose you had a magic function which automatically skimmed the entire list in no time at all to find the smallest number remaining. Then you could scan these cards and find the Ace first, and put it in the first spot. Then you find the 3, and put it in the second spot. And so on. Then the sequence of events would look like this:

  • A, 5, 7, K, 3 (we put the ace in the first spot, but we needed to move the 5 out of the way, so we just swap their positions)
  • A, 3, 7, K, 5 (3 swapped with 5)
  • A, 3, 5, K, 7
  • A, 3, 5, 7, K (sorted!) 

This is the most efficient sort possible, but even with our magic function you still have to move each card to the correct location. There are n cards, so it takes n moves. So no matter what happens, we can't do better than O(n). (Technically, it took "n-1" moves to finish, because once the first four cards are sorted, the king MUST be in the right place. There's nothing to swap with. But remember how I said before that constant multipliers don't matter? Subtracting a 1 definitely doesn't matter. In fact, we ignore any constants that are added to or subtracted from the outcome. Order of magnitude is all that counts.)

Unfortunately, we don't have a magic function that takes no time to find the Ace, so we actually have to read the cards in order to find it. How do we do this? Well, we could read the entire list manually, finding the Ace by inspecting every card. This is a technique known as a "selection sort," for the obvious reason that we are looking through the unsorted list to select the smallest number. Like in the dumb dictionary search, that would take O(n) reads to accomplish (technically n/2 on average, but constant multipliers don't matter).

But it's worse than that. Because we have to move n cards, AND we have to read n cards each time we move one. So we do n*n things, which means that this sort is O(n2).

Terrible. Remember, our dictionary had 170,000 words. You square that, and you're doing almost thirty billion operations. There must be a better way.

There is, and one example is QuickSort. QuickSort is a recursive algorithm that is a little bit tedious to explain in a blog post, but it's so successful that it is implemented as a standard feature of many languages. If you type "sort @mylist" in Perl, you are instructing Perl to perform a QuickSort in most cases. If you have some time to kill, I recommend that you look up QuickSort on Wikipedia, lay out some cards, and try to figure out how it works. It's kind of cool.

QuickSort usually requires a time of O(n*log n), and you can't do much better than that. Well, there are some tricks you can use to speed it up if you know exactly what sort of data you're dealing with, even getting pretty close to O(n) in some cases. But for the most part, you're stuck with n*log n. And this makes sense: In the dictionary example we discovered that you can find a word in O(log n) time, and it takes O(n) time to move all the numbers, so in the best case we are going to find a number as quick as possible and then move it.

If you sort a 170,000 word dictionary in O(n*log n) time, how long will it take?  Log 170,000 is about 5, and multiplied by 170,000 itself is about 900,000.  That's not fast, but it's nowhere near as bad as 30 billion.

This is the last purely academic post I'm going to write about Big O notation. By now, I'm sure some readers are asking "What's with all this theory? Am I ever going to use this professionally?"  My answer is: "You might.  For one thing, you can use it to show off in job interviews."  Next time I post about Big O, I'll explain how.

9 comments:

  1. Love the blog. Keep em coming, I am currently working for a software development company and I have a degree in Physics which makes what you say perfectly logical but I am missing much of the programming methodology you speak of. Please write more and if you could when you use an example such as QuickSort can you include equivalent functionality in other languages where possible??

    Thanks,
    Thundergod

    ReplyDelete
  2. The Wikipedia page on QuickSort is actually really good, although it is a bit long-winded, as tech pages on Wikipedia tend to be. If I've got some time to kill in later posts, I might do a whole post about QuickSort.

    ReplyDelete
  3. Love your posts. I don't know much about programming (can handle some Basic) but I love science in general and computer science is not the exception.

    ReplyDelete
  4. When I find a pack of cards lying around, I often find myself shuffling them, dealing out about 10-20, and operating a QuickSort on them.

    Not trying to start a geek DSW or anything, but I once wrote a Makefile to help me with a craft project. Because make is basically a tool for doing topological sorting (if that's the right term).

    QuickSort usually requires a time of O(n*log n), and you can't do much better than that.

    I seem to remember seeing a proof of this, that started with the observation that n items can be shuffled n! ways, and your algorithm has to be able to deal with all of them. I don't remember how it goes from there, though.

    some readers are asking "What's with all this theory? Am I ever going to use this professionally?"

    If you write code for a living, then yes. I find that in many cases, it's easier to write a slow algorithm than a fast one. In the example above, selection sort is easier to write than QuickSort. When you're writing experimental code or a first draft of something, it's okay to write something inefficient quickly, than to spend too much time getting it right. But it's best to know that this is inefficient, so you can leave a note to yourself to fix it in the production release.

    Likewise, if your code is running too slowly, it helps to figure out the speed of the algorithms you're using, in order to improve it intelligently: if 50% of the time is spent sorting using an O(n*log n) algorithm, there's not much you can do to improve that. If the other 50% of the code uses an O(n^3) algorithm where it could be using an O(n^2) one, you'll get a lot more bang for the buck.

    In his book "Deep C Secrets", Peter van der Linden talks about a slowness bug in Sun's C compiler. The compiler used a hash table to keep track of identifiers in the program being compiled. Hash functions are notoriously hard to write, so the developer had put in a quick and dirty one that simply returned 1 (effectively turning the O(1) hash table into an O(n) linked list), along with a comment saying to write a real hash function later.

    ReplyDelete
  5. Yay, I really like your new blog so far. I did notice one thing, though :-)

    If you type "sort @mylist" in Perl, you are instructing Perl to perform a QuickSort in most cases.

    Actually, since 5.8.0, Perl has switched its default sort algorithm to Merge Sort. I'm not pointing this out to be pedantic, but to highlight this alternative algorithm, as it actually gives slightly higher performance than the QuickSort for most use cases :-)

    Otherwise, great stuff! Keep it up!

    ReplyDelete
  6. Ah, I was afraid I might have been wrong, so I looked it up in advance. Here is what I was reading:

    In Perl versions 5.6 and earlier the quicksort algorithm was used to implement sort(), but in Perl 5.8 a mergesort algorithm was also made available, mainly to guarantee worst case O(N log N) behaviour: the worst case of quicksort is O(N**2). In Perl 5.8 and later, quicksort defends against quadratic behaviour by shuffling large arrays before sorting.

    According to the page, you can choose a merge sort yourself if you desire, but the default behavior is a shuffle followed by a quicksort.

    For the benefit of people who don't have this imprint in their memory banks, there's a good reason for shuffling an array before quicksorting it. Ordinarily quicksort is O(n log n) on a randomize array. However, if your array is already sorted beforehand, Quicksort winds up being O(n^2). Obviously this is not good, because if you accidentally sort something twice, it hurts your performance and you also get nothing out of it.

    ReplyDelete
  7. Am I ever going to use this professionally?" My answer is: "You might. For one thing, you can use it to show off in job interviews."

    Sadly, this is the case. While people argue about what is and isn't computer science(engineering), understanding the basics of runtime complexity does seem to have gone by the wayside. I'm very disappointed when people interview for jobs and have absolutely no clue about the subject.

    Even softball questions like: "When would you use an array/vector over a list? vice-versa?" are often unanswerable. Let alone any discussion of sorting...

    All of that aside, I'm enjoying your posts!

    ReplyDelete
  8. Ah, I was afraid I might have been wrong, so I looked it up in advance. Here is what I was reading:

    Aye, but scroll down 3 paragraphs:

    The default algorithm is mergesort, which will be stable even if you do not explicitly demand it. But the stability of the default sort is a side-effect that could change in later versions.

    It's also covered in detail in perldelta580, as the core devs obviously had to take the time to verify and justify their decision.

    Ok, now I admit to being pedantic :-)

    Anyway, it's not really germane to the rest of your article, of which algorithmic complexity is indeed overlooked by far too many programmers, who all too often leave others holding the bag by not carefully considering their program design. Thanks for covering this issue, I hope you do a long series on it :-)

    ReplyDelete
  9. For a nice demonstration of some different sorting algorithms side by side and under different conditions, try the site http://www.sorting-algorithms.com/

    ReplyDelete