Castles of Air: sorting algorithms

Showing posts with label sorting algorithms. Show all posts

Wednesday, March 18, 2009

Recursive sorting 2: QuickSort

I got good feedback from my last post about recursive sorting, some of it by email and in person. I also heard some serious concerns about my world domination scheme. You can all relax, because my Evil Generals are all away on extended vacation for the foreseeable future. This post probably won't be as amusing as the last one, as I've already explained the basic groundwork. But I'll try to give you the basics so you can do QuickSorting yourself.

Here are the fundamentals of how QuickSort works. In Merge Sort, we split up two decks of cards by size, and then sorted each half. How exactly you "sort each half" is up to you. If the remaining divided up decks are still large, you can just divide them in half again and run two more Merge Sorts. But if the decks are small, you can sort them in another way. Even Selection Sort might be a good choice if you have no more than, let's say 10 cards. OR, you can just keep dividing and dividing until you get down to two or three cards, and then perform a couple of swaps to get the cards in order.

QuickSort is similar, but a little different. We still split up the cards, but in this case we split them around a card value instead of a specific deck size. Here's an example. Start with unsorted card:

8, 4, K, A, J, 5, 9

In QuickSort our plan is to start with the first card (8) and put it in the correct place. You remember that in Merge Sort we did the recursive step -- the splitting up into smaller problems -- first, and then the action of actually moving cards around was saved until we reached the lowest level (merging the cards). That's when we actually do something constructive... like telling an evil private "Go kill that guy." In this one, we do the reverse. The action step comes first, and then the recursive step second. After the 8 is in the right place, we'll sort the cards around it, using another call to QuickSort.

When I say we want the 8 in "the right place," what I mean is that we want all the cards that are less than 8 to be on the left side of the 8, and we want all the cards that are greater than 8 on the right side.

How do we get there? The 8 starts in the low position, so put your left hand on the 8, then put your right hand on the right side of the list. That's 9. 9 is bigger than 8. No switch is necessary yet, so move your right hand one space to the left, toward the 8. You're now on 5. 5 is smaller than 8, so it's in the wrong position. Do a swap and you get this:

5, 4, K, A, J, 8, 9

So now the 8 is on the high side. It is correctly positioned with respect to some of the cards: it is between 5 and 9, which is where it should be. But you're not done yet. Now you have your left hand on 5, right hand on 8. Move your left hand up one space, to the 4.

4 is less than 8. Keep going.

King is more than 8. It's in the wrong place. Time to swap.

5, 4, 8, A, J, K, 9

Now the 8 is correctly sorted with respect to the cards you already looked at. It is more than 5 and 4, but less than king and 9. Nearly done.

The 8 is now on the low side. With your right hand on the king, move backwards. The jack is in the right place. The ace is smaller than the 8. Swap.

5, 4, A, 8, J, K, 9

...your hands now meet in the middle, and you're done. The 8 is in the precise position where it should be.

Now what? Here's the recursive step. You have the 8 in the right place, but you have three cards on the left that are totally scrambled, and three cards on the right that are also scrambled. How do you unscramble them? You QuickSort them, of course.

First you sort this array:

5, 4, A

Then you sort this array:

J, K, 9

Do it in the same way as before. For the first set of three cards, put your left hand on the 5, and move it to the proper place (the third position). For the second set, put your left hand on the jack, and move it to the proper place (second position, between J and K). By the time you've finished, we discover that all the cards are sorted. Quick, wasn't it?

Seriously, get your own deck and try this on 20 cards. It's easy and fun once you get the hang of it. Lay them out in a line. As you work, you'll put a card at a time in its final location. When you do this, push the card up slightly so it will stand out from the others. Then work on the unsorted cards from left to right until you're finished.

Complexity of QuickSort

Starting with the first card, you made one complete pass through the array in order to put it in the right place. So this algorithm is at least O(n). Then you performed this pass through the array multiple times. How many?

In the best case, QuickSort does the same thing as Merge Sort: it splits the array almost exactly in half. (In the above case, our first number, 8, landed smack in the middle of the array, with three numbers on each side. That's a perfect split.) So basically, you're dividing by two each time, and after (log₂ n) divisions, you're down to sorting one card -- just like what we did in a binary search. So we simplify this and say we examined "n" cards approximately "log n" times, so as I indicated earlier, this algorithm is O(n*log n).

If you're lucky!

But what happens if you start with an array that's already sorted?

A, 4, 5, 8, 9, J, K

Well, the first card is already the lowest. So you'll start from the king and start moving left, but eventually you'll get to the ace, having examined all the cards but not moved any.

Then you'll "split" around the ace... but there are no cards at all on the left, and (n-1) cards on the right. So you'll do almost as many comparisons as before against the 4. Then against the 5. And so on, until you've compared every card to every other card. This takes O(n²) time. Needless to say, that's bad.

So what we see here is that there's a difference between the complexity of a typical execution of an algorithm and the worst case scenario. Knowing that, we can do things to plan for the worst case.

Unfortunately, sooner or later it's highly likely that you will accidentally try to sort an array that is already sorted. So although it is counter-intuitive, we can guard against this by scrambling the array first and then sorting it. Scrambling is easy... all you do is go through the entire array (n operations), pick a random number (picking random numbers takes constant time, so it's O(1)), and swap the current number with the one at the random location.

This takes 1*n moves, so it's O(n) time. Then "scramble + sort" would take O(n log n + n) time But n is of lower complexity than n log n, and at high numbers it will be trivially small in comparison. So the "scramble + sort" routine is simply expressed as O(n log n).

Of course, you could be super unlucky and wind up with a scrambled array that happens to be perfectly sorted. But as you get a very large number for n, the chances of that happening become extremely small.

Another way to randomize this thing is by picking random pivot point. Instead of always trying to position the first card, pick a RANDOM card, move it to the leftmost location, and put that one in position. This way you fix the problem of potentially starting with the smallest card every time.

Saturday, March 14, 2009

Recursive sorting, part 1

I thought I could get away with just a brief mention of QuickSort, but apparently some people really want to hear details. So here you go. Today I'm going to talk about something that is not QuickSort, but it is a closely related technique called a Merge Sort. This is tangentially related to the discussion about Big O notation, but it's not really my main focus here. I'll just tell you in advance that, like QuickSort, this takes a time of O(n log n).

Along the way, I'm going to have to explain the basics of recursion to the novices, who may have been a little uncertain about what to make of the story about my dream.

Understanding recursion

Imagine that I am playing the board game "Risk" and I have an objective to take over the world. No wait, better idea. Imagine that I am an Evil Genius, and my actual objective is to take over the world.

The world is a big place. Too big to plan my attack in every minute detail by myself. How am I going to go about this?

Presumably I need military commanders, so I'll assume that I have a number of Evil Generals at my Evil Headquarters, who are expert military strategists. Luckily, the world is already divided up into several logical categories, known as continents. So I formulate a plan. First I'll conquer Europe. Then I'll conquer Asia. Then Africa. Then Australia. Then North America. Then South America. Conquering each region will help me gain additional bases, which will aid me in the remainder of my nefarious plans.

So I pick my six finest generals, and I say "Okay, listen up minions! YOU. Conquer Europe. I don't care how you do it, just get the job done and report back to me. YOU. Conquer Asia." And so on. I assign one continent to each general. Then I sit back and relax, confident that my brilliant generals will each do their job thoroughly and well. My part in the plan is finished.

But continents are still big places, so my Generals now have quite a job in front of them. Consider the poor General whose job it is to conquer Europe. Where does he start? Well, like me, the logical thing to do is divide up the work, and the way to do that is country by country. "First I'll conquer France," he says. (Ha ha, get it? That's apparently screamingly funny if you're a neoconservative. I aim to please a wide audience.) "Then I'll conquer England. Then Germany." Etc.

So my Evil General calls in all of his Evil Colonels, and he says "Okay, you get France, you get England, you get Germany," etc.

The Colonel in charge of conquering England then calls all his Evil Majors and explains "You conquer London, you get Southampton, you get Lancashire," etc.

You can expect this process to trickle down through several more layers of authority, but here's the thing: you can't go on subdividing the territories forever. You can't run an organization with only a bunch of managers, because sooner or later somebody has to do the actual work. You can't have some little underling running around saying "Okay, you conquer this microbe, and you conquer that microbe."

What you need is a terminating condition. On some level, you simply have to have a whole lot of Evil Privates, and the Evil Sergeants will point at the enemy and say "Go now and kill that guy over there." Privates don't get to delegate.

So my Privates kill people, which is something that is in some way completely different from what everyone else is doing, which is just splitting up the territories and telling somebody else to do the work.

Eventually, all my Privates report to all my Sergeants, announcing "Territory secured, sir!" Then all the Sergeants report to their respective Evil Lieutenants, and so it goes. Up the chain, eventually reaching my Colonels, who say "Country secured!" and my Generals then come to me and say "Continent secured!" and then I chortle with glee in my Evil Lair. "I've done it!" I say. "I own all the continents, and the world is now mine. MINE! Mooha, mooha, mwahahahahaha haaaaaaaaaaa!!!"

Or words to that effect. I'll work on my speech when the time comes.

Today, Merge Sort. Tomorrow, the world

We've just covered a fundamental principle of recursion, known as "divide and conquer." In this case, we talked about literally dividing and conquering territories of the world. Now we're going to bring it back to the domain of sorting cards.

I give you an entire deck of cards to sort, and I say "Put these cards in order. You have three minutes." I know, those instructions are a little unclear because the suits can come in any order. Typically in a new, unshuffled deck, all the spades come first, then the hearts, then diamonds, then clubs; and then within the suits, the cards are ordered from Ace to King. We'll assume that you and I both know and agree on the order of a "correctly" shuffled deck.

So you have three minutes to sort the cards. Now, you know your card-sorting abilities really well, and you think to yourself, "Hmmm, I know how fast I can work. I can finish about half a deck in two minutes. If I do that I'll finish almost all the work, but not quite. I'm gonna need some help."

So you call in a friend, who also knows everything there is to know about correctly sorted decks, and works just as fast as you do. You say to your friend "Hey, glad you're here. Here's what I want you to do. I'm going to split these cards in half, and give half to you. You sort your half, I'll sort my half. Then when I'm done, we'll deal out our two sorted half-decks into one pile, or 'merge' them, and we'll have a whole sorted deck."

So you divide up the deck into two. You have 26 cards, which you sort using your favorite technique (you can use selection sort if you want). Your friend sorts his pack of 26 cards, and then you get together.

You have the ace of spades. You lay it face up on the table. Your friend has the two of spades. He lays his down. Your friend also has the three, so he lays another one down. You have the four, so you lay it down. You continue until the whole pack is done; flip the pack and you can deal them out in perfect order, ace first.

You finished your deck in two minutes, and your friend finished his deck in two minutes. Both of you worked in parallel, so the two of you were done in just two minutes, not four. Merging your decks goes pretty quickly, only taking up another minute. Three minutes have elapsed, and you're done.

Double your fun

I tell you: "Great job sorting that pack, now I've got more for you to do. I have two different packs of cards this time, all mixed together. I want you to sort them out into two piles, one with blue backs and one with green backs. I know this will take more time... I'll give you five minutes."

How long does it take to sort a whole deck? From the last job, we know it takes three minutes. Clearly five minutes won't be enough for two decks, so we're going to need more friends.

So we get two more people in on the action. You tell your friend: "I'm going to split up these 104 cards into two piles of about 52 cards each. You get your half sorted, and I'll get my half sorted. Sort them the same way you did before, but this time always put blue cards before green cards."

But selection sort is way too slow, so you then take your 52 cards and turn to another friend. You say "Now I have 52 cards. You take 26 and I'll take 26, and we'll each sort them in two minutes."

Two minutes later, you're both done. You take another minute to merge the decks. Now you have 52 sorted cards. It's been 3 minutes.

You turn to your friend and discover that, with the help of the fourth guy, he also has a sorted pack of 52. You merge the cards -- there are more than before, but you finish dealing them together in another 2 minutes. Five minutes are now up, and you have a single pile of perfectly sorted cards. Then all you have to do is separate them in the middle, to get two decks.

Generalizing the solution to "n"

By now I'm sure you've got the idea. I can sort four decks, or eight decks, or any number of cards I want. All I need is more friends. Your friends aren't "free" -- they take up a lot of space, demand lunch breaks, and if you're asking for a lot of work they might want to get paid. If you don't like that, then you're free to go back to selection sorting the whole deck. But with one person performing an O(n²) algorithm, the job will take hours if you have a lot of cards. Much better to keep dividing and subdividing your packs until you have a manageable little pile, and then it gets easier to merge a few piles at a time.

The analogy to taking over the world should be clear. "Divide and conquer" is in fact a well known program-writing technique. Splitting the deck and handing the job to friends is directly analogous to filtering the world by smaller pieces of land, and then handing them to lower ranking underlings.

Another point this illustrates is the frequently recurring theme of trading space for time. As I said before, the more friends you have sorting cards, the more costly the job is. Inside the guts of your computer, what this means is that using a recursive algorithm eats up more of your computer's memory. This can cause everything else in your operating system to run slower, if you start using up too much memory.

On the other hand, computers these days have a lot of memory, and unless you're trying to run World of Warcraft while you're computing, most of your memory is often sitting idle. Why not throw more storage at the problem, if the time saved is so great?

Better yet, thanks to advances in parallel computing, it's now possible to actually physically divide up the work among multiple machines. You order one computer to sort half the cards, and another computer to sort the other half. Then you just take up a little network traffic to have them report to a master server, which consolidates the information.

One example of a popular program that takes this approach is the SETI@home project. We may never discover intelligent life elsewhere in the universe. But if we want to try, we've got massive task collecting readings from different areas of space. SETI takes advantage of the idle time on computers whose owners have voluntarily installed their program. Your desktop runs the program when you enter screen saver mode, analyzing different areas of space, and the SETI server gathers whatever interesting information that you discover.

Now that everyone's up to speed on the basics of divide-and-conquer, I'll go over QuickSort next time.

Tuesday, March 10, 2009

Big O Notation 2: Sorting

As a Card-Carrying Nerd (you have to mail away exactly 314 cereal box tops for your card) I find entertainment in some very unusual activities. For example, I like to execute algorithms by hand. When I find a pack of cards lying around, I often find myself shuffling them, dealing out about 10-20, and operating a QuickSort on them. Sometimes when I have a stack of old bills and I want to eliminate duplicates while retaining the latest copy for my records, I'll even lay them out in a grid on the floor and perform a bucket sort.

Sorting algorithms are material from Intro to Computer Science 1, but I'm going to give a very condensed overview of the topic as background material to what I'm going to go over in a future post. If you're already rock solid on sorting algorithms, I suggest you skip or skim this one, and I promise I'll get to more practical discussions later.

You'll remember that in the last post on Big O, I explained that a "dumb dictionary search" (just viewing each word one at a time) requires O(n) time, while a more efficient search (which is also known as a "binary search") requires only O(log n) time, which is a lot shorter. But the binary search only works if the words are in some kind order already. If they're all scrambled up -- if "xylophone" comes before "cereal" comes before "pheasant" -- then you have no choice but to execute a dumb search. There's no other way to find your word besides looking at every entry to make sure you didn't miss any.

Therefore, it's often helpful to sort a list of things (such as dictionary words). Of course, sorting a dictionary might take longer than actually searching it. But if you have to search the dictionary multiple times, it's worth it. You eat the cost of sorting only once, but you get to do binary searches as many times as you want after that.

Suppose we take a pack of cards and deal out "n" cards at random. Let's make this quick and pick 5. I'll imagine they came out like this:

5, A, 7, K, 3

How long will it take to put these cards in the correct order? Well, naturally it can't take less than O(n) time. It's easy to see why. Suppose you had a magic function which automatically skimmed the entire list in no time at all to find the smallest number remaining. Then you could scan these cards and find the Ace first, and put it in the first spot. Then you find the 3, and put it in the second spot. And so on. Then the sequence of events would look like this:

A, 5, 7, K, 3 (we put the ace in the first spot, but we needed to move the 5 out of the way, so we just swap their positions)
A, 3, 7, K, 5 (3 swapped with 5)
A, 3, 5, K, 7
A, 3, 5, 7, K (sorted!)

This is the most efficient sort possible, but even with our magic function you still have to move each card to the correct location. There are n cards, so it takes n moves. So no matter what happens, we can't do better than O(n). (Technically, it took "n-1" moves to finish, because once the first four cards are sorted, the king MUST be in the right place. There's nothing to swap with. But remember how I said before that constant multipliers don't matter? Subtracting a 1 definitely doesn't matter. In fact, we ignore any constants that are added to or subtracted from the outcome. Order of magnitude is all that counts.)

Unfortunately, we don't have a magic function that takes no time to find the Ace, so we actually have to read the cards in order to find it. How do we do this? Well, we could read the entire list manually, finding the Ace by inspecting every card. This is a technique known as a "selection sort," for the obvious reason that we are looking through the unsorted list to select the smallest number. Like in the dumb dictionary search, that would take O(n) reads to accomplish (technically n/2 on average, but constant multipliers don't matter).

But it's worse than that. Because we have to move n cards, AND we have to read n cards each time we move one. So we do n*n things, which means that this sort is O(n²).

Terrible. Remember, our dictionary had 170,000 words. You square that, and you're doing almost thirty billion operations. There must be a better way.

There is, and one example is QuickSort. QuickSort is a recursive algorithm that is a little bit tedious to explain in a blog post, but it's so successful that it is implemented as a standard feature of many languages. If you type "sort @mylist" in Perl, you are instructing Perl to perform a QuickSort in most cases. If you have some time to kill, I recommend that you look up QuickSort on Wikipedia, lay out some cards, and try to figure out how it works. It's kind of cool.

QuickSort usually requires a time of O(n*log n), and you can't do much better than that. Well, there are some tricks you can use to speed it up if you know exactly what sort of data you're dealing with, even getting pretty close to O(n) in some cases. But for the most part, you're stuck with n*log n. And this makes sense: In the dictionary example we discovered that you can find a word in O(log n) time, and it takes O(n) time to move all the numbers, so in the best case we are going to find a number as quick as possible and then move it.

If you sort a 170,000 word dictionary in O(n*log n) time, how long will it take? Log 170,000 is about 5, and multiplied by 170,000 itself is about 900,000. That's not fast, but it's nowhere near as bad as 30 billion.

This is the last purely academic post I'm going to write about Big O notation. By now, I'm sure some readers are asking "What's with all this theory? Am I ever going to use this professionally?" My answer is: "You might. For one thing, you can use it to show off in job interviews." Next time I post about Big O, I'll explain how.

Castles of Air