Tuesday, March 3, 2009

Speed up your programs with Big O (part 1)

Part of what inspired me to do this blog was that I found myself explaining "Big O Notation" to two coworkers this week. My coworkers are smart people, but I was a little surprised to find that some aspects of computer science aren't common knowledge.

My intent is to make this blog accessible to software amateurs, but many readers will probably have heard this stuff before. Hopefully I can at least make the description entertaining, because I think it's an important one to be familiar with. If you happen to be one of those people who dimly remembers this topic from your "Intro to Computer Science" class, but forgot it promptly, it's time to learn it again. It matters.

Job interviewers often ask questions where they require you to make up a program that solves a programming puzzle. I have impressed the hell out of numerous interviewers by making up a solution, analyzing the solution in Big O, and then coming up with a more efficient solution. I didn't always get those jobs, but the feedback was always "You seem like an excellent candidate, very solid on the fundamentals, but you don't know enough of the specific technologies we work with." When I got the job at Digital Motorworks last year, the director of engineering flat-out told me that mine was the best answer he ever heard to his question. So pay attention!

A "dumb" dictionary search

Suppose you are looking for a word in a printed dictionary. There are two ways you can do it, a dumb way and a smart way. The dumb way is: read the first word. Is that the word you want? No. Go to the next word. Is that the word you want? No. Go to the next word. And so on.

I just looked up how many words are in a typical dictionary. It's about 170,000. So how long will it take you to find the word you want? Well, it depends, of course. If you are looking for "aardvark," then you will have to look at only one word, or just a few. If you are looking for "zymurgy," then you will have to look at nearly all 170,000. On average, you can expect to read through 85,000 words: half of 170,000.

Get Smart

Nobody reads the dictionary that way, of course. Think about how you would really look up a word, such as "program." Open the book to the middle. You find the word "mouse." That is before the word you want, so divide the right side of the dictionary in half, pick another page, and look again. "Teacher." That is after the word you want, so divide the space between "mouse" and "teacher" in half and look again. And so on.

This technique is much more efficient than the dumb search, because it takes advantage of the fact that the dictionary is sorted for you. How long does it take? When you first open the dictionary, you are splitting the search space in half: instead of 170,000 words, you now have 85,000 words to look at. Then you split the search space in half again. With each page turn, you divide the number of words remaining for you to look at. So here's how it goes:

  1. 170,000
  2. 85,000
  3. 42,500
  4. 21,250
  5. 10,625
  6. 5,312
  7. 2,656
  8. 1,328
  9. 664
  10. 332
  11. 166
  12. 83
  13. 41
  14. 20
  15. 10
  16. 5
  17. 2
  18. 1
18 Words we looked at. Instead of 85,000. Much better, right?

170,000 is a very specific number of words for a dictionary to have, so let's generalize this. The number of words in the dictionary is "n". How long did the dumb search take? We read on the order of n/2 words. Therefore, we say that solution took O(n/2) operations. That's all "Big O" means: "O" stands for "Order of." Nothing magical here, right?

How about the smart search? It took 18 searches, because
218 > 170,000
(17 wouldn't be enough. Check.) If you fiddle with that equation, you'll find that
18 > log2 170,000

So, to speak more generally, the smart dictionary search is O(log2 n).

With me so far?  Now I'm going to get slightly tricky.

When I said the dumb search was O(n/2), the "/2" part is a constant multiplier: it represents the number 1/2, which is always the same.  You take O(n) and you multiply it by one half.  But really, we don't need to be that specific.  We don't know exactly how long each of the operations will take, so when we are estimating compute time, it's pointless to factor in a constant.  For instance, if you do 100 operations that each take half a second, or you do 50 operations that each take a whole second, it amounts to the same thing.

So, by convention, we ignore constant multipliers and don't include them.  Instead of saying that the dumb search is O(n/2), let's just say that it takes O(n); all we care about is finding the right order of magnitude anyway.

Similarly, I said the smart search takes O(log2 n).  But the "2" subscript in "log2" is unnecessarily specific too; all we care about is that it's a log.  So the smart search is O(log n) time.

Why should you care?

Let's consider some specific values of "n" and see how much time we save by using an O(log n) search instead of an O(n) search.
n=10 -> log n=1
n=100 -> log n=2
n=1,000 -> log n=3
n=10,000 -> log n=4
n=100,000 -> log n = 5
n=1,000,000 -> log n = 6

Obviously, if you are searching ten values, you can probably get away with scanning the entire list, but if you are searching a million values, you would rather have it take about 6 reads instead of a million.

As obvious as that is when I put it like that, I can tell you that there are a distressingly high number of times when I've sat down and analyzed code to discover that something which should easily take O(log n) time is written in such a way that it takes O(n) or even O(n2).  The reason for this is that the guy who came before me started out with a small list of items, and then failed to plan for a time when the data would grow out of control to a thousand items or more.

You should be planning for this stuff up front.  You should be asking yourself "How many values am I dealing with now?  How many values can I imagine dealing with in the future?"  If your program is dealing with information about cars, for example, you should think about how many cars there are in the world, and what fraction of those cars will eventually be in your database if you're a big success.

Remember that if you don't plan for this, you will have to put in the effort to change it.  But also, by the time it gets to that point, you will have forgotten where the O(n2) function is, so you'll have to put in the effort of finding it and decoding the logic too.  That's why you want to be mindful of future bottlenecks for your code as early as possible.

Next time I get around to posting about Big O, I'll be writing a little about sorting routines, interview challenges, and strategies to make basic logic more efficient -- some or all of those, as time and post length allow.

4 comments:

  1. I was actually relearning (previously learned this in my freshman level CS course about two years ago) this concept in my discrete math course today. What my professor taught was pretty much the same, but I really do like your presentation better.

    Thanks for the blog, I really enjoy it.

    - An undergrad CS major

    ReplyDelete
  2. If you fiddle with that equation, you'll find that
    18 > log2 170,000


    I'd work it from the other end: with one try, you can search a 2-word dictionary (is the first word the one that I want? If not, it must be the second one). With two tries, you can break up a dictionary into two sub-dictionaries that can be searched with one try, i.e., 2×2 = 4 words. With three tries, you can break it up into two 2-try-sized sub-dictionaries: 2×4 = 8.

    More generally, using k tries, you can search a dictionary of n=2^k words. Solving for k gives k = log_2(n).

    ReplyDelete
  3. Okay, a couple of questions from a prog newb.

    Does your system only work if your database is both sorted and static?

    This could be just my lack of math skills, but in your first example you identified 170,000 entries can be found in 18 searches (n=170,000 -> log n = 18), but the second example you said (n=1,000,000 -> log n = 6). What am I missing?

    ReplyDelete
  4. Great questions, James.

    Does your system only work if your database is both sorted and static?

    The reason database tables read information faster than ordinary text representations or spreadsheets is that they are indexed. There are two kinds of indexes: clustered and non-clustered. A clustered index will physically sort your entire table. Obviously you can only have one clustered index. A non-clustered index is simply a sorted list of one column value (for instance, last name) which then points to the memory location of the correct table row.

    If you have a database search that is running very slowly, it's usually because you're doing your initial search on a non-indexed column, which means that you have to read nearly every item in your table and examine the values one at a time. It's a dumb dictionary search.

    This could be just my lack of math skills, but in your first example you identified 170,000 entries can be found in 18 searches (n=170,000 -> log n = 18), but the second example you said (n=1,000,000 -> log n = 6). What am I missing?

    When I say it's "18" or "6" that's really a rough estimate, because remember we eliminated the constant multipliers. So we said that the dumb search was O(n), even though it "really" takes 85k operations and not 170k operations.

    The difference between (log_2 n) and (log
    _10 n) is also a constant multiple -- there's a factor of about 3.32 between them. For example:
    16 = 2^4 = 10^1.204
    and
    4/1.204 = 3.32
    Similarly, 18 is 3 times bigger than 6, so you can how they're similar orders of magnitude. When we eliminate that factor of 3.3, the calculation will be correct.

    Maybe we could make the dictionary search "really" take (log_10 n) operations instead of (log_2 n) operations if we divide up the dictionary into ten equally sized parts instead of just two. Then it would indeed take 6 operations and not 18. But there are additional hidden costs associated with dividing the dictionary up in pieces, so you wouldn't gain any performance that way.

    Because it's a headache to price out all these different hidden costs, and not very reliable anyway, that's why big O is kind of a rule of thumb and we just ignore constant multipliers.

    ReplyDelete