Castles of Air: best practices

Showing posts with label best practices. Show all posts

Friday, October 16, 2009

Ugly numbers in a heap

As I was saying, a good solution to finding ugly numbers would be to somehow enumerate only numbers which are known to be ugly, removing them from memory after they have been read. I couldn't think of the way to do that. Internet to the rescue.

Here's what you do:

Store the number 1 in a set.
Consider the smallest item, "n" in the set. That is counted as the next ugly number.
Add the numbers n*2, n*3, and n*5 to the set.
Remove n from the set.
Repeat steps 2-4 until you reach the desired ugly count (1500).

Each time you search for the next ugly number, you remove one number from the set but add three more, for a net gain of two. Thus, the set of numbers you are storing on each step i is (2*i-1). That's a manageable memory size. But there's still a snag.

Sets are unordered lists, and we want to always find the smallest number of the set as efficiently as possible. How do we do that? We can't just read every number in the set, or the problem scope quickly becomes O(n²). Can't have that.

Turns out there is an ideal data structure for handling this kind of thing, but it's something I haven't looked at since I was an undergrad. It's called a min-heap.

A heap is a special kind of binary tree, where every level of the tree is filled to maximum density. For instance, in a heap with 8 elements, you know that the top level has one node with exactly two children; each of those two children has two children (that makes 7), and the leftmost node at the third level has one child on the left.

The neat thing about using a heap is that because of its predictable structure, you can represent it using an expandable one-dimensional array. You don't have to mess with pointers or the other headaches of data structures. If you are looking for root.left.right, you just calculate the spot in the array and access element 4.

A min-heap is constructed in such a way that the children of each node are guaranteed to be larger than the parent node. In other words, the smallest value is guaranteed to be at the root, and each child of the root is itself a min-heap with the smallest element of that at the root.

(If I'm going too fast for you, speak up. I don't believe I've ever done a post specifically on binary trees, so maybe I should.)

In any case... you can do two things with a min-heap: add a new element, which is rapidly sorted to the right place; and remove the root element, which then moves the next smallest element into the root and fixes the rest of the heap so that it is still a min-heap.

This is exactly what we need. It keeps the cost of searching for the smallest known ugly number to O(log n), where n is the number of elements currently in the list.

So to recap:

Write your own min-heap class, which is the fun part. ;)
Add the number 1 to the heap.
Inspect the root of the heap, and count that number ("r") as ugly.
Add 2r, 3r, and 5r to the heap.
Delete the root.
Repeat steps 3-5 until you reach the desired ugly count (1500).

Actually it turned out that wasn't quite the answer, because you can get duplicate ugly numbers on the heap. For instance, in the second pass you add 2*3, while in the third pass you add 3*2, which results in multiple 6's on the heap. But that's easy to fix. All we have to do is change step 5 to: "Keep deleting the root until it is no longer equal to the last recorded ugly number."

And there you go. Small storage space, quick execution time, ignores non-ugly numbers, and finishes in 3 seconds easily.

SPOILER:

In case you wondered, the answer is 859,963,392.

Ugly numbers and prime numbers

The second wrong approach I tried taking to find ugly numbers was something called a Sieve of Erathosthanes. This was something that my dad taught me about in high school, and it helped me win a programming contest as a sophomore.

Here's where I should give a quick shout out to my favorite HS teacher, Thom Laeser, who I think might read this blog. Thom was kind of a Renaissance man as a teacher; across my four years in high school, I took one geometry class, two programming classes, and a journalism class from him. He also started the school's first Humanities course, which I did not attend. It wasn't until high school that started really programming (rather than just dabbling in BASIC), and I credit Mr. Laeser with giving me that extra boost.

When I was a sophomore, he gave all his programming students a challenge, which was to be the one who could write a program printing the most consecutive prime numbers within five minutes. This was 1990 and the computer lab probably was limited to 80386 processors or something.

It was clear early on that the contest was going to be between me and a senior. The next day, we both proudly came in with programs that printed all primes from 2 to MAXINT in well under 5 minutes. This was in Pascal, I think integers were 16 bit, so the highest number would be 65,536. Mr. Laeser shrugged and said that was very impressive, but we would have to keep going if we wanted to win.

So. Along the way, I had to learn to write my own large integer output, and also save space by representing numbers with bit manipulation instead of wasting entire "int" values on booleans. More importantly, I learned about the Sieve of Erathosthanes, which I mentioned before.

The sieve is a method for quickly weeding out prime numbers from a finite list of numbers. Here's how it works (simplified version): start with an array of booleans from 1 to, say, a million. The first true prime number is two, so cross out all multiples of 2 that are not 2; so, 4, 6, 8, 10, etc. Then we move up to three, crossing out 6 (again), 9, 12, etc. Next we encounter 4, but 4 is already crossed out, so we ignore it and move up to 5, eliminating 10, 15, 20, etc.

It turns out that you only have to filter on numbers up to the square root of the array size, which in this case is 1000 (sqrt(1000000)). After that, you know there are no multiples of 1001 because you would have had to multiply 1001 by something smaller, and hence already visited, to reach a million.

Long story short, I won the contest, but it was a grueling arms race for several weeks as we kept improving our programs and then "stealing" new ideas by observing how the other's program worked.

Sieve of Ugly

So that was a roundabout personal story, but I haven't told it here before so I thought I might as well get it out. And the point is that my second pass at solving the ugly number problem was to use a sieve.

Sort of a simple modification, really. I didn't just use a boolean array, because there are now three kinds of numbers: 1: prime; 2: ugly; 3: neither prime nor ugly. So I made it an integer array, where each integer stores one of three flags. All numbers start prime; then multiples of 2, 3, and 5 get marked as ugly. Then run another pass on all numbers above 5, marking their multiples as non-ugly.

Using integers was super-overkill, as an int is 32 bits, and I only needed 2 bits to represent my flags: 00=prime, 01=ugly, 10=non-ugly. What I found was that I couldn't sieve up to the 1500'th ugly number, because the array was so huge that I ran out of memory.

So first I changed it to an array of bytes (8 bits). Then, still not satisfied, I improved performance by writing some bit-manipulation routines to simulate an array of two-bit numbers using four numbers per byte.

It worked; I found the same number much faster than before, but still with a noticeable delay before the sieve finished running. 3 seconds was still way out of reach.

What next?

While I was coding the sieve, it occurred to me that it was a terrible waste of space to store ALL the numbers, both ugly and non-ugly, in an array until the entire program was finished. It seemed like there ought to be some kind of quicker way to figure out, given the current ugly number, what the next one in the sequence should be. At worst, I could maintain an array of only 1500 numbers to store only ugly numbers (instead of all numbers up to a trillion).

I wracked my brains for this method; I wrote down the factorization of the first 10 numbers or so, looking for a pattern, but couldn't find one. I'm pretty sure there isn't one, but I was on the right track at this point. I know because that's when I "cheated" by looking to see what other people had done.

And maybe they're smarter than me, or maybe they spent more time on it, but there is still a certain grim accomplishment in not having to reinvent the wheel.

There's a well known "hard" problem in computer science called the traveling salesman problem, which goes like this: You have a bunch of cities, separated by known distances. You have to visit all of them while spending as little time as possible in transit.

To solve this problem by brute force becomes impossible very quickly as you add more cities. So the best known approaches use short cuts; they use sneaky tricks that don't necessarily produce the "best" answer, but are guaranteed to produce a pretty good answer most of the time.

Anyway, I heard that some computer science genius was once asked how he would solve the traveling salesman problem, and he said: "I would put it on the back of a cereal box. 'Find the most efficient path through these points. Winner gets $100, mail your answer to the following address'..."

Well, that's one kind of divide and conquer solution right there. Hooray for the internet, though... now we don't even need cereal boxes.

Online programming puzzles and "ugly" numbers

Ran across this site while looking for general programming puzzles to stay in practice.

On the site, my attention was captured by this puzzle involving "ugly" numbers.

Ugly numbers are numbers whose only prime factors are 2, 3 or 5. The sequence

1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, ...

shows the first 11 ugly numbers. By convention, 1 is included.

Write a program to find and print the 1500'th ugly number.

This isn't a particular hard puzzle if you brute force it. All you have to do is write an algorithm to determine whether a particular number is ugly or not, sort of like this:

while (num % 2 == 0) num /= 2;
while (num % 3 == 0) num /= 3;
while (num % 5 == 0) num /= 5;
if (num == 1) return true;

Then you can just loop through as many numbers as it takes, incrementing a counter as you find ugly numbers, until you reach 1500 numbers.

The problem with this approach is that there is a time limit of three seconds. The brute force approach requires you to do an increasingly long operation (dividing repeatedly to find the prime factors) on a series of numbers which turns out to be very large (I won't spoil it, but the answer is just short of one trillion). Early numbers are computer quickly; then it slows down so finding the 1500th number may take several minutes.

I confess I cheated to solve this puzzle. I came up with one other wrong approach after trying the brute force method, then went googling for discussions of this puzzle. As I hoped, I did find a protracted argument about the right way to solve it. One guy hit on a really clever solution, based on a data structure which I haven't used for years. It took me a few tries to implement, but it was very satisfying in the end.

I guess I could feel bad for "cheating," but I don't. I came close to finding the right answer, but really, if you want an optimal answer to a well defined puzzle then there's no harm in hunting around to see if someone did it better than you. Professionally, implementing a GREAT solution that you looked up is preferable to implementing a mediocre solution that you got with only the sweat of your brow. You can learn new tricks from other people that you'll have in your toolbox for next time.

As a tangent, I take the same approach to strategy games. I like to take my time and learn the rules of a game on my own, and I would scorn to look at a strategy guide during that period. However, there comes a point where you have a basic understanding of the rules and can improve much faster by considering input from others who have already mastered the game.

Anyway, more on the wrong approaches and the right one later.

Friday, August 28, 2009

Divide and conquer your project, part 1

James, who is a regular commenter, is a programmer looking for more experience. James has studied a few languages and probably has a decent grasp on the basic logic that it takes to write a program, but he wants to work on a personal project because he doesn't get enough variety of practical applications at work.

This is an admirable goal, first because developing your programming skills requires constant practice, and second because personal projects can make good resume padding. Potential employers like to hear that you program for fun, because it will make you more likely to throw yourself into your work. Also, if they ask "Have you worked with technology X?" you can say "Not professionally, but let me show you this project that I posted on my web page." It gives you the opportunity to talk intelligently about the technology.

However, starting a new project can be overwhelming, and it's hard to know where to start. You have all these things in your head that you would like the program to do, but you can't see how to get from what you have now (a blank screen) to what you would like to have (an awesome program with flawless logic and a clean interface). What do you do first?

I spent a few posts discussing the technique of "divide and conquer." As it turns out, that is also an excellent plan for approaching your programming projects. Here's a little secret of programming: you won't get it right the first time. Fuhgeddaboutit. Don't try. If you sit down for a couple of hours and try to write an enormous, complex program in one shot, and only after you think it is done do you compile it for the first time, you will be confronted by a screen full of errors and another few hours of headache, tracing back through what you did to figure out where each of numerous bugs exists in your thousand line program.

Actually, my proudest moment as an undergrad was writing a single assembly language module that executed perfectly with no debugging on my first try. But that's not something you should count on, and in any case, this was one line in a much larger program.

When you are building a castle of air, you do it brick by brick. Except they're, you know, wind bricks. You don't wave your hands and get a fully formed castle. You break down the project in your mind into easily manageable pieces, and at each step you (1) Look at that step to determine whether you should break it down even further before proceeding, (2) Write complete code that does something visible, and (3) test that component thoroughly.

That third step is crucial, and it's a hallmark of a solid programmer that they develop testable and tested code at every step. When you test a piece of your code until you know it's solid, you will save yourself a lot of time later that would have been spent backtracking to figure out where your code is going wrong.

How do you break it down? Let's consider the project that James set for himself.

One idea I've had is a program that accepts inputs on your credit card balances, minimum payments, etc, and calculates the most efficient payment plan to snowball. I've got a spreadsheet I downloaded from somewhere that I'd model, but the spreadsheet can't jigger your payments to determine the best way to pay off the balances the soonest, or with the least amount of interest. I envision it as an Iphone App, although I learned that without a Mac I can't write it that way.

Actually a spreadsheet probably can do that, but writing a full featured program would be cooler. So let's consider what components this program has.

A popular way of dividing up pieces of a program is to start by splitting it into three layers: a presentation layer (the way your program handles input and output); a data layer (the information that your program will be manipulating); and a business layer (the program logic that figures out what to do). If you split your program up into pieces this way, then you can hopefully change up the pieces at any stage without risking ruining your entire program.

How you start your problem will largely depending on which component is most interesting to you personally. In many projects, the graphics are the most interesting thing. In that case, your top priority would be to get something, anything to display on the screen.

For example: My first real programming job was modeling equations for scientists. There was an eccentric old physicist who wanted me to graph some of the equations he wrote for me, using Fortran on a green and black terminal. As I progressed on getting the equations to display the right answer, he seemed distinctly bored with hearing about the programming details. But then I figured out how to generate a line diagram. It excited the hell out of him, and suddenly he couldn't stop talking about his ideas for what we could do with this.

The diagram, of course, bore almost no relation to the actual equations he had described. But it didn't matter. It was a launching point for his concept of the program. Similarly, as you write your own programs, hopefully you will keep on continually inspiring yourself by producing code that does something that will tickle your interest in the next phase of your project.

Now, in James' case, the interesting part of the project is probably not the graphics. It's the business logic. After all, the ultimate objective is not to create something that looks cool, but something that provides useful information. Since this is a program that you will be using yourself, the inspiration will come from your feeling that "Wow! This is really telling me some valuable stuff! But I bet it would be even more useful if..." And then whatever comes after "if" will determine which piece of the program you conquer next.

So the business layer is the most important part of your program. Therefore, I would recommend that you pick a language, and start with an extremely simple presentation layer and data layer. So for presentation, let's say you'll use plain text input and output. For data, let's say that you'll write a module or an object that hard codes the value of your credit card balances, and then you can tweak the code to change the values.

These aren't good solutions, but they are stepping stones for the program you really want to build. Write good business logic first, using crude presentation and data models, and then make it better later. How will you make it better? Well, you might decide to turn your text input into a web form interface, and perhaps add graphs. You might change your data from hard-coded values, to a separate text file that you will edit by hand, to a database table. As long as you keep these parts of the program reasonably well separated, you can build them up a bit at a time, focusing on the parts that strike your fancy.

Whew, that was a mouthful. Pausing for feedback now. James, do you think this is enough to get you started? Do you have an idea which language you'll start in? Any thoughts or questions?

Friday, February 27, 2009

Search and destroy missions for your bugs

Pop quiz, hotshot. You're new on the job, you have a million lines of code that you've never looked at before, and there's a bug. It's doing something really bad, like taking five minutes to load and then displaying a blank screen. Your boss has some vague idea of what the program should show, or used to show, but it's not doing that. The most specific instruction you can get is "make it look like this" or worse, "just fix it." What do you do??

Take a deep breath. First of all, don't be intimidated by the size of the code. It really doesn't matter. The whole program may have a million lines, but you can be sure it's not actually doing a million things when it tries to run your specific command. Think of the million lines of code as locations on an extremely detailed map. If you want to find a street in Austin, you won't get very far by carefully reading every street name on the whole Austin map. Instead, you want to narrowly focus on the specific path your program takes when you execute your command.

In a nutshell, you are a detective. A crime has been committed: a bug has murdered the effective execution of your program. You know that the perpetrator went from point A (the beginning of the program) to point B (the place where the program is misbehaving). Your first task is to follow a trail of clues and find the suspect. After you know where the error occurred, the last part of your job -- move the suspect into custody by making your program do what it's supposed to -- is relatively easy.

Let me talk math for a minute. I'm going to solve a puzzle live in this post. I have a large number: let's pick 987,654. I want to know what is the square root of this number to the nearest integer, but I'm not allowed to just hit the square root buttonon my calculator. Instead, I can only pick a number and multiply it to see if my answer is right. How will I find the answer?

I'll indent my solution in the next several paragraphs so you can see where the rest of the post continues. Think of how you might solve it before reading on.

We could brute-force it with trial and error. 1*1 = 1. Nope. 2*2 = 4. Nope. 3*3 = 9. Nope. And so on. This could take a while.

Let's be smarter about this. We want to get CLOSE to the number and then home in from there. So let's start with an educated guess. About what order of magnitude is my number? 987654 is pretty close to a million, and it's easy to figure out the square root of 1000000: it's 1000. (Just cut the number of decimal places in half.) So we know the answer is less than 1000.

Is it 900? 900*900 = 810,000. Too small. Let's pick a bigger number. Go halfway up, we get 950.

950*950 = 902,500. Still too small. Let's try 990.

990*990 = 980100. Hey, we're getting closer! But it's still too small. How about 995?

995*995 = 990025. Now it's too big, but just a little bit.

At this point we know that it's between 990 and 995, so let's just step backwards from 995 until we find it.

994*994 = 988036. Too big.
993*993 = 986049. Too small.

Ah ha! We've found the answer! We know that the square root of 987654 is more than 993 and less than 994. In other words, it's 993 point something. If we wanted to, we could keep playing this game to guess more places after the decimal. But I said that we'd only go to the nearest integer, so "993 point something" is good enough.

Let me check with the calculator now that we've settled on an answer:
sqrt(987654) = 993.808. See? I was right.

I've just demonstrated something very much like Newton's Method of approximation. Instead of brute forcing the solution by walking through every possible number, we started in a likely spot and quickly converged on an answer. Now, what insight does this give into debugging?

Figure out a likely entry point for your code. For instance, if your program errors out when you click the button that says "Display all prices" then search all your files for the exact string "Display all prices". If it only exists in one place, you've got a location to start looking, and you're off.

Quick sanity check. Once you've found this spot in the code, you might want to temporarily change the text to "Display all prices!!!" Then reload the program and see if the exclamation points show up. I can't tell you how often I've found what I thought was the right spot in the code, but then wasted a bunch of time wondering "Why didn't THIS change do anything?" when I'm actually not in the right spot at all. This hearkens back to the important principle from yesterday's post: Programming is both theoretical and experimental. Don't just trust your guesses to be right, do the experiment!

So you know where your code started, now where is it going? Off the top of my head, there are three important tools you have for narrowing down the location of your bug.

Use a debugger. Decent program development environments all have a debugger. Learn how yours works, it is your best friend. With the debugger, you can step through your code a line at a time, see exactly where it's going, spot check your variables, and trace back where you've been. However, some types of development don't make debugging easy. For instance, if you're developing web pages, the program runs on your browser and not in your editing environment. Sometimes code can load an unrelated program and you have to go to a different tool. So if you have no debugger or you run out of use for it, you have to go to your backup plan:
Print statements. Lots of print statements. Print what you're doing: "X was set to 15!" or "Entering function..." "leaving function...". Just remember that debug statements in shipped code look really bad, so don't fail to clean up all your print statements when you're done. A good habit is to put a distinctive string in front of all your prints to make sure you don't miss any. For instance, sometimes I will make debugging statements like
print "RG Entering function foo";
The string "RG " rarely shows up in code, so I can search for all occurrences and delete them when I've fixed my problem. (Some languages use "ARGV", but if you include the space after "RG " then this won't show up in the search.) If you are willing to put in a little extra time, the preferred solution is to write a "printDebug" routine that only prints when a flag is set, so that you can globally turn off all debug messages if necessary. This isn't always worth the effort, though, when working across multiple files that don't share libraries.
Comment out large blocks of code. Just remove them entirely. Don't underestimate this technique. If your problem is that the program is slow, and you comment out an entire function, and it STILL takes five minutes, you immediately know "Guess that function isn't causing the slowness!" Then you don't have to waste your time looking there.

A word about print statements: It's not necessarily that easy. Sometimes you will add a "printf" command (C) or "cout" (C++) or "System.out.println" (Java) and you see nothing at all. "Print" in this case means "do whatever you can to make it visible." If you're coding to a web page, write to the web page's output stream. If you're writing Javascript, use "alert" statements to pop up a window. If your program uses a log file, write to the log. If you're writing a windowed program, you might want to make a special panel that you can use to display debug messages. Just figure out what makes the most sense to make your messages visible.

Basically the objective here is to narrow down all the possible locations where the bug might be. If your program crashed, where did it crash? If it was supposed to display a picture of a flower and didn't, did it even reach the "drawFlower()" function? If so, why is drawFlower broken? If not, where did it make a wrong turn?

So you're a detective, casting a wide search net at first but tightening the net. From the million lines of code, we've found that the error must be happening in this 1,000 lines. Then we cut it down to 100 lines, then 10, then 1: this exact line is where it misbehaved.

As you tighten your search net, you will dig deeper into the code. If you get down to one line and discover that it's a function you control, you're not done yet: you have to step into that function and keep going. For instance, I comment out a single line and discover that the Bad Thing no longer happens. I uncomment the line, step into the function, and comment out the entire function body. Same result. Good, my guess is confirmed. Now uncomment half the body. Now does it still do the Bad Thing? Is it getting inside this "if" block, or the "else" block? Why did it go here and not there? What values is it seeing when this decision is made?

Ultimately, fixing a bug in a million lines of code often comes down to changing one line. So finding where the bug is, is actually 99% of the work. Hence, this is probably the single most important skill you can develop.

Castles of Air