Friday, February 27, 2009

Search and destroy missions for your bugs

Pop quiz, hotshot.  You're new on the job, you have a million lines of code that you've never looked at before, and there's a bug.  It's doing something really bad, like taking five minutes to load and then displaying a blank screen.  Your boss has some vague idea of what the program should show, or used to show, but it's not doing that.  The most specific instruction you can get is "make it look like this" or worse, "just fix it."  What do you do??

Take a deep breath.  First of all, don't be intimidated by the size of the code.  It really doesn't matter.  The whole program may have a million lines, but you can be sure it's not actually doing a million things when it tries to run your specific command.  Think of the million lines of code as locations on an extremely detailed map.  If you want to find a street in Austin, you won't get very far by carefully reading every street name on the whole Austin map.  Instead, you want to narrowly focus on the specific path your program takes when you execute your command.

In a nutshell, you are a detective.  A crime has been committed: a bug has murdered the effective execution of your program.  You know that the perpetrator went from point A (the beginning of the program) to point B (the place where the program is misbehaving).  Your first task is to follow a trail of clues and find the suspect.  After you know where the error occurred, the last part of your job -- move the suspect into custody by making your program do what it's supposed to -- is relatively easy.

Let me talk math for a minute.  I'm going to solve a puzzle live in this post.  I have a large number: let's pick 987,654.  I want to know what is the square root of this number to the nearest integer, but I'm not allowed to just hit the square root buttonon my calculator.  Instead, I can only pick a number and multiply it to see if my answer is right.  How will I find the answer?

I'll indent my solution in the next several paragraphs so you can see where the rest of the post continues.  Think of how you might solve it before reading on.

We could brute-force it with trial and error.  1*1 = 1.  Nope.  2*2 = 4.  Nope.  3*3 = 9.  Nope.  And so on.  This could take a while.

Let's be smarter about this.  We want to get CLOSE to the number and then home in from there.  So let's start with an educated guess.  About what order of magnitude is my number?  987654 is pretty close to a million, and it's easy to figure out the square root of 1000000: it's 1000.  (Just cut the number of decimal places in half.)  So we know the answer is less than 1000.

Is it 900?  900*900 = 810,000.  Too small.  Let's pick a bigger number.  Go halfway up, we get 950.

950*950 = 902,500.  Still too small.  Let's try 990.

990*990 = 980100.  Hey, we're getting closer!  But it's still too small.  How about 995?

995*995 = 990025.  Now it's too big, but just a little bit.

At this point we know that it's between 990 and 995, so let's just step backwards from 995 until we find it.

994*994 = 988036.  Too big.
993*993 = 986049.  Too small.

Ah ha! We've found the answer!  We know that the square root of 987654 is more than 993 and less than 994.  In other words, it's 993 point something.  If we wanted to, we could keep playing this game to guess more places after the decimal.  But I said that we'd only go to the nearest integer, so "993 point something" is good enough.

Let me check with the calculator now that we've settled on an answer:
sqrt(987654) = 993.808.  See?  I was right.

I've just demonstrated something very much like Newton's Method of approximation.  Instead of brute forcing the solution by walking through every possible number, we started in a likely spot and quickly converged on an answer.  Now, what insight does this give into debugging?

Figure out a likely entry point for your code.  For instance, if your program errors out when you click the button that says "Display all prices" then search all your files for the exact string "Display all prices".  If it only exists in one place, you've got a location to start looking, and you're off.

Quick sanity check.  Once you've found this spot in the code, you might want to temporarily change the text to "Display all prices!!!"  Then reload the program and see if the exclamation points show up.  I can't tell you how often I've found what I thought was the right spot in the code, but then wasted a bunch of time wondering "Why didn't THIS change do anything?" when I'm actually not in the right spot at all.  This hearkens back to the important principle from yesterday's post: Programming is both theoretical and experimental.  Don't just trust your guesses to be right, do the experiment!

So you know where your code started, now where is it going?  Off the top of my head, there are three important tools you have for narrowing down the location of your bug.
  1. Use a debugger.  Decent program development environments all have a debugger.  Learn how yours works, it is your best friend.  With the debugger, you can step through your code a line at a time, see exactly where it's going, spot check your variables, and trace back where you've been.  However, some types of development don't make debugging easy.  For instance, if you're developing web pages, the program runs on your browser and not in your editing environment.  Sometimes code can load an unrelated program and you have to go to a different tool.  So if you have no debugger or you run out of use for it, you have to go to your backup plan:
  2. Print statements.  Lots of print statements.  Print what you're doing: "X was set to 15!" or "Entering function..." "leaving function...".  Just remember that debug statements in shipped code look really bad, so don't fail to clean up all your print statements when you're done. A good habit is to put a distinctive string in front of all your prints to make sure you don't miss any.  For instance, sometimes I will make debugging statements like
    print "RG Entering function foo";
    The string "RG " rarely shows up in code, so I can search for all occurrences and delete them when I've fixed my problem.  (Some languages use "ARGV", but if you include the space after "RG " then this won't show up in the search.)  If you are willing to put in a little extra time, the preferred solution is to write a "printDebug" routine that only prints when a flag is set, so that you can globally turn off all debug messages if necessary.  This isn't always worth the effort, though, when working across multiple files that don't share libraries.
  3. Comment out large blocks of code.  Just remove them entirely.  Don't underestimate this technique.  If your problem is that the program is slow, and you comment out an entire function, and it STILL takes five minutes, you immediately know "Guess that function isn't causing the slowness!"  Then you don't have to waste your time looking there.
A word about print statements: It's not necessarily that easy.  Sometimes you will add a "printf" command (C) or "cout" (C++) or "System.out.println" (Java) and you see nothing at all.  "Print" in this case means "do whatever you can to make it visible."  If you're coding to a web page, write to the web page's output stream.  If you're writing Javascript, use "alert" statements to pop up a window.  If your program uses a log file, write to the log.  If you're writing a windowed program, you might want to make a special panel that you can use to display debug messages.  Just figure out what makes the most sense to make your messages visible.

Basically the objective here is to narrow down all the possible locations where the bug might be.  If your program crashed, where did it crash?  If it was supposed to display a picture of a flower and didn't, did it even reach the "drawFlower()" function?  If so, why is drawFlower broken?  If not, where did it make a wrong turn?

So you're a detective, casting a wide search net at first but tightening the net.  From the million lines of code, we've found that the error must be happening in this 1,000 lines.  Then we cut it down to 100 lines, then 10, then 1: this exact line is where it misbehaved.

As you tighten your search net, you will dig deeper into the code.  If you get down to one line and discover that it's a function you control, you're not done yet: you have to step into that function and keep going.  For instance, I comment out a single line and discover that the Bad Thing no longer happens.  I uncomment the line, step into the function, and comment out the entire function body.  Same result.  Good, my guess is confirmed.  Now uncomment half the body.  Now does it still do the Bad Thing?  Is it getting inside this "if" block, or the "else" block?  Why did it go here and not there?  What values is it seeing when this decision is made?

Ultimately, fixing a bug in a million lines of code often comes down to changing one line.  So finding where the bug is, is actually 99% of the work.  Hence, this is probably the single most important skill you can develop.

Thursday, February 26, 2009

Coding is both theoretical and experimental

"As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs."
  - Maurice V. Wilkes, inventor of microprogramming

My father is a computational physicist.  That's a job description that didn't exist when he started working a few decades ago, but he made the decision to transition into that role.

In the past, scientists have generally fallen into two broad categories: theoretical, and experimental.  Theoretical science is distantly related to pure math.  You have an arsenal of known scientific facts, you know the equations, you analyze those equations and learn new consequences that result from them.  Experimental science is more directly involved with measuring and testing the real world.  You come up with ideas, you do something to test if your ideas match reality, and if they do, you've got a theory.  (Ultra-simplified version.)

A computational scientist is a different beast altogether.  A computational scientist can create models of reality using a computer simulation to represent theoretical information.  Then he can run the simulation with different parameters, finding out what would "really" happen if you changed the initial configuration.

When we deal with things like planets, we pretty much know how they behave in general, but you can't say "I wonder what would happen if we put three planets at this distance from each other" because there is no practical way to set up an experiment by moving planets around.  Instead, your computer can simulate what would happen, and you might learn some things from the simulation that weren't obvious just by running the equations.

Programming in general is like that: it's both theoretical and experimental.  You can eyeball your code, trace the logic by hand, and figure out what it should do.  That's theory.  Then you can run the code, either logging the output or stepping through it with effective debugging tools, and find out what it really does.  That's experiment.

In decades long past, computer programs ran on expensive mainframe terminals which could take hours to complete a single execution via, for example, punch cards that were hand-fed to the computer.  Because terminal time was at a premium, writing a program with any bugs could be a costly mistake.  It was more economical to spend hours poring over the punch cards or machine code, making dead certain that the program would run correctly before you fed it to a computer.  In other words, programmer time was cheap and computation time was expensive.

Today, the reverse is true.  Rather than hours, it takes seconds to compile and run a computer program.  Programmers are still free to spend their time making sure that the code "should" run before they run it.  However, in many cases it is easier and smarter just to run it and see if something goes wrong.  Following Moore's Law, computer time has become incredibly cheap.

A smart programmer takes advantage of this and does a mixture of both theory and experiment.  Rather than write an entire program from scratch on paper, then running it and hoping for the best, today's programmers should view a program as a series of components, to be built and tested bit by bit, making sure that each part behaves properly through real demonstration.  Unless I'm in the middle of a major rewrite that deliberately broke old code, I will rarely leave my program in an uncompilable state for more than fifteen minutes, and usually a lot less.  Each change can be tested to make sure that it has predictable behavior.

In an optimal testing environment there are no computer errors; logic errors come from the developers.  That's not a failing on the part of the developer; to err is human, and the point of this style of coding is to recognize and anticipate those errors.  The benefit of working this way is that if you make a mistake, you get immediate feedback that something has gone wrong.

If you write a large number of changes and then discover an error after running them all at once, you are temporarily stuck.  You have no way of knowing which of your changes was responsible, so you have no choice but to backtrack and break your code into smaller fragments to isolate the problem... which is exactly how you should have been writing the program in the first place.

So how do you approach a problem that requires debugging?  I advocate an approach that uses something akin to the scientific method to systematically track down and destroy bugs.  I'll discuss this in a future post.

Programming thought-stuff

Hi, my name is Russell Glasser, and I've been writing software professionally since 1995.  I have a BS in computer science from UC San Diego (1997) and an MS in computer engineering from UT Austin (2007).  I currently do Java Enterprise development for a logistics company in Temple, TX.  In my past lives, I've done work that included scientific modeling, 3D virtual reality simulations, educational games, and data mining.  I also tutored a kid for two years and taught programming classes for about six months.

I'm a prolific blogger, and for a while I've been kicking around the idea of starting a new blog to share thoughts with my colleagues about software development.  I decided to take the plunge today.  I don't know if I'll be updating frequently or not, but add me to your feed and you can read my one post a year if that's how it turns out. :)  Also, I encourage people to ask questions, as it will give me more incentive to write new posts.  The only restriction is that I'm going to insist on keeping this as a purely professional blog, which means I will not field any questions that aren't at least marginally related to software.  If you want to discuss other topics, look me up or email [rglasser *at* apollowebworks *dot*  com].

The title of this blog was inspired by a quote from The Mythical Man-Month, by Frederick Brooks.  When I read it, I discovered that it perfectly encapsulated the reason why I decided to pick software as a career path.

Brooks wrote:

"The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.

Yet the program construct, unlike the poet's words, is real in the sense that it moves and works, producing visible outputs separate from the construct itself. It prints results, draws pictures, produces sounds, moves arms. The magic of myth and legend has come true in our time. One types the correct incantation on a keyboard, and a display screen comes to life, showing things that never were nor could be.

Programming then is fun because it gratifies creative longings built deep within us and delights sensibilities we have in common with all men."

I hope you enjoy this blog.