Friday, February 27, 2009

Search and destroy missions for your bugs

Pop quiz, hotshot.  You're new on the job, you have a million lines of code that you've never looked at before, and there's a bug.  It's doing something really bad, like taking five minutes to load and then displaying a blank screen.  Your boss has some vague idea of what the program should show, or used to show, but it's not doing that.  The most specific instruction you can get is "make it look like this" or worse, "just fix it."  What do you do??

Take a deep breath.  First of all, don't be intimidated by the size of the code.  It really doesn't matter.  The whole program may have a million lines, but you can be sure it's not actually doing a million things when it tries to run your specific command.  Think of the million lines of code as locations on an extremely detailed map.  If you want to find a street in Austin, you won't get very far by carefully reading every street name on the whole Austin map.  Instead, you want to narrowly focus on the specific path your program takes when you execute your command.

In a nutshell, you are a detective.  A crime has been committed: a bug has murdered the effective execution of your program.  You know that the perpetrator went from point A (the beginning of the program) to point B (the place where the program is misbehaving).  Your first task is to follow a trail of clues and find the suspect.  After you know where the error occurred, the last part of your job -- move the suspect into custody by making your program do what it's supposed to -- is relatively easy.

Let me talk math for a minute.  I'm going to solve a puzzle live in this post.  I have a large number: let's pick 987,654.  I want to know what is the square root of this number to the nearest integer, but I'm not allowed to just hit the square root buttonon my calculator.  Instead, I can only pick a number and multiply it to see if my answer is right.  How will I find the answer?

I'll indent my solution in the next several paragraphs so you can see where the rest of the post continues.  Think of how you might solve it before reading on.

We could brute-force it with trial and error.  1*1 = 1.  Nope.  2*2 = 4.  Nope.  3*3 = 9.  Nope.  And so on.  This could take a while.

Let's be smarter about this.  We want to get CLOSE to the number and then home in from there.  So let's start with an educated guess.  About what order of magnitude is my number?  987654 is pretty close to a million, and it's easy to figure out the square root of 1000000: it's 1000.  (Just cut the number of decimal places in half.)  So we know the answer is less than 1000.

Is it 900?  900*900 = 810,000.  Too small.  Let's pick a bigger number.  Go halfway up, we get 950.

950*950 = 902,500.  Still too small.  Let's try 990.

990*990 = 980100.  Hey, we're getting closer!  But it's still too small.  How about 995?

995*995 = 990025.  Now it's too big, but just a little bit.

At this point we know that it's between 990 and 995, so let's just step backwards from 995 until we find it.

994*994 = 988036.  Too big.
993*993 = 986049.  Too small.

Ah ha! We've found the answer!  We know that the square root of 987654 is more than 993 and less than 994.  In other words, it's 993 point something.  If we wanted to, we could keep playing this game to guess more places after the decimal.  But I said that we'd only go to the nearest integer, so "993 point something" is good enough.

Let me check with the calculator now that we've settled on an answer:
sqrt(987654) = 993.808.  See?  I was right.

I've just demonstrated something very much like Newton's Method of approximation.  Instead of brute forcing the solution by walking through every possible number, we started in a likely spot and quickly converged on an answer.  Now, what insight does this give into debugging?

Figure out a likely entry point for your code.  For instance, if your program errors out when you click the button that says "Display all prices" then search all your files for the exact string "Display all prices".  If it only exists in one place, you've got a location to start looking, and you're off.

Quick sanity check.  Once you've found this spot in the code, you might want to temporarily change the text to "Display all prices!!!"  Then reload the program and see if the exclamation points show up.  I can't tell you how often I've found what I thought was the right spot in the code, but then wasted a bunch of time wondering "Why didn't THIS change do anything?" when I'm actually not in the right spot at all.  This hearkens back to the important principle from yesterday's post: Programming is both theoretical and experimental.  Don't just trust your guesses to be right, do the experiment!

So you know where your code started, now where is it going?  Off the top of my head, there are three important tools you have for narrowing down the location of your bug.
  1. Use a debugger.  Decent program development environments all have a debugger.  Learn how yours works, it is your best friend.  With the debugger, you can step through your code a line at a time, see exactly where it's going, spot check your variables, and trace back where you've been.  However, some types of development don't make debugging easy.  For instance, if you're developing web pages, the program runs on your browser and not in your editing environment.  Sometimes code can load an unrelated program and you have to go to a different tool.  So if you have no debugger or you run out of use for it, you have to go to your backup plan:
  2. Print statements.  Lots of print statements.  Print what you're doing: "X was set to 15!" or "Entering function..." "leaving function...".  Just remember that debug statements in shipped code look really bad, so don't fail to clean up all your print statements when you're done. A good habit is to put a distinctive string in front of all your prints to make sure you don't miss any.  For instance, sometimes I will make debugging statements like
    print "RG Entering function foo";
    The string "RG " rarely shows up in code, so I can search for all occurrences and delete them when I've fixed my problem.  (Some languages use "ARGV", but if you include the space after "RG " then this won't show up in the search.)  If you are willing to put in a little extra time, the preferred solution is to write a "printDebug" routine that only prints when a flag is set, so that you can globally turn off all debug messages if necessary.  This isn't always worth the effort, though, when working across multiple files that don't share libraries.
  3. Comment out large blocks of code.  Just remove them entirely.  Don't underestimate this technique.  If your problem is that the program is slow, and you comment out an entire function, and it STILL takes five minutes, you immediately know "Guess that function isn't causing the slowness!"  Then you don't have to waste your time looking there.
A word about print statements: It's not necessarily that easy.  Sometimes you will add a "printf" command (C) or "cout" (C++) or "System.out.println" (Java) and you see nothing at all.  "Print" in this case means "do whatever you can to make it visible."  If you're coding to a web page, write to the web page's output stream.  If you're writing Javascript, use "alert" statements to pop up a window.  If your program uses a log file, write to the log.  If you're writing a windowed program, you might want to make a special panel that you can use to display debug messages.  Just figure out what makes the most sense to make your messages visible.

Basically the objective here is to narrow down all the possible locations where the bug might be.  If your program crashed, where did it crash?  If it was supposed to display a picture of a flower and didn't, did it even reach the "drawFlower()" function?  If so, why is drawFlower broken?  If not, where did it make a wrong turn?

So you're a detective, casting a wide search net at first but tightening the net.  From the million lines of code, we've found that the error must be happening in this 1,000 lines.  Then we cut it down to 100 lines, then 10, then 1: this exact line is where it misbehaved.

As you tighten your search net, you will dig deeper into the code.  If you get down to one line and discover that it's a function you control, you're not done yet: you have to step into that function and keep going.  For instance, I comment out a single line and discover that the Bad Thing no longer happens.  I uncomment the line, step into the function, and comment out the entire function body.  Same result.  Good, my guess is confirmed.  Now uncomment half the body.  Now does it still do the Bad Thing?  Is it getting inside this "if" block, or the "else" block?  Why did it go here and not there?  What values is it seeing when this decision is made?

Ultimately, fixing a bug in a million lines of code often comes down to changing one line.  So finding where the bug is, is actually 99% of the work.  Hence, this is probably the single most important skill you can develop.

13 comments:

  1. Distinctive string or not, anytime you have to "make sure you don't miss any", you should automate it somehow. Set a global constant "DEBUG", or use built-in facilities to automatically deal with debug statements when you do a release versus debug build.

    If you have to go back and search and find all occurrences of something, you will miss at least one. It reminds me of a story I once heard of a generic office application that was released with a very obscure bug that would happen only in a specific, very rare program state. A debug window was left that would pop up--complete with a joke picture of one of the developer's bare ass. The program was subsequently re-classified as "Adults Only". Might be partially or completely an urban legend, but it illustrates the point.

    ReplyDelete
  2. Yeah you're absolutely right... I was going to say (but forgot) that the best way to do this is to create your own "debug" function that can be globally turned on or off.

    ReplyDelete
  3. Good post.

    Which takes more of your time: coding, debugging, or maintaining?

    ReplyDelete
  4. I'd have to say it's definitely debugging by far. Even when you're developing new code, you're still "debugging" by thinking ahead all the time to see what you could do to break your own program and trying to catch those errors in advance. Effectively you start looking for bugs even before they exist.

    ReplyDelete
  5. Added Shane's correction to the post body.

    ReplyDelete
  6. One technique I've found useful when debugging layout, in either GUIs or web page, is to give each element a distinctive background color: red, yellow, orange, purple... That way, you can see that the misplaced button is on a blue background instead of the orange area where you expected it, and narrow down the scope of your search.

    I showed an app to a coworker while I was debugging it, and he christened it "Angry Fruit Salad".

    ReplyDelete
  7. It's funny that you mention that technique, Arensb, because I was just doing exactly that with a stubborn jsp page about 20 minutes ago. Horrible table design, it was. Horrible.

    ReplyDelete
  8. Along the lines of different backgrounds, the Web developer toolbar for Firefox has the option of outlining elements based on different criteria on a page, which is a handy way of doing it without having to add any extra code.

    ReplyDelete
  9. Hi Russell and all, I came across this blog on the atheist blog, good stuff.

    I do QA for a living so I normally _have_ to test my code before checking it in (unlike engineering proper hah! no just kidding).

    Some things I've learnt about writing debuggable code (because I've had to debug so much of it):

    - don't be clever where you don't need to be clever, at least when just trying to get the code to behave the way you want. This includes things like highly clever branching code:

    if(a()-b())
    {...}

    just check for non-zero in the usual way:

    if (a !=0 ){}
    if (b != 0){}
    Much more readable (and most compilers/interpreters are very good at branch optimization anyway).

    - avoid code that obfuscates the flow of control, i.e. I commit this kind of sin a bunch in C# and Java:

    foo = String.Split("/", bar).Replace("\\","/").Length;
    etc...

    I'm paying for this at work already just having to figure out what I wrote a few months ago.
    This kind of stuff puts you at the mercy of your debugger and how well it displays stack crawls, locals etc.

    - don't #define things to death (in compiled languages like C and C++). Macro expansion/conflict bugs are often very hard to figure out both at compile and runtime. Use typedefs instead (which are better anyway because they obey the type system of the language in C/C++). This is partly why languages like Java don't have #define facilities ;).

    - Exception abuse. Please dont do things like

    try
    {
    ...
    }
    catch(...)
    {
    std::cout << "oh shit!";
    return 1;
    }

    most debuggers can't follow exception propagation, so don't use them for control-flow operations like this (this just subverts the purpose of exceptions anyway).
    Some of my code at work has some of this, for which I should be flogged (well I am because I have to maintain it hah!).

    Less obvious things are stuff like unrolling your own loops and things like that....

    Finally, make sure you have debug versions of any 3rd party libraries your code may use. Debuggers get very confused if they have to try to go through code sections without debug information.

    Just some things that have occurred to me drinking my morning coffee...

    LS

    ReplyDelete
  10. that honestly sounds like a very slow, painful way of finding the bugs in a system.

    I prefer the TDD approach where i code my system to make tests pass (unit tests, integration tests, test automation of UI, etc, though i'm only beginning to do UI automation). then when the system doesn't do what it should be doing, i should have a test that is failing, and telling me exactly where the problem is, without having to hunt through so much code.

    In this situation, finding the bug is no longer 99% of the work. You can see a failed test in the test report, go right to where the failure occured, and eliminate 99.999% of the codebase from the outset. this makes finding the bug a much smaller percentage of the solution.

    I understand the value of this approach, though. There are times when you just have to dig into the debugger or otherwise hunt for a bug, to figure out what's going on. for example, when you have a bug but all tests are passing... the first thing you do is figure out why it's not working with the sort of approach you're talking about. then, write a test to prove that it's a bug by specifying what it should be doing and seeing the test fail. then fix the code so the test (and all other tests) passes. now i have a regression test to ensure that this bug is never introduced again.

    :)

    ReplyDelete
  11. Derick,

    You are correct, of course. In an ideal world, you should be working on code that has unit and integration tests built in, so that it's easy to pinpoint the source of the failure. As a long term developer, you should strive for that to maintain stable code where it's easy to find errors.

    We don't live in an ideal world, though, and we don't always have the power to go back in time and throttle the guy before us into writing correct code. I'm speaking from the perspective of a guy who's relatively new on a particular job, and there's a large project written by predecessors with very few tests built into it already. Inexperienced programmers are going to run into this situation many times in their career, so they might as well learn to deal with it early.

    I strive to replace existing stuff with tests to catch future problems, but in the short term, everyone's faced with the basic problem: "There's a bug. There's inscrutable code that I didn't write. What's causing the problem?"

    After you track the problem down, then you can start wrapping tests around that section of code to get it under control. But short of a clean sweep where you rewrite everything at once to confirm to best practices, you can't assume that the error has already got test coverage.

    I admit that test-driven development is still relatively new to me, and something I'm working to get better at.

    ReplyDelete
  12. In my experience it is vital to have a system in which you can turn on logging at runtime, and preferably in steps.
    I know some people advocate the DEBUG preprocessor way with removing all debug statements at runtime, but if you do that, when the shit hits the proverbial fan, you are probably looking at a big black box.
    We tend to have problems which are difficult to reproduce, as most of the time it will be a concurrency/race condition kind of thing, most of the time there is no good way to try and reproduce the problem in our development environment (the customer may have a way faster machine then we have at out disposal or we just don't have the external connection/facilities needed to reproduce the problem)
    Extensive logging which we can tune per facility and level saved our bacon numerous times.

    ReplyDelete
  13. Hmm, I should add one caveat.
    when dealing with race conditions .... turning up the debug level ....
    will probably 'turn off' the bug.

    Writing multi-threaded programs ... it's bound to bite you in the behind
    once in a while and still remains a bit of black magic.
    but then again, if it was easy ... anyone could do it, and where would that leave us, the skilled programers :D )

    ReplyDelete