Friday, October 16, 2009

Ugly numbers in a heap

As I was saying, a good solution to finding ugly numbers would be to somehow enumerate only numbers which are known to be ugly, removing them from memory after they have been read. I couldn't think of the way to do that. Internet to the rescue.

Here's what you do:
  1. Store the number 1 in a set.
  2. Consider the smallest item, "n" in the set. That is counted as the next ugly number.
  3. Add the numbers n*2, n*3, and n*5 to the set.
  4. Remove n from the set.
  5. Repeat steps 2-4 until you reach the desired ugly count (1500).
Each time you search for the next ugly number, you remove one number from the set but add three more, for a net gain of two. Thus, the set of numbers you are storing on each step i is (2*i-1). That's a manageable memory size. But there's still a snag.

Sets are unordered lists, and we want to always find the smallest number of the set as efficiently as possible. How do we do that? We can't just read every number in the set, or the problem scope quickly becomes O(n2). Can't have that.

Turns out there is an ideal data structure for handling this kind of thing, but it's something I haven't looked at since I was an undergrad. It's called a min-heap.

A heap is a special kind of binary tree, where every level of the tree is filled to maximum density. For instance, in a heap with 8 elements, you know that the top level has one node with exactly two children; each of those two children has two children (that makes 7), and the leftmost node at the third level has one child on the left.

The neat thing about using a heap is that because of its predictable structure, you can represent it using an expandable one-dimensional array. You don't have to mess with pointers or the other headaches of data structures. If you are looking for root.left.right, you just calculate the spot in the array and access element 4.

A min-heap is constructed in such a way that the children of each node are guaranteed to be larger than the parent node. In other words, the smallest value is guaranteed to be at the root, and each child of the root is itself a min-heap with the smallest element of that at the root.

(If I'm going too fast for you, speak up. I don't believe I've ever done a post specifically on binary trees, so maybe I should.)

In any case... you can do two things with a min-heap: add a new element, which is rapidly sorted to the right place; and remove the root element, which then moves the next smallest element into the root and fixes the rest of the heap so that it is still a min-heap.

This is exactly what we need. It keeps the cost of searching for the smallest known ugly number to O(log n), where n is the number of elements currently in the list.

So to recap:
  1. Write your own min-heap class, which is the fun part. ;)
  2. Add the number 1 to the heap.
  3. Inspect the root of the heap, and count that number ("r") as ugly.
  4. Add 2r, 3r, and 5r to the heap.
  5. Delete the root.
  6. Repeat steps 3-5 until you reach the desired ugly count (1500).
Actually it turned out that wasn't quite the answer, because you can get duplicate ugly numbers on the heap. For instance, in the second pass you add 2*3, while in the third pass you add 3*2, which results in multiple 6's on the heap. But that's easy to fix. All we have to do is change step 5 to: "Keep deleting the root until it is no longer equal to the last recorded ugly number."

And there you go. Small storage space, quick execution time, ignores non-ugly numbers, and finishes in 3 seconds easily.

SPOILER:

In case you wondered, the answer is 859,963,392.

Ugly numbers and prime numbers

The second wrong approach I tried taking to find ugly numbers was something called a Sieve of Erathosthanes. This was something that my dad taught me about in high school, and it helped me win a programming contest as a sophomore.

Here's where I should give a quick shout out to my favorite HS teacher, Thom Laeser, who I think might read this blog. Thom was kind of a Renaissance man as a teacher; across my four years in high school, I took one geometry class, two programming classes, and a journalism class from him. He also started the school's first Humanities course, which I did not attend. It wasn't until high school that started really programming (rather than just dabbling in BASIC), and I credit Mr. Laeser with giving me that extra boost.

When I was a sophomore, he gave all his programming students a challenge, which was to be the one who could write a program printing the most consecutive prime numbers within five minutes. This was 1990 and the computer lab probably was limited to 80386 processors or something.

It was clear early on that the contest was going to be between me and a senior. The next day, we both proudly came in with programs that printed all primes from 2 to MAXINT in well under 5 minutes. This was in Pascal, I think integers were 16 bit, so the highest number would be 65,536. Mr. Laeser shrugged and said that was very impressive, but we would have to keep going if we wanted to win.

So. Along the way, I had to learn to write my own large integer output, and also save space by representing numbers with bit manipulation instead of wasting entire "int" values on booleans. More importantly, I learned about the Sieve of Erathosthanes, which I mentioned before.

The sieve is a method for quickly weeding out prime numbers from a finite list of numbers. Here's how it works (simplified version): start with an array of booleans from 1 to, say, a million. The first true prime number is two, so cross out all multiples of 2 that are not 2; so, 4, 6, 8, 10, etc. Then we move up to three, crossing out 6 (again), 9, 12, etc. Next we encounter 4, but 4 is already crossed out, so we ignore it and move up to 5, eliminating 10, 15, 20, etc.

It turns out that you only have to filter on numbers up to the square root of the array size, which in this case is 1000 (sqrt(1000000)). After that, you know there are no multiples of 1001 because you would have had to multiply 1001 by something smaller, and hence already visited, to reach a million.

Long story short, I won the contest, but it was a grueling arms race for several weeks as we kept improving our programs and then "stealing" new ideas by observing how the other's program worked.

Sieve of Ugly

So that was a roundabout personal story, but I haven't told it here before so I thought I might as well get it out. And the point is that my second pass at solving the ugly number problem was to use a sieve.

Sort of a simple modification, really. I didn't just use a boolean array, because there are now three kinds of numbers: 1: prime; 2: ugly; 3: neither prime nor ugly. So I made it an integer array, where each integer stores one of three flags. All numbers start prime; then multiples of 2, 3, and 5 get marked as ugly. Then run another pass on all numbers above 5, marking their multiples as non-ugly.

Using integers was super-overkill, as an int is 32 bits, and I only needed 2 bits to represent my flags: 00=prime, 01=ugly, 10=non-ugly. What I found was that I couldn't sieve up to the 1500'th ugly number, because the array was so huge that I ran out of memory.

So first I changed it to an array of bytes (8 bits). Then, still not satisfied, I improved performance by writing some bit-manipulation routines to simulate an array of two-bit numbers using four numbers per byte.

It worked; I found the same number much faster than before, but still with a noticeable delay before the sieve finished running. 3 seconds was still way out of reach.

What next?

While I was coding the sieve, it occurred to me that it was a terrible waste of space to store ALL the numbers, both ugly and non-ugly, in an array until the entire program was finished. It seemed like there ought to be some kind of quicker way to figure out, given the current ugly number, what the next one in the sequence should be. At worst, I could maintain an array of only 1500 numbers to store only ugly numbers (instead of all numbers up to a trillion).

I wracked my brains for this method; I wrote down the factorization of the first 10 numbers or so, looking for a pattern, but couldn't find one. I'm pretty sure there isn't one, but I was on the right track at this point. I know because that's when I "cheated" by looking to see what other people had done.

And maybe they're smarter than me, or maybe they spent more time on it, but there is still a certain grim accomplishment in not having to reinvent the wheel.

There's a well known "hard" problem in computer science called the traveling salesman problem, which goes like this: You have a bunch of cities, separated by known distances. You have to visit all of them while spending as little time as possible in transit.

To solve this problem by brute force becomes impossible very quickly as you add more cities. So the best known approaches use short cuts; they use sneaky tricks that don't necessarily produce the "best" answer, but are guaranteed to produce a pretty good answer most of the time.

Anyway, I heard that some computer science genius was once asked how he would solve the traveling salesman problem, and he said: "I would put it on the back of a cereal box. 'Find the most efficient path through these points. Winner gets $100, mail your answer to the following address'..."

Well, that's one kind of divide and conquer solution right there. Hooray for the internet, though... now we don't even need cereal boxes.

Online programming puzzles and "ugly" numbers

Ran across this site while looking for general programming puzzles to stay in practice.

On the site, my attention was captured by this puzzle involving "ugly" numbers.

Ugly numbers are numbers whose only prime factors are 2, 3 or 5. The sequence

1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, ...

shows the first 11 ugly numbers. By convention, 1 is included.

Write a program to find and print the 1500'th ugly number.


This isn't a particular hard puzzle if you brute force it. All you have to do is write an algorithm to determine whether a particular number is ugly or not, sort of like this:

while (num % 2 == 0) num /= 2;
while (num % 3 == 0) num /= 3;
while (num % 5 == 0) num /= 5;
if (num == 1) return true;

Then you can just loop through as many numbers as it takes, incrementing a counter as you find ugly numbers, until you reach 1500 numbers.

The problem with this approach is that there is a time limit of three seconds. The brute force approach requires you to do an increasingly long operation (dividing repeatedly to find the prime factors) on a series of numbers which turns out to be very large (I won't spoil it, but the answer is just short of one trillion). Early numbers are computer quickly; then it slows down so finding the 1500th number may take several minutes.

I confess I cheated to solve this puzzle. I came up with one other wrong approach after trying the brute force method, then went googling for discussions of this puzzle. As I hoped, I did find a protracted argument about the right way to solve it. One guy hit on a really clever solution, based on a data structure which I haven't used for years. It took me a few tries to implement, but it was very satisfying in the end.

I guess I could feel bad for "cheating," but I don't. I came close to finding the right answer, but really, if you want an optimal answer to a well defined puzzle then there's no harm in hunting around to see if someone did it better than you. Professionally, implementing a GREAT solution that you looked up is preferable to implementing a mediocre solution that you got with only the sweat of your brow. You can learn new tricks from other people that you'll have in your toolbox for next time.

As a tangent, I take the same approach to strategy games. I like to take my time and learn the rules of a game on my own, and I would scorn to look at a strategy guide during that period. However, there comes a point where you have a basic understanding of the rules and can improve much faster by considering input from others who have already mastered the game.

Anyway, more on the wrong approaches and the right one later.

Friday, August 28, 2009

Divide and conquer your project, part 1

James, who is a regular commenter, is a programmer looking for more experience. James has studied a few languages and probably has a decent grasp on the basic logic that it takes to write a program, but he wants to work on a personal project because he doesn't get enough variety of practical applications at work.

This is an admirable goal, first because developing your programming skills requires constant practice, and second because personal projects can make good resume padding. Potential employers like to hear that you program for fun, because it will make you more likely to throw yourself into your work. Also, if they ask "Have you worked with technology X?" you can say "Not professionally, but let me show you this project that I posted on my web page." It gives you the opportunity to talk intelligently about the technology.

However, starting a new project can be overwhelming, and it's hard to know where to start. You have all these things in your head that you would like the program to do, but you can't see how to get from what you have now (a blank screen) to what you would like to have (an awesome program with flawless logic and a clean interface). What do you do first?

I spent a few posts discussing the technique of "divide and conquer." As it turns out, that is also an excellent plan for approaching your programming projects. Here's a little secret of programming: you won't get it right the first time. Fuhgeddaboutit. Don't try. If you sit down for a couple of hours and try to write an enormous, complex program in one shot, and only after you think it is done do you compile it for the first time, you will be confronted by a screen full of errors and another few hours of headache, tracing back through what you did to figure out where each of numerous bugs exists in your thousand line program.

Actually, my proudest moment as an undergrad was writing a single assembly language module that executed perfectly with no debugging on my first try. But that's not something you should count on, and in any case, this was one line in a much larger program.

When you are building a castle of air, you do it brick by brick. Except they're, you know, wind bricks. You don't wave your hands and get a fully formed castle. You break down the project in your mind into easily manageable pieces, and at each step you (1) Look at that step to determine whether you should break it down even further before proceeding, (2) Write complete code that does something visible, and (3) test that component thoroughly.

That third step is crucial, and it's a hallmark of a solid programmer that they develop testable and tested code at every step. When you test a piece of your code until you know it's solid, you will save yourself a lot of time later that would have been spent backtracking to figure out where your code is going wrong.

How do you break it down? Let's consider the project that James set for himself.

One idea I've had is a program that accepts inputs on your credit card balances, minimum payments, etc, and calculates the most efficient payment plan to snowball. I've got a spreadsheet I downloaded from somewhere that I'd model, but the spreadsheet can't jigger your payments to determine the best way to pay off the balances the soonest, or with the least amount of interest. I envision it as an Iphone App, although I learned that without a Mac I can't write it that way.

Actually a spreadsheet probably can do that, but writing a full featured program would be cooler. So let's consider what components this program has.

A popular way of dividing up pieces of a program is to start by splitting it into three layers: a presentation layer (the way your program handles input and output); a data layer (the information that your program will be manipulating); and a business layer (the program logic that figures out what to do). If you split your program up into pieces this way, then you can hopefully change up the pieces at any stage without risking ruining your entire program.

How you start your problem will largely depending on which component is most interesting to you personally. In many projects, the graphics are the most interesting thing. In that case, your top priority would be to get something, anything to display on the screen.

For example: My first real programming job was modeling equations for scientists. There was an eccentric old physicist who wanted me to graph some of the equations he wrote for me, using Fortran on a green and black terminal. As I progressed on getting the equations to display the right answer, he seemed distinctly bored with hearing about the programming details. But then I figured out how to generate a line diagram. It excited the hell out of him, and suddenly he couldn't stop talking about his ideas for what we could do with this.

The diagram, of course, bore almost no relation to the actual equations he had described. But it didn't matter. It was a launching point for his concept of the program. Similarly, as you write your own programs, hopefully you will keep on continually inspiring yourself by producing code that does something that will tickle your interest in the next phase of your project.

Now, in James' case, the interesting part of the project is probably not the graphics. It's the business logic. After all, the ultimate objective is not to create something that looks cool, but something that provides useful information. Since this is a program that you will be using yourself, the inspiration will come from your feeling that "Wow! This is really telling me some valuable stuff! But I bet it would be even more useful if..." And then whatever comes after "if" will determine which piece of the program you conquer next.

So the business layer is the most important part of your program. Therefore, I would recommend that you pick a language, and start with an extremely simple presentation layer and data layer. So for presentation, let's say you'll use plain text input and output. For data, let's say that you'll write a module or an object that hard codes the value of your credit card balances, and then you can tweak the code to change the values.

These aren't good solutions, but they are stepping stones for the program you really want to build. Write good business logic first, using crude presentation and data models, and then make it better later. How will you make it better? Well, you might decide to turn your text input into a web form interface, and perhaps add graphs. You might change your data from hard-coded values, to a separate text file that you will edit by hand, to a database table. As long as you keep these parts of the program reasonably well separated, you can build them up a bit at a time, focusing on the parts that strike your fancy.

Whew, that was a mouthful. Pausing for feedback now. James, do you think this is enough to get you started? Do you have an idea which language you'll start in? Any thoughts or questions?

Monday, August 17, 2009

Q&A open thread

After a two month spell without posting, I'd like to get more out of this blog again, but I don't have any pressing topics that I feel I need to write about. So I'd like to invite questions in the comments section of this post in order to inspire new topics that we could explore. Any pressing questions about software development that need some discussion? Have I posted anything in the past that I ought to expand on? Is anyone anxiously awaiting a continuation of the chess series of posts?

I haven't spoken at all about my current position, which is a lot different from past jobs I've had. I'm currently doing contract work on a product that helps people to automatically activate and configure their high speed internet connections. The work is all done through customized web pages that use ActiveX to deal with low level PC operations. As you may have already noticed if you've been following either my Twitter or Facebook feeds, I spend a lot of time in a computer lab working with custom configurations for a like DSL modem on various configurations, including released candidates of Windows 7. I've never really been a fan of hardware, but this involves a lot of interesting network puzzles and some really advanced ways to use JavaScript.

Anyway, if you've got any questions, other commenters may answer you before I get to it, at least it will help me feel out some directions to take potential upcoming posts.

Friday, August 14, 2009

Question about web security

Wanted to put out a question to readers. I recently happened to tune in to the Kim Komando show and heard something that sounded like a mistake. However, I'm not a security expert, so I'd like to find out more.

A caller had been making use of his neighbor's unsecured wireless connection when his own internet was down. He was wondering about the security hazards of doing this.

Kim's answer was mostly correct, which was to say that anything you transmit or receive over the web is visible across the entire network. By using freely available software, your neighbor could potentially steal your information. He could have a packet sniffer and log the information on his computer, or even in the router.

Kim then went on to say "So obviously you wouldn't want to do any banking over the unsecured line, because you don't know who's looking at your data."

These struck me as solid words of caution, but potentially incomplete. Presumably your bank would make you log in over an SSL connection. My own understanding is fuzzy, but I believe this means that the data gets encrypted at each end before being sent over the network, and only your client and their server have the information necessary to decrypt the data.

Can anybody confirm or deny that my understanding is correct? Is there some way you know of where the neighbor could read your secure data as plain text? If not, what information is your neighbor lacking that would allow him to see it?

Sunday, May 31, 2009

Artificial intelligence in games, part 3

As I warned when I started this blog, I've allowed it to lie dormant for a while, but I feel like it's high time to pick up where I left off and finish the series about playing chess by look-ahead.

So to review the posts that led up to this one:
Now we're going to turn our attention to chess.

Although chess is a much more complex game than tic-tac-toe, it's easy to see that you can apply the same tactics to systematically list all possible games of chess, in theory. Of course I'm not going to attempt to diagram them, but there's definitely a finite number of things you can do on each move. If you're the white player, on your first turn you have eight pawns that you can move forward one or two spaces, and then you have two knights that can move to two spots (forward two, left or right one). So the total possible moves on the first turn is (8*2)+(2*2), or 20.

Then it gets a little more complicated. Black has the same choices, but he has the same 20 choices for each of the 20 possible moves that white could make. So after each player has made one move, there are 20*20=400 possible states the board could be in. It only gets worse from there.

For instance, if white performed the traditional P-K4 opening, on the second move he has nearly all the other choices remaining (7 pawns to move one or two squares, plus two knights two ways, 18 moves) but he also has a whole lot of new choices. The king's knight may move into the spot occupied by the pawn (19). If the black king's pawn doesn't block it, the white king's pawn may advance one space (20) or, depending on what black did, capture an enemy pawn.

Also, the white king's bishop is now free to move to any of 5 spaces toward the upper left (25) and the queen can move to any of 4 spaces toward the upper right (29). So the number of possible moves keeps increasing as the board opens up, and so does the complexity of the tree. Now it looks like by the time white has moved twice, there are about (20*20*29) possible states the board could be in. (Some starters obviously make less than 29 moves possible, and there is more than one way to get to the same state; i.e., P-K4 followed by P-Q4 has the same result as P-Q4 followed by P-K4).

As we observed in the tic-tac-toe problem, you can calculate roughly how many possible games there are in two ways: by multiplying out the possibilities for all moves, or by figuring out every possible state of the board. The second method tends to be smaller because it eliminates various ways of GETTING to those states. However, in the case of chess, figuring out how many moves are possible from each state is massively complicated, so will go with the second approach.

Each side has 16 pieces. The pieces are not unique; for instance, the eight pawns are interchangeable with each other and so are the two rooks . However, to simplify the calculations we'll assume that all pieces are unique. So there are 32 pieces on the board, and 64 squares where they could be located. The first piece could be on any of the 64 squares, the second on any of the remaining 63, the third on any of the remaining 62, and so on. So as a rough guess, the number of board configurations is (64*63*62*...*35*34*33) and then the remaining 32 squares are blank. That is to say, the answer is "64 choose 32", or (64!/(32!*32!))

Side note: Cool feature of Google. Type "64 choose 32" into the search bar and it will answer the question for you. It's 1.83*1018.

Is this number an accurate portrayal of all chess boards? Not exactly. In the first place, many of the moves are not legal. This set of all possible boards includes boards where the two kings are next to each other, or in an impossible multiple check; it includes pawns on the first row, and positions where all eight pawns are on their home squares, but the king and queen are not.

On the other hand, it does not include any position where one or more pieces have been captured, nor does it include positions with two or more queens due to promotion. So there may be more games that are truly possibly, or there may be fewer, but on the order of 1018 possibilities is a fair guess.

Now, 1018 is a very large number. In point of fact, it is larger than the number of seconds since the universe began (about 4.4*1017... although it's only 189 billion if you're a creationist. Har har.) It's close to the number of grains of sand on earth, and much larger than the number of stars in the milky way.

In theory, you could have a computer run through the win/loss state of every one of those games, but not in a reasonable amount of time. If you could analyze one game a second, it would of course take longer than the universe has been around to decide on one move. Computers can do much better than that, of course. Currently the world's fastest supercomputer can perform 596 TeraFLOPs -- 596 trillion floating point operations per second. That means that if you were so incredibly optimistic that you believed that analyzing one board position could be done in a single operation (believe me, it can't), the world's fastest supercomputer would still take 40 minutes to analyze the outcome for a single move. And that level of optimism is ridiculous.

Granted, computers have become a lot more powerful since this topic was discussed in the 1970's, but it's still fairly impractical to play a decent game of perfect chess, unless you have several supercomputers lying around doing parallel computations. Your desktop PC would take years. More realistically, thousands of years. Needless to say, no computer plays chess based on the tic-tac-toe strategy of exploring every possible move.

Next time: What's a heuristic, and how come a computer can beat the human chess champion of the world?