Friday, October 16, 2009

Ugly numbers and prime numbers

The second wrong approach I tried taking to find ugly numbers was something called a Sieve of Erathosthanes. This was something that my dad taught me about in high school, and it helped me win a programming contest as a sophomore.

Here's where I should give a quick shout out to my favorite HS teacher, Thom Laeser, who I think might read this blog. Thom was kind of a Renaissance man as a teacher; across my four years in high school, I took one geometry class, two programming classes, and a journalism class from him. He also started the school's first Humanities course, which I did not attend. It wasn't until high school that started really programming (rather than just dabbling in BASIC), and I credit Mr. Laeser with giving me that extra boost.

When I was a sophomore, he gave all his programming students a challenge, which was to be the one who could write a program printing the most consecutive prime numbers within five minutes. This was 1990 and the computer lab probably was limited to 80386 processors or something.

It was clear early on that the contest was going to be between me and a senior. The next day, we both proudly came in with programs that printed all primes from 2 to MAXINT in well under 5 minutes. This was in Pascal, I think integers were 16 bit, so the highest number would be 65,536. Mr. Laeser shrugged and said that was very impressive, but we would have to keep going if we wanted to win.

So. Along the way, I had to learn to write my own large integer output, and also save space by representing numbers with bit manipulation instead of wasting entire "int" values on booleans. More importantly, I learned about the Sieve of Erathosthanes, which I mentioned before.

The sieve is a method for quickly weeding out prime numbers from a finite list of numbers. Here's how it works (simplified version): start with an array of booleans from 1 to, say, a million. The first true prime number is two, so cross out all multiples of 2 that are not 2; so, 4, 6, 8, 10, etc. Then we move up to three, crossing out 6 (again), 9, 12, etc. Next we encounter 4, but 4 is already crossed out, so we ignore it and move up to 5, eliminating 10, 15, 20, etc.

It turns out that you only have to filter on numbers up to the square root of the array size, which in this case is 1000 (sqrt(1000000)). After that, you know there are no multiples of 1001 because you would have had to multiply 1001 by something smaller, and hence already visited, to reach a million.

Long story short, I won the contest, but it was a grueling arms race for several weeks as we kept improving our programs and then "stealing" new ideas by observing how the other's program worked.

Sieve of Ugly

So that was a roundabout personal story, but I haven't told it here before so I thought I might as well get it out. And the point is that my second pass at solving the ugly number problem was to use a sieve.

Sort of a simple modification, really. I didn't just use a boolean array, because there are now three kinds of numbers: 1: prime; 2: ugly; 3: neither prime nor ugly. So I made it an integer array, where each integer stores one of three flags. All numbers start prime; then multiples of 2, 3, and 5 get marked as ugly. Then run another pass on all numbers above 5, marking their multiples as non-ugly.

Using integers was super-overkill, as an int is 32 bits, and I only needed 2 bits to represent my flags: 00=prime, 01=ugly, 10=non-ugly. What I found was that I couldn't sieve up to the 1500'th ugly number, because the array was so huge that I ran out of memory.

So first I changed it to an array of bytes (8 bits). Then, still not satisfied, I improved performance by writing some bit-manipulation routines to simulate an array of two-bit numbers using four numbers per byte.

It worked; I found the same number much faster than before, but still with a noticeable delay before the sieve finished running. 3 seconds was still way out of reach.

What next?

While I was coding the sieve, it occurred to me that it was a terrible waste of space to store ALL the numbers, both ugly and non-ugly, in an array until the entire program was finished. It seemed like there ought to be some kind of quicker way to figure out, given the current ugly number, what the next one in the sequence should be. At worst, I could maintain an array of only 1500 numbers to store only ugly numbers (instead of all numbers up to a trillion).

I wracked my brains for this method; I wrote down the factorization of the first 10 numbers or so, looking for a pattern, but couldn't find one. I'm pretty sure there isn't one, but I was on the right track at this point. I know because that's when I "cheated" by looking to see what other people had done.

And maybe they're smarter than me, or maybe they spent more time on it, but there is still a certain grim accomplishment in not having to reinvent the wheel.

There's a well known "hard" problem in computer science called the traveling salesman problem, which goes like this: You have a bunch of cities, separated by known distances. You have to visit all of them while spending as little time as possible in transit.

To solve this problem by brute force becomes impossible very quickly as you add more cities. So the best known approaches use short cuts; they use sneaky tricks that don't necessarily produce the "best" answer, but are guaranteed to produce a pretty good answer most of the time.

Anyway, I heard that some computer science genius was once asked how he would solve the traveling salesman problem, and he said: "I would put it on the back of a cereal box. 'Find the most efficient path through these points. Winner gets $100, mail your answer to the following address'..."

Well, that's one kind of divide and conquer solution right there. Hooray for the internet, though... now we don't even need cereal boxes.

No comments:

Post a Comment