Castles of Air: data structures

Showing posts with label data structures. Show all posts

Monday, September 19, 2011

Operation HackMaster Crit Tables, episode 2

Now that I've explained what data structures are for, I can finally explain how I approached the problem of deciphering those ponderous HackMaster tables.

First of all, I discovered to my dismay that the tables were only available in the form of PDF files -- images, not text. Just as I was about to tell my friend that I didn't want to waste time converting six pages of tiny print to very usable data, my lovely assistant Lynnea (the aforementioned fiancee who actually plays the game) stepped in and volunteered to do it. This is one thing about her personality that I've never been able to understand, but she loves doing data entry, filling out forms, etc. I think it's some kind of OCD thing. But for whatever reason, she's extremely enthusiastic, diligent, and thorough about this kind of work. And by the way, if you need this kind of work done, she's available to hire! Ask me for a resume. :)

With this powerful ~~slave~~ human resource at my disposal, I wrote up a few sample lines of text in a spreadsheet to show how I wanted them, wound her up and let her go at it. She cranked the rest out in a surprisingly short time. I then converted the results to standard comma-separated value format, some of which you can download from here: Hacking weapon table part 1; List of effects. In case you're not familiar with them, .csv files are a generic text-only format which can be read in a spreadsheet program like Excel, or any standard text editor.

Working with just my sample rows, I set out to work out what the abstract properties of the data were. The first thing to consider is the way a body part is selected. In the hacking weapon table, you can see that if you roll a 1-100, you get hit in the "Foot, Top"; if you roll 101-104, you get hit in the heel, 105-136 is Toe, and so on.

This is like a hash table, almost, but it's not one. If it was a hash table, you'd usually have one body part per number: 1 -> Foot Top, 2 -> Heel, and so on. Here, we're working with a range of numbers corresponding to each lookup value.

I decided to start with a generic lookup table, where you start with objects which contain a "low" value, a "high" value, and a generic object which can get returned from the lookup. The declaration looks like this:

public class RangeLookup<T>
{
   private List ranges;

   private class Entry
   {
      protected T item;
      protected int low, high;

      public Entry( T i, int l, int h )
      {
         item = i;
         low = l;
         high = h;
      }
   }
   ...
}

In Java using the "<T>" notation means that "T" could be anything. Even though I wasn't going to be using this lookup table more than once, I like to keep structures as all-purpose as possible. That's partly because I might want to reuse them in the future, and partly because I want to be able to test how the component works without making it dependent on the equally complex item which will be retrieved by the lookup.

Every structure needs an interface -- a means of communicating with it that only does what you want and hides the guts of it from the rest of the program. I created an "addEntry" function to the RangeLookup class, so that you could insert a new entry with a high, a low, and a retrieved object of type T. Then I added a "lookup" function where you send in a number, it gives you an object. In my implementation, the lookup function simply walks through all of the possible results and checks whether the requested number is between the high and the low. This would be inefficient if there were going to be a lot of entries, so I might have come up with some kind of hashing structure or tree search; but since there are only about 20 or so body parts, it wasn't worth the extra effort and runs fine as is.

After verifying that this was working right, I created the following additional structures:

Looking at the Effects table, it is a basic mapping (in my case, placed in a HashMap) from one string to another. You put in the code "f", and the resulting effect is "fall prone and drop items". So, I created a simple object called an "effect," containing "key" and "description."
It's a bit more complicated than that, though. Often the table will contain numbers, but the effects will contain only the symbol "X". For instance, if the table says "d4" then the relevant effect is "dX", which means "reduce Dexterity by X". Therefore I made another class called an "Outcome," which contains an Effect AND a number (which may be zero if it's not necessary).
I made an EffectTable, which implements the HashMap of Effects.
Almost ready to create an actual table object, I first made a class called "CritTableEntry." This represents a cell in the table. It contains: a low roll, a high roll, the name of a body part, and a List of effects (because each cell may result in several outcomes, not just one).
A CritTable class to put them all together. This class has an addEntry method and another method for retrieving the entries.

As a final step, I created a "Reader" class which did the heavy lifting of reading and interpreting the CSV files and adding one row at a time into a generated table. I don't like to reinvent the wheel, so I googled class libraries which would read CSV files and interpret them as lists. I settled on using OpenCSV. I could have written my own parser, but when the task is as common as reading a CSV, I tend to assume that somebody has already done all the work before me and has already been through the process of making all the mistakes and catching the bugs which come up.

Notice that none of these objects deals with input and output directly. It's preferable to test each component of your program separately as much as possible BEFORE trying to decide what kind of user interface to make. Your interface should be tailored to the problem space. As it turns out, I wound up creating several different interfaces before I settled on created a web application. I'll discuss these concerns in a later post.

When testing your data structures it's a good idea to create unit tests. A unit test is a small, self contained application which is designed to test one thing at a time. You need to think about every possible way that your program might break, create a unit test for each one, and make sure that it works right at the boundary conditions.

Off the top of my head, here are some boundaries of the crit tables that needed to be tested:

Spot check several "body part" rolls with random(ish) numbers and see that the returned information matches the table.
Spot check several "outcome" rolls in one row and see that the returned effects match the table.
Test the boundaries of some rolls. For instance, on the table I linked, "4301-4492" corresponds to "Arm, upper inner", and "4493-4588" corresponds to "Elbow". Therefore I have to make sure that a roll of 4492 returns a different part from 4493.
Test when happens when the body part roll is 0 (invalid), 1, 10000, and 10001 (invalid).
Test what happens when the effect roll is 0, 1, 24, and 25.

Keep all your unit tests around forever. If something breaks, that's a quick way of figuring out which part is not working. If it's a problem with your data model rather than your user interface, the unit tests will catch it.

Next time I'll be talking about all the different ways of making an interface on the same models.

Wednesday, September 14, 2011

A bit about data structures

I wanted to write another HackMaster post, but what I wanted to write about was the way I approached deciphering the data in the tables and converting them into data structures. Then I skimmed through some older posts looking for reference points about data structures, and it occurred to me that I've never written any. In order to provide a foundation for the rest of the HackMaster breakdown, I'll have to digress and talk about structures in the abstract.

Whenever you are presented with a problem of modeling some numbers in conceptual space, the first thing you have to figure out before you write a single line of behavioral code is what kind of data structures you are going to use. Going all the way back to the beginning of this blog, I've emphasized the importance of considering the efficiency of your design and the effect that it has on the Big-O performance of your program. Thinking about proper data structures can buy you a lot of speed, and it can also make it really easy to visualize your program in small chunks as the complexity increases.

So what's a data structure? The first thing programmers learn is how to use variables for individual chunks of information, like this:

int x = 3;
String str = "Hello world.";

(Technically, of course, a String object in Java is a whole bunch of characters, which makes it a data structure in itself. But the nice thing about object-oriented programming is that you don't have to think about it if you want to.)

To understand data structures, consider an array. An array is one of the first slightly more advanced concepts that a beginning programmer will run into. Instead of storing just one integer, it can store several. For example, here's a simple representation of part of the fibonacci sequence:

int[] fib = new int[10];
fib[0] = 1;
fib[1] = 1;
fib[2] = 2;
fib[3] = 3;
fib[4] = 5;
fib[5] = 8;
fib[6] = 13;
fib[7] = 21;
fib[8] = 34;
fib[9] = 55;

When you create a single "int," you're asking the program to set aside a chunk of space in memory, large enough to hold one number. When you create an array like this, you're asking the program instead of set aside a bigger chunk of memory ten times that size, plus (for some languages) a little bit of extra information about size constraints and such.

But arrays can be wasteful. What if you want to set aside space that sometimes houses a hundred numbers, and sometimes houses just a few? You could create an array of size 100, but most of the time that space would be wasted. That's when you want to use a linked list, where you ask for new memory only at the moment that you actually need it.

I'm not dedicating this whole post to the implementation fundamentals of lists, but interested beginners should go check out the Wikipedia article to find out how this works. (Sidebar: While relying on Wikipedia for information about controversial topics is often unwise, most of the technical topics that are covered are really good.)

Besides linked lists, there are lots of other data structures that you can use depending on your situation:

A tree (which may or may not be binary) will hierarchically organize information for you, much like the folder structure on your computer does, shortening the search time as long as you know where you are going.
A hash table or map is a structure which will find a value associated with a key, usually very quickly. An example would be a dictionary search: you supply a word, and the program would retrieve a definition.

You can write your own versions of these structures, or if your language supports it, use predefined classes that create common structures.

Understanding what purpose the various structures serve, and when to use each one, is a very key skill in programming interviews. Often when you are asked "How would you solve this problem?" the best answer is not to blurt the first notion that comes into your head, but to start applying data structures to model the problem space: lists (or specifically, stacks or queues), trees (binary or otherwise), tables (sometimes you can just assume the existence of a database, which is centered around associative tables).

When I hear a problem that lends itself to this, I usually make a beeline to the whiteboard and start thinking out loud: "You're asking about a list of items, so let's describe what's in an item first... then build a linked list out of items..." Then I'll be either writing code to illustrate what I'm thinking, or (if the interview is shorter) just sketch out diagrams so that the interviewer understands the description and will probably accept that I know how to implement it.

Software is built a piece at a time. If you start explaining how you visualize the problem in your head, you can give a much better insight into how you think than if you just start solving the problem directly. In fact, if you start off strong with this approach but then go off on the wrong track, often the interviewer will be eager to guide you towards his concept of the solution because he's being carried along with your thought process. This often changes the dynamic of the interview entirely. Instead of being a room with an interrogator and a suspect, the interviewer may start thinking of himself as your ally and not your judge. And that's exactly where you want to be when you're looking for work.

Digression's over. Next time I'll illustrate this when I get back to decoding HackMaster tables.

Thursday, September 8, 2011

Operation HackMaster Crit Tables, episode 1

This post is about a project I recently did for fun. Although the project itself has extremely limited application, it inspired me to do something I've been meaning to do for a long time, namely upgrade the web server for my apollowebworks.com domain to something which supports Java applications with Tomcat. (I chose a company called Arvixe based on scanning the features they offer and reading a bunch of reviews. Shout out to my boyz at Arvixe!)

Because I'll be touching on a wide range of topics, I'll split them up into multiple posts, and that should keep me from neglecting this blog for a little while. So, partly to tell you what's coming and partly just so I can keep track for myself, here's a road map of the topics I plan to hit with this discussion.

General geekery about using programming to automate complex tasks.
Building a project from the ground up, starting by visualizing the data structures instead of just diving in blindly.
How separating interface from implementation makes it easier to translate your program to multiple platforms.
Good riddance to the bad old days: How the web has made program content delivery easier.
A primer for noobs on what web servers do, and why I bought another one.

First let me explain the problem. My fiancee, Lynnea, has gotten into paper-and-pencil roleplaying games. She has been running a light Dungeons & Dragons campaign occasionally for me and my son Ben over the last few months. She also plays with a group of friends one night a week. For my part, I've played RPGs a few times before but never been a strong part of that scene, so I just listen to the stories of her sessions. Currently they're playing with a rules system called "HackMaster."

As I understand it from skimming a few chapters of the rulebook, HackMaster was written by someone who hates doing things the easy way. When D&D was first published in 1974, it was at first an ever-expanding set of complicated rules involving tons of die-rolling for all kinds of special situations. Or as a German exchange student my family hosted in high school put it, "D&D isn't a role-playing game, it's a roll-playing game." Har. Har. Very droll, those gamers. (Oh hey, how are things going, Max? ;)

Still, as the years went by, D&D started to reverse the trend and become more streamlined, or so I've heard. The latest release, fourth edition rules, was clearly heavily influenced by World of Warcraft, and reduces a lot of the unnecessary choices in favor of more interesting combinations of focused options.

The HackMaster author just hates that. Not that I would recognize the difference, but I have the impression that if anything, he's aggressively chosen to make every little action even more complicated than it would have been under old-school D&D, to the point where a party can spend thirty minutes meticulously checking a single room for traps before seeing any combat. And the critical hit tables are an utter monstrosity.

Critical hits (or "crits") for you non-gamers, means that for every attack there is a chance (1 in 20 under D&D rules) that it will be a super-strong attack which does extra damage. In WoW, this is handled automatically by the game; you can build up gear that will increase your crit chance, but usually there's a flat formula that says "If you crit, your strike does twice as much damage." Rules in 4th edition D&D are pretty close to the same.

Not HackMaster.

In HackMaster, you roll a number between 1 and 10,000 for the body part where your blow lands, and there are all kinds of "realistic" outcomes that result if you land a powerful blow on that body part. Furthermore, rolling a 20 is just your cue to pull out a ginormous, six page tables full of teeny little numbers and two accompanying rules pages. You then roll another number that might range from 1-24 to see just how badly your crit hurt the monster (or the monster's crit hurt you).

For instance, if you get hit on the top of the head, the crit damage can range from "8 extra damage" to "twice normal damage AND you lose some hit accuracy AND you lose some dexterity AND you fall down and drop everything you're carrying"... to "brain goo." Do not pass go.

Each one of those individual clauses, by the way, is represented in the table by a little symbol like "dX" which you have to look up on a smaller table of effects. I think you have to look at a sample to see what I'm talking about.

(Click for a larger image.)

Not to put too fine a point on it, this sounds like a game that I would personally hate playing. But no accounting for taste, you know? Even so, problems of managing huge sets of data are piles of fun to solve with a computer. NO, that's NOT sarcasm, you non-programmers, that's fun. Shut up.

So when the DM tentatively asked if it would be possible to write a program that could handle the die rolls and just spit out some nice, clean output, I said "Piece of cake!" I wouldn't even accept his offer to pay for it. Just like when I wrote a Sudoku solver, sometimes figuring out how to automate gameplay is more fun than actually playing. (Disclaimer: I used to claim that Sudoku is a fun math problem and a crappy game. Now I like playing it. It's my Android Evo's fault, damn it... it's the perfect game for a handheld, and it has built in strategy advice. Once I started improving at harder puzzles, there was no escape.)

And so it began.

To be continued...

Friday, October 16, 2009

Ugly numbers in a heap

As I was saying, a good solution to finding ugly numbers would be to somehow enumerate only numbers which are known to be ugly, removing them from memory after they have been read. I couldn't think of the way to do that. Internet to the rescue.

Here's what you do:

Store the number 1 in a set.
Consider the smallest item, "n" in the set. That is counted as the next ugly number.
Add the numbers n*2, n*3, and n*5 to the set.
Remove n from the set.
Repeat steps 2-4 until you reach the desired ugly count (1500).

Each time you search for the next ugly number, you remove one number from the set but add three more, for a net gain of two. Thus, the set of numbers you are storing on each step i is (2*i-1). That's a manageable memory size. But there's still a snag.

Sets are unordered lists, and we want to always find the smallest number of the set as efficiently as possible. How do we do that? We can't just read every number in the set, or the problem scope quickly becomes O(n²). Can't have that.

Turns out there is an ideal data structure for handling this kind of thing, but it's something I haven't looked at since I was an undergrad. It's called a min-heap.

A heap is a special kind of binary tree, where every level of the tree is filled to maximum density. For instance, in a heap with 8 elements, you know that the top level has one node with exactly two children; each of those two children has two children (that makes 7), and the leftmost node at the third level has one child on the left.

The neat thing about using a heap is that because of its predictable structure, you can represent it using an expandable one-dimensional array. You don't have to mess with pointers or the other headaches of data structures. If you are looking for root.left.right, you just calculate the spot in the array and access element 4.

A min-heap is constructed in such a way that the children of each node are guaranteed to be larger than the parent node. In other words, the smallest value is guaranteed to be at the root, and each child of the root is itself a min-heap with the smallest element of that at the root.

(If I'm going too fast for you, speak up. I don't believe I've ever done a post specifically on binary trees, so maybe I should.)

In any case... you can do two things with a min-heap: add a new element, which is rapidly sorted to the right place; and remove the root element, which then moves the next smallest element into the root and fixes the rest of the heap so that it is still a min-heap.

This is exactly what we need. It keeps the cost of searching for the smallest known ugly number to O(log n), where n is the number of elements currently in the list.

So to recap:

Write your own min-heap class, which is the fun part. ;)
Add the number 1 to the heap.
Inspect the root of the heap, and count that number ("r") as ugly.
Add 2r, 3r, and 5r to the heap.
Delete the root.
Repeat steps 3-5 until you reach the desired ugly count (1500).

Actually it turned out that wasn't quite the answer, because you can get duplicate ugly numbers on the heap. For instance, in the second pass you add 2*3, while in the third pass you add 3*2, which results in multiple 6's on the heap. But that's easy to fix. All we have to do is change step 5 to: "Keep deleting the root until it is no longer equal to the last recorded ugly number."

And there you go. Small storage space, quick execution time, ignores non-ugly numbers, and finishes in 3 seconds easily.

SPOILER:

In case you wondered, the answer is 859,963,392.

Ugly numbers and prime numbers

The second wrong approach I tried taking to find ugly numbers was something called a Sieve of Erathosthanes. This was something that my dad taught me about in high school, and it helped me win a programming contest as a sophomore.

Here's where I should give a quick shout out to my favorite HS teacher, Thom Laeser, who I think might read this blog. Thom was kind of a Renaissance man as a teacher; across my four years in high school, I took one geometry class, two programming classes, and a journalism class from him. He also started the school's first Humanities course, which I did not attend. It wasn't until high school that started really programming (rather than just dabbling in BASIC), and I credit Mr. Laeser with giving me that extra boost.

When I was a sophomore, he gave all his programming students a challenge, which was to be the one who could write a program printing the most consecutive prime numbers within five minutes. This was 1990 and the computer lab probably was limited to 80386 processors or something.

It was clear early on that the contest was going to be between me and a senior. The next day, we both proudly came in with programs that printed all primes from 2 to MAXINT in well under 5 minutes. This was in Pascal, I think integers were 16 bit, so the highest number would be 65,536. Mr. Laeser shrugged and said that was very impressive, but we would have to keep going if we wanted to win.

So. Along the way, I had to learn to write my own large integer output, and also save space by representing numbers with bit manipulation instead of wasting entire "int" values on booleans. More importantly, I learned about the Sieve of Erathosthanes, which I mentioned before.

The sieve is a method for quickly weeding out prime numbers from a finite list of numbers. Here's how it works (simplified version): start with an array of booleans from 1 to, say, a million. The first true prime number is two, so cross out all multiples of 2 that are not 2; so, 4, 6, 8, 10, etc. Then we move up to three, crossing out 6 (again), 9, 12, etc. Next we encounter 4, but 4 is already crossed out, so we ignore it and move up to 5, eliminating 10, 15, 20, etc.

It turns out that you only have to filter on numbers up to the square root of the array size, which in this case is 1000 (sqrt(1000000)). After that, you know there are no multiples of 1001 because you would have had to multiply 1001 by something smaller, and hence already visited, to reach a million.

Long story short, I won the contest, but it was a grueling arms race for several weeks as we kept improving our programs and then "stealing" new ideas by observing how the other's program worked.

Sieve of Ugly

So that was a roundabout personal story, but I haven't told it here before so I thought I might as well get it out. And the point is that my second pass at solving the ugly number problem was to use a sieve.

Sort of a simple modification, really. I didn't just use a boolean array, because there are now three kinds of numbers: 1: prime; 2: ugly; 3: neither prime nor ugly. So I made it an integer array, where each integer stores one of three flags. All numbers start prime; then multiples of 2, 3, and 5 get marked as ugly. Then run another pass on all numbers above 5, marking their multiples as non-ugly.

Using integers was super-overkill, as an int is 32 bits, and I only needed 2 bits to represent my flags: 00=prime, 01=ugly, 10=non-ugly. What I found was that I couldn't sieve up to the 1500'th ugly number, because the array was so huge that I ran out of memory.

So first I changed it to an array of bytes (8 bits). Then, still not satisfied, I improved performance by writing some bit-manipulation routines to simulate an array of two-bit numbers using four numbers per byte.

It worked; I found the same number much faster than before, but still with a noticeable delay before the sieve finished running. 3 seconds was still way out of reach.

What next?

While I was coding the sieve, it occurred to me that it was a terrible waste of space to store ALL the numbers, both ugly and non-ugly, in an array until the entire program was finished. It seemed like there ought to be some kind of quicker way to figure out, given the current ugly number, what the next one in the sequence should be. At worst, I could maintain an array of only 1500 numbers to store only ugly numbers (instead of all numbers up to a trillion).

I wracked my brains for this method; I wrote down the factorization of the first 10 numbers or so, looking for a pattern, but couldn't find one. I'm pretty sure there isn't one, but I was on the right track at this point. I know because that's when I "cheated" by looking to see what other people had done.

And maybe they're smarter than me, or maybe they spent more time on it, but there is still a certain grim accomplishment in not having to reinvent the wheel.

There's a well known "hard" problem in computer science called the traveling salesman problem, which goes like this: You have a bunch of cities, separated by known distances. You have to visit all of them while spending as little time as possible in transit.

To solve this problem by brute force becomes impossible very quickly as you add more cities. So the best known approaches use short cuts; they use sneaky tricks that don't necessarily produce the "best" answer, but are guaranteed to produce a pretty good answer most of the time.

Anyway, I heard that some computer science genius was once asked how he would solve the traveling salesman problem, and he said: "I would put it on the back of a cereal box. 'Find the most efficient path through these points. Winner gets $100, mail your answer to the following address'..."

Well, that's one kind of divide and conquer solution right there. Hooray for the internet, though... now we don't even need cereal boxes.

Online programming puzzles and "ugly" numbers

Ran across this site while looking for general programming puzzles to stay in practice.

On the site, my attention was captured by this puzzle involving "ugly" numbers.

Ugly numbers are numbers whose only prime factors are 2, 3 or 5. The sequence

1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, ...

shows the first 11 ugly numbers. By convention, 1 is included.

Write a program to find and print the 1500'th ugly number.

This isn't a particular hard puzzle if you brute force it. All you have to do is write an algorithm to determine whether a particular number is ugly or not, sort of like this:

while (num % 2 == 0) num /= 2;
while (num % 3 == 0) num /= 3;
while (num % 5 == 0) num /= 5;
if (num == 1) return true;

Then you can just loop through as many numbers as it takes, incrementing a counter as you find ugly numbers, until you reach 1500 numbers.

The problem with this approach is that there is a time limit of three seconds. The brute force approach requires you to do an increasingly long operation (dividing repeatedly to find the prime factors) on a series of numbers which turns out to be very large (I won't spoil it, but the answer is just short of one trillion). Early numbers are computer quickly; then it slows down so finding the 1500th number may take several minutes.

I confess I cheated to solve this puzzle. I came up with one other wrong approach after trying the brute force method, then went googling for discussions of this puzzle. As I hoped, I did find a protracted argument about the right way to solve it. One guy hit on a really clever solution, based on a data structure which I haven't used for years. It took me a few tries to implement, but it was very satisfying in the end.

I guess I could feel bad for "cheating," but I don't. I came close to finding the right answer, but really, if you want an optimal answer to a well defined puzzle then there's no harm in hunting around to see if someone did it better than you. Professionally, implementing a GREAT solution that you looked up is preferable to implementing a mediocre solution that you got with only the sweat of your brow. You can learn new tricks from other people that you'll have in your toolbox for next time.

As a tangent, I take the same approach to strategy games. I like to take my time and learn the rules of a game on my own, and I would scorn to look at a strategy guide during that period. However, there comes a point where you have a basic understanding of the rules and can improve much faster by considering input from others who have already mastered the game.

Anyway, more on the wrong approaches and the right one later.

Castles of Air