## Wednesday, September 14, 2011

### A bit about data structures

I wanted to write another HackMaster post, but what I wanted to write about was the way I approached deciphering the data in the tables and converting them into data structures.  Then I skimmed through some older posts looking for reference points about data structures, and it occurred to me that I've never written any. In order to provide a foundation for the rest of the HackMaster breakdown, I'll have to digress and talk about structures in the abstract.

Whenever you are presented with a problem of modeling some numbers in conceptual space, the first thing you have to figure out before you write a single line of behavioral code is what kind of data structures you are going to use.  Going all the way back to the beginning of this blog, I've emphasized the importance of considering the efficiency of your design and the effect that it has on the Big-O performance of your program.  Thinking about proper data structures can buy you a lot of speed, and it can also make it really easy to visualize your program in small chunks as the complexity increases.

So what's a data structure?  The first thing programmers learn is how to use variables for individual chunks of information, like this:
int x = 3;
String str = "Hello world.";
(Technically, of course, a String object in Java is a whole bunch of characters, which makes it a data structure in itself.  But the nice thing about object-oriented programming is that you don't have to think about it if you want to.)

To understand data structures, consider an array.  An array is one of the first slightly more advanced concepts that a beginning programmer will run into.  Instead of storing just one integer, it can store several.  For example, here's a simple representation of part of the fibonacci sequence:
int[] fib = new int[10];
fib[0] = 1;
fib[1] = 1;
fib[2] = 2;
fib[3] = 3;
fib[4] = 5;
fib[5] = 8;
fib[6] = 13;
fib[7] = 21;
fib[8] = 34;
fib[9] = 55;
When you create a single "int," you're asking the program to set aside a chunk of space in memory, large enough to hold one number.  When you create an array like this, you're asking the program instead of set aside a bigger chunk of memory ten times that size, plus (for some languages) a little bit of extra information about size constraints and such.

But arrays can be wasteful.  What if you want to set aside space that sometimes houses a hundred numbers, and sometimes houses just a few?  You could create an array of size 100, but most of the time that space would be wasted.  That's when you want to use a linked list, where you ask for new memory only at the moment that you actually need it.

I'm not dedicating this whole post to the implementation fundamentals of lists, but interested beginners should go check out the Wikipedia article to find out how this works.  (Sidebar: While relying on Wikipedia for information about controversial topics is often unwise, most of the technical topics that are covered are really good.)

Besides linked lists, there are lots of other data structures that you can use depending on your situation:
• A tree (which may or may not be binary) will hierarchically organize information for you, much like the folder structure on your computer does, shortening the search time as long as you know where you are going.
• A hash table or map is a structure which will find a value associated with a key, usually very quickly.  An example would be a dictionary search: you supply a word, and the program would retrieve a definition.
You can write your own versions of these structures, or if your language supports it, use predefined classes that create common structures.

Understanding what purpose the various structures serve, and when to use each one, is a very key skill in programming interviews.  Often when you are asked "How would you solve this problem?" the best answer is not to blurt the first notion that comes into your head, but to start applying data structures to model the problem space: lists (or specifically, stacks or queues), trees (binary or otherwise), tables (sometimes you can just assume the existence of a database, which is centered around associative tables).

When I hear a problem that lends itself to this, I usually make a beeline to the whiteboard and start thinking out loud: "You're asking about a list of items, so let's describe what's in an item first... then build a linked list out of items..."  Then I'll be either writing code to illustrate what I'm thinking, or (if the interview is shorter) just sketch out diagrams so that the interviewer understands the description and will probably accept that I know how to implement it.

Software is built a piece at a time.  If you start explaining how you visualize the problem in your head, you can give a much better insight into how you think than if you just start solving the problem directly.  In fact, if you start off strong with this approach but then go off on the wrong track, often the interviewer will be eager to guide you towards his concept of the solution because he's being carried along with your thought process.  This often changes the dynamic of the interview entirely.  Instead of being a room with an interrogator and a suspect, the interviewer may start thinking of himself as your ally and not your judge.  And that's exactly where you want to be when you're looking for work.

Digression's over.  Next time I'll illustrate this when I get back to decoding HackMaster tables.

1. One thing to consider is that, before you think about the data structure is what are the relationships between the data and what operations do you want to perform on the data and on the relationships. For example, if there is no relationship between the data that is being stored, the best data structure is often something similar to a hash table. As soon as you want to store, query and possibly manipulate relationships between objects being stored, things become more interesting.

First, it's useful to note that any data structure can be written as a simple container (storing just objects) or an associative container (storing keys and associated data). In the C++ standard template library (STL), the set is a simple container while the map is an associative container. The question is, are there any relationships between the keys or the objects. If there are, these will guide the choice of the data structure.

Some of the most common relationships are linear orderings, hierarchical orderings, partial orderings, equivalence relations, and what I call adjacency relations (are to objects adjacent or not, i.e., graphs).

Now, the next question is, what queries or manipulations are you going to make on the data. For example, suppose your objects/keys can be linearly ordered. If you're never going to ask "What is the first object/key?" or "What collection of objects/keys are contained in some interval [a, b]?" or "Given X, what is the next largest/smallest object/key?", then you really don't need the linear ordering and it's best to go back to a hash table.

I'll just refer to objects from here on in.

If you're storing linearly ordered data, if you're only interested in operations being performed at near the largest or smallest objects, a deque is probably the right solution. If random access is necessary, some form of balanced search tree may be more appropriate. Priority queues, where you have random insertions but fixed accesses and removals, can be implemented as a heap.

Partial orderings are stored as DAGs, hierarchical orderings are often stored as general trees. (By focusing first on the ordering, it answers the question why trees can be used both for linear orderings and more general hierarchical orderings. The hierarchy of a binary search tree means nothing when it is storing linearly ordered data.)

Another ordering that is used in the STL is a weak ordering. This is a linear ordering of equivalence classes. This can be best explained by ordering people based on their birthday. Given two people, they either have the same birthday or one is older than the other. Two people having the same birthday, however, does not mean they are the same person.

The set, multiset, map, and multimap all use a weak ordering. Any linear ordering is a weak ordering where the equivalence classes are singletons.

Thinking about relationships allows you to think more abstractly about the data and once that is cleared up, you can pick the appropriate data structure.

2. One example is a priority queue. Classically, a priority queue allows you to insert with a random priority but you can access and remove only that object in the container with the highest priority. A heap is great for this and Fibonacci heaps are great if you want to merge heaps while simple binary heaps stored as arrays are most appropriate if merging of priority queues is not necessary. If, however, you want to update the priority of objects within the heap, this becomes more interesting. Finding something in most heaps is an O(n) operation. I once fixed this by using a hash table to point into the heap for all objects within the heap. The hash table associated objects within the heap with their location in the heap. If insert( X, p' ) was called on an object X already in the heap with priority p, it could be percolated either up or down to satisfy the new priority p'. Otherwise, if X was not in the hash table, that mean it wasn't in the heap, so the new object X was placed into the heap with priority p'. Because hash table operations are O(1), this did not affect any of the asymptotic run times; however, it did require O(n) additional memory.

Anyway, I think it's useful to begin by focusing on what you want from the data and the relationships between the data before you focus on the data structure. Once you know what you want to do, you can pick one or more data structures, or possibly try to combine features from different data structures to efficiently satisfy your requirements.