Tuesday, October 8, 2013

Affordable Care Act rollout blues

I keep hearing journalists speculate about what the explanation might be for the Affordable Care Act website being so unreliable in its first week. While it sounds like a good talking point, the conversations have generally been pretty dumb.

You want an explanation? It's a massive online system that got hit with nine million users during launch week. That is more than World of Warcraft has subscribers right now.

As an IT engineer and an online gamer, I would be shocked NOT to see an online system on that scale experiencing technical problems. I played WoW since launch day. It had sporadic connections for the first week or two. The same thing happened when the first couple of expansions came out. Most other MMO's and numerous other games experienced similar glitches early in their life cycles also. Who remembers the PR nightmare of the latest rollout in the SimCity series? People who pre-ordered the game, mostly couldn't play at all for a week.

Non-functioning websites during launch week are the rule, not the exception. The reason is, an individual server can handle a certain number of simultaneous connections. For a large project, you use multiple servers running in parallel, so that they can spread out the load. There are also multiple redirections to other computers on the network; for instance, the database is housed on another server, and there might also be multiple parallel databases running in some cases. In the case of a big government system accessing social security numbers, it's likely to interface with some really old legacy systems somewhere along the line.

So in the first place, the throughput of the whole system can only be as reliable as its weakest connection. In the second place, systems are designed to handle an expected average load each day. One reason WoW was so slow at first, was because an especially high number of people wanted to be the first to play. After some time, usage settles down a bit.

You can't always predict how high the early load will be on the system. And even if you do guess correctly, it's not necessarily a good idea to buy double the number of computers that you'll need on an average day; that's a lot of expensive computing power that you'll just wind up having to sell off fairly quickly.

So, don't be surprised that there are problems. Be surprised that it's working at all most of the time.


  1. Not to mention the fact that the government usually has to buy equipment and services from the lowest bidder.
    I'm not saying that buying from the lowest bidder necessarily results in shoddy construction. After all, a thousand lowest bidders got us to the moon. The way the bidding process works is, you write up a document saying what the equipment has to do (e.g., 16 racks capable of providing 100 amps of power. 128 servers with X many cores and Y amount of storage; load-balancing software capable of handling this many parallel HTTP sessions, etc.). Vendors submit their proposals, and you buy the cheapest one that fits the spec.
    So the obvious problem is that you have to be very careful in how you write the spec: it describes the smallest system you're willing to put up with. If you underestimate the size of the job, you may be stuck with equipment that isn't up to the task.
    You may also discover that you've left out some requirement that makes a difference: for instance, the lowest bidder might offer you a router without an intuitive interface, or one very different from the one you're used to, meaning that you now have to get up to speed on the new interface before being able to do anything else.
    So yes, I agree that it would have been remarkable if the exchanges had worked flawlessly on day one.

  2. Dear anonymous guy,

    I'll be happy to approve your comment containing empty snark, but not if you post as "anonymous."

  3. Russel, you are wasting your time trying to convince people who aren't in the business. This happens almost monthly with me when projects are rolled out. Ten years ago, we rolled out the annual enrollment internal Web site for a large bank in which 120,000 employees would be able to enroll for benefits such as ADD, LTC and health insurance (ironically). The enrollment period lasted four months so we, and the bank, never expected more than 10% of employees to try and enroll on the first day. Predictably, the site crashed. But as people saw this they began to wait and spread it out and within two days, the site began working as excepted for everyone. This is standard stuff but laymen don't understand this.