Castles of Air: Murphy's Law

Showing posts with label Murphy's Law. Show all posts

Wednesday, October 9, 2013

Affordable Care Act Blues 2

Following up on my post from yesterday, a friend linked me to an article on Salon that attempted to diagnose the problems with the healthcare.gov website. The article is full of bad logic and half-baked assumptions that I couldn't resist correcting.

First of all, the article refers to a discussion on Reddit that was just ridiculous.

“The Redditors picking apart the client code have found some genuine issues with it, but healthcare.gov’s biggest problems are most likely not in the front-end code of the site’s Web pages, but in the back-end, server-side code that handles—or doesn’t handle—the registration process, which no one can see. Consequently, I would be skeptical of any outside claim to have identified the problem with the site.”

This is WAY too charitable. I’m not “skeptical.” I think the Redditors are dumb as rocks for wasting any time on this at all.

Server side (back end) code handles pretty much everything that is important about a web application's internal logic. The client side (front end) stuff -- which you can see printed out if you right-click on any page in your browser and select "View Source" -- handles cosmetic stuff. How the website looks, what kind of warning messages you get if you enter bad data, etc. It's possible for a site to break because of bad front end code, but that were true then it would almost certain be broken for everybody, not up and down sporadically.

I passed along the article to a friend who is the architect of our software at my job. His explanation is more thorough than mine, so I'm reprinting it with his permission.

Here’s how I would have written the same article but with my speculation….

I think the biggest and most important challenge is the overall architecture. From an architectural standpoint you need to ensure that your system is

Secure

Can scale appropriately

Handles the scenario where it is overloaded

Secondly, for the end user experience, from a UI perspective you need to be

User Friendly

Ensure proper feedback

Have as much client side validation as possible

Once you go live you are in a situation where you have to deal with all types of scenarios quickly. One issue that I see referenced a lot is the fact that the security question dropdowns aren’t populating. Maybe they load tested this but didn’t verify the contents of the actual html rendered. If you rely on older back ends then that needs to be part of your load testing and you need to reduce talking to those systems the least amount possible.

As far as server load there are 3 types of congestion you need to deal with. Memory, CPU and information I/O (Network and Drive). The dropdown thing might be related to either one of those three.

This, of course, is informed speculation about the nature of the problem itself, drawing on real experience about how websites are designed, and how to go about debugging the problem. The Salon author, David Auerbach, doesn't do this sort of thing. Instead, he casts about wildly to find a place to assign blame. He claims that he can identify it as an Oracle problem based on a single error message:

“Error from: https%3A//www.healthcare.gov/oberr.cgi%3Fstatus%253D500%2520errmsg%253DErrEngineDown%23signUpStepOne.”

To translate, that’s an Oracle database complaining that it can’t do a signup because its “engine” server is down.

What? It is? How did he know that? I looked up “ErrEngineDown” to see if it might be a standard Oracle message. It is not. So my reading is that it’s simply the name that the developers themselves chose to assign to this particular error. There is literally nothing you can determine from this one status result, as far as identifying what kind of database they used, or why the database failed.

After that, Auerbach goes on to state that

That is, the front-end static website and the back-end servers (and possibly some dynamic components of the Web pages) were developed by two different contractors. Coordination between them appears to have been nonexistent, or else front-end architect Development Seed never would have given this interview to the Atlantic a few months back, in which they embrace open-source and envision a new world of government agencies sharing code with one another.

It's true, apparently there are at least two different developers who've had their hands on the system: Development Seed and CGI Federal. This is not a legitimate criticism of the process. The federal government is big. The ACA is big. It is routine and normal to have multiple companies working with one set of data. After all, each individual state seemingly has their own website which has to connect to the ACA. In that case, you typically have a web service host on the back end, which processes results, and the results come from many different clients -- i.e., the state's web site, which was probably developed at least partially by someone in the state.

And that's all we know. There is no evidence I can find in any of those links, that "coordination between them appears to have been nonexistent." Auerbach simply made that up. He also makes up a cute little fictional dialogue that has no grounding in reality whatsoever. That conclusion also does not follow from the fact that they use open source code and are enthusiastic about promoting open source principles, unless I’m missing some other piece of information cited. It may be the case, but it’s not demonstrated at all. Open source is just a model of developing something with transparency. It doesn't say anything at all about the process which the two developers followed.

To reiterate: We're talking about a website that is not working, for some people, some of the time. A friend posted just today that he successfully created his account and is done with the process. This is not a site that is fundamentally broken or flawed; it is a server that is sometimes overloaded by a very high volume of traffic. That's it. That's the whole problem.

Tuesday, October 8, 2013

Affordable Care Act rollout blues

I keep hearing journalists speculate about what the explanation might be for the Affordable Care Act website being so unreliable in its first week. While it sounds like a good talking point, the conversations have generally been pretty dumb.

You want an explanation? It's a massive online system that got hit with nine million users during launch week. That is more than World of Warcraft has subscribers right now.

As an IT engineer and an online gamer, I would be shocked NOT to see an online system on that scale experiencing technical problems. I played WoW since launch day. It had sporadic connections for the first week or two. The same thing happened when the first couple of expansions came out. Most other MMO's and numerous other games experienced similar glitches early in their life cycles also. Who remembers the PR nightmare of the latest rollout in the SimCity series? People who pre-ordered the game, mostly couldn't play at all for a week.

Non-functioning websites during launch week are the rule, not the exception. The reason is, an individual server can handle a certain number of simultaneous connections. For a large project, you use multiple servers running in parallel, so that they can spread out the load. There are also multiple redirections to other computers on the network; for instance, the database is housed on another server, and there might also be multiple parallel databases running in some cases. In the case of a big government system accessing social security numbers, it's likely to interface with some really old legacy systems somewhere along the line.

So in the first place, the throughput of the whole system can only be as reliable as its weakest connection. In the second place, systems are designed to handle an expected average load each day. One reason WoW was so slow at first, was because an especially high number of people wanted to be the first to play. After some time, usage settles down a bit.

You can't always predict how high the early load will be on the system. And even if you do guess correctly, it's not necessarily a good idea to buy double the number of computers that you'll need on an average day; that's a lot of expensive computing power that you'll just wind up having to sell off fairly quickly.

So, don't be surprised that there are problems. Be surprised that it's working at all most of the time.

Monday, August 1, 2011

On learning to love your users, who are idiots

In the process of discussing error checking in the comments section of a recent post, I flippantly remarked that "all users are idiots." This is a sentiment that I expect all veteran programmers will have encountered or stated themselves on some occasion.

From a new programmer struggling with this problem, I hear this: "This would be a heck of a lot easier if users just weren't idiots." And of course, that is something we all wish. But then again, if we didn't have to assume user idiocy, we wouldn't be writing programs.

There's a fundamental difference between constructing a program and writing a novel or painting a picture. For static forms of media, you only have to create one thing from beginning to end. Once you've finished writing your book, it's over. For better or for worse, your characters have finished interacting with each other. They've said what they have to say. People will either like it or not like it; they may debate endlessly about what your words or images "really mean," but they can't affect its behavior.

Unfortunately, programs aren't like that. Once your release your program into the wild, users get to do whatever they want with it. And if they do something utterly crazy and your program breaks, they'll blame you.

Murphy's Law ("Whatever can go wrong, will") was written by an engineer in the 19th century, but it is used most by software engineers. It's not that everything possible will go wrong every single time your program is run. It's just that if you have millions of users (which you will if you are successful) then even a very small chance that one person will do something unexpected, must necessarily magnify into a virtual certainty that somebody, somewhere will find a way to blow up your program.

With that in mind, writing a program is just as much about covering every possible angle of what some idiot user might do to you, as it is about creating a pleasing presentation when the program is Used As Intended.

As a gamer, I have heard several interviews with voice actors who are veterans of film or television, but new to performing voices for video games. The universal sentiment seems to be "This is the craziest thing I've ever had to do. You have to perform a dozen different lines for every single scene, and they're not just different takes for the editor to select from. They're ALL USED in the game. I'll have to perform a scene featuring my dramatic death, and in the very next scene I'm alive again. I'll have to answer the same question five different ways, just in case the player decides to harass me by asking my character over and over again." And so on.

So, because we must cover every possible use of the program, we create elaborate error scenarios and use all kinds of tricks to keep the user on track. One way to handle user error is to simply give the user very strict instructions, like this: "You MUST TYPE A NUMBER from 1-100, and if you do ANYTHING ELSE, then the program WILL CRASH and that is YOUR RESPONSIBILITY." That's not very satisfying, though. People make mistakes, even people who are not idiots. It's much nicer to recover gracefully from an error. At every step, you're asking yourself: "What can go wrong here?" And then you add a clause to your program: "If bad thing xyz occurs, say something polite to guide the user back on track, and try it again."

Often, an even better solution is to tie the user's hands so he can't actually make the mistake in the first place. For example, slider bars and drop boxes exist exactly so that the user can pick an integer from 1-100 without the capability to do something stupid.

Obviously, it takes a lot of work to write a bulletproof program, and the more thoroughly you prepare for bad user behavior, the harder the work is going to be. Often you have to strike a happy medium. The smaller your audience is, the less effort you have to put into mistrusting your users. If you are just writing a program for yourself, it's probably less work to try and give good input than to write error handling routines.

Also, the more general the application, the more you have to just let the errors happen, and your only responsibility becomes to make sure the errors don't cause a crash. For instance, if you're writing a calculator program, you can't just stop all operations that divide by zero. The user might just DECIDE to divide a number by zero. You just have to tell him he made an illegal operation, and move on to the next step.

When you write a program, you're not designing a static thing for people to look at. You're designing a universe of possibilities to cover all uses. The more time you spend thinking about what an idiot might do, the better you can guarantee that you will make it a pleasant experience for those who are not idiots.

Castles of Air