The Robustness Principle

It seems I’m becoming a man of principles. Last week I wrote about the principle of least surprise. This week I was reading this article about martians over at Joel it reminded me about my previous job in some non-obvious ways. And before you say it, no I didn’t work with martians or any sort of aliens for that matter. The main topic of Joel’s blog is very interesting but a side issue is the discussion of the robustness principle as defined by Jon Postel in RFC 793.

2.10. Robustness Principle

TCP implementations will follow a general principle of robustness:  
be conservative in what you do, be liberal in what you accept 
from others.

Whilst Joel is talking about HTML standards Jon is talking about TCP standards. Anyway, whilst this difference is minor the pedant in me is forced to point it out ;-).

However it’s fair to say that such a principle exists and that I’ve been living by it, without actually having a name for it, for some time. The reason I started doing it was because I inherited a large code-base for a distributed system once and I discovered that occasionally components would simply terminate, users would complain and I would have to go and invent some bullshit to explain why our system wasn’t working.

I felt like I had to invent some bullshit because I’d found that some components had statements in them like the following:

    if (such and such)

The ‘such and such’ that caused the termination would be some condition that should not have occurred. You see, the original programmer felt the need to have all unknown conditions delivered to him by aborting the process. This is rather like demanding a criminal’s head on a plate because aborting a process under UNIX will usually cause a core dump. This memory dump file contains the contents of all the memory and registers at the time that abort() was called. With the appropriate tools this gives something to indicate what the cause of the problem might have been. This style of programming is sometimes referred to as kamikaze programming but I think seppuku programming is probably more apt.

My sin was to systematically remove these little time-bombs and attempt to handle the condition that had occurred. The problem with this, of course, is that if any of these conditions have undesirable side-effects then they can store problems up for the future which, I guarantee, will be a lot harder to find and fix.

It seems there is no right solution to this problem either. Extending Joel’s conclusion we could say that the idealist (the original code author) wanted to control the error state and deal with each case appropriately by hand. Perhaps, eventually, all these little abort() statements would get to be unused code and the system would run very smoothly. Perhaps. The pragmatist (me) would have a hard enough time of keeping the system running as it was without having components voluntarily dying all the time.

I stand by my decision to improve stability but I think I probably should have left a couple of the more esoteric abort()’s in place. I did improve the stability of the system to some extent. But my effort to restore faith in the system by making it completely stable would be somewhat thwarted for other reasons (more closely related to Joel’s post about martians). As we shall see next week …