I like war stories. They remind me that I’m not alone on this journey.

This is a war story. It involves a lot of entrails, questionable surgery and plenty of walking wounded. I carry the scars so perhaps you don’t have to …

… when I joined the team the system consisted of:

  1. A few thousand lines of C++ code which ran as 10 process on two Solaris hosts;
  2. Some Java code that ran on an NT4 J2EE server;
  3. A bunch of windows client PCs all running the same (but different to the server) Java VM
  4. A collection of ‘glue’ scripts written in shell script and Tcl.

It ran in two locales, and the hardware was broadly equivalent in both, from what I recall. Now compare this to what we had by the time I left:

  1. A larger body C++ code with 15 or so processes
    on one Solaris host as well as a set of additional x86 UNIX hosts (of mixed hardware pedigree) that were sprinkled with various flavours of Solaris and Linux (RedHat) that would run between 2-4 processes depending on the number of cores;
  2. 2 J2EE servers of the same vendor;
  3. A further 2 J2EE servers of a different vendor (don’t ask!);
  4. A collection of scripts written in shell script & Python (thankfully we stamped out the Tcl);
  5. 3 primary locations each running a 2 different versions of our software and a single satellite location (hanging from a primary location).

Each server release involved somewhere around 10 hosts running different hardware, OS & JVM. It’s similar and different to the problem that software vendors must have when they need to make their product run on multiple platforms. However the difference for us was that our software was a distributed system and each component needed to seamlessly interact with its peers. Something that not many software vendors make a habit of, other than Microsoft I suppose.

Against this background was a team of 10 developers in 3 timezones developing software for a constantly changing and fairly lucrative business. Quickly made enhancements could secure profits, instability and failures might secure losses so it was important to try and keep the system running as smoothly as possible. However, the large code base (>100,000 lines) and confusing deployment array made every release a roller coaster ride. In my last two years of the job the release cycle, whilst somewhat improved from when I started, had increased from 1-2 months to almost 6. This had an unforseen consequence that developers would, out of necessity, place new features onto release branches to be able to get features out faster. That’s when the madness started.

There were too many release versions, operating system versions and client library versions to contend with. Sometimes even trivial changes become enormous chess games where the order of the changes that we made would determine whether the system would actually run or not. Eventually it was bound to grind to a halt because with that many deployment configurations each release had too many testing dependencies. There were two problems here, firstly since this was now a very widely distributed system it would be difficult for us to have an accurate test deployment that worked. Further, making distributed systems work is hard anyway and the more configurations you have to manage the more complex it’s going to be. We were trying to help ourselves by retrospectively adding unit-tests but the coverage was still fairly low and so we could never have very much confidence that a built system was actually going to work. What we really desperately needed were integration tests but we never quite managed it.

That’s where Joel’s post from last week comes in. As described by Joel we essentially had a SEQUENCE-MANY situation. Where to be sure of stability we had to test many releases against many deployment configurations. It would be fair to say that we failed to do this adequately. I sometimes wonder if we could have done it a little better.

  1. Could We Have Had Tighter Control Over The Hardware? Unlike the problem of enforcing standards in Web browsers we of course had full control over the deployment environment so we could have mandated a common platform for it. As enticing as this sounds, talking to system admins now and then would tend to suggest that this is simply not possible if the hardware is to be purchased incrementally. This is because after you buy the first 2 Dell servers with a standard specification, a month later that specification will have changed. As more time passes the drift between the hardware is larger.

    If, however, you sourced a job lot of the hardware in the same place at the same time, you could buy extra (for spares and future requirements) and attempt to keep this variable constant. It would have been expensive to do but it is at least possible in this scenario. I think that this probably would have reduced the number of different cross-compilations that were required and reduced the number of different JVMs that we had to manage. The biggest problem though is that we would have, to a certain extent, needed to know the future to be able to predict what sorts and what amounts of hardware we would need when we set out. That kind of makes it a non-starter, coupled with the fact that I’ve never actually heard of anyone doing this for real.

  2. Could We Have Had Tighter Control Over The Software This is the thing that concerns me the most and is definitely a place we didn’t do as well as we should have. We let people go ahead and implement locale specifc solutions that were unworkable globally but those created internal system dependencies that would later need to be ‘undone’. Anyone who has ever worked on a system after its release will know it’s much easier to get it right first time. This is because if you create an intermediate solution that ends up being used then you have to manage the old intermediate-version ‘out’.

    Indeed, there was a story here too. The original system architect moved on a year or so after I joined. He used to worry about 80% of the code that got committed, when he left no-one really had his insight into the architecture and the rust quickly set in. Related to the loss of architect, as already mentioned, was the lack of integration tests. Both would have helped us to identify which code was bogus and have it fixed before it reached a release stage.

The one thing we did succesfully manage to do was to stop developers changing release branches. But the effect of that was to make us look like chumps when the business had to be denied features until new releases could be rolled out. Ho hum.

The idealist in me thinks we could have done a few things to make it work better but the pragmatist thinks that we did what we had to do. Whilst the idealist in my head makes a lot of noise and gets listened to an awful lot the pragmatist is the one who gets the most results. When you are faced with a daily tightrope walk, like we were, you have to try and be both idealist and pragmatist. Choosing the idealist’s course when you think you can get away with it and the pramatist when you can’t.

But when all else fails just hope for the best. The scars will heal. Eventually.