A Version Aversion

I like war stories. They remind me that I’m not alone on this journey.

This is a war story. It involves a lot of entrails, questionable surgery and plenty of walking wounded. I carry the scars so perhaps you don’t have to …

… when I joined the team the system consisted of:

  1. A few thousand lines of C++ code which ran as 10 process on two Solaris hosts;
  2. Some Java code that ran on an NT4 J2EE server;
  3. A bunch of windows client PCs all running the same (but different to the server) Java VM
  4. A collection of ‘glue’ scripts written in shell script and Tcl.

It ran in two locales, and the hardware was broadly equivalent in both, from what I recall. Now compare this to what we had by the time I left:

  1. A larger body C++ code with 15 or so processes
    on one Solaris host as well as a set of additional x86 UNIX hosts (of mixed hardware pedigree) that were sprinkled with various flavours of Solaris and Linux (RedHat) that would run between 2-4 processes depending on the number of cores;
  2. 2 J2EE servers of the same vendor;
  3. A further 2 J2EE servers of a different vendor (don’t ask!);
  4. A collection of scripts written in shell script & Python (thankfully we stamped out the Tcl);
  5. 3 primary locations each running a 2 different versions of our software and a single satellite location (hanging from a primary location).

Each server release involved somewhere around 10 hosts running different hardware, OS & JVM. It’s similar and different to the problem that software vendors must have when they need to make their product run on multiple platforms. However the difference for us was that our software was a distributed system and each component needed to seamlessly interact with its peers. Something that not many software vendors make a habit of, other than Microsoft I suppose.

Against this background was a team of 10 developers in 3 timezones developing software for a constantly changing and fairly lucrative business. Quickly made enhancements could secure profits, instability and failures might secure losses so it was important to try and keep the system running as smoothly as possible. However, the large code base (>100,000 lines) and confusing deployment array made every release a roller coaster ride. In my last two years of the job the release cycle, whilst somewhat improved from when I started, had increased from 1-2 months to almost 6. This had an unforseen consequence that developers would, out of necessity, place new features onto release branches to be able to get features out faster. That’s when the madness started.

There were too many release versions, operating system versions and client library versions to contend with. Sometimes even trivial changes become enormous chess games where the order of the changes that we made would determine whether the system would actually run or not. Eventually it was bound to grind to a halt because with that many deployment configurations each release had too many testing dependencies. There were two problems here, firstly since this was now a very widely distributed system it would be difficult for us to have an accurate test deployment that worked. Further, making distributed systems work is hard anyway and the more configurations you have to manage the more complex it’s going to be. We were trying to help ourselves by retrospectively adding unit-tests but the coverage was still fairly low and so we could never have very much confidence that a built system was actually going to work. What we really desperately needed were integration tests but we never quite managed it.

That’s where Joel’s post from last week comes in. As described by Joel we essentially had a SEQUENCE-MANY situation. Where to be sure of stability we had to test many releases against many deployment configurations. It would be fair to say that we failed to do this adequately. I sometimes wonder if we could have done it a little better.

  1. Could We Have Had Tighter Control Over The Hardware? Unlike the problem of enforcing standards in Web browsers we of course had full control over the deployment environment so we could have mandated a common platform for it. As enticing as this sounds, talking to system admins now and then would tend to suggest that this is simply not possible if the hardware is to be purchased incrementally. This is because after you buy the first 2 Dell servers with a standard specification, a month later that specification will have changed. As more time passes the drift between the hardware is larger.

    If, however, you sourced a job lot of the hardware in the same place at the same time, you could buy extra (for spares and future requirements) and attempt to keep this variable constant. It would have been expensive to do but it is at least possible in this scenario. I think that this probably would have reduced the number of different cross-compilations that were required and reduced the number of different JVMs that we had to manage. The biggest problem though is that we would have, to a certain extent, needed to know the future to be able to predict what sorts and what amounts of hardware we would need when we set out. That kind of makes it a non-starter, coupled with the fact that I’ve never actually heard of anyone doing this for real.

  2. Could We Have Had Tighter Control Over The Software This is the thing that concerns me the most and is definitely a place we didn’t do as well as we should have. We let people go ahead and implement locale specifc solutions that were unworkable globally but those created internal system dependencies that would later need to be ‘undone’. Anyone who has ever worked on a system after its release will know it’s much easier to get it right first time. This is because if you create an intermediate solution that ends up being used then you have to manage the old intermediate-version ‘out’.

    Indeed, there was a story here too. The original system architect moved on a year or so after I joined. He used to worry about 80% of the code that got committed, when he left no-one really had his insight into the architecture and the rust quickly set in. Related to the loss of architect, as already mentioned, was the lack of integration tests. Both would have helped us to identify which code was bogus and have it fixed before it reached a release stage.

The one thing we did succesfully manage to do was to stop developers changing release branches. But the effect of that was to make us look like chumps when the business had to be denied features until new releases could be rolled out. Ho hum.

The idealist in me thinks we could have done a few things to make it work better but the pragmatist thinks that we did what we had to do. Whilst the idealist in my head makes a lot of noise and gets listened to an awful lot the pragmatist is the one who gets the most results. When you are faced with a daily tightrope walk, like we were, you have to try and be both idealist and pragmatist. Choosing the idealist’s course when you think you can get away with it and the pramatist when you can’t.

But when all else fails just hope for the best. The scars will heal. Eventually.

article finance programming python

Calculating peak-to-trough drawdown

Ok, so this is a little bit technical but it’s an intriguing puzzle that got me thinking quite hard. So here’s the problem. Sometimes investors want to be able to judge what the absolute worst case scenario would have been if they’d invested in something. Look at the following random graph of pretend asset prices:


You’ll see that there are two points on the graph (marked in red) where if you had invested at the first point and pulled out on the second point you would have the worst-case loss. This is the point of this analysis and is a way for investors in the asset to see how bad, ‘bad’ has really been in the past. Clearly past prices are not an indicator of future losses. 🙂

The upper one is the ‘peak’ and the lower one is the ‘trough’. Well, finding these two babys by eye is trivial. To do it reliably (and quickly) on a computer, is not that straight forward. Part of the problem is coming up with a consistent natural language of what you want your peak and trough to be. This took me some time. I believe what I really want is: the largest positive difference of high minus low where the low occurs after the high in time-order. This was the best I could do. This led to the first solution (in Python):

def drawdown(prices):
	maxi = 0
	mini = 0
	for i in range(len(prices))[1:]:
	   maxj = 0
	   max = 0
	   for j in range(i+1, len(prices)):
		if prices[i] - prices[j] > max:
		    maxj = j
		    max = prices[i] - prices[j]
	   if max > prices[maxi] - prices[mini]:
	   	maxi = i
		mini = maxj
	return (prices[maxi], navs[mini])

Now this solution is easy to explain. It’s what I have come to know as a ‘between’ analysis. I don’t know if that’s the proper term but it harks back to the days when I used to be a number-cruncher for some statisticians. The deal is relatively straight-forward: compare the fist item against every item after it in the list and store the largest positive difference. If this difference is also the largest seen in the data-set so far then make it the largest positive difference of all points. At the end you just return the two points you found. This is a natural way to solve the problem because it looks at all possible start points and assesses what the worst outcome would be.

The problem with this solution is that it has quadratic complexity. That is for any data-series of size N the best and worst case will result in N * N-1 iterations, in shorthand this is O(N^2). For small n this doesn’t really matter, but for any decently sized data-series this baby will be slow-as-molasses. The challenge then is to find an O(N) solution to the problem and to save-those-much-needed-cycles for something really important:

def drawdown(prices):
  prevmaxi = 0
  prevmini = 0
  maxi = 0

  for i in range(len(prices))[1:]:
    if prices[i] >= prices[maxi]:
      maxi = i
      # You can only determine the largest drawdown on a downward price!
      if (prices[maxi] - prices[i]) > (prices[prevmaxi] - prices[prevmini]):
	prevmaxi = maxi
	prevmini = i
      return (prices[prevmaxi], prices[prevmini])

This solution is a bit harder to explain. We move through the prices and the first part of the ‘if’ will find the highest part of the peak so far. However, the second part of the ‘if’ is where the magic happens. If the next value is less than the maximum then we see if this difference is larger than any previously encountered difference, if it is then this is our new peak-to-trough.

The purist in me likes that fact that the O(N) solution looks like easier code to understand than the O(N^2) solution. Although the O(N^2) solution is, I think, an easier concept to grapple with, when it’s translated into code it just doesn’t grok.

article project management

You think your code don’t smell?

So, code reviews are great. Get the benefit of some ass-hole telling you that your comments should be C-style (/*) and not C++-style (//) and remind you that the member name ‘mSuckThis’ is not suitable, ever. No really, code reviews are great. It’s just that a lot of times they just don’t work.

The first time I encountered code-review was when my boss of the time had just read some book on how to manage programmers and was keen to inflict it on all his employees. His code-review process was to take all my work print it out and go through it line-by-line. Master and student style.

This type of code-review, in the way that he implemented it, was meaningless. It concentrated on an important but largely automatable aspect of code review and that is: adherence to coding guidelines.

As I see it there are three types of defect that code review is trying to identify:

  1. Adherence to coding guidelines (or lack of it) and inter-package dependencies.
  2. Identification of localised errors: “that loop is infite”, or “that algorithim should be log(N) and not N^2”, “that module is way too big”
  3. Identification of non-local errors. Where local means local to the code-review. For instance the impact of adding a throw on a widely used method and how that affects all the dependent code paths.

I question anyone’s ability to fully understand the dynamic nature of any reasonable sized piece of software by just looking at a small excerpt. Every time you review that code you have to ‘load’ that supporting information into your head to be able to identify whether the code is totally awesome or tragically bogus. In my experience defects of a local type (type 2) were sadly rarely identified by code review and defects of a non-local type (type 3) almost never.

The improvement of code-quality I’m passionate about. But I don’t see any realistic way to achieve what I want. To identify non-local errors you really need your code reviewer to sit next to you during the development or be as deeply involved in the code as you are. It probably would need a similar approach to reliably find local errors too. However your reviewer is rarely as involved as that. It seems that some judicious use of pair programming might be the answer but that comes with its own problems.

It seems that to get the best out of code-reviews you have to be very careful about how you implement them. Sure, let’s automate what we can automate and pair program on tricky items but the real code-review needs to be extremely skilfully handled to get the best bang-for-your-buck-chuck.

article programming

Time: the unseen global variable

Just about everyone knows that global variables need to be used sparingly. The more you use the more likely you are to capture complex state in places that are hard to maintain. Or something.

As well as all the globals you can see and measure there exists a shadowy league of ‘unseen’ globals in your programs. Some, like environment variables, are clearly designed as global variables and are desirable and understandable. However, some are wistful and ephemeral and dance round your program like wicked elves. Time is the biggest and most scarey of these elves.

For most programs you write time probably doesn’t matter, they are to all intents-and-purposes time-less. But as soon as you start entering the shadowy world of time, and the even more nebulous one of time-zones and daylight savings, a whole set of other state is being used. In my experience the programs and components that I have written that have been dependent on time have been some of the most complex to develop and maintain. This is for a variety of reasons but in summary:

time is not constant and can be interpreted in more than one way.

This leads to all manner of difficulties:

  1. Code that depends on the current system time ‘Now()’ and doesn’t pass it as a parameter is always going to be fragile. This is mostly because its behaviour can be non-deterministic unless you properly account for the fact that time is not-constant. This is especially important because your programs are susceptible to hard-to-spot boundary effects if you write expressions that use Now() more than once and depend on it returning the same value for each call. Which of course it never will.
  2. Time and date should never, ever, ever be separated from one another. You get all sorts of tricky errors when you split the two. Especially when you are performing some sort of time zone or daylight savings calculation where the two should change together but do not.
  3. Some programming languages represent the date (no time) as a date with a time of 00:00:00. Which is intuitive, but consider then what happens when you load a date (with no time) from a database in the past, when there were daylight savings, into a time when when there are no daylight savings. In the frame of reference of now your localised past time will now be an hour earlier and so will be in the final hour of the previous day. This problem clearly applies to timezones also but is because you made the mistake of not having a consistent view of time.
  4. Not only can the meaning of calendar time change after-the-fact (due to time-zones) but it can also be interpreted differently by different cultures.

There’s probably a lot of other time related pickles you can get yourself into. If Harold Lloyd was a programmer ...

You’d probably not be surprised to hear me say that unit-testing is one way of addressing at least some of these problems. This does two things. If you are to get good coverage for your unit-tests you are practically forced to make time a parameter wherever it’s used, instead of calling Now(). As a direct consequence of this your code can now be called ‘As Of’ and you will be able to offer the historical view where appropriate.

Indeed, I would say that where a piece of sotware has a time-context, then it will only be a matter of time before someone says: “Ok, that’s what it says today but what if I want to rerun it for that time in the past 3 weeks ago?”.

The time-zone and daylight savings problems can be nailed by having a consistent view on the treatment of time. For instance storing all dates/times as UTC is one thing. But if you ever need to store a local time then it should be clear what frame of reference is being used to store that time. So you might need to additionally know: the calendar, the timezone, and the daylight savings rules before you can correctly store a time.

Then and only then will time become your faithful and obedient friend.

article training

Did you learn anything? I forget …

Something that’s been concerning me for some time is the cost and benefit of courses and seminars. Most employers and employees would perceive programmer’s training as a positive benefit and I think I’d have to agree but it seems that there is a common view that all training is good because it’s personal development. To deny that training to an employee would make you a bad employer because you are stunting your employee’s professional growth. Well I’m not so sure. I’d even go as far as to say:

A lot of technical training is of limited value.

There I said it. It’s out. I’m probably never going to get to go on a course ever again, ever.

The last purely technical course I went on was a compulsory learn Java course in-or-around 1999 (yes I’ve been avoiding courses since then). I remember it not for the content, which was forgettable, but for the fact that I’d snapped my wrist 1 week before and I could only type with one hand. The course, however, was custom designed for our company and our tutors had been briefed about what we needed to know. I would say that this sort of training, i.e. directed, has good benefit but again it only teaches the how. The why is lost.

Compare this with the ‘shrink-wrapped’ course. Which is offered by a training company on a technology and is a generic product. In my experience I probably end up using a small-ish fraction of the material learnt on such courses. This is because to attract the candidates they need to give the course a broad appeal. However, the chances are that I’m going on a course for the broad appeal are low, it’s more likely I’m doing it for a very narrow reason. A narrow reason usually defined by the next biggest project of the moment. Sure it’s helpful to know all the aspects of a particular technology, but the things I don’t need to know right now will very soon be forgotten.

This is not the only problem with shrink-wrapped courses. There is also a tendency for candidates to choose ‘advanced’ courses that are sometimes beyond their current ability, secure in the knowledge that “it can’t be that hard” and they will pick it up. When this happens the tutor has to work very hard to bring everyone up to the same level so that he/she can teach some of the more advanced aspects.

So this sort of leraning is inefficient in that the information that needs to be conveyed is often greater than is needed, but there’s another deeper problem with courses. In my opinion I think that programmer’s would sometimes be better schooled if they learnt good approach first and implementation details later. Take security for instance, almost every application these days needs some sort of built in security. I’d argue that it would be useful for programmers to go on a ‘security for programmers course’ which covered lots of different security aspects relevant to programmers. More useful than, say, a technical course suited to a particular technology which attempts to teach security, amongst a lot of other things. As I’m starting to learn it’s the principles that matter, not the implementations. At least this way you’ll see the entire security picture and then when faced with a situation which could be a security risk you can say ‘here’s a potential risk’ now I need to find a way to mitigate it.

In some ways this sort of training is perhaps best delivered inside the organisation by mentors. Big-brothers (and sisters) who can guide the novice through the general principles leaving the rookie to grapple with the fine details of the implementation. Sadly, when you get to my age, big-brothers are most likely to be grand parents so I guess I’ll just have to keep getting my training from Amazon.