The Imperfect Hash (but what percentage is beef?)

What would you do? You’ve got a C# List<T> full of thousands of objects of the same class. You think there’s a few duplicates hiding in there somewhere and you want to root them out. You think that you would really like to put your objects into a dictionary because if you did you could get that shiny O(1)-ness that the cool sleek lines of a dictionary is going to give you.

Only thing is your keys aren’t exact, they’re doubles or DateTimes or some such and you can’t use them because you really need to apply a tolerance to those values. What you need is an imperfect hash function.

The solution to this problem is obvious to me now but not so obvious at the time when I was trying to speed up my painfully O(N^2) application. To add your elements into a dictionary and get faster lookup you must obviously include all the elements of the comparison in the Equals() method but exclude the non-exact ones from the GetHashCode() method.

Before I start with this analysis I’d like to apologise to all vegetarians on the planet for eating your friends. Secondly, and more importantly, I’d like to apologise to all decent-folk for producing an example so contrived that it will make you want to vomit. Please pay particular attention to avoid your shoes.

Let’s begin:

public class Beefy
    public string    name;
    public double   percentage;

    public override bool Equals(object o)
        Beefy b = o as Beefy;
        if (b == null) return false;
        return (name.Equals( && Math.Abs(percentage - b.percentage) >= 1);
    public override int GetHashCode()
         return name.GetHashCode();

This works by only relying on the string "name" to derive the hashcode. What this means is kind-of subtle and relies on what the Dictonary<T> does to resolve collisions. A collision occurs when two items hash to the same location. Since we know that for Beefy we're not using a unique identifier to place the items into the Dictionary<T> then that dictionary must have a way of coping with the situation that two hash-codes resolve to the same place. Now you would need to check the language you love to find out exactly what it does in this instance. But for .NET 2.0, C# uses a technique called chaining. That is it creates a secondary data-structure which is commonly referred to as a bucket but is really just another list.

Looking at the following diagram, (and again my sincere apologies) you'll see that we have 3 topsides in the same bucket. This happened because the hashcode for the string "Topside" led us to the fourth bucket. Every time we see "Topside" as the name we will end up inserting the Beefy object into its bucket. Every time we want to retrieve a "Topside" we will start at the first element of the bucket and iterate over the chain until we find a match.

Dictionary Beef

In this way then we can get the trade-off we need. Hopefully our inexact matching fields will be few and our exact ones will be many so that we will get close to O(1) performance, however if it all goes horribly badly and every piece of beef we have is "Topside" then we will tend towards an O(N) lookup time.

This started me thinking. Exactly what does this continuum look like? In other words for a trivial class like the one above what sort of read access times would I get searching for each of the 1000 objects in a variety of data structures? Now what if I vary the number of hashcodes between 1 (i.e. everything hashes to the same bucket) and 1000 (where everything hashes to a different bucket) and try to access the same 1000 objects for each hashcode distribution. I performed the process on a Dictionary<T>, my List<T> and a Hashtable, which has a different collision resolution mechanism in C# to the dictionary.

Table vs Dictionary (Beef Off!)

The results were almost as I expected. The Dictionary<T> is faster than the Hashtable because when a Hashtable has an imperfect hash the collisions 'spill' over one another which means that other future values can potentially be mis-placed, not just the current one. However, that performance converges as the quality of the hash improves. Secondly, the way I set up my List<T> the items being searched for should on average be found in the middle (requiring iteration over half of the List<T> to find a single element). You can see that for a very imperfect hash where there are less than 100 distinct hashes in 1000 items, it's better to use a List<T> than either a Hashtable or a Dictionary<T>. At a 90% perfect hash then a List is 22x slower than a Hashtable and more than 50x slower than a Dictionary<T>.

All good news for me because my real hashcode was much more discriminating than that typically sitting above the 90% perfect range. So the List<T> has gone and my hash is meaty sister. Very meaty indeed.


The Code That Was Not There and the Badly Fitting Suit

There seems to be a common, not unfounded, view that programming is really hard. The response from tool vendors is to produce products that make programming simpler. They do this by introducing a user interface that ‘simplifies’ what would ordinarily be a programming task into something that mere mortals can understand and program without understanding too many of the fundamentals of the task at hand. I think the view that such tools succeed at what they set out do is broadly untrue.

Up until this week I had intimate experience of only one such tool and that was BusinessObjects. Now I’m going back a bit here but in 1998 BusinessObjects made a tool that allowed a designer to take your ugly enterprise database and turn it into ‘Business Objects’. These objects contain a little more ‘information’ about the data you were looking at and made it possible to compose those objects in ways of your choosing, but crucially, they did this with the aid of graphical tools that didn’t make it look like you were really writing SQL queries at all. This then, in-turn lets a common Jo-user compose a view of that data for a report or particular query they need without having to learn SQL. In concept the idea has tremendous power because you can do things that you could never do before without a lot of expensive IT expertise. The reality, for us at least, was a little different and this was I think for two reasons.

Unapparent Inefficiencies

A normal SQL programmer would know about the indices and relationship cardinalities on a table, hence they would know which joins would suck. The abstraction provided by BusinessObjects would happily hide those inefficiencies of the underlying representation. That made it it really easy to kill the database with a single mouse-click and drag. You can’t just blame Jo-user for this either, when you hide the inefficiencies I would not be surprised if a seasoned programmer would sometimes make the same mistakes that Jo did.

Apparent deficiencies

BusinessObjects, of the time, was a language in itself. Indeed the BusinessObjects language is in fact an abstraction of another much more general purpose language that we know as: SQL. Programming languages that are developed for a specific domain (or Domain-Specific Languages) tend to exhibit their own behaviour. They make some things easy at the expense of making other things less so. The trick, with these tools, is to make sure that all the things you make hard are things people don’t really do very often anyway. The problem for us, at the time, was that we were putting a clean layer of abstraction on-top of a not-so-great database design. BusinessObjects wasn’t able to cope very well with those apparent deficiencies and so we had to go hunting for workarounds to put into place until we could get back and refactor the pimple on our software arse away.

In the end the project limped on for a bit and then I got moved onto another task and I lost track of it. Perhaps it lives on, but I doubt it. This week I discovered that Microsoft have a related but different purpose tool: SQL Server Integration Services (SSIS). Apparently it’s been going on for years under a previous guise of DTS but I’d never seen-nor-heard of it before. Anyway, I was initially very excited when I started working with it, I really believed that I could take my complicated data import requirements and transform them into a few diagrams and have all the work done for me. Yeah right. The reality was somewhat different, and like our friend BusinessObjects, SSIS coped badly with the unapparent inefficiencies and the apparent deficiencies.

The conclusion is that tools like this are complex and that complexity includes, but is not limited to, a lot of the complexity of the lower-level underlying representation that they must use. Often then, it will be far better to just use a general purpose tool (like SQL or Perl) to get you your reports or data transformations done. Don’t mess around with some factory-made ill fitting suit if you think you can tailor one yourself for less. No matter how pretty the buttons are.

In the end I surmised that using gawk and bash I could produce something equivalent to what I wanted to do in Integration Services in a fraction of the time and hence cost. If I’d used Perl or Python I could have done it even better and/or faster. I had been hoodwinked my friends. Hoodwinked into thinking that I could write a program that was not there and discovered in the end that it was far easier to just knuckle down and make my own suit and have it fit nice, than use the pretty off-the-shelf one with the gaping crotch.

The problem, it seems, is that there is a market for tools that allow us to create code that seemingly isn’t. We still believe, as I did, that tools will somehow save the day. Dammit, why didn’t someone remind me earlier that there is still not, and never has and never will be any silver bullet.


It’s imperative that it’s functional

So last week I mentioned that I’d been practicing my Common Lisp by having a go at some puzzles on Project Euler.

A few days later I got an email asking me about what my experiences of the project were and how far I’d got on the puzzles. Well now I can reveal to all of you the heady extent of my genius. My correspondent also casually dropped in the fact that he was trying to do his puzzles in a functional, rather than imperative, programming style. Oh … yeah, “functional programming” I’d conveniently forgotten about that. I was so busy getting to grips with programming in CL I forgot about the important bit. I guess 15 years bashing your head against C++.Java and C# is going to have some long-term mental health effects.

Losing the plot

Now, my CS class at University (15 years ago) very briefly touched on functional languages so I’m not totally clueless. However, we learnt precious little and there is really no defense for that University to not place a larger emphasis on functional programming. I would argue, now, that it would have been a lot more useful for them to teach us a functional language than what they did teach us, which was: COBOL.

So, given my limited understanding, the point of functional programming is to write programs that behave more like mathematical functions. This has certain consequences since a mathematical function can not have side-effects like printing “Hello, World!” or delivering spam. Which surely is going to disappoint some people.

It is this “dirty” little secret of programming (i.e. that it should probably do something if we’re gonna get paid) that makes pure functional-programming a little less straightforward. Languages like Haskell have monads that address the problem of purity by effectively differentiating the pure from the impure. But I don’t think I’m quite ready to go there just yet. Maybe in another 15 years.

Taking the tablets

Ok, so where was I? Well, I’m going to try to make my code more functional too. Which means a diet of higher order functions and recursion for me. So let’s pick a trivial example and see how we do.

Problem #6 in Project Euler asks us:

Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.

My first solution looked like this:

(defun sum-of-arithmetic-series (n)
  (/ (* n (1+ n)) 2)

(- (expt (sum-of-arithmetic-series 100) 2)
   (loop for i from 1 to 100 sum (expt i 2)))

Which I was proud of because I’d figured out that I could use an equation to solve the first part. This isn’t terribly functional though because the second part uses a loop. So I tried again, this time I thought I could do better. I decided that what I was really doing was summing a series of natural numbers in the range 1..n. I could generalise this into: “apply a function (or the identity) over a series of numbers and sum the results”.

Looking at the problem this way generates an abstraction that did not previously exist in the earlier solution, making it a more general result and increasing the possibility of its reuse (perhaps!):

(defun sum-of-series (n &optional f)
  (let ((n-dash (if (null f) n
		    (funcall f n))))
    (if (= n 1) 1
     (+ n-dash (sum-of-series (1- n) f)))))

CL-USER> (- (expt (sum-of-series 100) 2) 
	    (sum-of-series 100 (lambda (x) (* x x))))

Ok, I know it's trivial but it made me feel better. There was only one problem though. When testing this for large n (> 30000) I got stack-overflows. Which surprised me because I was expecting it to be tail recursive and hence a tail-optimization to happen. I tried playing around with the debug settings and trying different variations but no amount of playing could make SBCL do the right-thing and optimize the recursion away.

Guess I need to do some more investigation into functional programming and that's probably an imperative! (groan)


Common Lisp and the Towers of Blub

So I’ve been teaching myself Common Lisp. It’s a slow process because I don’t get a lot of time to do it, and also I’m a bit slow. The reason I’ve been teaching myself? Well I’m wholly dissatisfied with C++, Java, C#, and to a much lesser extent Python. I still love C though. She’s my lady. We connect.

The dissatisfaction started after I read Paul Graham’s “Hackers and Painters” book. I wrote about it too. Whilst reading the book I came across this paragraph and identied myself as a card-holding member of team “Blub”:

As long as our hypothetical Blub programmer is looking down the power continuum, he knows he’s looking down. Languages less powerful than Blub are obviously less powerful, because they’re missing some feature he’s used to. But when our hypothetical Blub programmer looks in the other direction, up the power continuum, he doesn’t realize he’s looking up. What he sees are merely weird languages. He probably considers them about equivalent in power to Blub, but with all this other hairy stuff thrown in as well. Blub is good enough for him, because he thinks in Blub.

Since then I’ve been on a journey. It’s taken me through books, well a book. I’ve discovered that Emacs is like 1 million times more powerful and sexy than her slightly musty smelling older sister Vi, with whom I’ve had an almost 15 year relationship and am trying to get out of my head. :x!

If you don’t believe me about Emacs (and why should you) consider how you’d write a video editing suite in whatever code editor you use or can extend. I already knew long ago that Emacs is at least partly written in Emacs Lisp but I should have guessed that arguably the most productive Common Lisp development environment, or IDE, is also an Emacs extension called SLIME.

Soon after that I discovered some fantastic screencasts in the Ruby On Rails mould for stupid people (like me): SLIME, Hunchentoot, and Selenium. BTW, I know Selenium is not Common Lisp but the screencast presents a Common Lisp binding to the Selenium Remote Control server that makes it possible to use Selenium from Common Lisp as a testing framework.

So it feels like I’ve come along way. But today, your humble programmer-of-Blub discovered something on the HyperSpec that scared me a little. Through the PlanetLisp RSS feed I discovered that I am also really crap at mathematics. I did this by trying some of these puzzles. In attempting to solve puzzle #2 in Lisp I found I needed to brush up on some Lisp basics. After some digging, and through a series of HyperLinks, I ended up here and before that here:

;; This illustrates the use of SETQ on a symbol macro.
 (let ((x (list 10 20 30)))
   (symbol-macrolet ((y (car x)) (z (cadr x)))
     (setq y (1+ z) z (1+ y))
     (list x y z)))
=>  ((21 22 30) 21 22)

I think I stared at this code snippet for about 2 minutes before it finally made sense when I at last remembered that setf works on a place and not a symbol. Now, I’m guessing an experienced Lisp-er would look at the sample and say ‘so-what’? I’m guessing that non-Lispers would say, anything that looks that complicated must be bad.

And this is when it struck me again. And this time it hurt. I may have lost a little “Blub” to get this far but I still have way much more “Blub” to lose.

c++ rant

Why not just use Haskell?

I am a fully paid up member of the ACCU. I joined because it seemed most of my now ex-colleagues were members and I’m a sucker for peer pressure.

Anyway, as you’d expect from a magazine dedicated to C and C++ users it occasionally has articles about C and C++. This quarter’s doozy is from a chap we’ll call Stuart. Now it looks like some of the things that Stuart is doing for his PhD are quite interesting. However his article about the similarities he has managed to draw between Haskell and using partial-template specializations in C++ gives me the shudders. It is based, in part, on the work of Andrei Alexandrescu. As charming as Mr Alexandrescu is in person I wouldn’t let him anywhere near any of my production C++ code (if I had any). But that’s beside the point.

Now I have to admit that I don’t know any Haskell and I’m also not a lover of C++ much either so I can’t bring myself to actually RTFA. However the article is full of little code snippets like this:

    template <typename T> struct Head;
    template <int x, typename xs>
       struct Head<IntList<x,xs> > {
       enum { value = x };

    template <typename T> struct Tail;
    template <int x, typename xs>
       struct Tail<IntList<x,xs> > {
        typedef xs result;

Why anyone would hate themselves enough to actually do this and write about it I don’t know. Especially when you could just use Haskell and not have to wrap your head around this lunacy.

But I know there is a simple answer. I should just unsubscribe from the magazine if I don’t like it. But sadly for me it’s not that simple. Stuart is obviously a clever guy, it almost goes without saying. But he’s wasting his brain cycles. Surely he must have better things to do with his time.

Then I read this 2005 interview with the glorious Alan Kay. If you have the time I suggest you read all of it because he makes some really very excellent points. In reference to my pain though this one stands out:

Perhaps it was commercialization in the 1980s that killed off the next expected new thing. Our plan and our hope was that the next generation of kids would come along and do something better than Smalltalk around 1984 or so. We all thought that the next level of programming language would be much more strategic and even policy-oriented and would have much more knowledge about what it was trying to do. But a variety of different things conspired together, and that next generation actually didn’t show up. One could actually argue—as I sometimes do—that the success of commercial personal computing and operating systems has actually led to a considerable retrogression in many, many respects.

You could think of it as putting a low-pass filter on some of the good ideas from the ’60s and ’70s, as computing spread out much, much faster than educating unsophisticated people can happen.

And while Stuart carries on trying to make C++ like Haskell, rather than just using Haskell, another wasted brain drops into the soup.