Categories
database unit testing

Database Unit Testing: It’s Tricky

I’ve been aware for a while that I really should be doing some database unit-testing along with the main unit-testing I’m already doing of the other code. With databases there’s a bunch of stuff I’d like to test apart from the logic inside any executable code:

  • Relationships – the database model should support certain types of queries, make sure that the more common ones work as needed
  • Query Execution Time – to verify that indices are present on SELECT’s and to monitor the cost of insertions and as a general (and fairly gross) monitor of performance
  • Arbitary Data Values – some data in the database is ‘special’. It’s always like that, it’s data that you don’t get from another source. It’s static data that makes your abstractions work. When it’s there everything is ok, when it’s gone you’re screwed
  • Constraints & Triggers – constraints and triggers occassionally get dropped when maintenance occurs when they don’t get put back things go south
  • Permissions – certain types of activity in a database should be prohibited ensure that these protections are in place for a class of users

There’s probably a lot more I could do too, but this will do to begin with. In the past I’ve spent significant investigative time hunting down problems that originated from some initial assumption being violated. Since I don’t like feeling violated, at least not on a Thursday, it seems like I should unit test what I can to get early warnings before the shit-storm hits.

So I did what any lazy, self-respecting developer would do I went looking for some code that someone else had written that I could steal. T-SQLUnit looked like the best match for my needs so I downloaded it and checked it out. Now, before I upset anyone I should say that T-SQLUnit is ok. But it suffers from a few fairly major drawbacks. There’s nothing wrong with TSQLUnit per-se it’s just that all database unit testing frameworks that are written in SQL are broken. Here’s a few lowlights:

  1. Results of expressions can not be inputs to stored procedures making traditional unit testing Assert statements awkward
  2. Catching all exceptions and reporting automatically that as a failure (ala jUnit) is messy requiring a BEGIN/END TRY around every test block

It’s the first one that makes life hard because all your assertions have to read something like:

IF @MyVar <> @TheirVar
THEN
     EXEC ts_failure 'My variable doesn't equal your variable'
END

When what you really want to write is something like:

EXEC ts_assert @MyVar <> @TheirVariable,  'My variable doesn't equal your variable'

I don’t know, perhaps I’m just not smart enough but I could not see anyway to make something like the above work without ending up with a syntax so verbose and ugly that even Eurocrats would balk a little. So bad that you might as well have just used a bunch of IF statements in the first place. Also, a direct consequence of not being able to have an ‘assert’ stored proc is that you can’t easily count the number of assertions you make to report them later. Now whilst this is just icing on the unit-test cake it’s still a nice feature to have and adds to the warm fuzz after a test run.

If that was hard then testing permissions related activities is next-to impossible. This is because your unit-testing framework is running as a particular user in a particular session. For you to be able to test particular permissions you might need to disconnect and reconnect as a different user. Well, it’s not impossible it’s just … a bit tricky

The obvious thing to do, then, is to completely give up on SQL unit testing frameworks and go back to your programming language of choice. As far as is possible you want to hide all the internals of dealing with the database and leave just the pieces you need to setup, run and teardown your tests. To do this I made a helper class to do all the heavy lifting by: connecting to the database, running any query, timing the execution, processing the results and storing them somewhere. Finally I made my class provide a function based query interface so that I could then write test code using NUnit style assertions against it. Creating this custom class took only a few hours. Once I’d created it I could hand all the testing framework stuff to my old-friend NUnit. This technique worked well for me and integrated nicely with my existing code tests.

Categories
programming

The Cracked Mirror

I remember reading somewhere (perhaps here but I can’t find the reference now) that the business of programming is the act of producing a simplified mirror-model of the business processes that it is trying to encapsulate.

This seems an intuitive statement. If I come to your company and make software to help your business it must capture at least some of the essence of what your business does. Indeed the programs that I and my colleagues write should in some way mirror the businesses that they belong to. If they don’t then we have to consider that I and my colleagues have probably failed.

In 1997 I entered the finance industry knowing almost nothing about finance. Pure green. So, eager to learn, I thought that if I looked at the code that I would be able to understand some of what was going on in the business. It turned out that this was as true as it was false. Yes the code mirrored the business but it turns out that that particular mirror was cracked. What I thought I understood about the business was distorted by irrelevant detail.

It’s obvious when I think about it now but the code that I was looking at had not been placed infront of me by an alien life-force (although some of the dudes were pretty strange) it had evolved. Code had been added to support business ventures that had subsequently been ended or even worse code had been added that was just plain wrong. In both these circumstances the users of the system compensated for the semantic gap between business and system, by doing what humans do best: working around the problem.

It seems then that this cracked mirror is inevitable because software decays. To really know what’s going on in your organisation you have to bang on doors. You have to ask the users the questions that make you look like Mr. Stupid. Only then can you build the model in your head of what is really going on.

What’s that you say? You’ve got business analysts? FIRE THEM! They don’t work people. I mean, yes they do work, but unless they are top-notch they create more problems than they solve.

Perhaps I’m preaching to the choir. Perhaps the choir went home. Perhaps I’ve been abducted by aliens and I’m still living in my 1997. Perhaps not.

Categories
education

Why I Make A Better Software Developer Than Parent

A while back a friend and colleague revealed to me that he took an interest in programming from the very young age of six. Today he is all grown up and he’s just a little younger than me, and sadly (for me) a far better developer, in many respects, than I’ll probably ever be. If I’m being honest though, I don’t actually think that it would have helped me very much to have started at six years old, seven years earlier than I actually did. The truth is I think I’m doomed to a life of mediocrity (in this arena at least) simply because my head isn’t as well wired for the task as his. Still, though, I wonder about whether programmer parents exposing young children to programming at a young age is really a form of vanity-torture.

For me it’s the paradox of being a parent. I want to guide my kids into being socially acceptable people but I don’t want to tell them how to live their lives. My biggest hope is that they are happy people and I want to help them to achieve that. Nothing else really matters. It will matter, of course, if they wind up in jail because they stole my car whilst high on the latest techno-drug. Then I’ll be doing a lot of guiding, telling and most probably shouting. I am hoping though that it won’t come to that, but I’ll let you know how I get on. The heads on this tails-up coin is that I want to give my kids enough of everything so that they can make their own choices.

So, it was with some trepidation then that I wondered if I should teach my children a programming language as suggested last week by another esteemed ex-colleague. He offhanded-ly recommended (I haven’t asked him if he actually tried it) a programming language developed by Mitchel Resnick’s Media Lab team at MIT called “Scratch”. Now I don’t know if you know (I didn’t until 2 weeks ago) but Media Labs and Media Labs people seem to be doing good things everywhere. You know that “One Laptop Per Child” thingy everyone’s talking about? Well the founder of that association is a Nicholas Negroponte who is persuing the OLPC project whilst on leave from MIT Media Labs. Oh and did I mention that Nicholas co-founded and directed Media Labs too? Nice.

I was sold then. I thought I’d give Scratch a try myself and possibly expose my progeny to it. Now bearing in mind that my daughter is a few years below the target age of 8+, it was possible that this could very well not work out at all. But she is a fairly comptent mouse and keyboard user and is learning to read and write so I thought that there was a chance that she might like it if I did the driving.

The first thing I did was to tell her that Scrach was a game, and yes it seems I lie to my kids too. Next I told her that there was this little cat called Scratch and we could make him do what we wanted. She immediately, and with that childish-flair for the ridiculous ordered him to go and eat an apple. Dutifully, I hunted about in the clip art for an apple but could find only a banana so we settled on that. I showed her what I was doing, how when I took the building pieces from one side I could build up a list of things for him to do. She made a noise like she understood the concept at least so I was pleased.

After a while we got tired of watching the little cat eat the banana so I said there was a few ready made games to play. We tried them all out and the one she really liked was a game where a big fish chased little fish. Sometimes I’m really glad I’m not a psychologist. Anyway, when we were done I said if she would like me to I could make her a game of her own. Again, without even a pause for breath she told me she would like a game where a dragon chased a princess.

I like challenges so I spent 15 minutes after her bath time trying to figure out how to do that and built up this charming fantasy below using all stock clip-art and photos from the Scratch install.

Help!  Save me!

I then built up a little “Scratch” program where the dragon would follow the princess about the screen if you moved her. Very pleased with myself (who’s the child now?) I declared to my daughter that I was done. I showed it to her and she liked it, but then something unexpected happened. She said, and I quote: “That’s nice daddy but it’s not really what I wanted. I want the princess to chase the dragon”. I explained that I’d made it behave like she asked me but I could change it if she wanted and so I did. So 5 more minutes passed and I wired up the two sprites the opposite way around and then, a little less pleased with myself announced that I had finished. We had fun letting the princess chase the dragon and then she said: “I want to be the dragon”.

I wasn’t as much exasperated by this as surprised. I scratched my head, pointed silently at the screen, wagged my finger at it, made a shape with my mouth like I was going to say something and then didn’t. I then turned to her and said: “But if you control the dragon where will the princess go?”. She shrugged and then said “You can use the keyboard to move her”. Now then. Let’s stop and review. The important point here is that my rudimentary “follow” AI of the dragon sort of just happened. I just created it because that’s what I thought she wanted. However, the more I think about it the more I know that I did it wrong. What she wanted was obvious all along. She wanted a version of the big-fish chases little fish game where the chasing character was a dragon (not a big-fish) and the chased character was a princess (not a little fish).

In about 15 minutes my daughter had shown me exactly why requirements gathering is crucial. I’d let my guard down and taken one on the chin from a 5 year old. Now this was play-time so there weren’t any penalty clauses in our contract but if this hadn’t been play-time she would have sued my sorry white ass. I joke of course. Her lawyer won’t touch software contracts, not after the last time. But you get the picture.

Overall though I really enjoyed using “Scratch” and I enjoyed showing Scratch to my daughter. I especially liked the way that the code ‘blocks’ are shaped in such a way that each building-piece has a limited number of places where it looks like it will fit. I also like the way that the whole think feels ‘chunky’, like big fat duplo bricks.

Dragon AI

Finally I also like the way everything is colour coded. All these things are real genius since it’s a visual reinforcement that teaches the difference between the different elements of the program. More importantly though there are plenty of visual cues about what you could do next if you weren’t really sure. The only thing I didn’t like was that I had to drop a screen resolution so that all the fonts were nice and big. It would have been nice to be able to adjust them to suit my needs. But hey, it’s not a big deal.

So, there you have it. I tried and nothing bad happened. I might have created a revenue stream for tomorrow’s psycho-analyst or I might have created tomorrow’s psycho-analyst or maybe even programmer. Who knows or cares. It was just fun to do, that’s all.

Categories
linux programming

Caring For Your Environment

I’ve been thinking about where all my spare time goes recently because I just don’t have the time to do things I want to do, development-wise. In my life I have a couple of obvious (non-family related) time sinks like:

  • XBOX 360: Bioshock, Assain’s Creed, Gears of War, …
  • Physical Exercise

Now clearly physical exercise is unnecessary for a desk-jockey like me right? But there’s evidence to suggest that physical exercise might make me smarter. I need all the smarts I can get so I guess I’ll be continuing that for now.
What about gaming? Well I’ve been gaming since Jet Set Willy so I don’t think this is really a time luxury anymore. It’s now a defining character trait.

So that’s not given me very much wiggle room so I started to look more closely at what I actually do when I’m the big fat “H” in HCI. I discovered that I was spending some time on Project Euler which is definitely not time-wasted but is perhaps a little frivolous so I stopped it. But after that still no development work was getting done. Then I found I would spend a fair amount of time tending my development environment garden.

Gardening Again

Recent projects have included:

  • Switching my development machine from Gentoo to Ubuntu, ooo-ooo-ooo
  • Setting up SVN over SSH
  • Getting Emacs to provide true type font support
  • Upgrading Ubuntu to Gutsy for Compiz-Fusion support
  • Trying to get Gutsy to work with my f*cking ATI Radeon 9600 so I can actually use Compiz-Fusion
  • Trialling Lisp based tiling X window managers

And so it goes on. I can always think of something to do, and it’s very much like gardening I think. I admit that I haven’t really done much real gardening but when I did have a garden I found I could spend hours in it mowing, pruning, removing stones, painting fences, … you get the idea. The only difference is that with gardening, gardening and aesthetics are the objectives. The objectives of my development environment “gardening” are less clear. I’m clearly not getting very much productivity benefit from trying to get Compiz-Fusion to work, only that it makes me feel very powerful to be able to make an ATI graphics card work as expected with Linux.

What’s in your garden? Are the weeds ten feet high but you just can’t see them or could you submit it for a competition and win? Is this sort of OCD the preserve of the programmer or have I really lost it this time?

Categories
programming

That Funny Nose Thing Again

In one of my previous jobs we had an unwritten rule that if you wanted to introduce a new programming language into the organisation you had to have a pretty good reason for it. When I joined that company around 1999 the company was using C/C++, Tcl and a little Java. By the time I had left they were using a lot of Java, a lot of Python (thanks, at least in part to me), a bunch more C++ and a steadily growing amount of C#. I wasn’t exactly responsible for the addition of the other languages but I contributed code to all of them I think.

Back then I decided that a small company should not adopt and keep that many technologies simultaneously within a single team without retiring some of the older code. From a company’s point of view it is in their interests to stop this in-house proliferation of tools & programming languages because it makes the code base both harder to support and harder to integrate. But it seems, to me at least that tools and programming languages are one part of the programmer condition. I just can’t get enough of them. They look shiny and new and full of promise. They are quite simply bewitching to me.

No one nose what I feel for you

Unlike Elizabeth Montgomery programming tools have little sex appeal and don’t do that funny nose thing. You know, the nose thing that makes everything better, when it all turns to shit at the end-of-the-show.

It is therefore in my company’s interests to pick languages & tools that are general purpose because it will reduce the possibility of tools proliferation later. But I know the drill. Hey, I practically wrote the drill. Find something I want to use, find a reason why I want to use it or why what we have now is deficient. Then bitch and moan until I get my way.

Sometimes though the benefits of a switch to a different tool or programming language can be compelling. Steve Yegge claimed last month that his MUD Wyvern has so many lines-of-code in Java that it is simply unsupportable by one tool/person and so he’s going to use Mozilla’s Rhino to reduce the LoCs. Yeah, that does sound like a good plan, but I think I’ll check back with Steve in 2010 to see how he’s getting on.

As I already mentioned in the “Towers of blub” I have been on a personal quest for about 1.5 years now to find a more powerful programming language. At the moment I have been learning Common Lisp. So it was that this week that my Tabitha nose picked up the strong scent of a new programming language gaining ground. The new player is Scala. I read a couple of blog posts about it, had a look at a tutorial a reference manual or two and was, as you Americans say, pretty stoked. I was thinking about when I was going to download it to see what it could do for me.

But then I was hit by a 10 foot wall of apathy. Whilst it’s interesting to be able to see as much of the language & tools landscape as is humanly possible I’m starting to wonder if it’s a very worthwhile use of my time. Perhaps I should stop evaluating all these different tools and languages and actually write some code. In fact if I was going to list over the years which technologies I’ve learned and subsequently forgotten, instead of coding, it would probably make quite a long list.

So I think I’ll do what Dan Weinreb’s going to do and just keep an eye on Scala to see what happens next. Now, since he is way smarter than me I reckon this is a pretty safe bet. BTW, I’ve tried it before people and this technique really does work. Pick someone whose opinion you respect (this is obviously never going to be a politician, a teacher or a member of law enforcement) and simply base your opinion on theirs. You don’t really need a great deal of rhetoric to back your arguments up just remember whose opinion you copied and come back to it later. So I’ll keep an eye on Scala and remember that I might need it one day and save the rest of my time for some more serious keyboard intercourse and beer.

Then I had the other slightly larger epiphany. More important, at least in hindisght, are not the tools and languages I use but the things that I do with those tools & languages. I’ll be more specific. The important things about what I do can be broken into technical and non-technical. The technical things like distributed computing, defensive coding, testing, multi-threading, relational databases and networking are important knowledge and experience that I draw on all the time and are language and tool independent. Those are the things that I really need to know to know how to program. But the things that make me (or would make me if I was any good at them!) really effective are the non-technical things like communication, interview and planning skills. I need to spend time working on developing all these other skills rather than finding the next new programming language or tool.

Bewitched first aired in the US on 17 September, 1964. A time when COBOL was pretty shiny and new. In 1991 I had to learn COBOL for my undergraduate degree. I have not used COBOL since the last programming assignment we did and I remember almost nothing about it. But I did learn something valuable from that assignment because it was the first time I had ever tried to produce a piece of software in a team.

So, it seems then that I really shouldn’t care very much, about what I have to create my solutions ‘in’, as long as I don’t have to use too many. I think that there’s ways round most programming language deficiencies, unless you use COBOL of course.

Categories
lisp

Let the Foreigners In!

In Learning Ruby by Extension I claimed that learning a programming by writing extensions was a worthwhile exercise to really get to grips with a language. This is because to write a natural-feel extension you are forced to learn a fair bit about the internals of the language. I was wrong. Well sort of, let me explain.

Last week I bemoaned the poor network programming support in my Common Lisp implementation and so this week I resolved to get my beloved multicast sockets working by breaking the abstraction a little and just implementing the missing socket option call as a foreign function call. I was looking forward to it because I had anticipated that it would teach me something about Lisp. It didn’t. Well at least not in the way I was expecting. Absolutely nothing, and there’s a reason for this.

The foreign function interface for Lisp is at a higher level of abstraction than what I was doing with Ruby or Python. With Python, especially, I was actually creating Python objects and classes within C++ modules that Python would allow me to load as if they were native modules. This meant I had to delve into topics about Python’s garbage collection, threading and more. With Lisp’s CFFI (and all of the other FFI’s I’ve seen so far) things don’t work quite like that because they don’t need to. The language allows for much of the mechanics to be completely abstracted away and become almost transparent to the user. Let’s take a look at this by looking at what I did:


(require 'cffi)
(use-package :cffi)
(defcstruct ip-mreq
  (multiaddr :uint32)
  (interface :uint32))

(defcfun "setsockopt" :int
  (socket :int)
  (level :int)
  (option :int)
  (option-value :pointer)
  (option-length :int))

(defcfun ("inet_addr" inet-addr) :int
  (cp :string))

(defun socket-add-mcast-membership (socket-fd mcast-addr iface-addr)
  (with-foreign-object (mreq 'ip-mreq)
               (setf (foreign-slot-value mreq 'ip-mreq 'multiaddr)
                 (inet-addr mcast-addr)
                     (foreign-slot-value mreq 'ip-mreq 'interface)
                 (inet-addr iface-addr))
               ;; level = 0 = IPPROTO_IP,  option = 12 = IP_ADD_MEMBERSHIP
               (setsockopt socket-fd 0 12 mreq
                   (foreign-type-size 'ip-mreq))))

That's it, no external C modules or makefiles it's all done inside Lisp. I'd make this a bit cleaner before I incorporated it into my application because I've intentionally cut some corners to show the relative simplicity. However, it goes to show how easy it is to do. For example if you look at the inet_addr C function definition it only requires a two-line symbolic-expression to have a callable version inside my program . Indeed this solution is on about the same level of complexity as defining an entry point in .NET which couldn't really be any simpler. And like .NET there's no C 'glue' libraries to write because CFFI is doing all the heavy lifting. All arguments and return values are being marshalled into and out of dynamically allocated intermediate values. Which is pretty neat because it means all I have to do is concentrate on the abstraction. Even better is that CFFI allows the partial definition of structures which is great when you have a large and complicated struct which you only care about some of the values for. But that's another story.

Now if we're going to be fair to Ruby, Ruby actually has two ways to write non-local extensions. The first is to use the supplied C programming interface, similar (but simpler) to how Python does it. The second goes a step further and has a gem that allows you to embed inline C directly into the Ruby code. Although I don't think I could have used it for the tibcorv project (because I needed to support callbacks) that's also pretty impressive. Having said that Ruby can only achieve that by having access to a C compiler.

At the risk of stating the obvious a simple method for the definition and invocation of foreign functions is a requirement, I think, for any language to become and remain successful. The easier it is to get the foreigners in the more possibilities you can exploit because there's a C library somewhere for just about any task you can imagine. However, I qualify my original statement that writing extensions can teach us a lot about a language. More specifically I now think that writing extensions will teach you a lot about a language only if the level of abstraction that you have to work at to get the job done forces you to. And as far as I'm concerned, the more abstract the better.

Categories
lisp

Talking to Donkeys

Too much choice is challenging, to say the least. Well, it is to me anyway. I get hung up on trying to pick the best choice from a huge list of items and end up doing nothing at all. All too often just making a choice, any choice, would be the best idea. However when you’re writing software and trying to pick from an array of available libraries the penalties of a wrong choice can be amplified later. This penalty comes from having to bet early on your chosen horse, but here’s the kicker, you could well have a lot of capital in the race before you realise you backed a donkey.

Donkey

For instance, if you choose the wrong core library for the “Next Big Thing” now, you run the risk of tying yourself to an implementation that might have a flaw in it that you can’t fix because it’s not open source or because you don’t have the resources to undo it. So you have to adopt a work-around. Thankfully, most of the time things don’t turn out so bad. But if I stop talking to the donkeys for a minute and listen very carefully I can hear millions of developers happily using Microsoft’s tools and libraries without a care in the world. They don’t have to think about this sort of problem very often because Microsoft thought about it for them and gave them the library as part of the SDK. Those developers are real happy with what they’ve got and who can blame them, it’s like having an older brother who looks out for you. So what if he steals your lunch money, it’s part of his charm.

I don’t want to bash Microsoft here, by the way, I think they have a massively important role to play and I have to admit that some of the more recent tools they have made have been pretty awesome achievements for a gigadollar-company. It seems obvious, to me at least, that the higher Microsoft raises the bar in terms of tools the better it is for pretty much everyone. But the (real?) world outside of Redmond is a little more confusing, I know from (old) experience that when I want to use a SMTP client in Java I’ve got to go through a process of evaluation. In .NET the client that comes free in System.Net.Mail will do just fine, nothing to add. Sure sometimes .NET will sometimes be found lacking, but I’ve been pretty happy with the standard libraries in .NET so far and it’s only when I want something a little unusual that I need to go elsewhere (except for the annoying lack of a built-in Set class but that’s a different story).

Well all I wanted to do this week on my continuing Lisp learning exercise was send a multicast packet from my Common Lisp project. Guess what, network programming in Common Lisp is simply not possible. Gasp. It seems that the Common Lisp standard was set in 1994 and so predated the massive expansion of the internet. So CL projects like trivial-sockets and usocket have been created as CL vendor neutral APIs. Unfortunately, it seems trivial sockets is abandonware so I guess usockets is the way to go. But then, it seems broadcast networking is still work in progress for usocket so it seems fair to say that Common Lisp network programming for multicast will be some time off. In a platform independent way at least.

Even that’s ok. I don’t mind. I’m not producing any world-changing libraries here. Hell I might even pitch in on the usocket effort to help out. But right now I’m only messing about trying out some ideas to see what might fly, it’s what Lisp is good at after all. So I’m happy to use whatever multicast support SBCL has. Crap. They don’t seem to have implemented all the socket options needed . Then I read this very illuminating 2005 paper that lays the CL network programming problem out beautifully but when you scroll down to the “Proposed Solution” it says:

… To be written

If you stop talking to your own donkeys for a second and listen you might be able to hear me laughing. Very faintly.

Ok, I know what I’ll do. I’ll take a look at what I need to do make a foreign-function interface work. Yeah, that’s good, I’ll do that. Crap. I’ve got to chose between UFFI & CFFI. So I already know that I’m going to have to take a look at both to see what’s going on. A short while later and I’m starting to believe that CFFI is the cleaner of the two but then I stumble upon Verrazano and SWIG and my head starts to spin.

At this point it’s hard to be cross or frustrated about anything. This isn’t really anything that I can really blame CL for, it’s just how it is, if time-travel had of been available when the 1994 ANSI Common Lisp standard had been accepted then I’m sure they would have added a sockets layer. I can’t blame OpenSource because I can’t stop applauding it. I bow down to those people that have sacrificed their own personal time to make my life easier / better. I know that the fact that I have to do a little bit of work in the decision making process should make me happy because I have a choice to begin with. And anyway it’s better than having to do the whole thing from scratch myself.

But here’s the thing. When I stop shouting and beating my donkeys for being a bit lame and look up there’s millions of Microsoft developers cheerfully waving at the crazy guy who talks (and occasionally beats) his donkeys. I know if I was one of them, I’d think I was crazy too.

I think we donkey beaters can turn this around. I still believe that in general what I can hold in my hand for free is far more powerful than anything I can buy. The problem is essentially one of continuing improvement and better communication. All I have to do is stop cursing the donkeys and do something about it.

Categories
.NET

The Imperfect Hash (but what percentage is beef?)

What would you do? You’ve got a C# List<T> full of thousands of objects of the same class. You think there’s a few duplicates hiding in there somewhere and you want to root them out. You think that you would really like to put your objects into a dictionary because if you did you could get that shiny O(1)-ness that the cool sleek lines of a dictionary is going to give you.

Only thing is your keys aren’t exact, they’re doubles or DateTimes or some such and you can’t use them because you really need to apply a tolerance to those values. What you need is an imperfect hash function.

The solution to this problem is obvious to me now but not so obvious at the time when I was trying to speed up my painfully O(N^2) application. To add your elements into a dictionary and get faster lookup you must obviously include all the elements of the comparison in the Equals() method but exclude the non-exact ones from the GetHashCode() method.

Before I start with this analysis I’d like to apologise to all vegetarians on the planet for eating your friends. Secondly, and more importantly, I’d like to apologise to all decent-folk for producing an example so contrived that it will make you want to vomit. Please pay particular attention to avoid your shoes.

Let’s begin:

public class Beefy
{
    public string    name;
    public double   percentage;

    public override bool Equals(object o)
    {
        Beefy b = o as Beefy;
        if (b == null) return false;
        return (name.Equals(b.name) && Math.Abs(percentage - b.percentage) >= 1);
    }
 
    public override int GetHashCode()
    {
         return name.GetHashCode();
    }
}

This works by only relying on the string "name" to derive the hashcode. What this means is kind-of subtle and relies on what the Dictonary<T> does to resolve collisions. A collision occurs when two items hash to the same location. Since we know that for Beefy we're not using a unique identifier to place the items into the Dictionary<T> then that dictionary must have a way of coping with the situation that two hash-codes resolve to the same place. Now you would need to check the language you love to find out exactly what it does in this instance. But for .NET 2.0, C# uses a technique called chaining. That is it creates a secondary data-structure which is commonly referred to as a bucket but is really just another list.

Looking at the following diagram, (and again my sincere apologies) you'll see that we have 3 topsides in the same bucket. This happened because the hashcode for the string "Topside" led us to the fourth bucket. Every time we see "Topside" as the name we will end up inserting the Beefy object into its bucket. Every time we want to retrieve a "Topside" we will start at the first element of the bucket and iterate over the chain until we find a match.

Dictionary Beef

In this way then we can get the trade-off we need. Hopefully our inexact matching fields will be few and our exact ones will be many so that we will get close to O(1) performance, however if it all goes horribly badly and every piece of beef we have is "Topside" then we will tend towards an O(N) lookup time.

This started me thinking. Exactly what does this continuum look like? In other words for a trivial class like the one above what sort of read access times would I get searching for each of the 1000 objects in a variety of data structures? Now what if I vary the number of hashcodes between 1 (i.e. everything hashes to the same bucket) and 1000 (where everything hashes to a different bucket) and try to access the same 1000 objects for each hashcode distribution. I performed the process on a Dictionary<T>, my List<T> and a Hashtable, which has a different collision resolution mechanism in C# to the dictionary.

Table vs Dictionary (Beef Off!)

The results were almost as I expected. The Dictionary<T> is faster than the Hashtable because when a Hashtable has an imperfect hash the collisions 'spill' over one another which means that other future values can potentially be mis-placed, not just the current one. However, that performance converges as the quality of the hash improves. Secondly, the way I set up my List<T> the items being searched for should on average be found in the middle (requiring iteration over half of the List<T> to find a single element). You can see that for a very imperfect hash where there are less than 100 distinct hashes in 1000 items, it's better to use a List<T> than either a Hashtable or a Dictionary<T>. At a 90% perfect hash then a List is 22x slower than a Hashtable and more than 50x slower than a Dictionary<T>.

All good news for me because my real hashcode was much more discriminating than that typically sitting above the 90% perfect range. So the List<T> has gone and my hash is meaty sister. Very meaty indeed.

Categories
database

The Code That Was Not There and the Badly Fitting Suit

There seems to be a common, not unfounded, view that programming is really hard. The response from tool vendors is to produce products that make programming simpler. They do this by introducing a user interface that ‘simplifies’ what would ordinarily be a programming task into something that mere mortals can understand and program without understanding too many of the fundamentals of the task at hand. I think the view that such tools succeed at what they set out do is broadly untrue.

Up until this week I had intimate experience of only one such tool and that was BusinessObjects. Now I’m going back a bit here but in 1998 BusinessObjects made a tool that allowed a designer to take your ugly enterprise database and turn it into ‘Business Objects’. These objects contain a little more ‘information’ about the data you were looking at and made it possible to compose those objects in ways of your choosing, but crucially, they did this with the aid of graphical tools that didn’t make it look like you were really writing SQL queries at all. This then, in-turn lets a common Jo-user compose a view of that data for a report or particular query they need without having to learn SQL. In concept the idea has tremendous power because you can do things that you could never do before without a lot of expensive IT expertise. The reality, for us at least, was a little different and this was I think for two reasons.

Unapparent Inefficiencies

A normal SQL programmer would know about the indices and relationship cardinalities on a table, hence they would know which joins would suck. The abstraction provided by BusinessObjects would happily hide those inefficiencies of the underlying representation. That made it it really easy to kill the database with a single mouse-click and drag. You can’t just blame Jo-user for this either, when you hide the inefficiencies I would not be surprised if a seasoned programmer would sometimes make the same mistakes that Jo did.

Apparent deficiencies

BusinessObjects, of the time, was a language in itself. Indeed the BusinessObjects language is in fact an abstraction of another much more general purpose language that we know as: SQL. Programming languages that are developed for a specific domain (or Domain-Specific Languages) tend to exhibit their own behaviour. They make some things easy at the expense of making other things less so. The trick, with these tools, is to make sure that all the things you make hard are things people don’t really do very often anyway. The problem for us, at the time, was that we were putting a clean layer of abstraction on-top of a not-so-great database design. BusinessObjects wasn’t able to cope very well with those apparent deficiencies and so we had to go hunting for workarounds to put into place until we could get back and refactor the pimple on our software arse away.

In the end the project limped on for a bit and then I got moved onto another task and I lost track of it. Perhaps it lives on, but I doubt it. This week I discovered that Microsoft have a related but different purpose tool: SQL Server Integration Services (SSIS). Apparently it’s been going on for years under a previous guise of DTS but I’d never seen-nor-heard of it before. Anyway, I was initially very excited when I started working with it, I really believed that I could take my complicated data import requirements and transform them into a few diagrams and have all the work done for me. Yeah right. The reality was somewhat different, and like our friend BusinessObjects, SSIS coped badly with the unapparent inefficiencies and the apparent deficiencies.

The conclusion is that tools like this are complex and that complexity includes, but is not limited to, a lot of the complexity of the lower-level underlying representation that they must use. Often then, it will be far better to just use a general purpose tool (like SQL or Perl) to get you your reports or data transformations done. Don’t mess around with some factory-made ill fitting suit if you think you can tailor one yourself for less. No matter how pretty the buttons are.

In the end I surmised that using gawk and bash I could produce something equivalent to what I wanted to do in Integration Services in a fraction of the time and hence cost. If I’d used Perl or Python I could have done it even better and/or faster. I had been hoodwinked my friends. Hoodwinked into thinking that I could write a program that was not there and discovered in the end that it was far easier to just knuckle down and make my own suit and have it fit nice, than use the pretty off-the-shelf one with the gaping crotch.

The problem, it seems, is that there is a market for tools that allow us to create code that seemingly isn’t. We still believe, as I did, that tools will somehow save the day. Dammit, why didn’t someone remind me earlier that there is still not, and never has and never will be any silver bullet.

Categories
lisp

It’s imperative that it’s functional

So last week I mentioned that I’d been practicing my Common Lisp by having a go at some puzzles on Project Euler.

A few days later I got an email asking me about what my experiences of the project were and how far I’d got on the puzzles. Well now I can reveal to all of you the heady extent of my genius. My correspondent also casually dropped in the fact that he was trying to do his puzzles in a functional, rather than imperative, programming style. Oh … yeah, “functional programming” I’d conveniently forgotten about that. I was so busy getting to grips with programming in CL I forgot about the important bit. I guess 15 years bashing your head against C++.Java and C# is going to have some long-term mental health effects.

Losing the plot

Now, my CS class at University (15 years ago) very briefly touched on functional languages so I’m not totally clueless. However, we learnt precious little and there is really no defense for that University to not place a larger emphasis on functional programming. I would argue, now, that it would have been a lot more useful for them to teach us a functional language than what they did teach us, which was: COBOL.

So, given my limited understanding, the point of functional programming is to write programs that behave more like mathematical functions. This has certain consequences since a mathematical function can not have side-effects like printing “Hello, World!” or delivering spam. Which surely is going to disappoint some people.

It is this “dirty” little secret of programming (i.e. that it should probably do something if we’re gonna get paid) that makes pure functional-programming a little less straightforward. Languages like Haskell have monads that address the problem of purity by effectively differentiating the pure from the impure. But I don’t think I’m quite ready to go there just yet. Maybe in another 15 years.

Taking the tablets

Ok, so where was I? Well, I’m going to try to make my code more functional too. Which means a diet of higher order functions and recursion for me. So let’s pick a trivial example and see how we do.

Problem #6 in Project Euler asks us:

Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.

My first solution looked like this:

(defun sum-of-arithmetic-series (n)
  (/ (* n (1+ n)) 2)

(- (expt (sum-of-arithmetic-series 100) 2)
   (loop for i from 1 to 100 sum (expt i 2)))

Which I was proud of because I’d figured out that I could use an equation to solve the first part. This isn’t terribly functional though because the second part uses a loop. So I tried again, this time I thought I could do better. I decided that what I was really doing was summing a series of natural numbers in the range 1..n. I could generalise this into: “apply a function (or the identity) over a series of numbers and sum the results”.

Looking at the problem this way generates an abstraction that did not previously exist in the earlier solution, making it a more general result and increasing the possibility of its reuse (perhaps!):

(defun sum-of-series (n &optional f)
  (let ((n-dash (if (null f) n
		    (funcall f n))))
    (if (= n 1) 1
     (+ n-dash (sum-of-series (1- n) f)))))

CL-USER> (- (expt (sum-of-series 100) 2) 
	    (sum-of-series 100 (lambda (x) (* x x))))
25164150
CL-USER> 

Ok, I know it's trivial but it made me feel better. There was only one problem though. When testing this for large n (> 30000) I got stack-overflows. Which surprised me because I was expecting it to be tail recursive and hence a tail-optimization to happen. I tried playing around with the debug settings and trying different variations but no amount of playing could make SBCL do the right-thing and optimize the recursion away.

Guess I need to do some more investigation into functional programming and that's probably an imperative! (groan)