Categories
article programming

Floating Point Flotsam

I have never been particularly clear about when to choose single or double floating point arithmetic. I think I have been operating on a sort of ‘trial-and-error’ approach for some time. It seems ludicrous that I should not, after all these years, know exactly when to chose single or double precision arithmetic. The truth is that I don’t, and it’s now time to fix that.

First, a little background. I was, for reasons that are too sad to explain, interested in how to calculate (accurately) the time of the Vernal Equinox. That is the exact time of the year (to the minute – if possible) that the sun is 90o above the earth at the equator. Anyway, I found some code which is an implementation of an astronomical algorithim designed expressly for the purpose of calculating the vernal equinox and is accurate to about 20 minutes. Which is good enough for now.

Now the next piece of background is that I was also doing this in Lisp and up until now I have let the reader interpret all the literals I enter. When I tried to do the calculation in the REPL I could get no closer than being in the same part of the day as I was expecting and with the result somewhat rounded to the nearest half day. After some head scratching it finally occurred to me that the the computation required more significant digits than I really had. It seems that the formula I was entering was being interpreted as single precision floating point numbers because if I had wanted double’s I would have suffixed my literal numbers with a ‘d’. It would seem that d could also stand for D’uh. Seems fair enough. Time to do some research then …

You see, according to the IEEE standard 754-1985 a single precision floating point number has 23 significant bits and a double has 53 significant bits. For me to answer the question of how many significant figures I can get in decimal would require me to know that the smallest decimal fraction that I can represent in binary is. This number would be 1/223 which is about .00000011920928955078125.

Now, floating point numbers are represented from a fractional part and an exponent part to give the decimal representation of the number. Therefore you can never get more accuracy than the smallest binary fraction multiplied by the exponent you have. This means that the significant figures should be something like:

log10(1/(2^23)) = -6.9

Therefore when a number has 7 significant figures you are already losing a little accuracy, the more significant figures you add the worse it will get. It was then clear that my astronomical antics were less than stellar since the first literal in the computation has 11 significant figures.

Indeed whilst I was thinking about this problem it occurred to me that if I wanted to continue using single precision I could split the fractional part from the integer part and continue this way. However, this is still inferior to a double because it would give me 7 significant figures for both parts and therefore a total of 14 significant figures. This is inferior because using my shiny new brain I can show that a double will give about 16 signficant figures. Of course I could have also concluded that double’s are better than two floats by noting that 23 bits+ 23 bits < 53 bits but that would never have been as much fun.

Type Word Size Mantissa Dec Sig. Figs.
Single 32 23 7
Double 64 53 16
Extended 96 63 19
Quad-Extended 128 113 34

You could argue that I could have saved myself 20 minutes of time by looking the answer up but then again I would never remember the answer unless I proved it to myself first. So, now my spring occurs at the same time as everyone else’s and I know why. Oh it happened already? Sheeeeiiiitttt.