Categories
database sql

rozenshtein method for pivoting relational data

I came across this blog entry while trying to make up for the fact that SQL server 8 does not have the PIVOT statement that I needed. It’s taken from a book (that I don’t have) by Rozenshtein on advanced database queries.
The blog explanation is lengthy and a little confusing but the idea is deceptively simple. It hinges around being able make a column expression that can ‘select’ data.

Making a numeric column select data in a row is reasonably straightforward. You need a column expression where the data in the row will return zero if it is non-matching and 1 if it is matching. You can then simply multiply the column value by the expression.

The particular problem I needed to solve was how to present a year report where each column contained the total number of transactions done in a month.

Therefore I had columns like ‘Now’, ‘1 month ago’, ‘2 months ago’, …, ‘5 months ago’

So it seems like I need to work out for each piece of data which month it falls in. This was easy:

 
SELECT MONTH(date) 
FROM table 

Now all I have to do is to figure out the distance from this month to now for the last year. So something like:

SELECT DATEDIFF(month, date, GETDATE()) 
FROM TABLE 
WHERE date >= DATEADD(month, -11, date)

So how does this help? Well all rows in my table for the current month and year will return zero, all preceeding months will return a positive number of months from now. What we’d rather have though is a function that returns 1 for this month and zero for all others. If we could reduce this to returning a 0 for the current month and a 1 for the other months then we’d only need to invert this to get the function we need. We can do this by taking the 1-ABS(SIGN()) of our DATEDIFF() result. The 1-ABS(SIGN()) term is going to make everything that is the current month return 1 and everything that is a different month return zero. This gives us exactly what we need because we can then multiply this result by whatever quantity we like to select it (i.e. 1) or remove it (i.e. 0). You can see this at work in the following table:

xSIGN(x)ABS(x)1-x
-2-110
-1-110
0001
1110
2110

So now we just need to take that and apply it to all the months we want to pivot for. Of course we’re aggregating data here so we’re going to additionally have to group by the month and sum it to get the answer we want. So, say we want 5 months worth, the final SQL looks like this:

SELECT SUM(val * 1-ABS(SIGN(DATEDIFF(month, date, GETDATE()) - 0)) Now
      SUM(val * 1-ABS(SIGN(DATEDIFF(month, date, GETDATE()) - 1)) Month1
      SUM(val * 1-ABS(SIGN(DATEDIFF(month, date, GETDATE()) - 2)) Month2
      SUM(val * 1-ABS(SIGN(DATEDIFF(month, date, GETDATE()) - 3)) Month3
      SUM(val * 1-ABS(SIGN(DATEDIFF(month, date, GETDATE()) - 4)) Month4
      SUM(val * 1-ABS(SIGN(DATEDIFF(month, date, GETDATE()) - 5)) Month5
FROM TABLE 
WHERE date >= DATEADD(month, -11, date)
GROUP BY MONTH(date)

That’s it. Note that the only offensive thing about this is that the column names aren’t very descriptive. I had the benefit of doing this in SQL server 8 so I could build queries inside the stored procedure and build descriptive column names like ‘Jan’, ‘Feb’, ‘Mar’, …. On the whole this is usually a bad thing to do inside a stored procedure because it means that the stored procedure can’t be fully compiled until it is run which is one of the reasons for using stored procedures in the first place! If performance was a consideration and this stored procedure was being called a lot I would probably choose to have the client application rename the columns.