Friday, 11 November 2011

Implementing an Agile ETL process in SQL Server

I was fortunate enough recently to attend the DW 2.0 Conference in Sydney.  One of the presenters, Therese Ahlstam (from the Genesee Academy) advocated a technique for building data warehouse's she referred to as Hyper-agility.  The idea was to avoid the need to constant schema changes by storing Staging & ODS tables as Key/Value pairs.  To me this seems to fit the Serialisation/de-serialisation design pattern so common in many software frameworks these days.

This idea seemed quite radical to me, but I've been thinking on this ever since and I decided to have a look at how feisible this really is.  As it turns out, while the syntax is a bit tricky, this is not only possible but relatively easy to do in MS SQL Server thanks to the PIVOT/UNPIVOT commands.

Here is an example:

(pkID int, 
Name VARCHAR(50), 
Address VARCHAR(250), 
Phone VARCHAR(30) 

INSERT INTO SourceTable VALUES (1,'Bob Smith', '1 Miller St Sydney', '555-1234');
INSERT INTO SourceTable VALUES (2, 'John wayne', '10 california Dr Hollywood', '+1-555-4567');
INSERT INTO SourceTable VALUES (3,' Billy Bowden', '4 Wainui road Wellington', '+64-21-3456789');

--Unpivot the table.
SELECT pkID, KeyCol, ValCol INTO unpivotSourceTable
   (SELECT pkID, CAST(Name as VARCHAR(250)) as Name,Address,cast(Phone as VARCHAR(250)) as Phone from SourceTable) ST
   (ValCol FOR KeyCol IN       (Name,Address,Phone)

select * from  unpivotSourceTable

pkID KeyCol ValCol
1 Address 1 Miller St Sydney
1 Name Bob Smith
1 Phone 555-1234
2 Address 10 california Dr Hollywood
2 Name John wayne
2 Phone +1-555-4567
3 Address 4 Wainui road Wellington
3 Name Billy Bowden
3 Phone +64-21-3456789

Note we need to convert all the columns to the same data type to get this to work (which makes sense) .  This is how we would store the data in our Operational data store and/or staging database.

But how do we get back to a regular row when we need it ?  We pivot it !

SELECT pkID, [Name],[Address],[Phone]
(select pkID, KeyCol, ValCol FROM unpivotSourceTable) uST
(MIN(ValCol) for KeyCol in ([Name],[Address],[Phone]) )as pST

pkID Name Address Phone
1 Bob Smith 1 Miller St Sydney 555-1234
2 John wayne 10 california Dr Hollywood +1-555-4567
3 Billy Bowden 4 Wainui road Wellington +64-21-3456789

Do you think there is value in this approach ?  I've yet to put it into practice, but if you have, let me know.  I'd be a bit concerned about the performance for very large tables, but I can see how this would speed up development.