Friday, 11 November 2011

Implementing an Agile ETL process in SQL Server


I was fortunate enough recently to attend the DW 2.0 Conference in Sydney.  One of the presenters, Therese Ahlstam (from the Genesee Academy) advocated a technique for building data warehouse's she referred to as Hyper-agility.  The idea was to avoid the need to constant schema changes by storing Staging & ODS tables as Key/Value pairs.  To me this seems to fit the Serialisation/de-serialisation design pattern so common in many software frameworks these days.

This idea seemed quite radical to me, but I've been thinking on this ever since and I decided to have a look at how feisible this really is.  As it turns out, while the syntax is a bit tricky, this is not only possible but relatively easy to do in MS SQL Server thanks to the PIVOT/UNPIVOT commands.

Here is an example:

CREATE TABLE SourceTable 
(pkID int, 
Name VARCHAR(50), 
Address VARCHAR(250), 
Phone VARCHAR(30) 
)
GO


INSERT INTO SourceTable VALUES (1,'Bob Smith', '1 Miller St Sydney', '555-1234');
INSERT INTO SourceTable VALUES (2, 'John wayne', '10 california Dr Hollywood', '+1-555-4567');
INSERT INTO SourceTable VALUES (3,' Billy Bowden', '4 Wainui road Wellington', '+64-21-3456789');
GO


--Unpivot the table.
SELECT pkID, KeyCol, ValCol INTO unpivotSourceTable
FROM 
   (SELECT pkID, CAST(Name as VARCHAR(250)) as Name,Address,cast(Phone as VARCHAR(250)) as Phone from SourceTable) ST
UNPIVOT
   (ValCol FOR KeyCol IN       (Name,Address,Phone)
)AS uST
GO


select * from  unpivotSourceTable
ORDER BY pkID, KeyCol

pkID KeyCol ValCol
1 Address 1 Miller St Sydney
1 Name Bob Smith
1 Phone 555-1234
2 Address 10 california Dr Hollywood
2 Name John wayne
2 Phone +1-555-4567
3 Address 4 Wainui road Wellington
3 Name Billy Bowden
3 Phone +64-21-3456789

Note we need to convert all the columns to the same data type to get this to work (which makes sense) .  This is how we would store the data in our Operational data store and/or staging database.

But how do we get back to a regular row when we need it ?  We pivot it !

SELECT pkID, [Name],[Address],[Phone]
FROM 
(select pkID, KeyCol, ValCol FROM unpivotSourceTable) uST
PIVOT
(MIN(ValCol) for KeyCol in ([Name],[Address],[Phone]) )as pST




pkID Name Address Phone
1 Bob Smith 1 Miller St Sydney 555-1234
2 John wayne 10 california Dr Hollywood +1-555-4567
3 Billy Bowden 4 Wainui road Wellington +64-21-3456789


Do you think there is value in this approach ?  I've yet to put it into practice, but if you have, let me know.  I'd be a bit concerned about the performance for very large tables, but I can see how this would speed up development.


No comments: