Monday, 6 July 2009

Best practice for slowly changing Fact tables.

I've been trying to find some evidence of best practice for facts which are updated and have been unable to find much written on this subject. While this is quite common for case management type systems (e.g. HP Service Centre, Salesforce.com), it is not typical for data warehousing in general, as most literature focuses on more traditional transaction based systems (e.g. bank accounts, PABX data, order/billing etc.).

The challenge we have then is to determine when the dimensions of a fact record should be changed, and when they should be left unchanged if the fact itself is altered. Obviously for SCD Type 1, this is a non-issue; we aren't keeping a record of the previous Dimensions value, so the fact must be linked to the newest version anyway. But for SCD Type 2, there are three possible options:
  1. Determine the Dimension at the first instance of the fact being recorded in the warehouse.
  2. Update the fact's dimension's to the version of the dimension that existed at the time of the fact being modified.
  3. Update to the Current version of the dimension at the time the fact is updated in the warehouse.
While 1 is the simplest to implement, it doesn't reflect the real world usage of many source systems. For example an address Dimension may start of with very minimal details initially, but will gain more details over time. It is unlikely that we would want to stay with the minimally populated dimension once we have richer data available.

Option 2 to me is probably the best 'default' option in this case. It makes sense to me that if the fact is updated, it should reflect the dimensions at that point in time. For example if sale was made to the sales rep for region 1 (e.g. 'Bob') and Region 1 is later re-assigned to 'Fred', we don't necessarily want that sale updated to 'Fred' as he never made that sale. I'm sure there are cases though where this may not be the right approach.

Option 3 to me in most cases would give an incorrect view of the data, so therefore is not what we want either, however it may be that in the source system certain data is only updated periodically and that for that system the most recent version is deemed to be the 'correct' version.

I'm include to use Option 2 in most cases, but I would appreciate any feedback on this, as there may be other scenarios I haven't considered. Also, if you have come across any articles linked to this, please pass these on to me.

What are your thoughts ?

2 comments:

Innovator said...

How about creating a factless fact table? Take a look at another blog by Andrew Fryer http://blogs.technet.com/andrew/archive/2009/03/03/changing-dimensions-fast-or-slow.aspx

Leigh Kennedy said...

As far as I can see this just moves complexity down stream. Given the whole point of the Star Schema is to provide a simply and clear model for querying, I don't see this as beneficial to the overall solution.