In a traditional star schema the dimensions are located within specialized tables which are referred to by numeric keys from the fact table. A dimension can represent anything from the gender (“male”, “female”, “intersex”) over a hierarchy representing a location (“Germany”, “RLP“, “Mainz“) to an individual user’s profile (name, address, date of birth, …). Now thanks to Mr. Kimball we know there are different types of what he refers to as Slow Changing Dimensions (SCD – “slow” because they are expected to change only infrequently):
- Type 1: can change but no versioning is needed. E.g. date of birth – if it changes then the old entry will simply be overwritten because it must have been wrong – judging from the new information.
- Type 2: can change and is versioned. E.g. the city where somebody lives – to keep old numbers unaffected by migration (for example sales by city of customer) we need to keep track of changes.
Expected Table Schema for SCDs of Type 2
For handling SCDs of Type 2 Pentaho Kettle provides a step answering to the name of “Dimension Lookup/Update“. Let’s assume we want to store some European countries as SCD Type 2:
|ID||Key||Name||Version||Start Date||End Date|
“ID” is unique within the whole dimension table also referred to as the technical key – a fact table will link a specific dimension’s entry using this number. “Key” is unique for a dimension’s category so it can be referred to independent of different versions and is usually chosen with respect to what it represents and hence also called “natural key”. “Version” and the respective time span define the versioning and during what time which version is valid. In the above example a report aggregating population by town could distinguish between Germany and GDR for data from until October 3 1990. For anything later an aggregation for Germany will include the former GDR.
Let’s do a lookup
For the rest of the text I am going to use the same simplified abstract dimension table named “dim_table”. This is its MySQL create statement:
CREATE TABLE `dim_table` (
`id` int(11) NOT NULL,
`natural_key` varchar(45) DEFAULT NULL,
`field_immutable` varchar(45) DEFAULT NULL,
`field_changeable` varchar(45) DEFAULT NULL,
`field_versioned` varchar(45) DEFAULT NULL,
`version` int(11) DEFAULT NULL,
`date_from` date DEFAULT NULL,
`date_until` date DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
As we progress looking at the different ways of using the dimension lookup/update step it is subject to change. This is its initial content:
We are now going to lookup the technical key (ID) for the natural key “c”. In case you are wondering – this might be something you do when you import aggregated data into a warehouse and you have to substitute the provided natural key with the technical key. Otherwise you can’t feed it to a regular fact table.
Updating the Dimension Table using “Insert” and “Update”
This time we will update the dimension table in two fashions. As indicated by the field names, “field_changeable” is supposed to be just updated without bothering with versioning (SCD Type 1) – “field_versioned” on the other hand we would like to keep track of (SCD Type 2).
What happend we can see easiest when we have a look at what changed in dim_table:
First of all Kettle added an entry with ID 0. This entry serves the purpose of guaranteeing a dimension entry to link to even when no matching entry could be found – it’s effectively saying “I don’t know what this is supposed to be!”. If an “unknown”-entry is not present Kettle adds it automatically.
Now regarding our updates – the entry with natural key “a” has just been updated, while for the entry “b” a new version with ID = 4 was added. This explains why the transformation returns IDs 1 and 4. Also note how the older version is considered to be invalid since the day of execution and the new version is valid from that day on.
Punch Through versus Update
Some fields of a dimension will never change and if they do then this is effectively a correction which might make sense to apply to all versions of a dimension’s entry. This can be achieved using the “Punch Through” update type.
In this case we are going to update the fields “field_changeable” and “field_immutable” for the entry with the natural key “b”. The first one using “Update” and the second one using “Punch through”. The effect is that “field_changeable” will only be updated for the most recent version and “field_immutable” for all versions:
In case a key references multiple versions of an entry Kettle determines which one to choose using the current day. But you can also request a different version by providing a reference date. Check this out:
The field reference_day is provided to “Stream Datefield” and used to determine which version of the entry with natural key “b” to choose – it is by the way of type “Date” with format “yyyy-MM-dd”. And as you see the first lookup yields an ID of 2 and the second an ID of 4.
And SCDs of Type 1?
According to the documentation I would expect that this step can also handle dimension tables keeping SCDs of Type 1 which wouldn’t feature columns for versioning. But it seems that at least the date columns have to be provided, otherwise the lookup won’t work. Not sure if I missed something or this is simply a bug. In case your tables don’t have the necessary schema the functionality for simple unversioned dimensions can be achieved using basic steps – f.x. joining for the lookup and insert/update for adding or changing entries.
Then again SCD of Type 1 can be modelled as Type 2 – if you have the choice – and some might argue that you should always use Type 2 for dimensions with more than a handful of entries.
(original article published on www.joyofdata.de)