Using the Dimension Lookup/Update Step in Pentaho Kettle

dim_lookup_update_iconIn a traditional star schema the dimensions are located within specialized tables which are referred to by numeric keys from the fact table. A dimension can represent anything from the gender (“male”, “female”, “intersex”) over a hierarchy representing a location (“Germany”, “RLP“, “Mainz“) to an individual user’s profile (name, address, date of birth, …). Now thanks to Mr. Kimball we know there are different types of what he refers to as Slow Changing Dimensions (SCD – “slow” because they are expected to change only infrequently):

  • Type 1: can change but no versioning is needed. E.g. date of birth – if it changes then the old entry will simply be overwritten because it must have been wrong – judging from the new information.
  • Type 2: can change and is versioned. E.g. the city where somebody lives – to keep old numbers unaffected by migration (for example sales by city of customer) we need to keep track of changes.

Expected Table Schema for SCDs of Type 2

For handling SCDs of Type 2 Pentaho Kettle provides a step answering to the name of “Dimension Lookup/Update“. Let’s assume we want to store some European countries as SCD Type 2:

ID Key Name Version Start Date End Date
1 fr France 1 1900-01-01 2100-12-31
2 de Germany 1 1900-01-01 2100-12-31
3 gdr G.D.R. 1 1900-01-01 1990-10-03
4 gdr Germany 2 1990-10-03 2100-12-31

“ID” is unique within the whole dimension table also referred to as the technical key – a fact table will link a specific dimension’s entry using this number. “Key” is unique for a dimension’s category so it can be referred to independent of different versions and is usually chosen with respect to what it represents and hence also called “natural key”. “Version” and the respective time span define the versioning and during what time which version is valid. In the above example a report aggregating population by town could distinguish between Germany and GDR for data from until October 3 1990. For anything later an aggregation for Germany will include the former GDR.

Let’s do a lookup

For the rest of the text I am going to use the same simplified abstract dimension table named “dim_table”. This is its MySQL create statement:

As we progress looking at the different ways of using the dimension lookup/update step it is subject to change. This is its initial content:

t1 We are now going to lookup the technical key (ID) for the natural key “c”. In case you are wondering – this might be something you do when you import aggregated data into a warehouse and you have to substitute the provided natural key with the technical key. Otherwise you can’t feed it to a regular fact table.

lookupFor this example “Update the dimension?” is switched off to restrict the functionality to the lookup. And as you can see the correct ID for “c” is returned.

Updating the Dimension Table using “Insert” and “Update”

This time we will update the dimension table in two fashions. As indicated by the field names, “field_changeable” is supposed to be just updated without bothering with versioning (SCD Type 1) – “field_versioned” on the other hand we would like to keep track of (SCD Type 2).

update1

What happend we can see easiest when we have a look at what changed in dim_table:

t2First of all Kettle added an entry with ID 0. This entry serves the purpose of guaranteeing a dimension entry to link to even when no matching entry could be found – it’s effectively saying “I don’t know what this is supposed to be!”. If an “unknown”-entry is not present Kettle adds it automatically.

Now regarding our updates – the entry with natural key “a” has just been updated, while for the entry “b” a new version with ID = 4 was added. This explains why the transformation returns IDs 1 and 4. Also note how the older version is considered to be invalid since the day of execution and the new version is valid from that day on.

Punch Through versus Update

Some fields of a dimension will never change and if they do then this is effectively a correction which might make sense to apply to all versions of a dimension’s entry. This can be achieved using the “Punch Through” update type.

update2In this case we are going to update the fields “field_changeable” and “field_immutable” for  the entry with the natural key “b”. The first one using “Update” and the second one using “Punch through”. The effect is that “field_changeable” will only be updated for the most recent version and “field_immutable” for all versions:

t3Changing the Reference Date

In case a key references multiple versions of an entry Kettle determines which one to choose using the current day. But you can also request a different version by providing a reference date. Check this out:

lookup2

The field reference_day is provided to “Stream Datefield” and used to determine which version of the entry with natural key “b” to choose – it is by the way of type “Date” with format “yyyy-MM-dd”. And as you see the first lookup yields an ID of 2 and the second an ID of 4.

And SCDs of Type 1?

According to the documentation I would expect that this step can also handle dimension tables keeping SCDs of Type 1 which wouldn’t feature columns for versioning. But it seems that at least the date columns have to be provided, otherwise the lookup won’t work. Not sure if I missed something or this is simply a bug. In case your tables don’t have the necessary schema the functionality for simple unversioned dimensions can be achieved using basic steps – f.x. joining for the lookup and insert/update for adding or changing entries.

Then again SCD of Type 1 can be modelled as Type 2 – if you have the choice – and some might argue that you should always use Type 2 for dimensions with more than a handful of entries.


(original article published on www.joyofdata.de)

4 thoughts on “Using the Dimension Lookup/Update Step in Pentaho Kettle

  1. I notice in the example, that the surrogate key is not an autoincrement. Is that recommended? I wonder about the “unknown” row. If an autoincrement is used, you can’t put a “0” entry into the field for the unknowns. Is it a better idea to not use auto increment for SCD dimensions?

    If it is used, how does Pentaho handle “unknown” values where you cannot predict the id field?

    Thanks,
    Mike

  2. Thanks for this, though by leaving out the example source data, or limiting the example to one trivial row, it makes it hard to understand the internal operation of the tool, or its scope.

    In my own case, I’m having problems with it adding new dimension rows just fine, but every time there is a “hit” on an existing dimension row, it adds a new row full of nulls.  I still don’t know the problem, but felt like this article could have gotten me so much closer to understanding had the example been complete.

    • This whole subject is far from trivial – so I guess something is missing no matter how much more I write. The examples I use aren’t trivial in my opinion they are just heavily boiled down to exemplify exactly what I am trying to convey.

      My suggestion to you is, write up a question on stackoverflow.com with a simplified and reproducable example. Tell me when you did and I will have a look at it.

Comments are closed.