Pentaho Kettle 4.4 database repository under the hood

We are using Pentaho Kettle 4.4 for realizing ETL processes and to make the involved jobs centrally accessible those are stored in a database repository. Given that several proxies separate my VPN access to the database holding the repo a latency was to be expected and in the beginning not too bothersome. But proportional to the complexity of the processes grew the time involved in loading and storing of jobs and transformations until a point was reached where this was no longer acceptable (up to a minute for storing an edited job f.x.) and a solution to this had to be found.

1942 logged MySQL commands for deactivating a hop

To get an idea about what might be the reason for the observed slow down I decided to log the network traffic and database activity. To make a long story short – even the most simple changes to a transformation – like deactivating a hop will lead to a waterfall of almost 2’000 logged MySQL commands. This is because Kettle will basically set up the transformation all-new. So 2000 is the magnitude of the transformation I used for this little benchmark here – which happens to consist of no more than two simple steps. For a realistical scenario this will reach easily 5 digits.

Network traffic is the bottleneck

But given that this does not lead to big delay on a locally running database plus having a look at what is sent back and forth between javaw.exe and the MySQL server it becomes clear that what causes the delay is the fact that every single command will be sent as and replied to with an individual TCP packet. If this is not running uninhibited you’ll experience considerable lagging.

I try to abstain from criticism but considering the toggling of net_write_timeout between 60 and 600 or the apparent redundancy of some statements make it seem like not much effort was put into designing this process.

How to deal with this?

But okay, at least it works and there are ways to bypass those issues. My suggestion would be to simply separate the development from the ETL execution not just structurally but also physically and just do the development on a local database. Using a file based repository locally will additionally simplfy application of a version control system.


Leave a Reply

Your email address will not be published. Required fields are marked *