Pentaho Kettle 4.4 database repository under the hood

We are using Pentaho Kettle 4.4 for realizing ETL processes and to make the involved jobs centrally accessible those are stored in a database repository. Given that several proxies separate my VPN access to the database holding the repo a latency was to be expected and in the beginning not too bothersome. But proportional to the complexity of the processes grew the time involved in loading and storing of jobs and transformations until a point was reached where this was no longer acceptable (up to a minute for storing an edited job f.x.) and a solution to this had to be found.

Continue reading

Generating virtual tables for JOIN operations in MySQL

Let me introduce this post with an example. You offer a web service and every request to it is logged in a database table – keeping the type of request, the user, a lot of other stuff and first and foremost the (date)time it happened. Now you want to know the number of requests within an hour, chronologically ordered, for the last four weeks – to get an idea about when your customers are using the service. If you encounter this situation for the first time your query might look like this:

Continue reading

Counting weeks (ISO 8601)

Using the ‘week’ as a time unit can be a bit confusing for at least two reasons:

  1. Does the week start on Sunday or Monday?
  2. How do you count a week that stretches two years?

Now probably your and anybody else’s take on those two questions will not just depend on cultural upbringing – but even more on – how it has been done so far. Nobody wants to change old statistics to a new counting or maintain two parallel statistics for one year with both countings. So to make a long story short, I’ll just stick with ISO 8601 standard because it seems reasonable to me. Here is what ISO 8601 has to say about the two questions:

Continue reading

Make R(ODBC) talk to MySQL on Windows 7 64bit

When you are dealing with large amounts of big data sets it is much more efficient to organizes those in database tables instead of CVSs or other files. Just yesterday I set up R for fetching data from a MySQL DBMS loading a table of stock quotes consisting of more than 300’000 rows into a data frame within seconds. That is pretty cool – and if necessary you can join huge tables in no time benefting of the indexing infrastructure of the DBMS of your choice.

Continue reading