Mondrian Schema for OLAP Cube Definition ft. Google Analytics and Saiku

data-insightsWhat I am going to showcase in this tutorial is how to load web stats from Google Analytics into a fact table with Penthao Kettle/PDI. And then how to represent that fact table with Mondrian 3.6 schema so we can visualize the data with Saiku on Pentaho BI Server. In the end I’ll give my two cents on Saiku Analytics and possible options involving d3.js and Roland Bouman‘s xmla4js.

In case you are new to this I recommend reading my articles on the following topics involved here:

Continue reading

Using the Dimension Lookup/Update Step in Pentaho Kettle

dim_lookup_update_iconIn a traditional star schema the dimensions are located within specialized tables which are referred to by numeric keys from the fact table. A dimension can represent anything from the gender (“male”, “female”, “intersex”) over a hierarchy representing a location (“Germany”, “RLP“, “Mainz“) to an individual user’s profile (name, address, date of birth, …). Now thanks to Mr. Kimball we know there are different types of what he refers to as Slow Changing Dimensions (SCD – “slow” because they are expected to change only infrequently):

Continue reading

Pentaho Kettle 5 comes with improved database repository performance

Two weeks ago I wrote about Kettle 4.4 and its database repository and how working with it is truly no fun due to excessive latency connected with loading and saving of jobs and transformations. Logging the queries sent to a MySQL database made it obvious that the reason is a tornado of thousands of database commands and replies sent back and forth.

Did Kettle 5 learn something new?

Continue reading

Pentaho Kettle 4.4 database repository under the hood

We are using Pentaho Kettle 4.4 for realizing ETL processes and to make the involved jobs centrally accessible those are stored in a database repository. Given that several proxies separate my VPN access to the database holding the repo a latency was to be expected and in the beginning not too bothersome. But proportional to the complexity of the processes grew the time involved in loading and storing of jobs and transformations until a point was reached where this was no longer acceptable (up to a minute for storing an edited job f.x.) and a solution to this had to be found.

Continue reading

Transforming an XML document into a table structure with Pentaho Kettle/PDI

In the past few months I have been using data sets provided by Eurostat a lot and so I crafted a Kettle job that loads SDMX files (an XML document keeping the data), sets up custom-tailored tables in a Postgres database and fills them. Because those SDMX files often weigh a couple of hundred MB it wouldn’t be good idea to deal with the XML by representing the full structure in memory and reading it using XPath. A more efficient solution is to stream the document (using the “XML Input Stream (StAX)” step) and deal with it in a piece-meal fashion.

Continue reading

SunCertPathBuilderException in PDI/Kettle

For executing an HTTPS request with a Java software like Kettle / PDI you need to provide a certificate for the domain the request is going to address. A possible error message for a request to https://www.example.org using the “HTTP client” step from the “Lookup” section might look like this:

I’ll show you in this article how to deal with this issue on Windows.
Continue reading

Google Analytics API with Pentaho Kettle

In my last post I was describing how to calculate “returning visitors” in a customizable way depending on how you want to define “returning”. At work as well as for personal projects I use for ETL processes Pentaho Data Integration (PDI) aka Kettle. PDI provides a step for fetching data from Google Analytics and I am going to describe in this post how to use this feature on the basis of the job I clicked together for the article on “returning visitors”. I will focus on the steps and aspects relevant to the subject.

Continue reading

Batch downloading files with Pentaho Kettle / PDI

Currently I am working on a project with the goal to download all available data sets on Eurostat as SDMX documents and load the data into a local database, so I can investigate it more conveniently. For this ETL process I use Pentaho Kettle aka PDI (Pentaho Data Integration). And one of quite a few small challenges was to download a list of files via HTTP. I must admit I was and still am surprised that this cannot be accomplished easier with Kettle – but what the heck, it’s still a great tool – and first and foremost it IS possible. So in this article I am going to describe the most straightforward way to implement a batch download. I assume basic knowledge on how to use Kettle.

You can download a zip archive keeping the job here.

The MAIN job

http-download-job

First of all there is no transformation step to download a file via HTTP, but a job entry for that purpose,  which is why we have to use a job (DOWNLOAD.kjb) within a job (MAIN.kjb). The file list will most likely have to provided by steps executed within a transformation, which extracts this information from a file. In my above mentioned use case, the list of to be downloaded files is provied within an XML document which is parsed and which offers a great opportunity for another article.

Continue reading