What I am going to showcase in this tutorial is how to load web stats from Google Analytics into a fact table with Penthao Kettle/PDI. And then how to represent that fact table with Mondrian 3.6 schema so we can visualize the data with Saiku on Pentaho BI Server. In the end I’ll give my two cents on Saiku Analytics and possible options involving d3.js and Roland Bouman‘s xmla4js.
In case you are new to this I recommend reading my articles on the following topics involved here:
The proposal was not successful and has been deleted :(
A gamified, high-speed, high-quality Q&A-site for topics revolving around making professionally sense of a company’s data – a.k.a. “Business Intelligence” – wouldn’t that be awesome? And let’s face it – asking a question on how to configure a step in Pentaho Kettle does not fit any StackExchange site’s realm yet. Usually this type of question is asked on StackOverflow but the feedback-latency is quite high to say the least. Or let’s take a question on how to design a KPI – this one usually ends up on CrossValidated but will often be greeted with disdain given the statistical triviality – plus most people in statistics are not working with BI and won’t be open for the subject’s specific intricacies. And finally you are wondering about how to configure a MySQL RDMS for a data warehouse – where to ask that? On dba.SE … I guess. And suddenly you get weird issues with TomCat which you need for Pentaho BI Server – hmmm, SuperUser? Or ServerFault?
It’s just too distributed!
Pentaho’s BI Server or BA platform allows you to access business data in the form of dashboards, reports or OLAP cubes via a convient web interface. Additionally it provides an interface to administer your BI setup and schedule processes. The aim of this tutorial is to illustrate how to get started with the BI Server and for that purpose I am going to use a small and artifical data set – as otherwise I would have to get deeper into further technologies – so I can keep this text lean. I am going to elaborate on Mondrian schemas, data warehouse design, MDX and further related concepts in separate articles. When you are through with this text and still hungry, make sure you check out “Mondrian Schema for OLAP Cube Definition ft. Google Analytics and Saiku” for a more advanced use case.
First – let me whet your appetite by showing you what a pretty pretty report you will be able to compose after you finished this tutorial – and there you go (click on the picture to bask in the report’s full glory!):
We are using Pentaho Kettle 4.4 for realizing ETL processes and to make the involved jobs centrally accessible those are stored in a database repository. Given that several proxies separate my VPN access to the database holding the repo a latency was to be expected and in the beginning not too bothersome. But proportional to the complexity of the processes grew the time involved in loading and storing of jobs and transformations until a point was reached where this was no longer acceptable (up to a minute for storing an edited job f.x.) and a solution to this had to be found.
Yesterday I was willing to invest a couple of hours to install Excel 2013 Home Premium and give PowerPivot a try using the largest dataset I have available locally. A data set from Eurostat keeping passenger and flight volume data going between pretty much all European airports and a partner country on a monthly basis. Extracted from SDMX and loaded into a Postgres database – the table keeps more than 8 million records already provided in long format. Okay – well, it’s not super huge but large enough to get a feeling about performance and usability of a tool. So I was looking forward to PowerPivot-driven data mining and in for a very bad surprise!
Google Analytics offers a KPI for “returning visitors” but what if you would like to be more specific about the meaning of “returning”? Actually this figure is customizable with basic API requests and a very simple idea – at least for two consecutive time spans.
Let’s assume we want to know how many visitors from calender week 2013-1 (Dec 31 2012 until Jan 6 2013) returned to the web-site in calender week 2013-2 (Jan 7 2013 until Jan 13 2013). I’ll refer to calender week 2013-1 as T1, to 2013-2 as T2 and to both combined as T1+T2. The function v maps the time span onto the number of visitors then – so v(T1) = 5 means in calender week 2013-1 Analytics counted 5 unique visitors. Then the number of visitors in T2 who also visited in T1 is:
“Number visitors from T1 who came back in T2” = v(T1) + v(T2) – v(T1+T2)
Sometimes it happens that PowerPivot crashes and is not loaded again automatically when starting Excel. Here is how you get PowerPivot back on track.
- Open the “File” section from the ribbon
- Choose “Options”
- Choose “Add-Ins”
- At the bottom on the right hand side you’ll find a select box labelled “Manage”
- Choose “Disabled Items”
- Click “Go …”
- Choose the item referring to PowerPivot
- Click “Enable”
- Restart Excel
That’s it. In my experience no handicaps are to be expected. The crashes seem pretty random to me. Like a sudden nervous breakdown – just that Excel doesn’t need time-off to recover.
(This article expects from the reader basic experience with PowerPivot. You should know how to access a PowerPivot table from Excel and how to turn it into a pivot table or a chart.)
KPIs for a grocery store
I think this topic is introduced best using an example. Let’s assume an offline store and we are responsible for the data analysis. The boss wants us to count the people visiting the shop considering their age group and also how many of those visits lead to a purchase.
As you might know, I am working as the Data Analyst for carpooling.com in Munich. carpooling.com is the company maintaining the leading web platform for organizing carpoolings (in the world, actually). Many people don’t know what “carpooling” means, so let me explain it you:
Tanja lives in Stuttgart and wants to visit her family in Hamburg next weekend. This is quite a long distance and hence pretty expensive – and also kind of dull sitting in a car alone for several hours. Having three free seats in her car left, she thinks to herself … “Why not offer those seats to other people and share the expenses with them?!”. So she advertises her planned lift on www.mitfahrgelegenheit.de – the biggest German web-site for carpoolings. Peter who also wants to travel to Hamburg next weekend finds her ad and gives her a call to seal the deal. Okay, so far so good, but few passengers aren’t as reliable as Peter and might just forget about the ride and Tanja would then be left with a free seat and no money. So carpooling.com came up with the idea of a “booking system” to make carpooling agreements more binding. Next weekend Tanja, Peter and further passengers meet and drive to Hamburg together.