In this little tutorial I am going to describe a handy tool for transforming an XML document into a more easily processable CSV format. There are many ways of getting this job done – but most are more tedious than necessary (like writing a custom made RegEx parser – yuck!). Using XMLStarlet and XPath expressions this is going to be cinch. Let’s evaluate a number of typical XML data configurations and turn them into a flat CSV structure.
What I am going to showcase in this tutorial is how to load web stats from Google Analytics into a fact table with Penthao Kettle/PDI. And then how to represent that fact table with Mondrian 3.6 schema so we can visualize the data with Saiku on Pentaho BI Server. In the end I’ll give my two cents on Saiku Analytics and possible options involving d3.js and Roland Bouman‘s xmla4js.
In case you are new to this I recommend reading my articles on the following topics involved here:
- Getting Started With Pentaho BI Server 5, Mondrian and Saiku
- Setting Up a Time Dimension Table in MySQL
- Using the Dimension Lookup/Update Step in Pentaho Kettle
- Google Analytics API with Pentaho Kettle
In the past few months I have been using data sets provided by Eurostat a lot and so I crafted a Kettle job that loads SDMX files (an XML document keeping the data), sets up custom-tailored tables in a Postgres database and fills them. Because those SDMX files often weigh a couple of hundred MB it wouldn’t be good idea to deal with the XML by representing the full structure in memory and reading it using XPath. A more efficient solution is to stream the document (using the “XML Input Stream (StAX)” step) and deal with it in a piece-meal fashion.