In the past few months I have been using data sets provided by Eurostat a lot and so I crafted a Kettle job that loads SDMX files (an XML document keeping the data), sets up custom-tailored tables in a Postgres database and fills them. Because those SDMX files often weigh a couple of hundred MB it wouldn’t be good idea to deal with the XML by representing the full structure in memory and reading it using XPath. A more efficient solution is to stream the document (using the “XML Input Stream (StAX)” step) and deal with it in a piece-meal fashion.
Dealing with a serialized tree structure
This will present you with the challenge of dealing with a tree structure served in a serialized stream of rows. The key feature making this possible is the IDs of each XML document element and its parent element provided by the “XML Input Stream” step in the fields ‘xml_element_id’ and ‘xml_parent_element_id’. This will allow you to understand and use the data structure programmatically.
So what I am going to show you is how to map the following XML structure onto a table (long format) which then can be fed to a CSV or a database table.
<!-- value of item named A in 2000 is 8.5 -->
The five essential steps to take
- XML Input Stream (StAX): Stream input the XML document.
- Filter rows: get rid of all document elements not relevant.
- Switch / Case: Separate first level “measurement” (“Select values” step) from second level “item” (“Row denormaliser” step).
- Row denormaliser: Transform 2nd level list of rows for one item into a single row.
- Merge Join: Combine first level with the rows keeping its child elements.
The discussed transformation with sample xml file can be downloaded: here.
The “Row denormaliser” step
The easiest way to understand the steps is of course to have a preview at each step to see what it does with the stream. One step that might afford some extra explanation is “Row denormaliser” – what it does is explained best with an example. Notice how the structure changes and how it is specified in the step’s settings.
The “Merge Join” step
The last non-trivial step is to reunite the two streams – the 1st level taking care of “measurement”s coming from “Select values” and the 2nd level taking care of “item”s coming from “Row normaliser”. The “Merge Join” step will be immediately familiar when you have used JOINs in SQL previously because it does essentially the same. For technical reasons – because Kettle doesn’t work on tables but on streamed rows – it is important to assure that the rows are sorted by the keys used for the join! This is already implicitely ensured by the way the XML is streamed and handeled in the beginning – which is why I don’t have to do that explicitely using the “Sort rows” step.
And in the end …
… the data is available in tabular structure and can be fed to a database or a spreadsheet. Whether you choose a long format like me in this case or a wide structure (‘A’ and ‘B’ would be separate fields/columns keeping the respective value) depends on your plans. But usually a long structure is more versatile.