In the past few months I have been using data sets provided by Eurostat a lot and so I crafted a Kettle job that loads SDMX files (an XML document keeping the data), sets up custom-tailored tables in a Postgres database and fills them. Because those SDMX files often weigh a couple of hundred MB it wouldn’t be good idea to deal with the XML by representing the full structure in memory and reading it using XPath. A more efficient solution is to stream the document (using the “XML Input Stream (StAX)” step) and deal with it in a piece-meal fashion.
Dealing with a serialized tree structure
This will present you with the challenge of dealing with a tree structure served in a serialized stream of rows. The key feature making this possible is the IDs of each XML document element and its parent element provided by the “XML Input Stream” step in the fields ‘xml_element_id’ and ‘xml_parent_element_id’. This will allow you to understand and use the data structure programmatically.
So what I am going to show you is how to map the following XML structure onto a table (long format) which then can be fed to a CSV or a database table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
<?xml version="1.0"?> <timeseries> <measurement year="2000"> <!-- value of item named A in 2000 is 8.5 --> <item name="A">8.5</item> <item name="B">9.8</item> </measurement> <measurement year="2001"> <item name="A">12.2</item> <item name="B">9.4</item> </measurement> <measurement year="2002"> <item name="A">11.1</item> <item name="B">7.2</item> </measurement> </timeseries> |
The five essential steps to take
- XML Input Stream (StAX): Stream input the XML document.
- Filter rows: get rid of all document elements not relevant.
- Switch / Case: Separate first level “measurement” (“Select values” step) from second level “item” (“Row denormaliser” step).
- Row denormaliser: Transform 2nd level list of rows for one item into a single row.
- Merge Join: Combine first level with the rows keeping its child elements.
The discussed transformation with sample xml file can be downloaded: here.
The “Row denormaliser” step
The easiest way to understand the steps is of course to have a preview at each step to see what it does with the stream. One step that might afford some extra explanation is “Row denormaliser” – what it does is explained best with an example. Notice how the structure changes and how it is specified in the step’s settings.
The “Merge Join” step
The last non-trivial step is to reunite the two streams – the 1st level taking care of “measurement”s coming from “Select values” and the 2nd level taking care of “item”s coming from “Row normaliser”. The “Merge Join” step will be immediately familiar when you have used JOINs in SQL previously because it does essentially the same. For technical reasons – because Kettle doesn’t work on tables but on streamed rows – it is important to assure that the rows are sorted by the keys used for the join! This is already implicitely ensured by the way the XML is streamed and handeled in the beginning – which is why I don’t have to do that explicitely using the “Sort rows” step.
And in the end …
… the data is available in tabular structure and can be fed to a database or a spreadsheet. Whether you choose a long format like me in this case or a wide structure (‘A’ and ‘B’ would be separate fields/columns keeping the respective value) depends on your plans. But usually a long structure is more versatile.
If you have deeply nested XML documents (or json, html, sql) with recursive nodes (node ‘folder’ in node ‘folder’ in node ‘folder’) of a mixed complex type without an XSD/DTD schema you can use the tool eXtractorONE (eXtractor.ONE).
No programming needed, no Xquery, no XSLT, nearly zero configuration. No limit on size. Just point to the folder with XML documents, select your target database and run it.
@Marvin:
The original project where I used Kettle for importing XML also had to deal with more levels. The solution is to identify every level and deal with them separately – taking the information you need an store it in a single row.
A>B>C
Every row needs its and its parent level’s ID. When you are done processing the levels you can recombine the rows by joining on levelA.ID = levelB.parentID.
But two more thoughts:
Kettle can be quite a pain in the brain – and is going to be for everything complexer than obvious. So expect some tinkering until you got it done. For technical questions I recommend this forum:
http://forums.pentaho.com/forumdisplay.php?135-Pentaho-Data-Integration-Kettle
Even tough questions are often answered within few hours on workdays.
The other thing is – in this case I used Kettle for the importing mostly out of interest / for academic reasons. For a professional setup I cannot recommend it. I would suggest to use a programming language of your choice. It is faster, easier to adjust and to debug.
You’re welcome
Raffael
Raffael-
Thank you for your recent Blog post (Transforming an XML document into a table structure with Pentaho Kettle/PDI). I’m trying to do something very similar, and am looking at your work as a model. However, my structure has one more level than yours and I don’t know PDI well enough to figure out the best way to handle this (perhaps separate the rows into multiple groups using a “Case” step on Stax’s xml_parent_element_id or xml_element_level, and then run your model separately for each Case? Or perhaps a bit of tweaking to your model will allow it to support more levels in the xml structure?)
Any advice you can provide would be greatly appreciated!
thank you-
Marvin