In this little tutorial I am going to describe a handy tool for transforming an XML document into a more easily processable CSV format. There are many ways of getting this job done – but most are more tedious than necessary (like writing a custom made RegEx parser – yuck!). Using XMLStarlet and XPath expressions this is going to be cinch. Let’s evaluate a number of typical XML data configurations and turn them into a flat CSV structure.
<key> value </key>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
<root> <record id="1"> <keyA>val_1A</keyA> <keyB>val_1B</keyB> </record> <record id="2"> <keyA>val_2A</keyA> <keyB>val_2B</keyB> </record> <record id="3"> <keyA>val_3A</keyA> <keyB>val_3B</keyB> </record> </root> |
Solution:
1 2 3 4 5 6 7 8 |
> xmlstarlet \ sel -T -t -m /root/record \ -v "concat(@id,';',keyA,';',keyB)" \ -n test.xml 1;val_1A;val_1B 2;val_2A;val_2B 3;val_3A;val_3B |
XMLStarlet can do a lot of stuff – we want to use it for querying / selecting from an XML document which we denote by sel. -T or --text tells xmlstarlet to output text instead of XML. -t -m /root/record or --template --match /root/record specifies the section (or template) of the XML document which we would like to match repetitively – which is every <record> -section below <root> . -v or --value-of followed by an XPath expression specifies the string which we would like to output line by line – hence -n or --nl for newline. test.xml is … correct!
The XPath expressions are concatted and separated with semicolons. I guess there is not much more to add really as XPath is best understood by staring at it and in case you have to write a custom XPath query this site featuring a whole lot of examples for XPath expressions is pretty helpful.
<tag name=”key”> value </tag>
1 2 3 4 5 6 7 8 9 10 11 |
<root> <record id="1"> <key name="C">val_1C</key> </record> <record id="2"> <key name="C">val_2C</key> </record> <record id="3"> <key name="C">val_3C</key> </record> </root> |
Solution:
1 2 3 4 5 6 7 8 |
> xmlstarlet \ sel -T -t -m /root/record \ -v "concat(@id,';',key[@name='C'])" \ -n test.xml 1;val_1C 2;val_2C 3;val_3C |
key[@name='C'] translates to “Get value of tag named ‘key’ if it features an attribute named ‘name’ valued ‘C'”.
<tag name=”key” val=”value” />
1 2 3 4 5 6 7 8 9 10 11 |
<root> <record id="1"> <key name="D" value="val_1D"/> </record> <record id="2"> <key name="D" value="val_2D"/> </record> <record id="3"> <key name="D" value="val_3D"/> </record> </root> |
Solution:
1 2 3 4 5 6 7 8 |
> xmlstarlet \ sel -T -t -m /root/record \ -v "concat(@id,';',key[@name='D']/@value)" \ -n test.xml 1;val_1D 2;val_2D 3;val_3D |
key[@name='D']/@value) translates to “Get value of attribute named ‘value’ of tag named ‘key’ if that tag has an attribute ‘name’ with value ‘D'”.
<item> <key> K </key> <value> V </value> </item>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
<root> <record id="1"> <item> <key>E</key> <value>val_1E</value> </item> </record> <record id="2"> <item> <key>E</key> <value>val_2E</value> </item> </record> <record id="3"> <item> <key>E</key> <value>val_3E</value> </item> </record> </root> |
Solution:
1 2 3 4 5 6 7 8 |
> xmlstarlet \ sel -T -t -m /root/record \ -v "concat(@id,';',item[key='E']/value)" \ -n test.xml 1;val_1E 2;val_2E 3;val_3E |
item[key='E']/value translates to “Get value of tag ‘value’ below ‘item’ if this ‘item’ has an attribute named ‘key’ with value ‘E’.
<object> <K1> V1 </K1> <K2> V2 </K2> </object>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
<root> <record id="1"> <object> <keyF>val_1F</keyF> <keyG>val_1G</keyG> </object> </record> <record id="2"> <object> <keyF>val_2F</keyF> <keyG>val_2G</keyG> </object> </record> <record id="3"> <object> <keyF>val_3F</keyF> <keyG>val_3G</keyG> </object> </record> </root> |
Solution:
1 2 3 4 5 6 7 8 |
> xmlstarlet \ sel -T -t -m /root/record \ -v "concat(@id,';',object/keyF,';',object/keyG)" \ -n test.xml 1;val_1F;val_1G 2;val_2F;val_2G 3;val_3F;val_3G |
Enjoy converting XML to CSV :)
(original article published on www.joyofdata.de)
Hey, thank you!
What if the “record” has no id though?
Like this:
val_1C
val_2C
Good job!
Thank you! This is just what I needed to get a table of data out of an XML file.
Thanks for your kind feedback – you’re most welcome!