In this little tutorial I am going to describe a handy tool for transforming an XML document into a more easily processable CSV format. There are many ways of getting this job done – but most are more tedious than necessary (like writing a custom made RegEx parser – yuck!). Using XMLStarlet and XPath expressions this is going to be cinch. Let’s evaluate a number of typical XML data configurations and turn them into a flat CSV structure.
In case the desired JSON objects structure is just a set of simple attributes this can be achieved by using mongoimport directly. But in case some of the fields are supposed to be combined into an array or a sub-document, mongoimport won’t help you. In this tutorial I will show you how to transform a CSV into a collection of GeoJSON objects and in the course of that teach you the basics of AWK.
Let’s assume we want to scrape the “Most Popular in News” box from bbc.com. What we need first is a CSS selector to locate what we are interested in. In this case it is simply a div tag with the ID “mostPopular” and you can figure this out using the Developer Tools of your favorite web browser. And now we are going to apply a chain of command line tools – each feeding their output to the next tool (that is called piping btw) and in the end we have a layouted text representation of the box’ content: