In case the desired JSON objects structure is just a set of simple attributes this can be achieved by using mongoimport directly. But in case some of the fields are supposed to be combined into an array or a sub-document, mongoimport won’t help you. In this tutorial I will show you how to transform a CSV into a collection of GeoJSON objects and in the course of that teach you the basics of AWK.
For illustration I am going to use the following CSV and will refer to it as cities.csv:
1 2 3 4 5 |
id; city; lat; lon 1; Mainz; 49.994640; 8.264160 2; Tallinn; 59.439451; 24.756094 3; Berlin; 52.515482; 13.402796 4; Munich; 48.140021; 11.580288 |
And the goal is a JSON object structure which would look for the first record as follows:
1 2 3 4 5 6 7 8 9 10 11 |
{ "_id" : 1, "name" : "Mainz", "geometry" : { "type" : "point", "coordinates" : [ 8.26416, 49.99464 ] } } |
(G)AWK
What I am going to do is take the CSV and have the fields rearranged line by line. This is precisely what AWK was designed for – GAWK is just the Gnu flavour of it. AWK – named after its developers Aho, Weinberger, Kernighan – works by walking through a file line by line applying an action (the commands within those curly braces) to the current line if the given regular expression yielded a match. I am going to walk you through the logic of the resulting AWK expression, so you will be able to adjust it to address your specific use case.
1 2 3 4 5 6 7 |
# /M/ is the regular expression matching (in this case) the first and the # last record which is then printed (the action). > gawk '/M/{print}' cities.csv 1; Mainz;49.994640;8.264160 4; Munich;48.140021;11.580288 |
By default AWK is going to split a line at spaces making the separate fields available for further processing as variables – $0 for the initial line and $1,$2,... for the separate fields. As we are using semicola to separate fields we have to tell AWK about it.
1 2 3 4 5 6 7 8 9 |
# -F specifies the field separator > gawk -F';' '{print $1,"-",$2}' cities.csv id - city 1 - Mainz 2 - Tallinn 3 - Berlin 4 - Munich |
Now we do need some sort of a matching expression to specify that we only want to print the lines starting from the second one. And to make the string formatting more pleasent to the beholder we are going to use the good old printf function.
1 2 3 4 5 6 |
> gawk -F';' 'NR>1{printf("%d-%s-%f-%f\n",$1,$2,$3,$4)}' cities.csv 1- Mainz-49.994640-8.264160 2- Tallinn-59.439451-24.756094 3- Berlin-52.515482-13.402796 4- Munich-48.140021-11.580288 |
Something we need to take care of are the white spaces padding the city names before we can use the variables to build up a JSON. This we do by simply extending the action with a gsub statement which we place in front of the printf. The gsub will remove all white spaces following a semicolon from the current line ( %0 ). The other white spaces are implicitely removed by printf.
1 2 3 4 5 6 7 |
> gawk -F';' 'NR>1{gsub(/; +/,";",$0);\ printf("%d-%s-%f-%f\n",$1,$2,$3,$4)}' cities.csv 1-Mainz-49.994640-8.264160 2-Tallinn-59.439451-24.756094 3-Berlin-52.515482-13.402796 4-Munich-48.140021-11.580288 |
Feeding MongoDB
Now all that is left is putting the JSON string together …
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# note the order of the coordinates - GeoJSON expects longitude/latitude > gawk -F';' 'NR>1{ \ gsub(/; +/,";",$0);\ printf("{_id:%d,name:\"%s\",geometry:{type:\"point\","\ "coordinates:[%f,%f]}}\n" \ ,$1,$2,$4,$3)}' \ cities.csv {_id:1,name:"Mainz",geometry:{type:"point",coordinates:[8.264160,49.994640]}} {_id:2,name:"Tallinn",geometry:{type:"point",coordinates:[24.756094,59.439451]}} {_id:3,name:"Berlin",geometry:{type:"point",coordinates:[13.402796,52.515482]}} {_id:4,name:"Munich",geometry:{type:"point",coordinates:[11.580288,48.140021]}} |
… and feeding it to mongoimport.
1 2 3 4 5 6 7 8 9 10 |
> gawk -F';' 'NR>1{ \ gsub(/; +/,";",$0);\ printf("{_id:%d,name:\"%s\",geometry:{type:\"point\","\ "coordinates:[%f,%f]}}\n" \ ,$1,$2,$4,$3)}' \ cities.csv \ | mongoimport -d "geodb" -c "cities" connected to: 127.0.0.1 2014-07-06T18:25:32.511+0200 imported 4 objects |
So let’s see if we can find the data set for “Mainz”:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
#> mongo MongoDB shell version: 2.6.3 connecting to: test > use geodb switched to db geodb > db.cities.find({name:"Mainz"}).pretty() { "_id" : 1, "name" : "Mainz", "geometry" : { "type" : "point", "coordinates" : [ 8.26416, 49.99464 ] } } |
And on Windows?
If you are a Windows user – do not despair – I bet you can do this also somehow with PowerShell … okay, sorry for that cruel joke – check out Cygwin, which works pretty darn well as far as I used it. Though you might have to refer to AWK as awk instead of gawk .
(original article published on www.joyofdata.de)
¡Excelent! Only I did some changes and now is working for me. Thanks
Hi Raffael, great tutorial.
I’ve tried to use this with worldcitiespop.txt and i’ve got some problems to import all records.
I’ve already used iconv to convert non utf8 entities.
When I’m importing the new file, directly with mongoimport, there are no errors.
But, when I’m running gawk script, mongo fails to import some rows cause some entities are not escaped.
csv header is: “country,city,accentcity,region,population,latitude,longitude”
I want to escape $1, $2 fields
in both cases, entities which can be replaced are quotes and double quotes.
Examples:
af,dekh"iykh'ya,Dekh"iykh'ya,13,,34.60345,69.2405
af,dekh"yaki-lal'begi-dzhanubi,Dekh"yaki-Lal'begi-Dzhanubi,39,,32.561313,65.864451
Script:
gawk -F',' 'NR>1{ \
gsub(/, +/, ",", $0);\
printf("{country:\"%s\",city:\"%s\",accentcity:\"%s\",region:\"%s
\",population:\"%d\",geometry:{type:\"point\","\"coordinates:[%f,%f]}}\n" \
,$1,$2,$3,$4,$5,$7,$6)}' \
/media/data/dev/maps/worldcitiespop-utf-8.csv \
| /opt/mongodb/bin/mongoimport --db "states_provinces" --collection "worldcitiespop"
Thanks in advance,
Fabio
Hi Fabio,
I would recommend doing it with R f.x. instead – at least the cleaning part. It’s of course possible to get this done with bash – in case you insist on doing it the bash way I would have to ask you to post a question with simplified and reproducible code to SO. If you tell me about it – I will have a look at it.
I used that dataset for “Interactive Heatmaps with Google Maps API v3” and I remember having to sort out some issues as well. At the end of the text you find the R code I used for reading the file.
Cheers
Raffael
Great showcase for AWK. When working on Windows (rarely <g>), I also use csvkit from time to time for very quick plain vanilla conversion from CSV to JSON (requires Python)
http://csvkit.readthedocs.org/en/latest/index.html
Hey Armin, thanks for your feedback and link! BTW – given that you wrote three books on Windows technology it seems ironic when you state that you “rarely work on Windows”. But I guess you just did “the switch” at some point – like me.