MongoDB – State of the R

mongodbNaturally there are two reasons for why you need to access MongoDB from R:

  1. MongoDB is already used for whatever reason and you want to analyze the data stored therein
  2. You decide you want store your data in MongoDB instead of using native R technology like data.table or data.frame

In-memory data storage like data.table is very fast especially for numerical data, provided the data actually fits into your RAM – but even then MongoDB comes along with a bag of goodies making it a tempting choice for a number of use cases:

  • Flexible schema-less data structures
  • spatial and textual indexing
  • spatial queries
  • persistence of data
  • easily accessible from other languages and systems

In case you would like to learn more about MongoDB then I have good news for you – MongoDB Inc. provides a number of very well made online courses catering to various languages. An overview you may find here.

rmongodb versus RMongo

The good news is – there are two packages available for making R talk to MongoDB.

For a larger project at work I decided to go with rmongodb because it does not require Java in contrast to RMongo and it seems to be more actively developed – not to mention that MongoDB Incorporation itself has a finger in the pie apparently. And I can say I did not regret that choice. It’s a great package – and along these lines a big thank you to Markus Schmidberger for investing his time in its development. Having said that – there are a few quirks and a rather uncomfortable not-yet resolved issue one is going to face for non-trivial queries. But before I get to those subjects let me first give you a quick introduction into its application.

UPDATE – 2014 November 15
Supposedly the below described issues have been resolved with the most recent version of rmongodb.

Storing Data with rmongodb

I assume that a MongoDB daemon is running on your localhost. By the way installing and using MongoDB from R is pretty much effortless for Ubuntu as well as Windows. First we are going to establish a connection and check whether we were successful.

Now let’s insert an object which we define using JSON notation:

Maybe this intermediate step is a bit surprising – after all MongoDB stores JSONs!? Well, it doesn’t – it works with BSONs.

BSON [bee · sahn], short for Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. Like JSON, BSON sup­ports the em­bed­ding of doc­u­ments and ar­rays with­in oth­er doc­u­ments and ar­rays. BSON also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec. For ex­ample, BSON has a Date type and a BinData type. [bsonspec.org]

 

The data structure closest to a JSON in R is a list – so naturally we can use that too for specifying a document to be inserted:

Retrieving Data from MongoDB

Now let me show you how to retrieve those two documents and print them – for illustrative purposes I query for documents whose field “a” holds a value greater or equal to 1:

Also the search query is only superficially a JSON and hence has to be casted to BSON before applying it. The result is a cursor which has to be iterated and leads to the resulting documents – as BSONs, of course. This little ritual with converting to and from BSON feels a bit clumsy at times but one gets used to it eventually and of course nothing keeps you from writing a more comfortable wrapper.

Implicit Coversion of Sub-Document to Vector

For the purpose of illustrating my point I added a document into the collection from MongoDB shell:

Now what you will see is that rmongodb implicitely casts the lowest sub-document as named R vector:

This is something you have to keep in mind. Personally I find this to be a tad uncomfortable. As a list() would work, too, and would allow for a more homogenous processing. But that is just my two cents.

 Formulating Queries Using Arrays is Problematic

At the present it is not possible to easily formulate a BSON containing an array. This is primarily a problem when you want to query documents and that query expression needs an array. For example:

Let’s assume three documents in database.collection :

We would expect to receive the two obvious documents. But doing so in rmongodb currently yields an error:

 Building Up the BSON Buffer from Scratch

Internally rmongodb constructs a BSON for a JSON by constructing an initial BSON buffer and then adding to it the elements of the JSON-document. For the suggested query this would be done as follows:

Obviously this approach is going to get quite complicated and error-prone for even mildly complex queries/documents. But then again this method of building up a BSON is also quite generic and straightforward. So, to simplify this task I composed a little package that recursively traverses the representing list and invokes those functions appropriately.

rmongodbHelper at Your Service

The following code will turn the or-query from a JSON into a BSON:

And its result:

Why Oh Why?

The reason why I took the time to craft a solution for this issue is two fold:

  1. I don’t program C, so I cannot contribute to rmongodb by fixing the issue myself
  2. The issue seems to be around already for almost a year

I really hope my solution will become superfluous very soon because R deserves an awesome MongoDB connector and rmongodb is pretty damn awesome already.

stay-tuned twitter feedly github

Some Details about rmongodbHelper

I answered a couple of questions on stackoverflow.com which will provide more code on how to use the package:

Keep few rules in mind in case you would like to use it:

  • It is version 0.0.0 – so there will be bugs – you are welcome to contribute or complain
  • If you would like to feed it a JSON then keep in mind that all keys have to be placed within double quotes
  • '"x":3'  will be casted as double
  • '"x":"__int(3)"'  will be casted as integer
  • Internally arrays and sub-documents are implemented as list()s. They are differentiated by non-existence of names() for arrays and presence of nams for sub-documents. An empty array has to contain one list()-element .ARR with arbitrary value.

To give an example for the last point:

The resulting document will look pretty-printed on MongoDB shell as follows:


In case you are interested in how to use Hadoop on AWS with R then have a look at …

MapReduce with R on Hadoop and Amazon EMR


(original article published on www.joyofdata.de)

2 thoughts on “MongoDB – State of the R

  1. Excellent article!

    There is another method (full disclosure – I work for this company) – see:

    – URL: jsonstudio.com/wp-content/uploads/2014/04/manual141/_build/html/sonarr.html

    – URL: jsonstudio.com/r-mongodb/

Comments are closed.