Parallel computing in R on Windows and Linux using doSNOW and foreach

"Sarah and Patti Chasing Me 2" by Martin Male [CC BY 2.0]

by Martin Male [CC BY 2.0]

Who would’ve thought that parallel processing in R is THAT simple? Thanks to a collection of packages having a task use all available cores is a cinch. I call an Intel i7-3632QM my own – which means 4 physical cores each providing 2 virtual cores running at something around 3 GHz. So let’s have a look at how this feat is accomplished.

The method of parallelization applied here works by starting a specified number of regular R processes and have those run on the available cores. The launch, stop of the those processes and the assignment of tasks is controlled from the original R process. The assignment of those processes to the cores is controlled by the OS. Because this method can be extended to cores beyond one CPU or computer the packages use the acronym SNOW for “Simple Network Of Workstations”.

All those complexities are taken care of by three packages:

  1. snow: for specifying the cluster
  2. doSNOW: for registering the cluster
  3. foreach: for splitting up a series of programs so it can be assigned to separate cores

 The two scripts

The two scripts do the same except for one detail. The first one stores the results of the SHA-256 hashings (in a data frame – which is already quite inefficient actually, because it indexes the newly added elements every time), the second one doesn’t. So the first one is more realistic but mixes computation and IO. The second one is more narrowed down to the computation. The comparison helps interpret the results.

In both cases the calculation is located in a function and there executed serially. The function itself is executed using foreach() which takes care of feeding the individual function calls to the available cores. The number of hashings is constant per run – 100’000 times with storing the result and 1’000’000 times in the version not bothering with storage. What varies is the number of hashings done within the function per call (M) and respectively the number of times the function is called (N) – hence N x M = 100’000 or 1’000’000.

The reason why I chose SHA-256 is because as far as I know it is by design computationally not trival and rather heavy. That’s all. Actually there might be better choices for that purpose – you are welcome to tell me if you know more about that!

This type of this parallelism, where no communication and synchronization of the parallelized tasks is taking place, is by the way referred to as “embarassingly parallel“. According to the wikipedia article it is called like that because it would be embarrassing to not take advantage of such an obvious choice. How true!

So what is going on?

What can we take from those charts? First of all that there is a non-trivial minium, which leads to the conclusion that it makes sense to look for the sweet spot before it’s crunch-time. Now why is there an optimal number of sub-cycles (M) at all? Given that the performance improvement is much higher for the IO-heavy version I guess this indicates that either the RAM is the bottleneck or the avid writing to a data frame. It seems that the read-/write-processes are more efficient when the tasks are split up propperly. Actually if you look at the “with storage” version the best configuration of 8 cores and M=500 the run time improvement is unbelievably about 100 times compared to the stupid way to do it.

Another observation that first dazzled me but after thinking about it becomes quite obvious is that when you try to parallelize the single execution of a hashing f.x. no parallelization is going to happen at all. Something like this …

… is going to keep no more than a single core busy. This is because – at least that is my theory – the method we apply here needs a master process to instruct the child processes. But when the child process is done too quickly the master process will be kept busy instructing it again and again instead of setting up and dealing with further child processes.

Given that the performance boost from running the no-storage version with two versus eight cores is merely 2.4 – not even close to 4 – indicates that this approach certainly didn’t reach the end of the flagpole yet.

With storage of results in a data frame

Few notes about foreach()

For details about how to use foreach I recommend reading this manual. The first part of the provided argument – “i=1:n” in this case – tells foreach how often the statement  following %do% / %dopar% has to be executed serially/parallely. i is also provided as an argument.

All data structures currently present in the workspace are provided to and made available in the created child processes. But if you want to use non-core functions from packages like digest() in this case you have to tell foreach() about that using the parameter “.packages”.

Because you need to get the results somehow from the child process back to the master foreach returns a list comprised of the returned values. In this case I specify that I want the returned values to be combined using rbind() because I am returning a data frame. By default foreach will return a list containing the returned values.

.inorder tells foreach() whether to preserve the specified order when combining the results or not. Because in our case this is not necessary and because I would guess this preservation takes computational effort I refrain from it.

Without storing the results

At the end of the day …

… this article is a starting point for parallelized programming. It is obvious that a lot more is to learn here in this area and more articles will follow. If you have any insights, corrections or advice to share – please do not hesitate to write a comment!


Update Febuary 19 2014

I also used foreach() on Linux / Ubuntu 13.10 64bit as well and it works like a charm.


(original article published on www.joyofdata.de)

One thought on “Parallel computing in R on Windows and Linux using doSNOW and foreach

Comments are closed.