GPU Powered DeepLearning with NVIDIA DIGITS on EC2

activationsIn this tutorial I am going to show you how to set up CUDA 7, cuDNN, caffe and DIGITS on a g2.2xlarge EC2 instance (running Ubuntu 14.04 64 bit) and how to get started with DIGITS. For illustrating DIGITS’ application I use a current Kaggle competition about detecting diabetic retinopathy and its state from fluorescein angiography.

Convolutional Deep Neural Networks for Image Classification

For classification or regression on images you have two choices:

  • Feature engineering and upon that translating an image into a vector
  • Relying on a convolutional DNN to figure out the features

Deep Neural Networks are computationally quite demanding. This is the case for two reasons:

  • The input data is much larger if you use even a small image resolution of 256 x 256 RGB-pixel implies 196’608 input neurons (256 x 256 x 3). If you engineer your features intelligently then a 1000 neurons would be a lot already.
  • Saddling the network with the burden of figuring out the relevant features also requires a more sophisticated network structure and more layers.

Luckily many of the involved floating point matrix operations have been unintentionally addressed by your graphic card’s GPU.


There are three major GPU utilizing Deep Learning frameworks available – Theano, Torch and caffe. NVIDIA DIGITS is a web server providing a convenient web interface for training and testing Deep Neural Networks based on caffe. I intend to cover in a future article how to work with caffe. Here I will show you how to set up CUDA

First of all you need an AWS account and g2.2xlarge instance up and running. That is mostly self-explanatory – for the command line parts (and some tips) you might want to have a look at my previous tutorial “Guide to EC2 from the Command Line“. Make sure to add an inbound rule for port 5000 for your IP – b/c this is where the DIGITS server is made available at.

Installing CUDA 7

Main source for this step is Markus Beissinger’s blog post on setting up Theano.

Installing cuDNN

To further speed up deep learning relevant calculations it is a good idea to set up the cuDNN library. For that purpose you will have to get an NVIDIA developer account and join the CUDA registered developer program. The last step requires NVIDIA to unlock your account  and that might take one or two days. But you can get started also without cuDNN library. As soon as you have the okay from them – download cuDNN and upload it to your instance.

Installing caffe

Main source for this and the following step is the readme of the DIGITS project.

Installing DIGITS

Starting and Configuring DIGITS

The first time you start DIGITS it will ask you number of questions for the purpose of its configuration. But those settings are pretty much self-explanatory and you can change them afterwards in ~/.digits/digits.cfg . You might want to consider locating your job-directory ( jobs_dir) on an EBS – the data set of about 140’000 PNGs in the example I feature here consumes about 10 GB of space and the trained models (with all its model snapshots) accounts for about 1 GB.

Troubleshooting DIGITS

When you start DIGITS for the first time you might run into a number of errors and warnings. Here’s my take on them.

Getting Started with DIGITS

digits_new_datasetFirst you have to create the data set on which you want to train a model. You have to provide at least one large set of pictures for the training and optionally two smaller sets for validation and testing. You can either separate those sets (and their correct labels) by means of different folders or – what I’d recommend – by providing corresponding CSVs. Those CSVs are supposed to feature two unnamed tab separated columns. The first column keeps the full path of the image (don’t use ~ for home, but the its path equivalent) and the second column keeps a 0-based index referencing the correct class. You will also have to provide a text file holding the different classes – one per line. digits_new_modelFor example if you have two classes “pos” (1st line) and “neg” (2nd line) – then an image belonging to class “pos” would have to have a class index of 0 associated with it. Loading might take a while. Loading my 140’000 PNGs with 256×256 resolution took about one hour.

Setting up the model you intend to train is even easier provided you stick with the suggested defaults – just choose the data set you want to use, a network and you’re ready to go! Training a GoogLeNet for 30 epochs on the described data set took about one day and 6 hours. This is why you should make sure that …

  • … your bidding for a Spot instance is not too low – or you risk it being terminated
  • … you start the server in tmux session. Otherwise if you lose connection – maybe b/c your IP changes over night – the server process will be killed

Tackling the Diabetic Retinopathy Kaggle challenge

The provided training set consists of about 35 thousand images of high resolution – zipped and split accross five files. The whole zip archive is about 33 GB large. I downloaded the five components directly onto an EBS using lynx – b/c you can just regularly log on and initiate the download. The download speed on the g2.2xlarge instance btw was incredible – you are granted up to 100 MB per second. I started all five downloads in parallel – each going at 6 MB per second. And yes, its mega byte – not mega bit (the unit DSL providers use).

The visible indicators of diabetic retinopathy are as I understand it mostly leaking (aneurysms) and pathologically growing blood vessels. I figure those features are mirror and rotation invariant. So to increase the available training set I created four versions:

  • (A): As is but resized to 256×256 pixels and saved as PNG
  • (R): 180 degree rotation of (A)
  • Vertical mirroring of (A)
  • Vertical mirroring of (R)

Because the task at hand is obviously not a classification but a regression I abstained from attempting to learn a classification into no DR and the four stages of DR. I labelled all DR cases as “positive” and the no-DR cases respectively as “negative”. This would have to be done for all four possible splits ({0} vs {1,…,4}, …, {0,…,3},{4}) and those predictions would finally be regressed against the actual stage.

The bash script for this transformation you may find on bash commands for the processing.

stay-tuned twitter feedly github

The Result

GoogLeNetWell … on one hand I would have liked to see a higher accuracy – on the other hand I can barely (if at all) make out the difference between some healthy cases and some extreme stage four cases. As 73.95% is the share of negative cases – this is also were the accuracy of the network started out at. In the course of 30 epochs it improved about 8 p.p. to 81.8%.

Any Questions?

I highly recommend the DIGITS Google Group for your questions on features and issues. The developers of DIGITS are very helpful and open for suggestions.

(original article published on

8 thoughts on “GPU Powered DeepLearning with NVIDIA DIGITS on EC2

  1. Raffael – Awesome Post!  Thanks for clarifying the DIGIT input CSV for image classification is tab delimited!


  2. Great article!

    In your discussion you mention that this is clearly not a classification problem. Can you explain why setting up DIGITS to simply perform a 5 class classification would be inappropriate here?

    Also can you explain in more detail your analysis? It sounds like you performed 4 different binary experiments via DIGITS (0 vs 1, 0 vs 2, 0 vs 3, 0 vs 4) then used those as inputs into a regression? Is that right? Thanks!

    • Hi Andy, wow, this text is too old – I honestly just remember that I did not do a regression. Just simple classification learning with DIGITS. Cheers, Raffael

  3. Thanks Raffael for the awesome post!!

    I got all the way through the installation until i spun up DIGITS and got the

    Gdk-CRITICAL **: gdk_cursor_new_for_display: assertion ‘GDK_IS_DISPLAY (display)’ failed’  error.

    As you suggest i tried
    ssh -i DIGITS.pem
    and i was able to get back in but when i tried to run it again i get the same error.   any suggestions appreciated.
    Thanks!   -rob

    • i reinstalled everything from the beginning and it now works fine!!  thanks again for the great tutorial!! -rob

  4. Great article!

    FYI, to fix the dot_parser warning all you have to do is “pip install pydot2.” Or you can just ignore the warning – pydot works just fine despite the warning.

Leave a Reply

Your email address will not be published. Required fields are marked *