In this tutorial I am going to show you how to set up CUDA 7, cuDNN, caffe and DIGITS on a g2.2xlarge EC2 instance (running Ubuntu 14.04 64 bit) and how to get started with DIGITS. For illustrating DIGITS’ application I use a current Kaggle competition about detecting diabetic retinopathy and its state from fluorescein angiography.
Convolutional Deep Neural Networks for Image Classification
For classification or regression on images you have two choices:
- Feature engineering and upon that translating an image into a vector
- Relying on a convolutional DNN to figure out the features
Deep Neural Networks are computationally quite demanding. This is the case for two reasons:
- The input data is much larger if you use even a small image resolution of 256 x 256 RGB-pixel implies 196’608 input neurons (256 x 256 x 3). If you engineer your features intelligently then a 1000 neurons would be a lot already.
- Saddling the network with the burden of figuring out the relevant features also requires a more sophisticated network structure and more layers.
Luckily many of the involved floating point matrix operations have been unintentionally addressed by your graphic card’s GPU.
NVIDIA DIGITS and caffe
There are three major GPU utilizing Deep Learning frameworks available – Theano, Torch and caffe. NVIDIA DIGITS is a web server providing a convenient web interface for training and testing Deep Neural Networks based on caffe. I intend to cover in a future article how to work with caffe. Here I will show you how to set up CUDA
First of all you need an AWS account and g2.2xlarge instance up and running. That is mostly self-explanatory – for the command line parts (and some tips) you might want to have a look at my previous tutorial “Guide to EC2 from the Command Line“. Make sure to add an inbound rule for port 5000 for your IP – b/c this is where the DIGITS server is made available at.
1 2 3 |
# don't forget to get your system up to date sudo apt-get update sudo apt-get dist-upgrade |
Installing CUDA 7
Main source for this step is Markus Beissinger’s blog post on setting up Theano.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# installation of required tools sudo apt-get install -y gcc g++ gfortran build-essential \ git wget linux-image-generic libopenblas-dev python-dev \ python-pip python-nose python-numpy python-scipy # downloading the (currently) most recent version of CUDA 7 sudo wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb # installing CUDA sudo dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb sudo apt-get update sudo apt-get install cuda # setting the environment variables so CUDA will be found echo -e "\nexport PATH=/usr/local/cuda/bin:$PATH" >> .bashrc echo -e "\nexport LD_LIBRARY_PATH=/usr/local/cuda/lib64" >> .bashrc sudo reboot # installing the samples and checking the GPU cuda-install-samples-7.0.sh ~/ cd NVIDIA\_CUDA-7.0\_Samples/1\_Utilities/deviceQuery make ./deviceQuery |
Installing cuDNN
To further speed up deep learning relevant calculations it is a good idea to set up the cuDNN library. For that purpose you will have to get an NVIDIA developer account and join the CUDA registered developer program. The last step requires NVIDIA to unlock your account and that might take one or two days. But you can get started also without cuDNN library. As soon as you have the okay from them – download cuDNN and upload it to your instance.
1 2 3 4 5 6 7 |
# unpack the library gzip -d cudnn-6.5-linux-x64-v2.tar.gz tar xf cudnn-6.5-linux-x64-v2.tar # copy the library files into CUDA's include and lib folders sudo cp cudnn-6.5-linux-x64-v2/cudnn.h /usr/local/cuda-7.0/include sudo cp cudnn-6.5-linux-x64-v2/libcudnn* /usr/local/cuda-7.0/lib64 |
Installing caffe
Main source for this and the following step is the readme of the DIGITS project.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
sudo apt-get install libprotobuf-dev libleveldb-dev \ libsnappy-dev libopencv-dev libboost-all-dev libhdf5-serial-dev \ libgflags-dev libgoogle-glog-dev liblmdb-dev protobuf-compiler \ libatlas-base-dev # the version number of the required branch might change # consult https://github.com/NVIDIA/DIGITS/blob/master/README.md git clone --branch v0.11.0 https://github.com/NVIDIA/caffe.git cd ~/caffe/python for req in $(cat requirements.txt); do sudo pip install $req; done cd ~/caffe cp Makefile.config.example Makefile.config # check that USE_CUDNN is set to 1 in case you would # like to use it and to 0 if not make all make py make test make runtest echo -e "\nexport CAFFE_HOME=/home/ubuntu/caffe" >> ~/.bashrc # load the new environmental variables bash |
Installing DIGITS
1 2 3 4 5 |
cd ~ git clone https://github.com/NVIDIA/DIGITS.git digits cd digits sudo apt-get install graphviz gunicorn for req in $(cat requirements.txt); do sudo pip install $req; done |
Starting and Configuring DIGITS
The first time you start DIGITS it will ask you number of questions for the purpose of its configuration. But those settings are pretty much self-explanatory and you can change them afterwards in ~/.digits/digits.cfg . You might want to consider locating your job-directory ( jobs_dir) on an EBS – the data set of about 140’000 PNGs in the example I feature here consumes about 10 GB of space and the trained models (with all its model snapshots) accounts for about 1 GB.
1 2 3 4 5 |
# change into your digits directory cd digits # start the server ./digits-devserver |
Troubleshooting DIGITS
When you start DIGITS for the first time you might run into a number of errors and warnings. Here’s my take on them.
1 2 3 4 |
"libdc1394 error: Failed to initialize libdc1394" # no big deal - either ignore or treat symptomatically sudo ln /dev/null /dev/raw1394 |
1 2 3 4 5 6 7 8 9 10 |
"Gtk-WARNING **: Locale not supported by C library." # not sure how serious this is - but it is easy to resolve sudo apt-get install language-pack-en-base sudo dpkg-reconfigure locales # check what locales are available and then ... locale -a # ... set LC_ALL to it echo -e "\nexport LC_ALL=\"en_US.utf8\"" >> ~/.bashrc |
1 2 3 4 5 |
"Gdk-CRITICAL **: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed" # this is a big deal and will cause the server start up to fail: # connect with ssh flags -Xi ssh -Xi ... |
1 2 3 4 5 6 |
"Couldn't import dot_parser, loading of dot files will not be possible." # reinstall pyparsing: sudo pip uninstall pyparsing sudo pip install pyparsing==1.5.7 sudo pip install pydot |
Getting Started with DIGITS
First you have to create the data set on which you want to train a model. You have to provide at least one large set of pictures for the training and optionally two smaller sets for validation and testing. You can either separate those sets (and their correct labels) by means of different folders or – what I’d recommend – by providing corresponding CSVs. Those CSVs are supposed to feature two unnamed tab separated columns. The first column keeps the full path of the image (don’t use
~ for home, but the its path equivalent) and the second column keeps a 0-based index referencing the correct class. You will also have to provide a text file holding the different classes – one per line.
For example if you have two classes “pos” (1st line) and “neg” (2nd line) – then an image belonging to class “pos” would have to have a class index of 0 associated with it. Loading might take a while. Loading my 140’000 PNGs with 256×256 resolution took about one hour.
Setting up the model you intend to train is even easier provided you stick with the suggested defaults – just choose the data set you want to use, a network and you’re ready to go! Training a GoogLeNet for 30 epochs on the described data set took about one day and 6 hours. This is why you should make sure that …
- … your bidding for a Spot instance is not too low – or you risk it being terminated
- … you start the server in tmux session. Otherwise if you lose connection – maybe b/c your IP changes over night – the server process will be killed
Tackling the Diabetic Retinopathy Kaggle challenge
The provided training set consists of about 35 thousand images of high resolution – zipped and split accross five files. The whole zip archive is about 33 GB large. I downloaded the five components directly onto an EBS using lynx – b/c you can just regularly log on and initiate the download. The download speed on the g2.2xlarge instance btw was incredible – you are granted up to 100 MB per second. I started all five downloads in parallel – each going at 6 MB per second. And yes, its mega byte – not mega bit (the unit DSL providers use).
The visible indicators of diabetic retinopathy are as I understand it mostly leaking (aneurysms) and pathologically growing blood vessels. I figure those features are mirror and rotation invariant. So to increase the available training set I created four versions:
- (A): As is but resized to 256×256 pixels and saved as PNG
- (R): 180 degree rotation of (A)
- Vertical mirroring of (A)
- Vertical mirroring of (R)
Because the task at hand is obviously not a classification but a regression I abstained from attempting to learn a classification into no DR and the four stages of DR. I labelled all DR cases as “positive” and the no-DR cases respectively as “negative”. This would have to be done for all four possible splits ({0} vs {1,…,4}, …, {0,…,3},{4}) and those predictions would finally be regressed against the actual stage.
The bash script for this transformation you may find on bash commands for the processing.
The Result
Well … on one hand I would have liked to see a higher accuracy – on the other hand I can barely (if at all) make out the difference between some healthy cases and some extreme stage four cases. As 73.95% is the share of negative cases – this is also were the accuracy of the network started out at. In the course of 30 epochs it improved about 8 p.p. to 81.8%.
Any Questions?
I highly recommend the DIGITS Google Group for your questions on features and issues. The developers of DIGITS are very helpful and open for suggestions.
(original article published on www.joyofdata.de)
Raffael – Awesome Post! Thanks for clarifying the DIGIT input CSV for image classification is tab delimited!
Bill
Bill, I do what I can … :)
Great article!
In your discussion you mention that this is clearly not a classification problem. Can you explain why setting up DIGITS to simply perform a 5 class classification would be inappropriate here?
Also can you explain in more detail your analysis? It sounds like you performed 4 different binary experiments via DIGITS (0 vs 1, 0 vs 2, 0 vs 3, 0 vs 4) then used those as inputs into a regression? Is that right? Thanks!
Hi Andy, wow, this text is too old – I honestly just remember that I did not do a regression. Just simple classification learning with DIGITS. Cheers, Raffael
Thanks Raffael for the awesome post!!
I got all the way through the installation until i spun up DIGITS and got the
‘Gdk-CRITICAL **: gdk_cursor_new_for_display: assertion ‘GDK_IS_DISPLAY (display)’ failed’ error.
As you suggest i tried
ssh -i DIGITS.pem ubuntu@ec2-xx-xx-xx-xx.eu-west-1.compute.amazonaws.com
and i was able to get back in but when i tried to run it again i get the same error. any suggestions appreciated.
Thanks! -rob
i reinstalled everything from the beginning and it now works fine!! thanks again for the great tutorial!! -rob
Great article!
FYI, to fix the dot_parser warning all you have to do is “pip install pydot2.” Or you can just ignore the warning – pydot works just fine despite the warning.
Hi Luke, thanks a lot for this advice! And of course, thanks for developing digits. I’m looking forward to future versions.