Every year during the holidays I find myself a little technical project outside of my normal SAP world to try out something new, or look into innovation topics that may not have immediate relevance to my job, but might in the longer term. This vacation period – after focusing in 2014 a lot on Big Data and predictive analytics topics – I decided to take a look at deep learning and neural networks.
There have been major breakthroughs in deep learning in the last few years, with perhaps two major milestones being the 2012 DNNResearch success in the ImageNet Large Scale Visual Recognition Challenge, followed by the win in 2014 GoogLeNet of which you can read more here. The leading lights in academia in the field now work for companies like Google, Facebook and Baidu. Interestingly enough, it appears that rather than a major change in approach, progress has primarily come from larger training sets (“Big Data”), as well as more computing power, including clustering and GPU processing. For more context you may want to read this.
In any case, I started playing with this by going through some MNIST-based tutorials, which handle hand-written numbers – for instance, to read postal codes on envelopes. This is a great way to start, and modern deep learning techniques can achieve 98-99% accuracy on that dataset. However, it is not the most challenging task, and why not take it a bit further?
This is when serendipity hit. On December 15th, Kaggle started the National Data Science Bowl competition (which runs till the end of March 2015). The competition consists of classifying images of ocean plankton in 121 different classes, with a supplied training set of around 30,000 labeled images, and a test set of 130,000 for which you have to provide the classification. The images are black and white, and in different sizes and shapes, with width and heights ranges roughly between 30 pixels and over 200 pixels. This is a real-world problem to tackle, while also providing through the leaderboard an ability to track your progress, as well as how you do compared to others.
Installation and preparing environment
I’ve been in the industry for a while now, and have got to recognize how the state of the tool chain is often a good indication for how mature or how new a particular technical area is. When I started with Java in the 90s, a basic text editor and javac was all we had – no fancy IDEs or frameworks. My initial experience with Hadoop was not that different. Similarly, there aren’t (yet) any easy point-and-click environments for this.
I am relatively handy with Python, but had never used NumPy, SciPy, or scikit-learn before, which are quite major extensions to the language, let alone Theano. Luckily, how to install Theano, including all of the dependencies and the required configuration to get your models running through C packages and on the GPU for substantial acceleration, is very well described here. Be prepared, though, to spend a considerable amount of time on this. I started with a new Linux install and patch it to the latest and greatest, then went through the NumPy, SciPy and openBLAS configuration, as well as compile an NVIDIA kernel driver, before even getting to Theano. Finally, once Theano is installed, it will run through about 1-2 hours of tests to ensure everything is working and utilizing the GPU. All in all, it took me about a day to get everything working and run my first test. To give a sense of how important GPU acceleration is if you have access to it, I ran an early test on a separate Linux desktop without a GPU, and the same model that took 24 hours to run on CPU, ran in about 20 minutes on the GPU (not entirely a fair comparison, as the PC with the GPU also has a much better CPU, but it gives a sense of how beneficial – crucial, even – GPU acceleration is for deep learning).
My Convolutional Neural Network
I went through a couple of iterations, but ended up with the model below for my best submission to the competition (so far). The model uses 4 convolutional layers with subsampling (or downsampling). The fourth layer is flattened and connected fully to 1024 hidden nodes (i.e. each layer 4 node connects to each hidden layer node), finally connecting fully to 121 nodes, each representing a single plankton class. The model uses linear rectifiers, rather than sigmoids, for the nodes and dropout.
You can read more about how convolutional networks work in this introduction, but the main idea is this: run the source image through a series of filters in order to detect features that may identify individual plankton types. These filters are learned through the training cycle through back propagation, and you can see an example below of the filters of the first convolutional layer. In the top image you see the state of these 32 filters after one epoch of training (i.e. after processing one round of the training set), with the bottom one the result after the final training epoch.
You can see that it is still rather random at first, but after 65 training epochs, there are distinct “shapes”, as for instance the very last one represents a bottom-left to top-right edge. The source image is run through these filters to create 32 different “feature maps”. Each of these goes through the process again, resulting in 64, 128 and finally 256 feature maps, each run through 3×3 filters, followed by a 2×2 max pool subsampling, where the highest value is retained.
Finally, these final feature maps are connected to the hidden layer, which itself connects to the 121 output nodes.
Training the model and avoiding overfitting
To be able to get any sense of how well your model is performing and whether your training is working well, you need a validation set. Unfortunately, with only 30,000 images in the training set which is already not that large, leaves little room. I wanted to use 1/5th of the training data as validation set, and that left just 25,000 training images. I decided to double the training set by creating a duplicate of each training image, rotated by 90 degrees. I then resized the images to 48×48 and since the images aren’t perfectly square to begin with,the rotated image “skews” in a different way than the original. (In later tests I actually added white space to images before scaling them, avoiding any skewing, but that didn’t really materially impact the result.) I finished the preparation by randomizing the order of the images and splitting them into a training set of around 50,000, and a validation set of around 10,000.
We train 64 images at the same time (batch size), and after all training images have been processed, we run a validation test to conclude an epoch. We then run multiple epochs until we minimize the validation set error rate (or “multi-class log loss” as used in this competition). It is important we look at the performance of the validation set, not the training set, as in all statistics and deep learning, but for neural networks this is crucial, as they are notoriously susceptible to overfitting. (In fact, as Geoff Hinton mentions in the video on dropout above, “if your deep neural net is not overfitting you should be using a bigger one”.) Overfitting tends to occur when you have too many parameters, and neural networks practically always have a large amount of parameters. This particular model, for instance, uses millions of them.
The diagram below illustrates this. As training progresses, the neural network will get better and better at predicting the training data. However, the better that gets, the less it will tend to generalize, and while initially the validation performance will follow the training data performance, eventually it will bottom out, and over time give worse performance, while the performance on the training set continues to improve.
Practically, this means monitoring the validation performance closely, detecting the “bottom”, and either manually (early-)stopping the training, or breaking out of the loop once the validation multi-class log loss hits a certain predetermined value (which you may or may not achieve!). Often this meant letting the model run for a long time, and finding the lowest point (and where performance no longer improves), then run it again and hope that the next run replicates the prior result.
(Note that using dropout, it is actually possible to get better validation results than training results, especially initially, since no dropout is used in the validation test, but eventually they will diverge as in the diagram)
So, how well did I do?
My best submission was based on a model that gave me a validation multi class loss log of 1.154655, which resulted in 1.180529 on the test set on the Kaggle Leaderboard, which represents something like a 65% accuracy rate (with 121 categories, pure uniform random distribution would be around 0.8%). At the time of submission that was good for 18th place, but that didn’t last as others submitted their own results. At the time of writing this, I am still 66th, which with 395 participants and teams still has me roughly in the top 6th/top 17% which frankly I am delighted with, considering I knew next to nothing about convolutional neural networks just two months ago.
I attempted a number of variations of this model, including using larger image sizes, larger filter sizes, and even added an additional convolutional layer. I also – as mentioned before – added white space to images to make them square before scaling, but that had next to no impact. My very best model actually beat my best submitted results described in this blog, but by just a couple of hundredths, took over 24 hours to run and wasn’t submitted. I am still running through a few more ideas, but fear the limits to my knowledge and understanding are holding me back. I am certainly curious to read any descriptions of those in the top 10 how they approached the problem once those are out.
Using HANA for Neural Networks
I had a pretty beefy GPU at my disposal with 6GB of memory. Nevertheless, I frequently ran out of memory with my models, and had to implement “batching” for validation and test predictions to avoid Out of Memory issues. And this is not even anywhere near a really “big” neural network. A HANA instance has near-endless memory (compared to 6GB) and has many cores, so while it may not be able to reach GPU speed (my GPU has 2688 CUDA cores) it will certainly be able to accommodate large neural networks, and would therefore be able to process larger image sizes, additional convolutional or hidden neural layers or more complex models as for instance described in this video.
In fact, with SPS9, we already can already implement back propagation neural networks in HANA, as described here and here. To my knowledge, no support yet has been built in for convolutional networks, but there is no reason my existing code wouldn’t run on HANA, right now, anyway. It will be interesting to explore SAP HANA further as a platform for deep learning in general, and convolutional networks in particular.
Note: You can now see the follow-up to this blog with the final results and architecture here, now that the competition has ended.