As described in an earlier blog post, I have been participating in the Kaggle National Data Science Bowl competition. This competition has now ended and the final results are in. Since I made considerable progress in the weeks after I wrote up my first blog on this, I wanted to show what I did to get to my final best results in the competition.
The top spot was taken by the Deep Sea team, all academics of the Reservoir research lab of Ghent University in Belgium, who wrote up their approach here. I would highly encourage you to look at that, as I still am, as it mentions a number of techniques and approaches that never even occurred to me, and some that I don’t even understand (yet). My results were nowhere near theirs, but I still think it is valuable to show what I ended up with and show my path towards that, as it was a fantastic learning process, and I believe an “intermediate” description that goes beyond tutorial level but is not yet as complex as the top 3-10 models has value for others that are developing their deep learning skills. My own “breakthrough” occurred from reading academic papers that gave me new ideas to try, and hopefully this will help others as well – or at least get a sense of what is involved in improving convolutional neural networks for such image classification tasks.
Training set augmentation
My previous blog post ended with a multi-class log loss of 1.1805, and was rapidly running out of ideas. The first thing I did was further expand the training set through rotations and mirroring. I tried a number of approaches, including trying to retain the aspect ratio of the image by adding white space first to make the image square – images were of varying sizes, and rarely if ever were square. Oddly, that actually performed worse than simply resizing the images (and therefore “skewing” them) to a fixed square size, so I went back to that. Eventually, I settled on the following approach:
- Rotate all images 45 degrees
- Remove all 45 degree images for the top 4 classes
- Rotate all images (including the 45 degree images) 90 degrees
- Rotate all images (including 45 and 90 degree images) by 180 degrees
For rotating 45 degrees I increased the image size, so that the rotated image “fits” fully inside the image boundary. This actually creates a “non-skewed” image, so for most plankton classes we have both skewed and non-skewed version (be it at a rotated angle). This gave me an overall training set size of around 180-200,000 images – split randomly into a training (80%) and validation (20%) set, which together with my model just about fit into the VRAM of my GTX Titan GPU. I actually had to remove the 45 degree images from the top 4 classes for the dataset to fit. If I’d gone further with augmentation, I would have had to rewrite the code to handle the training set, and perhaps – like the winners did – generate augmented training images during training itself, rather than before training starts in order to fit into GPU VRAM.
Weeks of dead ends
To see if I could improve my model, I tried 5×5 and 7×7 sized feature maps in otherwise the same 6 layered network described in the previous blog post, rather than the 3×3 that I had used before. This only improved my model marginally, and in fact in most cases made it substantially worse. However, there was one really good thing about it: by looking at the feature maps – which at 5×5 and 7×7 carry more information – I got a better sense of how those feature maps work, and it was possible, for instance, to look at layer 2 and layer 3 convolutional feature maps and get a sense that indeed these were compressed representations of plankton, and therefore could at least confirm that the concept worked.
Imagine a box with 500 million knobs, 1,000 light bulbs, and 10 million images to train it with. That’s what a typical Deep Learning system is. – Yann LeCun
But when even returning to 3×3 feature maps I didn’t make a lot of progress. There are lots of hyper-parameters to fiddle with and knobs to twiddle and most adjustments didn’t really help. I added layers, but as long as I had a “maxpool” after each convolutional layers this just meant that the image input size became larger and larger (up to 120×120 or so), but didn’t produce better results. I came to the realization that it wasn’t really the parameters that were the problem, but the architecture of the network itself.
I took a side tour trying to get something working with Restricted Boltzmann Machines (RBMs) and Deep Belief Nets (DBNs), but the results of that were very poor, and took more processing power, so I returned to my convolutional network model.
More layers, more feature maps
As my networks were getting bigger and I used more augmented training data, my models started to take longer to run. That gave me lots of time to read up on various academic papers on convolutional nets while the model was training, as well as go through the entire University of Toronto Neural Networks for Machine Learning course on YouTube, to get a better understanding. What eventually caused my breakthrough was a paper by Karen Simonyan and Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, which described a series of tests with different convolutional network architectures against the ImageNet dataset, using 11 to 19 weight layers and a 224×224 color image input size.
This made something click in my head. I’d experimented with dropping a maxpool layer, but didn’t really know whether that was appropriate or not, and this paper confirmed that that was perfectly legitimate. And after an initial test of an 8-layered architecture proved promising and gave me the first validation multi-class log loss below 1.0 (a somewhat arbitrary number, but one that I had the hardest time getting under), I knew that there was more in it. I saw improvement by adding another layer, and yet again with a 10-layered model. You can see a representation of that architecture below: On the left you see the input sizes into each layer, on the right a description of the shape of the feature maps, as well as the number of nodes in the bottom two fully connected layers.
By now, the models took easily 3 full days to train. In hindsight, I could have reduced the number of nodes in the fully connected layer, considering the winning team used substantially smaller fully connected layers, and that would have sped up training quite a bit. As it was, I ended up initializing the convolutional layer weights with the saved weights of prior runs, which helped and made training faster. My lowest validation multi-class log loss score was 0.79, which got me my final score of 0.943 on the leaderboard and 215th position out of 1,049 participants.
Kaggle even gave me a little badge…
Participating in this competition was one of the most valuable experiences I have gone through in the last couple of years. The combination of a real use case and the complexity of the task made it an excellent learning experience that went well beyond MNIST tutorials, and made me much more aware of what is involved, what sort of timelines one should expect for a project like this, and various techniques to improve the performance of a CNN model. There have been major breakthroughs in deep learning in the past few years, and this is likely to continue. We’ll soon see these techniques applied more widely, and get into the enterprise, especially within the context of IoT, beyond just the search giants of Google and Baidu, where image labeling, search and classification is part of their core business.
Even more interesting, by talking about this competition internally inside of SAP, I found various colleagues that are also working on deep learning topics. And while it is yet premature to talk about this in further detail, stay tuned as we’re working on ways to bring these techniques to HANA.