92.45% on CIFAR-10 in Torch

July 30, 2015 by Sergey Zagoruyko

The full code is available at https://github.com/szagoruyko/cifar.torch, just clone it to your machine and it's ready to play.

CIFAR-10 contains 60000 labeled for 10 classes images 32x32 in size, train set has 50000 and test set 10000. There was a Kaggle competition on it: https://www.kaggle.com/c/cifar-10 And there is a bit outdated table of state-of-the-art results here: http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html


The dataset is quite small by today's standards, but still a good playground for machine learning algorithms. State-of-the-art of course hold deep convolutional neural networks, however it's hard to define which deep technique is the best. If you don't do any crops, affine transforms or ensembles, just horizontal flips, I'd say 92.45% is the best result. It is interesting that human performance is about 94% and to beat it one has to use massive data augmentation. Check @nagadomi's training code on 24x24 crops: https://github.com/nagadomi/kaggle-cifar10-torch7

My code is about 400 lines with model definition and data preprocessing and mostly comes from the 2_supervised tutorial. There was a few things I didn't like about it:

  • everything is in one file: it is nice for a demo, but if you want to do experiments you don't want to wait for your data to be preprocessed each run;
  • batch is processed example by example. It is not obvious how to do it correctly for newbies and they copy-paste the code and then it is suuuper slow;
  • the model in it is quite old.

So I modernized it and added some nice tricks as saving an html report that updates each epoch. It can be copied to Dropbox for example, it is very useful to track model performance and have saved reports.

One would need a GPU with at least 2 GB of memory with cudnn, a bit more with the standard nn backend. cudnn is a deep learning library from NVIDIA, not publicly available yet but totally worth registering and applying, it is great: https://developer.nvidia.com/cudnn

NVIDIA GPU is preferred and required for BatchNormalization, but AMD card will do too with OpenCL thanks to this amazing work of Hugh Perkins: https://github.com/hughperkins/cltorch.git Check out opencl branch.

The post and the code consists of 3 parts/files:

  • model definition
  • data preprocessing
  • training

The model models/vgg_bn_drop.lua

After Batch Normalization paper [1] popped up in arxiv this winter offering a way to speedup training and boost performance by using batch statistics and after nn.BatchNormalization was implemented in Torch (thanks Facebook) I wanted to check how it plays together with Dropout, and CIFAR-10 was a nice playground to start.

The idea was simple, while the authors of BatchNormalization argued that it can be used instead of Dropout, I thought why not use them together inside the network to massively regularize training?

So the model definition is here:

It's a VGG-like network [2] with many 3x3 filters and padding 1,1 so the sizes of feature maps after them are not changed. They are only changed after max-pooling. Weights of convolutional layers are initialized MSR-style [3].

Preprocessing provider.lua

I like to train models with BatchProvider -> Network structure. Actually you can even create a model like:

model = nn.Sequential()
model:add(BatchProvider) -- different in train and test mode
model:add(DataAugmentation) -- disabled in test mode

Then your training code is very simple. However in the actual code I keep only DataAugmentation in the model, BatchProvider is a class though. So let's start preprocessing, clone the code from https://github.com/szagoruyko/cifar.torch, go to the folder cifar.torch and start the interpreter:

th -i provider.lua

It will run the file and go to interactive mode. Preprocessing will take ~40 minutes:

provider = Provider()

In the end you will have all the data stored in provider.t7 file, 1400 Mb. The images are converted to YUV and mean-std normalized.

Training train.lua

That's it, you can start training:


The parameters with which models achieves the best performance are default in the code. I used SGD with cross entropy loss with learning rate 1, momentum 0.9 and weight decay 0.0005, dropping learning rate every 25 epochs. After a few hours you will have the model. The accuracy can be tracked updating logs/report.html with for example auto-updating Firefox or Chrome plugin.

How accuracy improves:


Confusion matrix:

[[     929       4      15       7       4       0       1       2      29       9]   92.900%
 [       2     977       2       0       1       0       0       0       6      12]   97.700%
 [      16       1     923      16      25       8       7       1       3       0]   92.300%
 [      14       2      30     836      19      68      18       5       6       2]   83.600%
 [       6       1      18      16     932       7       6      10       4       0]   93.200%
 [       3       1      25      78      17     867       4       5       0       0]   86.700%
 [       4       1      19      17       6       7     942       0       2       2]   94.200%
 [       6       0      12      15      10      14       0     942       1       0]   94.200%
 [      22       6       4       1       0       1       1       0     961       4]   96.100%
 [       9      38       1       4       0       1       1       1      10     935]]  93.500%
 + average row correct: 92.439999580383%
 + average rowUcol correct (VOC measure): 86.168013215065%
 + global correct: 92.44%

Removing BN or Dropout results in 91.4% accuracy.

I created a small table benchmarking VGG+BN+Dropout architecture with different backends on GeForce GTX 980. Batch size was set to 128, the numbers are in seconds:

cunn cudnn R2 clnn (no BN)
forward 0.292 0.163 1.249
backward 0.407 0.333 0.831
forward + backward 0.699 0.500 2.079

cunn is a standard CUDA neural network backend of Torch, clnn is OpenCL backend. cudnn is the fastest as expected. There is also cuda-convnet2 backend which might be a bit faster, but I didn't test it on this architecture, mostly because BN is implemented in BDHW format and cuda-convnet2 works in DHWB.

The code is a nice playground for deep convnets, for example it is very easy to implement Network-In-Network architecure [4] that achieves 92% accuracy with BN (~2% more than they claim in the paper) and 88% without, and NIN is 4 times faster to train than VGG. I put the example in nin.lua

Thanks to Soumith and IMAGINE lab for helping me to prepare this post!


  1. Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift [arxiv]
  2. K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition [arxiv]
  3. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [arxiv]
  4. Min Lin, Qiang Chen, Shuicheng Yan. Network In Network [arxiv]
comments powered by Disqus