Training and investigating Residual Nets
The post was coauthored by Sam Gross from Facebook AI Research and Michael Wilber from CornellTech.
In this blog post we implement Deep Residual Networks (ResNets) and investigate ResNets from a modelselection and optimization perspective. We also discuss multiGPU optimizations and engineering bestpractices in training ResNets. We finally compare ResNets to GoogleNet and VGG networks.
We release training code on GitHub, as well as pretrained models for download with instructions for finetuning on your own datasets.
Our released pretrained models have a higher accuracy than the models in the original paper.
Introduction
At the end of last year, Microsoft Research Asia released a paper titled “Deep Residual Learning for Image Recognition”, authored by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. The paper achieved stateoftheart results in image classification and detection, winning the ImageNet and COCO competitions.
The central idea of the paper itself is simple and elegant. They take a standard feedforward ConvNet and add skip connections that bypass (or shortcut) a few convolution layers at a time. Each bypass gives rise to a residual block in which the convolution layers predict a residual that is added to the block’s input tensor.
An example residual block is shown in the figure below.
Deep feedforward conv nets tend to suffer from optimization difficulty. Beyond a certain depth, adding extra layers results in higher training error and higher validation error, even when batch normalization is used. The authors of the ResNet paper argue that this underfitting is unlikely to be caused by vanishing gradients, since this difficulty occurs even with batch normalized networks. The residual network architecture solves this by adding shortcut connections that are summed with the output of the convolution layers.
This post gives some data points for one trying to understand residual nets in more detail from an optimization perspective. It also investigates the contributions of certain design decisions on the effectiveness of the resulting networks.
After the paper was published on Arxiv, both of us who authored this post independently started investigating and reproducing the results of the paper. After learning about each other’s efforts, we decided to collectively write a single post combining our experiences.
Ablation studies (on CIFAR10)
When trying to understand complex machinery such as residual nets, it can be cumbersome to run exploratory studies on a larger scale – like on the ImageNet dataset – since training a full model takes days to converge. Instead, it’s often helpful to run ablation studies on a smaller dataset to independently measure the impact of each aspect of the model. This way, we can quickly identify the parts of the research that are the most important to focus on during further development. Quick turnaround time and continuous validation are helpful when designing the full system because overlooked details can often bite you in the end.
For these experiments, we replicated Section 4.2 of the residual networks paper using the CIFAR10 dataset. In this setting, a small residual network with 20 layers takes about 8 hours to converge for 200 epochs on an Amazon EC2 g2.2xlarge instance. A larger 110layer network takes 24 hours. This is still a long time, but critical bugs that prevent convergence are often immediately apparent because if the code is bugfree the training loss should decrease quickly (within minutes after training starts).
Effects of model depth. The residual networks paper provides a natural starting point for comparison: Figure 6 from the paper simply measures accuracy with respect to the depth of the network. To reproduce this figure, we held the learning rate policy and building block architecture fixed, while varying the number of layers in the network between 20 and 110. Our results come fairly close to those in the paper: accuracy correlates well with model size, but levels off after 40 layers or so.
Residual block architecture.
After validating that our results come fairly close to the original paper, we started considering the impact of slightly different residual block architectures to test the assumptions of the model. For example:

Is it better to put batch normalization after the addition or before the addition at the end of each residual block? If batch normalization is placed after the addition, it has the effect of normalizing the output of the entire block. This could be beneficial. However, this also forces every skip connection to perturb the output. This can be problematic: there are paths that allow data to pass through several successive batch normalization layers without any other processing. Each batch normalization layer applies its own separate distortion which compounds the original input. This has a harmful effect: we found that putting batch normalization after the addition significantly hurts test error on CIFAR, which is in line with the original paper’s recommendations.

The above result seems to suggest that it’s important to avoid changing data that passes through identity connections only. We can take this philosophy one step further: should we remove the ReLU layers at the end of each residual block? ReLU layers also perturb data that flows through identity connections, but unlike batch normalization, ReLU’s idempotence means that it doesn’t matter if data passes through one ReLU or thirty ReLUs. When we remove ReLU layers at the end of each building block, we observe a small improvement in test performance compared to the paper’s suggested ReLU placement after the addition. However, the effect is fairly minor. More exploration is needed.
These results were run on the deeper 110layer models. The effect is much less pronounced on the shallow 20layer baseline.
Alternate optimizers. When running a hyperparameter search, it can often pay off to try fancier optimization strategies than vanilla SGD with momentum. Fancier optimizers that make nuanced assumptions may improve training times, but they may instead have more difficulty training these very deep models. In our experiments, we compared SGD+momentum (as used in the original paper) with RMSprop, Adadelta, and Adagrad. Many of them appear to converge faster initially (see the training curve below), but ultimately, SGD+momentum has 0.7% lower test error than the secondbest strategy.
Solver  Testing error 

Nsize=18, Original paper: Nesterov, 1e1  0.0697 
Nsize=18, Best RMSprop (LR 1e2)  0.0768 
Nsize=18, Adadelta  0.0888 
Nsize=18, Best Adagrad (LR 1e1)  0.1145 
These experiments help verify the model’s correctness and uncover some interesting directions for future work. However, moving to the much larger ImageNet dataset opens its own Pandora’s box of interesting challenges.
For a more indepth report of the ablation studies, read here.
Training at a larger scale: ImageNet
We trained variants of the 18, 34, 50, and 101layer ResNet models on the ImageNet classification dataset.
What’s notable is that we achieved error rates that were better than the published results by using a different data augmentation method.
We are also training a 152layer ResNet model, but the model has not finished converging at the time of this post.
We used the scale and aspect ratio augmentation described in “Going Deeper with Convolutions” instead of the scale augmentation described in the ResNet paper. With ResNet34, this improved top1 validation error by about 1.2% points. We also used the color augmentation described in “Some Improvements on Deep Convolutional Neural Network Based Image Classification,” but found that had a very small effect on ResNet34.
Model changes
We experimented with moving the batch normalization layer to from after the last convolution in a building block to after the addition. We also experimented with moving the stridetwo downsampling in bottleneck architectures (ResNet50 and ResNet101) from the first 1x1 convolution to the 3x3 convolution.
Model  Batch Norm  Stridetwo layer  Top1 singlecrop err (%) 

ResNet18  after conv  3x3  30.6 
ResNet18  after add  3x3  30.4 
ResNet34  after conv  3x3  26.9 
ResNet34  after add  3x3  27.0 
ResNet50  after conv  3x3  24.5 
ResNet50  after add  1x1  24.5 
ResNet50  after add  3x3  24.2 
Batch normalization
Torch uses an exponential moving average to compute the estimates of mean and variance used in the batch normalization layers for inference. By default, Torch uses a smoothing factor of 0.1 for the moving average. We found that decreasing the smoothing factor to 0.003 and recomputing the mean and variance improved top1 error rate by about 0.2% points.
MultiGPU training
Using 4 NVIDIA Kepler GPUs and optimizations described below, training took from 3.5 days for the 18layer model to 14 days for the 101layer model.
To speed up training, we used:
Data parallelism over 4 GPUs : This is a standard way of speeding up training deep learning models. The input is a minibatch of N samples, which are divided into N / 4 subbatches and sent to each GPU separately for training, and the network parameters are communicated across the GPUs in the process. In torch, this can be done using nn.DataParallelTable.
FFT based convolutions via CuDNN4 : Using the CuDNN Torch bindings, one can select the fastest convolution kernels by setting cudnn.fastest
and cudnn.benchmark
to true
. This automatically benchmarks each possible algorithm on your GPU and chooses the fastest one. This sped up the time perminibatch by about 40% on a single GPU, but slowed down the multiGPU case due to the additional kernel launch overhead.
Multithreaded kernel launching : The FFTbased convolutions require multiple smaller kernels that are launched rapidly in succession. Although CUDA kernel launches are asynchronous, they still take some time on the CPU to enqueue. With DataParallelTable
, all the kernels for the first GPU are enqueued before kernels any are enqueued on the second, third, and fourth GPUs. To fix this, we introduced a multithreaded mode for DataParallelTable
that uses a thread per GPU to launch kernels concurrently.
NCCL Collectives : We also used the NVIDIA NCCL multiGPU communication primitives, which sped up training by an additional 4%. 4% might sound insignificant, but for example, when training Resnet101 this amounts to a saving of 13 hours.
GPU memory optimizations
We used a few tricks to fit the larger ResNet101 and ResNet152 models on 4 GPUs, each with 12 GB of memory, while still using batch size 256 (batchsize 128 for ResNet152). In a backwards pass, the gradInput
buffers can be reused once the module’s gradWeight
has been computed. In Torch, an easy way to achieve this is to modify modules of the same type to share their underlying storages. We also used the inplace variants of the ReLU and CAddTable
modules.
Adding these memory optimizations only amount to an extra 10 lines of code.
Speed of ResNets vs GoogleNet and VGGA/D
It is interesting to compare ResNets in terms of training / inference time against other stateoftheart convnet models in the context of image classification. We measured the time for a full forward and backward pass for a minibatch of 32 images for ResNet, VGG A, VGG D, BatchNormalized Inception, and Inception v3 on a NVIDIA Titan X. Also listed is the top1 singlecrop validation error on the Imagenet2012 dataset.
Model  Top1 err (%)  Time (ms) 

VGGA  29.6  372 
VGGD  26.8  687 
ResNet34  26.7  231 
BNInception  25.2  192 
ResNet50  24.0  403 
ResNet101  22.4  649 
Inceptionv3  21.2  494 
While ResNets have definitely improved over Oxford’s VGG models in terms of efficiency, GoogleNet seems to still be more efficient in terms of the accuracy / ms ratio.
Code Release
We are releasing code for training ResNets so that others can train them on their own datasets. Code for training ResNets on ImageNet is at https://github.com/facebook/fb.resnet.torch. This also includes options for training on CIFAR10, and we also describe how one can train Resnets on their own datasets.
The code for the CIFAR10 ablation studes is at https://github.com/gcr/torchresidualnetworks.
Pretrained models
We are releasing the ResNet18, 34, 50 and 101 models for use by everyone in the community. We are hoping that this will help accelerate research in the community. We will release the 152layer model when it finishes training.
The pretrained models are available at this link, and includes instructions for finetuning on your own datasets.
Our models have better accuracy than the original ResNet models, most likely because of the aspect ratio augmentation. The table below shows a comparison of singlecrop top1 validation error rates between the original residual networks paper and our released models.
As you see, our ResNet101 model gets a higher accuracy than MSRA’s ResNet152 model. We did not tune our models heavily over the validation error and so haven’t overfitted to the validation set. In fact, we only ever trained one instance of the ResNet101 model, without any hyperparameter sweeps.
Model  Original top1 err (%)  Our top1 err (%) 

ResNet50  24.7  24.0 
ResNet101  23.6  22.4 
ResNet152  23.0  N/A 
Conclusion
We presented our investigations on modelselection, optimization and our engineering optimizations, in the context of training Deep Residual Nets. We released optimized training code, as well as pretrained models, in the hope that this benefits the community.
#####################################################################################################################
Acknowledgements
Kaiming He for discussing ambiguous and missing details in the original paper and helping us reproduce the results.
Ross Girshick, Piotr Dollar, Tsungyi Lin and Adam Lerer for discussions.
Natalia Gimelshein, Nicolas Vasilache, Jeff Johnson for code and discussions around multiGPU optimization.
References
[1] He, Kaiming, et al. “Deep Residual Learning for Image Recognition.” arXiv preprint arXiv:1512.03385 (2015).
[2] Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” arXiv preprint arXiv:1502.03167 (2015).
[3] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for largescale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
comments powered by Disqus