[
Abstract
In this paper, we will show an unprecedented method to accelerate training and improve performance, which called random gradient (RG). This method can be easier to the training of any model without extra calculation cost, we use Image classification, Semantic segmentation, and GANs to confirm this method can improve speed which is training model in computer vision. The central idea is using the loss multiplied by a random number to random reduce the backpropagation gradient. We can use this method to produce a better result in Pascal VOC, Cifar, Cityscapes datasets.
Random Gradient]Fast, Better Training Trick —
Random Gradient
Hunan University of Technology
Keywords: Deep learning, Gradient Weight, Accelerating Convergence
1 Introduction
When deep learning shows excellent results in more and more areas, it is becoming more and more important to understand its internal working principles and can explain some phenomena. Backpropagation Rumelhart et al. (1986) plays an important role in deep learning field, the principle is use of gradients calculated for each iteration to update the parameters. It build the foundation for alexnet’s excellent performance in ImageNet Russakovsky et al. (2015) competition in 2012, though there are still many problems in backpropagation, such as gradient vanishing/exploding, etc, so far it is still an open question. One of the reason is that the derivative in the nonlinear layer may tend to very small or very big value, which is exacerbated by the accumulation of multiple layers, resulting in the network cannot become deeper or hard to train. Researchers mostly solved this problem by using ReLU Nair and Hinton (2010) activation functions, better initialization methods He et al. (2015), and skip connection by ResNet He et al. (2016a), which is proposed by Kaiming He. It either can solve this problem in terms of network structure and can be quickly transmitted through the residual structure effectively.
In this paper, we multiply the loss of loss function calculation by a random number between 0 and 1, and use it as a new loss for parameter optimization, in other words, the new loss is less than original loss because of a random number. Of course it can also be called random loss, however, since the derivation process will indirectly result in a random gradient, we are collectively called as random gradient. We used a variety of learning rates and momentum to experiment random gradient methods, drew some theory based on experimental phenomena, obtained the characteristics of the model at random gradients and some training skills. Experiments show that this is faster and better result, we have the following contributions:

We designed a random gradient for backpropagation, multiplying the loss value with a random number.

Experiments show that the random gradient can effectively accelerate the convergence and reduce oscillation during optimization.

We draw a close connection between RG, learning rate, task categories, and momentum, and come to a complete set of theory.
We have not shown the results on some large datasets, the main reason is that we cannot afford the time such as ImageNet Russakovsky et al. (2015) or MSCOCO Lin et al. (2014). The proposed method is evaluated on the Pascal VOC 2012 datasets Everingham et al. (2010) for semantic segmentation, and cifar datasets Krizhevsky and Hinton (2009) for image classification task, also have cityscapes Cordts et al. (2016), maps (scraped from Google Maps.) dataset for GANs, these experiments can be approximated as the performance of the network at different scales and different tasks, we also hope that if researchers are interested in this work, they can further study and improve performance. The following is a brief introduction to semantic segmentation, image classification, and generation adversarial networks.
Semantic segmentation task contains 20 foreground object classes and one background class, dataset contains 1,464 (train), 1,449 (val), and 1,456 (test) pixellevel annotated images. The performance is measured in terms of pixel intersectionoverunion averaged across the 21 classes (mIOU), but the commonly used extra annotations datasets Hariharan et al. (2011) will not be used to improve accuracy. Inspired by Hariharan et al. (2011), we use the "poly" learning rate policy that the current learning rate equals to the base one multiplying . We set the base power to 0.9, we use the random mirror for data augmentation.
Image classification task which using several excellent networks for training. Two CIFAR datasets Krizhevsky and Hinton (2009) consist of colored natural images with 3232 pixels, CIFAR10 consists of images drawn from 10 and CIFAR100 from 100 classes. The training and test sets contain 50,000 and 10,000 images respectively. We adopt a standard data augmentation scheme (mirroring/shifting) that is widely used for these two datasets. For preprocessing, we normalize the data using the channel means and standard deviations.
Generative Adversarial Networks (GANs) Goodfellow et al. (2014); Zhao et al. (2016) have achieved impressive results in image generation Denton et al. (2015); Radford et al. (2015), and representation learning Salimans et al. (2016). The key to GANsâ success is the idea of an adversarial loss that forces the generated images to be, in principle, indistinguishable from real images. This is particularly powerful for image generation tasks, as this is exactly the objective that much of computer graphics aims to optimize. We used the excellent pix2pix Zhu et al. (2017) network to experiment, batch size is 8, and other settings were strictly followed by the article.
2 Background and Inspiration
In the area of accelerating training, researchers have put more attention on how to adjust the hyperparameters such as learning rate, batch size and momentum or design an optimization algorithm to improve performance, both them exists extensive literature on accelerating training. In this paper, convergence stop as a criterion for judging whether training is completed.
In You et al. (2017); Goyal et al. (2017); Smith (2015); Smith et al. (2017), researchers get a way to improve speed of convergence by changing the learning rate or adjustment the batch size. In JastrzÄbski et al. (2017), the authors analyzed in detail three factors influencing minima, which is learning rate, batch size and the variance of the loss gradients, and experimentally verify that the noise ( is the learning rate, is the batch size) determines the width and height of the minima towards which SGD converges. In Smith and Topin (2017), the authors found a super convergence phenomenon, one of the key elements of super convergence is training with cyclical learning rates Smith (2015) and a large maximum learning rate.
Another direction is design a new optimization algorithm to achieve improvement, such as Kingma and Ba (2014); Ding et al. (2016), they based on adaptive estimates of lowerorder moments, generally faster than SGD convergence, however, in some cases it is inferior to SGD performance. We also tested the performance of the random gradient method on Adam and proved that our method is not limited to SGD.
Hawkins and Blakeslee (2004) proposes that the information transmitted in opposite direction in the brain is an order of magnitude larger than that passed forward, and neuroanatomy also confirms this view, this book gives us a deep understanding of the brain feedback mechanism. Although the current artificial neural network is very different from the brain, we wish to do some similar experiments. So we tested a variety of weights applied to the loss function, to simulate the feedback mechanism of the brain. On the right side of Figure 1 is the result which measured using PSPNet101 Zhao et al. (2017) under the same conditions, since we have fixed the learning rate and momentum, so we can directly see the influence of the RG on the model. What surprised us is that when the weight becomes larger, the result will be worse, but when the weight is random reduced, the effect will be better, when the weight becomes a random number of zero to one, the effect is best, and there is also the same effect on image classification.
The left side of Figure 1 can be more intuitive understanding, suppose we only have two parameters that need to be optimized, this can be visualized in a threedimensional coordinate system, the vertical axis is the value of the loss function, our finial goal is to minimize the loss function. We selected stochastic gradient descent method (SGD) for parameter updating, but the adjustment of the learning rate in SGD is crucial, excessively large may result in missing the local minimum or a better solution; too small may cause the parameter updating too slow. The orange line shows the gradient descent under normal conditions, it can be seen that there is a great deal of oscillating near the optimal value, usually we need to constantly adjust the learning rate. The black line on the right is the method proposed by us. Use the loss to multiply by a random number, it will ensure that there will not much oscillate in the optimization process under the same conditions, and we use the momentum Hinton (2012) method to make up for the lack of gradients, it can actually be faster convergence to local minimum or a better solution. Assume that in a more complicated example, such like the left side of Figure 2, gradient descent direction may oscillate during optimization, random gradient method can effectively reduce this phenomenon by randomly decreasing the updated gradient each time.
3 Analysis Optimization Method
Gradient descent is an optimization method that uses the slope as computed by the derivative to move in the direction of greatest negative gradient to iteratively update a variable. That is, given an initial point , gradient descent proposes the next point to be:
(1) 
When is the learning rate, fraction is the derivative of the loss function to . It can be seen that our method random reduce the value on the right meanwhile the learning rate remains unchanged.
Stochastic gradient descent is currently the most widely used optimization method, there are substantial discussion that why this solutions generalize so well in grateful literature Smith and Le (2017); Chaudhari et al. (2016); Chaudhari and Soatto (2017). At each step , a mini batch of samples is selected from the training set, the gradients of loss function are computed from this subset, and networks weights are updated based on this stochastic gradient descent:
(2) 
Recently, many researchers no longer use vanilla SGD, instead preferring SGD with momentum Hinton (2012). Momentumbased stochastic gradient descent methods are widely used in practice for training deep networks:
(3)  
It can be seen that the momentum method add the influence of the previous gradient on the present, in Smith et al. (2017), the authors believes increasing the momentum coefficient will accelerate convergence, though it is likely to lose some accuracy. And we found that the random gradient method is dependent on momentum to compensate for the loss of random numbers against gradients, this is also the reason why rapid convergence can be achieved even when the gradient decreases. Supposing that the random number is very small, and the new gradient is very small too, when the momentum method is not to be used, the parameter update speed will be very slow. At this moment, the momentum method ensures that even if the current gradient is very small, the parameters can be updated greatly. It sounds very reasonable, but in the experiment, momentum only works under certain conditions. Like in classification tasks, our highest point is only 0.5, but in semantic segmentation tasks, we generally adjust it to 0.9 as the lowest point. We believe that the main reason is the requirement about the learning rates of two tasks are different. It is well known that classification tasks require greater learning rates. Under the assumption above, the gradient update is mainly from the current gradient in the classification task, and the gradient update is mainly from the accumulation of the previous gradient in the segmentation task. As can be seen in the right side of Figure 2, momentum plays a crucial role in the speed of convergence. Compared to the normal method, the random gradient method hadn’t lose precision under the precondition of accelerating convergence.
As can be seen in the Figure 3, we use ResNetDUC+HDC Wang et al. (2017) to experiment. When the momentum is 0, the efficiency of random gradient method is not the highest, and it not obvious improve the convergence speed, when the momentum is 0.9, the advantage of the random gradient is obvious, the accuracy is better than the original gradient, and compared to using the original gradient method with momentum=0, the accuracy has not been lost. On the left side of Figure 3, when the momentum of the model is 0.95, if the comparison under the same conditions, convergence speed and accuracy have great improvement, but it will inevitably lose some accuracy.
We can further hypothesis that when the loss weight increases, reducing the learning rate can theoretically achieve a certain increase, such is the fact, the bottom right of Figure 5 is the result of this experiment. However, the consequence of this is that the convergence speed becomes slower and more complicated, this is not what we want, we do not have more indepth experiments.
In addition, regularization Jones et al. (1995) is also often used on optimize models, such as dropout Srivastava et al. (2014) (This article does not explain the dropout in detail), others like regularization and regularization (also called weight decay), make the parameter as close as possible to zero or direct equal to zero. As report by Krizhevsky et al. (2012), regularization sometimes even helps optimization. In the experiment, we used regularization = 0.0005. The basic formula is as follows:
(4) 
For convenience, we denote the parameters in a network as and is the loss function, is nonlinearity layers, is input, a network can be simplified as:
(5) 
The Hessianfree optimization method was proposed by Martens (2010) suggests a second order solution that utilizes the slope information contained in the second derivative (i.e., the derivative of the gradient ), the main idea of the second order Newton’s method is that the loss function can be locally approximated by the quadratic as:
(6) 
Where is the Hessian, or the second derivative matrix of . In general, it is not feasible to compute the Hessian matrix, which has elements, where is the number of parameters in the network, but it is unnecessary to compute the full Hessian. The Hessian expresses the curvature in all directions in a high dimensional space, but the only relevant curvature direction is in the direction of steepest descent that SGD will traverse. This concept is contained within Hessianfree optimization, as Martens (2010) suggests a finite difference approach for obtaining an estimate of the Hessian from two gradients:
(7) 
Where should be in the direction of the steepest descent. Although Hessianfree optimization method has not been widely used due to its impractical to invert or even store the Hessian matrix and promotion effect is not obvious, but we still consider it is necessary to mention the Hessianfree matrix to help us to understand model optimization more deeply in secondorder optimization method.
At the end of this section, we will mention the current optimization method Adam, which can converge more quickly. Adam Kingma and Ba (2014) is a simple and computationally efficient algorithm for gradientbased optimization of stochastic objective functions, but it takes extra memory and computing resources. We have further verified the effectiveness in this paper, it is proved that our method is also suitable for this kind of optimization algorithm.
4 Random Gradient
The above shows the basic formulas of optimization methods, it can be seen that the size of the learning rate and gradient directly determines the extent of the update. However, because the learning rate is associated with the batch size and is limited by the memory size of the hardware, our main breakthrough is to adjust the gradient. RG is available in almost all machine learning frameworks, such as mxnet Chen et al. (2017), pytorch Paszke et al. (2017). Furthermore, our approach theoretically can be applied to nearly every existing deep learning architecture. The basic code structure is as follows:
for input, target in dataset: optimizer.zero_grad() output = model(input) loss = criterion(output, target) loss_random = loss * random() loss_random.backward() optimizer.step()
Using RG can speed up training, although theoretically every parameter update gradient is attenuated, our experiment still have achieved great success. Similar to what is shown in Equation 3, momentum method uses the previous gradient to correct the problem of the current gradient, the RG by randomly reducing the current gradient to improves performance, and it can be combined with the momentum method, make up for the impact of RG on the gradient.
Learning Rate  0.01  0.01  0.001  0.001  0.01  0.01  0.0001  0.0001 

Random Gradient  
Momentum  0.95  0.95  0.95  0.95  0.90  0.90  0.95  0.95 
Mean IOU  31.940%  28.357%  65.473%  68.014%  28.121%  47.620%  69.076%  68.681% 
4.1 Theoretical Analysis
The most common training method is when the model stops converge is to decay the learning rate, which factor is typically 0.1. When decay two or three times, the model becomes unable to converge through the decay learning rate, normally, this means that training can stop. But in this paper, using random gradient strategy, the model can converge under a smaller gradient, and to explain random gradient can start with the secondorder Taylor series expansion of the cost function:
(8) 
Goodfellow et al. (2016) states: There are three terms here: the original value of the function, the expected improvement due to the slope of the function, and the correction we must apply to account for the curvature of the function. In many cases, the gradient norm does not shrink significantly throughout learning, but the term grows by more than an order of magnitude.
When the model has been oscillating without any performance improvement, it can be considered that is already large enough to affect convergence. The result is that learning becomes very slow despite the presence of a strong gradient, and the model will continue to oscillate. The random gradient strategy has come to the fore, on the premise of no loss of convergence speed, smaller gradient becomes the key sir, scored twice to further improve performance.
5 Experiment and Analysis
In this section, we will present some experimental data and analysis. However, it is frustrating that we currently have only one GTX1060, it cannot support excessive data calculations, perhaps the experimental results did not reach the highest level but fair and convincing experiments will prove the above results.
What needs to be clarified is, although the random gradient method is very simple, but because we are involved in three areas and a variety of experiments, the experiment code will be published in https://github.com/leemathew1998/RG to facilitate researchers to obtain some details not mentioned in this paper.
5.1 Semantic Segmentation and Image Classification
In Table 1, it can be observed that when the learning rate is 0.01, the momentum is 0.95, under the combined effects of the two adverse conditions (learning rate and momentum for semantic segmentation task was too big.), both the network performance is very poor. When only reducing the learning rate to 0.001, networks with random gradients are better than the original network; however, when we fixed the learning rate (0.01) and reduce the momentum to 0.90, the result is still similar. From this we can conclude that adjusting the learning rate or momentum has the same effect in random gradients, but the learning rate can more improve the performance of the network, this is a unique feature of random gradients. When the learning rate is 0.0001 and the momentum is 0.95, the accuracy of a random gradient is slightly worse than the original gradient, this also shows that a better learning rate is more efficient than momentum.
This conclusion also applies to image classification, as shown in the left side of Figure 4, when lr is 0.1 and momentum is 0.95, both models perform poorly, when one of the conditions is changed, and the results are greatly improved.
In the image classification task, we selected three excellent performance networks: ResNet He et al. (2016a), DenseNet Huang et al. (2017) and MobileNetV2 Sandler et al. (2018), we verify the relationship between momentum and model’s accuracy. For the sake of simplicity, we fixed the initial learning rate to 0.1 and constantly adjusted the value of momentum and loss weight, it can be observed in Figure 5 that the optimal momentum value for each model is not fixed. In general, setting a momentum of 0.5 in an image classification task will be a good choice. About the speed of convergence, it can be clearly seen in ResNet34 that the speed of convergence has accelerated, however, there is no obvious speed advantage in other models. In this article, we do not intend to continue to discuss in depth how ResNet network architecture relates to accelerated convergence, but we think this should be an interesting issue, because we tested a lot of models, ResNet and its variants He et al. (2016a, b); Zagoruyko and Komodakis (2016); Xie et al. (2016) seem to be easier to get better results, the right side of Figure 4 is our further experiment. About the accuracy rate, the improvement was not obvious in some cases, mainly because we fixed the learning rate.
In Fig 6, we put more attention on the learning rate. In the left side, we show the results in the image classification field when the momentum is 0.5, because the classification task requires a higher learning rate, the random gradient method does not perform well when the initial learning rate is small, but when the learning rate is greater than 0.1, the model can converge faster and better. In the right side, it can be seen that there is also the same result on semantic segmentation.
It can be seen from the above that the learning rate is very important for the convergence of the model, although the model structure is different, the optimal learning rate may be significantly different, just consistent with the hyperparameters of normal training still can have a better result, which also reduces the burden on researchers.
5.2 Gan
In Fig 7, 8, it can be seen clearly that our method can generate clearer and more realistic images. In generating tasks, we do not test the relationship between learning rate and momentum, so we accord the method which mentioned in the paper Zhu et al. (2017) to do our experiments. We apply the Adam solver Kingma and Ba (2014), with learning rate 0.0002, and momentum parameters = 0.5, = 0.999, we trained the network 200 epochs, please refer to the original paper for details. Particularly noteworthy is the observation that the random gradient method can also work well on Adam, this gives the random gradient a great degree of freedom.
6 Conclusion and Limitation
In this paper, we presented empirical evidence for a previously phenomenon that we name random gradient. Change the gradient by applying a random weight to the loss, we are surprised that the random gradient method can perform well in many fields. It get rid of the dependence on the optimization algorithm, and using Adam in generating tasks achieves better results.
Although our method can achieve compelling results in many field, but we have not given a convincing theoretical analysis, just simply and intuitively based on the phenomenon to summarize a vague conclusion. And we have no systematic analysis of Nesterov method Nesterov and NemirovskiÄ (1994), but in some simple experimental verifications, the results are similar to the momentum method.
7 Future Work
We can only see a short distance ahead, but we can see plenty there that needs to be done. – Alan Turing
We are pleasantly surprised to find that there is a further improvement in replacing the random number with the cyclical strategy proposed by Smith (2015). It is applied to adjust the learning rate and also exhibits the nature of fast convergence under certain conditions. We will conduct more indepth research in the future.
References
 Chaudhari and Soatto [2017] Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks., 2017. arXiv:1710.11029.
 Chaudhari et al. [2016] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropysgd: Biasing gradient descent into wide valleys, 2016. arXiv:1611.01838.
 Chen et al. [2017] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems, 2017. arXiv:1512.01274.
 Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016.
 Denton et al. [2015] Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. pages 1486–1494, 2015.
 Ding et al. [2016] Yi Ding, Peilin Zhao, Steven C.H. Hoi, and YewSoon Ong. Adaptive subgradient methods for online auc maximization, 2016. arXiv:1602.00351.
 Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
 Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 Goodfellow et al. [2014] Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In International Conference on Neural Information Processing Systems, pages 2672–2680, 2014.
 Goyal et al. [2017] Priya Goyal, Piotr DollÃ¡r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017. arXiv:1706.02677.
 Hariharan et al. [2011] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011.
 Hawkins and Blakeslee [2004] J. Hawkins and S. Blakeslee. On intelligence: How a new understanding of the brain will lead to truly intelligent machines. 2004.
 He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification, 2015. arXiv:1502.01852.
 He et al. [2016a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, , and Jian Sun. Deep residual learning for image recognition. pages 770–778, 2016a. In CVPR.
 He et al. [2016b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. pages 630–645, 2016b.
 Hinton [2012] Geoffrey E. Hinton. A practical guide to training restricted boltzmann machines. Momentum, 9(1):599–619, 2012.
 Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens van der Maaten, , and Kilian Q. Weinberger. Densely connected convolutional networks, 2017.
 JastrzÄbski et al. [2017] StanisÅaw JastrzÄbski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd, 2017. arXiv:1711.04623.
 Jones et al. [1995] Jones, M, and Poggio. Regularization theory and neural networks architectures. Neural Comp, 7(2):219–269, 1995.
 Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. arXiv:1412.6980.
 Krizhevsky and Hinton [2009] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images, 2009. Tech Report.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems, pages 1097–1105, 2012.
 Lin et al. [2014] Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr DollÃ¡r, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755, 2014.
 Martens [2010] James Martens. Deep learning via hessianfree optimization. In International Conference on International Conference on Machine Learning, pages 735–742, 2010.
 Nair and Hinton [2010] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on International Conference on Machine Learning, pages 807–814, 2010.
 Nesterov and NemirovskiÄ [1994] IU. E Nesterov and ArkadiÄ Semenovich NemirovskiÄ. Interior point polynomial algorithms in convex programming, sam. Studies in Applied Mathematics Philadelphia Siam, 13, 1994.
 Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch, 2017.
 Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. Computer Science, 2015.
 Rumelhart et al. [1986] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by backpropagating errors. Nature, 323(9):533–536, 1986.
 Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. Imagenet large scale visual recognition challenge, 2015. In IJCV, pages 10.1007/ s11263â015â0816ây.
 Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. 2016.
 Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. 2018.
 Smith [2015] Leslie N. Smith. Cyclical learning rates for training neural networks, 2015. arXiv:1506.01186.
 Smith and Topin [2017] Leslie N. Smith and Nicholay Topin. Superconvergence: very fast training of residual networks using large learning rates., 2017.
 Smith and Le [2017] Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent, 2017. arXiv:1710.06451.
 Smith et al. [2017] Samuel L. Smith, PieterJan Kindermans, Chris Ying, and Quoc V. Le. Don’t decay the learning rate, increase the batch size, 2017. arXiv:1711.00489.
 Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Wang et al. [2017] Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. Understanding convolution for semantic segmentation. pages 1451–1460, 2017.
 Xie et al. [2016] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. pages 5987–5995, 2016.
 You et al. [2017] Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks, 2017. arXiv:1708.03888.
 Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. 2016.
 Zhao et al. [2017] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.
 Zhao et al. [2016] Junbo Zhao, Michael Mathieu, and Yann Lecun. Energybased generative adversarial network. 2016.
 Zhu et al. [2017] Jun Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. pages 2242–2251, 2017.