Sobolev Training for Neural Networks
Abstract
At the heart of deep learning we aim to use neural networks as function approximators – training them to produce outputs from inputs in emulation of a ground truth function or data creation process. In many cases we only have access to inputoutput pairs from the ground truth, however it is becoming more common to have access to derivatives of the target output with respect to the input – for example when the ground truth function is itself a neural network such as in network compression or distillation. Generally these target derivatives are not computed, or are ignored. This paper introduces Sobolev Training for neural networks, which is a method for incorporating these target derivatives in addition the to target values while training. By optimising neural networks to not only approximate the function’s outputs but also the function’s derivatives we encode additional information about the target function within the parameters of the neural network. Thereby we can improve the quality of our predictors, as well as the dataefficiency and generalization capabilities of our learned function approximation. We provide theoretical justifications for such an approach as well as examples of empirical evidence on three distinct domains: regression on classical optimisation datasets, distilling policies of an agent playing Atari, and on largescale applications of synthetic gradients. In all three domains the use of Sobolev Training, employing target derivatives in addition to target values, results in models with higher accuracy and stronger generalisation.
Sobolev Training for Neural Networks
Wojciech Marian Czarnecki, Simon Osindero, Max Jaderberg Grzegorz Swirszcz, and Razvan Pascanu DeepMind, London, UK {lejlot,osindero,jaderberg,swirszcz,razp}@google.com
noticebox[b]\end@float
1 Introduction
Deep Neural Networks (DNNs) are one of the main tools of modern machine learning. They are consistently proven to be powerful function approximators, able to model a wide variety of functional forms – from image recognition [8, 24], through audio synthesis [27], to humanbeating policies in the ancient game of GO [22]. In many applications the process of training a neural network consists of receiving a dataset of inputoutput pairs from a ground truth function, and minimising some loss with respect to the network’s parameters. This loss is usually designed to encourage the network to produce the same output, for a given input, as that from the target ground truth function. Many of the ground truth functions we care about in practice have an unknown analytic form, e.g. because they are the result of a natural physical process, and therefore we only have the observed inputoutput pairs for supervision. However, there are scenarios where we do know the analytic form and so are able to compute the ground truth gradients (or higher order derivatives), alternatively sometimes these quantities may be simply observable. A common example is when the ground truth function is itself a neural network; for instance this is the case for distillation [9, 20], compressing neural networks [7], and the prediction of synthetic gradients [12]. Additionally, if we are dealing with an environment/datageneration process (vs. a predetermined set of data points), then even though we may be dealing with a black box we can still approximate derivatives using finite differences. In this work, we consider how this additional information can be incorporated in the learning process, and what advantages it can provide in terms of data efficiency and performance. We propose Sobolev Training (ST) for neural networks as a simple and efficient technique for leveraging derivative information about the desired function in a way that can easily be incorporated into any training pipeline using modern machine learning libraries.
The approach is inspired by the work of Hornik [10] which proved the universal approximation theorems for neural networks in Sobolev spaces – metric spaces where distances between functions are defined both in terms of their differences in values and differences in values of their derivatives.
In particular, it was shown that a sigmoid network can not only approximate a function’s value arbitrarily well, but that the network’s derivatives with respect to its inputs can approximate the corresponding derivatives of the ground truth function arbitrarily well too. Sobolev Training exploits this property, and tries to match not only the output of the function being trained but also its derivatives.
There are several related works which have also exploited derivative information for function approximation. For instance Wu et al. [30] and antecedents propose a technique for Bayesian optimisation with Gaussian Processess (GP), where it was demonstrated that the use of information about gradients and Hessians can improve the predictive power of GPs. In previous work on neural networks, derivatives of predictors have usually been used either to penalise model complexity (e.g. by pushing Jacobian norm to 0 [19]), or to encode additional, hand crafted invariances to some transformations (for instance, as in Tangentprop [23]), or estimated derivatives for dynamical systems [6] and very recently to provide additional learning signal during attention distillation [31]^{1}^{1}1Please relate to Supplementary Materials, section 5 for details. Similar techniques have also been used in critic based Reinforcement Learning (RL), where a critic’s derivatives are trained to match its target’s derivatives [29, 15, 5, 4, 26] using small, sigmoid based models. Finally, Hyvärinen proposed Score Matching Networks [11], which are based on the somewhat surprising observation that one can model unknown derivatives of the function without actual access to its values – all that is needed is a sampling based strategy and specific penalty. However, such an estimator has a high variance [28], thus it is not really useful when true derivatives are given.
To the best of our knowledge and despite its simplicity, the proposal to directly match network derivatives to the true derivatives of the target function has been minimally explored for deep networks, especially modern ReLU based models. In our method, we show that by using the additional knowledge of derivatives with Sobolev Training we are able to train better models – models which achieve lower approximation errors and generalise to test data better – and reduce the sample complexity of learning. The contributions of our paper are therefore threefold: (1): We introduce Sobolev Training – a new paradigm for training neural networks. (2): We look formally at the implications of matching derivatives, extending previous results of Hornik [10] and showing that modern architectures are well suited for such training regimes. (3): Empirical evidence demonstrating that Sobolev Training leads to improved performance and generalisation, particularly in low data regimes. Example domains are: regression on classical optimisation problems; policy distillation from RL agents trained on the Atari domain; and training deep, complex models using synthetic gradients – we report the first successful attempt to train a largescale ImageNet model using synthetic gradients.
2 Sobolev Training
We begin by introducing the idea of training using Sobolev spaces. When learning a function , we may have access to not only the output values for training points , but also the values of its th order derivatives with respect to the input, . In other words, instead of the typical training set consisting of pairs we have access to tuples . In this situation, the derivative information can easily be incorporated into training a neural network model of by making derivatives of the neural network match the ones given by .
Considering a neural network model parameterised with , one typically seeks to minimise the empirical error in relation to according to some loss function
When learning in Sobolev spaces, this is replaced with:
(1) 
where are loss functions measuring error on th order derivatives. This causes the neural network to encode derivatives of the target function in its own derivatives. Such a model can still be trained using backpropagation and offtheshelf optimisers.
A potential concern is that this optimisation might be expensive when either the output dimensionality of or the order are high, however one can reduce this cost through stochastic approximations. Specifically, if is a multivariate function, instead of a vector gradient, one ends up with a full Jacobian matrix which can be large. To avoid adding computational complexity to the training process, one can use an efficient, stochastic version of Sobolev Training: instead of computing a full Jacobian/Hessian, one just computes its projection onto a random vector (a direct application of a known estimation trick [19]). In practice, this means that during training we have a random variable sampled uniformly from the unit sphere, and we match these random projections instead:
(2) 
Figure 1 illustrates compute graphs for nonstochastic and stochastic Sobolev Training of order 2.
3 Theory and motivation
While in the previous section we defined Sobolev Training, it is not obvious that modeling the derivatives of the target function is beneficial to function approximation, or that optimising such an objective is even feasible. In this section we motivate and explore these questions theoretically, showing that the Sobolev Training objective is a well posed one, and that incorporating derivative information has the potential to drastically reduce the sample complexity of learning.
Hornik showed [10] that neural networks with nonconstant, bounded, continuous activation functions, with continuous derivatives up to order are universal approximators in the Sobolev spaces of order , thus showing that sigmoidnetworks are indeed capable of approximating elements of these spaces arbitrarily well. However, nowadays we often use activation functions such as ReLU which are neither bounded nor have continuous derivatives. The following theorem shows that for we can use ReLU function (or a similar one, like leaky ReLU) to create neural networks that are universal approximators in Sobolev spaces. We will use a standard symbol (or simply ) to denote a space of functions which are continuous, differentiable, and have a continuous derivative on a space [14]. All proofs are given in the Supplementary Materials (SM).
Theorem 1.
Let be a function on a compact set. Then, for every positive there exists a single hidden layer neural network with a ReLU (or a leaky ReLU) activation which approximates in Sobolev space up to error.
This suggests that the Sobolev Training objective is achievable, and that we can seek to encode the values and derivatives of the target function in the values and derivatives of a ReLU neural network model. Interestingly, we can show that if we seek to encode an arbitrary function in the derivatives of the model then this is impossible not only for neural networks but also for any arbitrary differentiable predictor on compact sets.
Theorem 2.
Let be a function. Let be a continuous function satisfying . Then, there exists an such that for any function either or .
However, when we move to the regime of finite training data, we can encode any arbitrary function in the derivatives (as well as higher order signals if the resulting Sobolev spaces are not degenerate), as shown in the following Proposition.
Proposition 1.
Given any two functions and on and a finite set , there exists neural network with a ReLU (or a leaky ReLU) activation such that and (it has 0 training loss).
Having shown that it is possible to train neural networks to encode both the values and derivatives of a target function, we now formalise one possible way of showing that Sobolev Training has lower sample complexity than regular training.
Let denote the family of functions parametrised by . We define to be a measure of the amount of data needed to learn some target function . That is is the smallest number for which there holds: for every and every set of distinct points such that . is defined analogously, but the final implication is of form . Straight from the definition there follows:
Proposition 2.
For any , there holds .
For many families, the above inequality becomes sharp. For example, to determine the coefficients of a polynomial of degree one needs to compute its values in at least distinct points. If we know values and the derivatives at points, it is a wellknown fact that only points suffice to determine all the coefficients. We present two more examples in a slightly more formal way. Let denote a family of Gaussian PDFs (parametrised by , ). Let and let be a family of functions from (Cartesian product of sets ) to of form (linear elementwise) (Figure 2 Left).
Proposition 3.
There holds and .
This result relates to Deep ReLU networks as they build a hyperplanesbased model of the target function. If those were parametrised independently one could expect a reduction of sample complexity by times, where is the dimension of the function domain. In practice parameters of hyperplanes in such networks are not independent, furthermore the hinges positions change so the Proposition cannot be directly applied, but it can be seen as an intuitive way to see why the sample complexity drops significantly for Deep ReLU networks too.
4 Experimental Results
We consider three domains where information about derivatives is available during training^{2}^{2}2All experiments were performed using TensorFlow [2] and the Sonnet neural network library [1]..
4.1 Artificial Data
First, we consider the task of regression on a set of well known lowdimensional functions used for benchmarking optimisation methods.
We train two hidden layer neural networks with 256 hidden units per layer with ReLU activations to regress towards function values, and verify generalisation capabilities by evaluating the mean squared error on a holdout test set. Since the task is standard regression, we choose all the losses of Sobolev Training to be L2 errors, and use a first order Sobolev method (second order derivatives of ReLU networks with a linear output layer are constant, zero). The optimisation is therefore:
Dataset  20 training samples  100 training samples  
Regular  Sobolev  Regular  Sobolev  
Figure 2 right shows the results for the optimisation benchmarks. As expected, Sobolev trained networks perform extremely well – for six out of seven benchmark problems they significantly reduce the testing error with the obtained errors orders of magnitude smaller than the corresponding errors of the regularly trained networks. The stark difference in approximation error is highlighted in Figure 3, where we show the StyblinskiTang function and its approximations with both regular and Sobolev Training. It is clear that even in very low data regimes, the Sobolev trained networks can capture the functional shape.
Looking at the results, we make two important observations. First, the effect of Sobolev Training is stronger in lowdata regimes, however it does not disappear even in the high data regime, when one has 10,000 training examples for training a twodimensional function. Second, the only case where regular regression performed better is the regression towards Ackley’s function. This particular example was chosen to show that one possible weak point of our approach might be approximating functions with a very high frequency signal component in the relatively low data regime. Ackley’s function is composed of exponents of high frequency cosine waves, thus creating an extremely bumpy surface, consequently a method that tries to match the derivatives can behave badly during testing if one does not have enough data to capture this complexity. However, once we have enough training data points, Sobolev trained networks are able to approximate this function better.
4.2 Distillation
Another possible application of Sobolev Training is to perform model distillation. This technique has many applications, such as network compression [21], ensemble merging [9], or more recently policy distillation in reinforcement learning [20].
We focus here on a task of distilling a policy. We aim to distill a target policy – a trained neural network which outputs a probability distribution over actions – into a smaller neural network , such that the two policies and have the same behaviour. In practice this is often done by minimising an expected divergence measure between and , for example, the Kullback–Leibler divergence , over states gathered while following . Since policies are multivariate functions, direct application of Sobolev Training would mean producing full Jacobian matrices with respect to the , which for large actions spaces is computationally expensive. To avoid this issue we employ a stochastic approximation described in Section 2, thus resulting in the objective
where the expectation is taken with respect to coming from a uniform distribution over the unit sphere, and Monte Carlo sampling is used to approximate it.
As target policies , we use agents playing Atari games [17] that have been trained with A3C [16] on three well known games: Pong, Breakout and Space Invaders. The agent’s policy is a neural network consisting of 3 layers of convolutions followed by two fullyconnected layers, which we distill to a smaller network with 2 convolutional layers and a single smaller fullyconnected layer (see SM for details). Distillation is treated here as a purely supervised learning problem, as our aim is not to reevaluate known distillation techniques, but rather to show that if the aim is to minimise a given divergence measure, we can improve distillation using Sobolev Training.
Test action prediction error  Test 
Regular distillation Sobolev distillation 
Figure 4 shows test error during training with and without Sobolev Training^{3}^{3}3Testing is performed on a held out set of episodes, thus there are no temporal nor causal relations between training and testing. The introduction of Sobolev Training leads to similar effects as in the previous section – the network generalises much more effectively, and this is especially true in low data regimes. Note the performance gap on Pong is small due to the fact that optimal policy is quite degenerate for this game^{4}^{4}4For majority of the time the policy in Pong is uniform, since actions taken when the ball is far away from the player do not matter at all. Only in crucial situations it peaks so the ball hits the paddle.. In all remaining games one can see a significant performance increase from using our proposed method, and as well as minor to no overfitting.
Despite looking like a regularisation effect, we stress that Sobolev Training is not trying to find the simplest models for data or suppress the expressivity of the model. This training method aims at matching the original function’s smoothness/complexity and so reduces overfitting by effectively extending the information content of the training set, rather than by imposing a dataindependent prior as with regularisation.
4.3 Synthetic Gradients
Noprop  Direct SG [12]  VFBN [25]  Critic  Sobolev  
CIFAR10 with 3 synthetic gradient modules  
Top 1 (94.3%)  54.5%  79.2%  88.5%  93.2%  93.5% 
ImageNet with 1 synthetic gradient module  
Top 1 (75.0%)  54.0%    57.9%  71.7%  72.0% 
Top 5 (92.3%)  77.3%    81.5%  90.5%  90.8% 
ImageNet with 3 synthetic gradient modules  
Top 1 (75.0%)  18.7%    28.3%  65.7%  66.5% 
Top 5 (92.3%)  38.0%    52.9%  86.9%  87.4% 
The previous experiments have shown how information about the derivatives can boost approximating function values. However, the core idea of Sobolev Training is broader than that, and can be employed in both directions. Namely, if one ultimately cares about approximating derivatives, then additionally approximating values can help this process too. One recent technique, which requires a model of gradients is Synthetic Gradients (SG) [12] – a method for training complex neural networks in a decoupled, asynchronous fashion. In this section we show how we can use Sobolev Training for SG.
The principle behind SG is that instead of doing full backpropagation using the chainrule, one splits a network into two (or more) parts, and approximates partial derivatives of the loss with respect to some hidden layer activations with a trainable function . In other words, given that network parameters up to are denoted by
In the original SG paper, this module is trained to minimise where is the final prediction of the main network for hidden activations . For the case of learning a classifier, in order to apply Sobolev Training in this context we construct a loss predictor, composed of a class predictor followed by the log loss, which gets supervision from the true loss, and the gradient of the prediction gets supervision from the true gradient:
In the Sobolev Training framework, the target function is the loss of the main network for which we train a model to approximate, and in addition ensure that the model’s derivatives are matched to the true derivatives . The model’s derivatives are used as the synthetic gradient to decouple the main network.
This setting closely resembles what is known in reinforcement learning as critic methods [13]. In particular, if we do not provide supervision on the gradient part, we end up with a loss critic. Similarly if we do not provide supervision at the loss level, but only on the gradient component, we end up in a method that resembles VFBN [25]. In light of these connections, our approach in this application setting can be seen as a generalisation and unification of several existing ones (see Table 1 for illustrations of these approaches).
We perform experiments on decoupling deep convolutional neural network image classifiers using synthetic gradients produced by loss critics that are trained with Sobolev Training, and compare to regular loss critic training, and regular synthetic gradient training. We report results on CIFAR10 for three network splits (and therefore three synthetic gradient modules) and on ImageNet with one and three network splits ^{5}^{5}5N.b. the experiments presented use learning rates, annealing schedule, etc. optimised to maximise the backpropagation baseline, rather than the synthetic gradient decoupled result (details in the SM). .
The results are shown in Table 1. With a naive SG model, we obtain 79.2% test accuracy on CIFAR10. Using an SG architecture which resembles a small version of the rest of the model makes learning much easier and led to 88.5% accuracy, while Sobolev Training achieves 93.5% final performance. The regular critic also trains well, achieving 93.2%, as the critic forces the lower part of the network to provide a representation which it can use to reduce the classification (and not just prediction) error. Consequently it provides a learning signal which is well aligned with the main optimisation. However, this can lead to building representations which are suboptimal for the rest of the network. Adding additional gradient supervision by constructing our Sobolev SG module avoids this issue by making sure that synthetic gradients are truly aligned and gives an additional boost to the final accuracy.
For ImageNet [3] experiments based on ResNet50 [8], we obtain qualitatively similar results. Due to the complexity of the model and an almost 40% gap between no backpropagation and full backpropagation results, the difference between methods with vs without loss supervision grows significantly. This suggests that at least for ResNetlike architectures, loss supervision is a crucial component of a SG module. After splitting ResNet50 into four parts the Sobolev SG achieves 87.4% top 5 accuracy, while the regular critic SG achieves 86.9%, confirming our claim about suboptimal representation being enforced by gradients from a regular critic. Sobolev Training results were also much more reliable in all experiments (significantly smaller standard deviation of the results).
5 Discussion and Conclusion
In this paper we have introduced Sobolev Training for neural networks – a simple and effective way of incorporating knowledge about derivatives of a target function into the training of a neural network function approximator. We provided theoretical justification that encoding both a target function’s value as well as its derivatives within a ReLU neural network is possible, and that this results in more data efficient learning. Additionally, we show that our proposal can be efficiently trained using stochastic approximations if computationally expensive Jacobians or Hessians are encountered.
In addition to toy experiments which validate our theoretical claims, we performed experiments to highlight two very promising areas of applications for such models: one being distillation/compression of models; the other being the application to various metaoptimisation techniques that build models of other models dynamics (such as synthetic gradients, learningtolearn, etc.). In both cases we obtain significant improvement over classical techniques, and we believe there are many other application domains in which our proposal should give a solid performance boost.
In this work we focused on encoding true derivatives in the corresponding ones of the neural network. Another possibility for future work is to encode information which one believes to be highly correlated with derivatives. For example curvature [18] is believed to be connected to uncertainty. Therefore, given a problem with known uncertainty at training points, one could use Sobolev Training to match the second order signal to the provided uncertainty signal. Finite differences can also be used to approximate gradients for black box target functions, which could help when, for example, learning a generative temporal model. Another unexplored path would be to apply Sobolev Training to internal derivatives rather than just derivatives with respect to the inputs.
References
 [1] Sonnet. https://github.com/deepmind/sonnet. 2017.
 [2] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 [3] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 [4] Michael Fairbank and Eduardo Alonso. Valuegradient learning. In Neural Networks (IJCNN), The 2012 International Joint Conference on, pages 1–8. IEEE, 2012.
 [5] Michael Fairbank, Eduardo Alonso, and Danil Prokhorov. Simple and fast calculation of the secondorder gradients for globalized dual heuristic dynamic programming in neural networks. IEEE transactions on neural networks and learning systems, 23(10):1671–1676, 2012.
 [6] A Ronald Gallant and Halbert White. On learning the derivatives of an unknown mapping with multilayer feedforward networks. Neural Networks, 5(1):129–138, 1992.
 [7] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 [9] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [10] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
 [11] Aapo Hyvärinen. Estimation of nonnormalized statistical models using score matching. Journal of Machine Learning Research, pages 695–709, 2005.
 [12] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343, 2016.
 [13] Vijay R Konda and John N Tsitsiklis. Actorcritic algorithms. In NIPS, volume 13, pages 1008–1014, 1999.
 [14] Steven G Krantz. Handbook of complex variables. Springer Science & Business Media, 2012.
 [15] W Thomas Miller, Paul J Werbos, and Richard S Sutton. Neural networks for control. MIT press, 1995.
 [16] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 [17] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 [18] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
 [19] Salah Rifai, Grégoire Mesnil, Pascal Vincent, Xavier Muller, Yoshua Bengio, Yann Dauphin, and Xavier Glorot. Higher order contractive autoencoder. Machine Learning and Knowledge Discovery in Databases, pages 645–660, 2011.
 [20] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
 [21] Bharat Bhusan Sau and Vineeth N Balasubramanian. Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650, 2016.
 [22] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 [23] Patrice Simard, Bernard Victorri, Yann LeCun, and John S Denker. Tangent propa formalism for specifying selected invariances in an adaptive network. In NIPS, volume 91, pages 895–903, 1991.
 [24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [25] Shinichi Maeda Koyama Masanori Takeru Miyato, Daisuke Okanohara. Synthetic gradient methods with virtual forwardbackward networks. ICLR workshop proceedings, 2017.
 [26] Yuval Tassa and Tom Erez. Least squares solutions of the hjb equation with neural network valuefunction approximators. IEEE transactions on neural networks, 18(4):1031–1041, 2007.
 [27] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR abs/1609.03499, 2016.
 [28] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
 [29] Paul J Werbos. Approximate dynamic programming for realtime control and neural modeling. Handbook of intelligent control, 1992.
 [30] Anqi Wu, Mikio C Aoi, and Jonathan W Pillow. Exploiting gradients and hessians in bayesian optimization and bayesian quadrature. arXiv preprint arXiv:1704.00060, 2017.
 [31] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.
Supplementary Materials for “Sobolev Training for Neural Networks”
Appendix A Proofs
Theorem 1.
Let be a function on a compact set. Then, for every positive there exists a single hidden layer neural network with a ReLU (or a leaky ReLU) activation which approximates in Sobolev space up to error.
We start with a definition. We will say that a function on a set is piecewiselinear, if there exist such that and is linear for every (note, that we assume finiteness in the definition).
Lemma 1.
Let be a compact subset of and let . Then, for every there exists a piecewiselinear, continuous function such that for every and for every , where is the set of points of nondifferentiability of .
Proof.
By assumption, the function is continuous on . Every continuous function on a compact set has to be uniformly continuous. Therefore, there exists such that for every , , with there holds . Moreover, has to be bounded. Let denote . By Mean Value Theorem, if then . Let . Let , be a sequence satisfying: for , for and for all . Such sequence obviously exists, because is a compact (and thus bounded) subset of . We define
It can be easily verified, that it has all the desired properties. Indeed, let . Let be such that . Then , as and by definitions. Moreover, applying Mean Value Theorem we get that there exists such that . Thus, as and .
∎
Lemma 2.
Let have finite limits and , and let . Then, for every there exists a piecewiselinear, continuous function such that for every and for every , where is the set of points of nondifferentiability of .
Proof.
Corollary 1.
For every there exists a combination of ReLU functions which approximates a sigmoid function with accurracy in the Sobolev space.
Proof.
It follows immediately from Lemma 2 and the fact, that any piecewisecontinuous function on can be expressed as a finite sum of ReLU activations. ∎
Remark 1.
The authors decided, for the sake of clarity and better readability of the paper, to not treat the issue of nondifferentiabilities of the piecewiselinear function at the junction points. It can be approached in various ways, either by noticing they form a finite, and thus a zeroLebesgue measure set and invoking the formal definition f Sobolev spaces, or by extending the definition of a derivative, but it leads only to noninteresting technical complications.
Proof of Theorem 1.
By Hornik’s result (Hornik [10]) there exists a combination of sigmoids approximating the function in the Sobolev space with accuracy. Each of those sigmoids can, in turn, be approximated up to accuracy by a finite combination of ReLU (or leaky ReLU) functions (Corollary 1), and the theorem follows. ∎
Theorem 2.
Let be a . Let be a continuous function satisfying . Then, there exists an such that for any function there holds either or .
Proof.
Assume that the converse holds. This would imply, that there exists a sequence of functions such that and . A theorem about termbyterm differentiation implies then that the limit is differentiable, and that the equality holds. However, , contradicting . ∎
Proposition 1.
Given any two functions and on and a finite set , there exists neural network with a ReLU (or a leaky ReLU) activation such that and (it has 0 training loss).
Proof.
We first prove the theorem in a special, 1dimensional case (when is a subset of ). Form now it will be assumed that is a subset of and } is a finite subset of . Let be smaller than , . We define a function as follows
Note that the functions have disjoint supports for . We define . By construction, it has all the desired properties.
Now let us move to the general case, when is a subset of . We will denote by a projection of a dimensional point onto the th coordinate. The obstacle to repeating the dimensional proof in a straightforward matter (coordinatebycoordinate) is that two or more of the points can have one or more coordinates equal. We will use a linear change of coordinates to get past this technical obstacle. Let be matrix such that there holds for any and any . Such exists, as every condition defines a codimensionone submanifold in the space , thus the complement of the union of all such submanifolds is a full dimension (and thus nonempty) subset of . Using the onedimensional construction we define functions , , such that and . Similarly, we construct in such manner and . Note that those definitions a are valid because for , so the right sides are welldefined unique numbers.
It remains to put all the elements together. This is done as follows. First we extend , to the whole space “trivially”, i.e. for any , we define . Similarly, . Finally, . This function has the desired properties. Indeed for every we have
and
∎
This completes the proof.
Proposition 3.
There holds and .
Proof.
Gaussian PDF functions form a 2parameter family . Therefore, determining in that family is equivalent to determining the values of and . Given , , we get and . Thus . The right hand side is a strictly decreasing function of . Substituting its unique solution to we determine . Thus is equal to for the family of Gaussian PDF functions.
On the other hand, there holds for the family of Gaussian PDF functions. For example, and have the same values at and (existence of a “real” solution near this approximate solution is an immediate consequence of the Implicit Function Theorem). This ends the proof for the family
We will discuss the family now. Every linear function is uniquely determined by its value at a single point and its derivative. Thus, for any function , as the partition is fixed, it is sufficient to know the values and the values of the derivative of in to determine it uniquely. On the other hand, we need at least (recall that is the dimension of the domain of ) in each of the domains to determine uniquely, if we are allowed to look only at the values.
∎
Appendix B Artificial Datasets
Dataset  20 training samples  100 training samples  
Regular  Sobolev  Regular  Sobolev  
Dataset  20 training samples  100 training samples  
Regular  Sobolev  Regular  Sobolev  
Dataset  20 training samples  100 training samples  
Regular  Sobolev  Regular  Sobolev  
Dataset  20 training samples  100 training samples  
Regular  Sobolev  Regular  Sobolev  
Dataset  20 training samples  100 training samples  
Regular  Sobolev  Regular  Sobolev  
Dataset  20 training samples  100 training samples  
Regular  Sobolev  Regular  Sobolev  
Dataset  20 training samples  100 training samples  
Regular  Sobolev  Regular  Sobolev  
Functions used (visualised at Figures 511):

Ackley’s
for

Beale’s
for

Booth
for

Bukin
for

McCormick
for

Rosenbrock
for

StyblinskiTang
for
Networks were trained using the Adam optimiser with learning rate . Training set has been sampled uniformly from the domain provided. Test set consists always of 10,000 points sampled uniformly from the same domain.
Appendix C Policy Distillation
Agents policies are feed forward networks consisting of:

32 8x8 kernels with stride 4

ReLU nonlinearity

64 4x4 kernels with stride 2

ReLU nonlinearity

64 3x3 kernels with stride 1

ReLU nonlinearity

Linear layer with 512 units

ReLU nonlinearity

Linear layer with 3 (Pong), 4 (Breakout) or 6 outputs (Space Invaders)

Softmax
They were trained with A3C [16] over 80e6 steps, using history of length 4, greyscaled input, and action repeat 4. Observations were scaled down to 84x84 pixels.
Data has been gathered by running trained policy to gather 100K frames (thus for 400K actual steps). Split into train and test sets has been done timewise, ensuring that test frames come from different episodes than the training ones.
Distillation network consists of:

16 8x8 kernels with stride 4

ReLU nonlinearity

32 4x4 kernels with stride 2

ReLU nonlinearity

Linear layer with 256 units

ReLU nonlinearity

Linear layer with 3 (Pong), 4 (Breakout) or 6 outputs (Space Invaders)

Softmax
and was trained using Adam optimiser with learning rate fitted independently per game and per approach between and . Batch size is 200 frames, randomly selected from the training set.
Appendix D Synthetic Gradients
All models were trained using multiGPU optimisation, with Sync main network updates and Hogwild SG module updates.
d.1 Meaning of Sobolev losses for synthetic gradients
In the setting considered, the true label is used only as a conditioning, however one could also provide supervision for . So what is the actual effect this Sobolev losses have on SG estimator? For being log loss, it is easy to show, that they are additional penalties on matching to , namely:
where is the index of “1” in the onehot encoded label vector . Consequently loss supervision makes sure that the internal prediction for the true label is close to the current prediction of the whole model . On the other hand matching partial derivatives wrt. to label makes sure that predictions for all the classes are close to each other. Finally if we use both – we get a weighted sum, where penalty for deviating from the prediction on the true label is more expensive, than on all remaining ones^{6}^{6}6Adding supervision on toy MNIST experiments increased convergence speed and stability, however due to TensorFlow currently not supporting differentiating cross entropy wrt. to labels, it was omitted in our largescale experiments..
d.2 Cifar10
All Cifar10 experiments use a deep convolutional network of following structure:

64 3x3 kernels with stride 1

BatchNorm and ReLU nonlinearity

64 3x3 kernels with stride 1

BatchNorm and ReLU nonlinearity

128 3x3 kernels with stride 2

BatchNorm and ReLU nonlinearity

128 3x3 kernels with stride 1

BatchNorm and ReLU nonlinearity

128 3x3 kernels with stride 1

BatchNorm and ReLU nonlinearity

256 3x3 kernels with stride 2

BatchNorm and ReLU nonlinearity

256 3x3 kernels with stride 1

BatchNorm and ReLU nonlinearity

256 3x3 kernels with stride 1

BatchNorm and ReLU nonlinearity

512 3x3 kernels with stride 2

BatchNorm and ReLU nonlinearity

512 3x3 kernels with stride 1

BatchNorm and ReLU nonlinearity

512 3x3 kernels with stride 1

BatchNorm and ReLU nonlinearity

Linear layer with 10 outputs

Softmax
with L2 regularisation of . The network is trained in an asynchronous manner, using 10 GPUs in parallel. Each worker uses batch size of 32. The main optimiser is Stochastic Gradient Descent with momentm of 0.9. The learning rate is initialised to 0.1 and then dropped by an order of magniture after 40K, 60K and finally after 80K updates.
Each of the three SG modules is a convolutional network consisting of:

128 3x3 kernels with stride 1

ReLU nonlinearity

Linear layer with 10 outputs

Softmax
It is trained using the Adam optimiser with learning rate , no learning rate schedule is applied. Updates of the synthetic gradient module are performed in a Hogwild manner. Losses used for both loss prediction and gradient estimation are L1.
For direct SG model we used architecture described in the original paper – 3 resolution preserving layers of 128 kernels of 3x3 convolutions with ReLU activations in between. The only difference is that we use L1 penalty instead of L2 as empirically we found it working better for the tasks considered.
d.3 Imagenet
All ImageNet experiments use ResNet50 network with L2 regularisation of . The network is trained in an asynchronous manner, using 34 GPUs in parallel. Each worker uses batch size of 32. The main optimiser is Stochastic Gradient Descent with momentum of 0.9. The learning rate is initialised to 0.1 and then dropped by an order of magnitude after 100K, 150K and finally after 175K updates.
The SG module is a convolutional network, attached after second ResNet block, consisting of:

64 3x3 kernels with stride 1

ReLU nonlinearity

64 3x3 kernels with stride 2

ReLU nonlinearity

Global averaging

1000 1x1 kernels

Softmax
It is trained using the Adam optimiser with learning rate , no learning rate schedule is applied. Updates of the synthetic gradient module are performed in a Hogwild manner. Sobolev losses are set to L1.
Regular data augmentation has been applied during training, taken from the original Inception V1 paper.
Appendix E Gradientbased attention transfer
Zagoruyko et al. [31] recently proposed a following cost for transfering attention model to model parametrised with , under the cost :
(3) 
where the first term simply is the original minimisation problem, and the other measures loss sensitivity of the target () and tries to match the corresponding quantity in the model . This can be seen as a Sobolev training under four additional assumptions:

ones does not model , but rather (similarly to our Synthetic Gradient model – one constructs loss predictor),

(target model is perfect),

loss being estimated is nonnegative ()

loss used to measure difference in predictor values (loss estimates) is L.
If we combine these four assumptions we get
Note, however than in general these approaches are not the same, but rather share the idea of matching gradients of a predictor and a target in order to build a better model.
In other words, Sobolev training exploits derivatives to find a closer fit to the target function, while the transfer loss proposed adds a sensitivitymatching term to the original minimisation problem instead. Following observation make this distinction more formal.
Remark 2.
Lets assume that a target function belongs to hypotheses space , meaning that there exists such that . Then is a minimiser of Sobolev loss, but does not have to be a minimiser of transfer loss defined in Eq. (3).
Proof.
By the definition of Sobolev loss it is nonnegative, thus it suffices to show that , but
By the same argument we get for the transfer loss
Consequently, if there exists another such that , then is not a minimiser of the loss considered.
To show that this final constraint does not lead to an empty set, lets consider a class of constant functions , and . Lets fix some that identifies , and we get:
and at the same time for any (i.e. ) we have:
∎