SHADE: InformationBased Regularization for Deep Learning
Abstract
Regularization is a big issue for training deep neural networks. In this paper, we propose a new informationtheorybased regularization scheme named SHADE for SHAnnon DEcay. The originality of the approach is to define a prior based on conditional entropy, which explicitly decouples the learning of invariant representations in the regularizer and the learning of correlations between inputs and labels in the data fitting term. Our second contribution is to derive a stochastic version of the regularizer compatible with deep learning, resulting in a tractable training scheme. We empirically validate the efficiency of our approach to improve classification performances compared to common regularization schemes on several standard architectures.
SHADE: InformationBased Regularization for Deep Learning
noticebox[b]32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.\end@float
1 Introduction
Deep neural networks (DNN) have shown impressive stateoftheart results in the last years on numerous tasks and especially for image classification alexnet; resnet which will be our main focus in this paper. One key element to this success is the use of very deep models with engineered architectures that can be optimized efficiently using stochastic gradient descent (SGD) SGD. Moreover, those models have a huge amount of parameters and need to be trained, accordingly, on a lot of data (e.g. ImageNet) in order to control overfitting. Despite constant progress of DNN performances since alexnet, their generalization ability is still largely misunderstood and theoretical tools such as PACbased analysis seem limited to explain it, as demonstrated by rethinking. Although regularization methods such as weight decay weightdecay, dropout dropout or batch normalization batchnorm are common practices to mitigate the ratio between the numbers of training samples and model parameters, the issue of DNN regularization remains an open question.
In this article, we study the possibility to design a regularization scheme that can be applied efficiently to deep learning and that has theoretical motivations. Our approach requires to define a regularization criterion, which is our first contribution. We claim that, for any model, the entropy of its intermediate representation, conditionally to the class variable, is a good criterion to apprehend the generalization potential of the model. More formally let’s note the input variable, the class variable, the model parameters with the (deep) representation of the input leading to the class prediction. Then, the objective is to penalize , where denotes the Shannon entropy measure (see element for definition). As explain in next section, the measure is perfectly suitable to quantify how invariant the representation is, accordingly to the underlying task of class prediction. This criterion also stands as a valid instantiation of the "Minimum Description Length Principle" mdl, an interpretation of the Occam Razor.
Unfortunately, information measures are usually difficult to estimate when the number of data is low, compared to the size of the variable support space. As a second contribution, we propose an implementation of a tractable loss that proves to reduce our criterion when minimized. Indeed, based on an interpretation of the class information encoding within neurons activations, we assume that for every neuron exists a random Bernoulli variable that contains most of the class information. This variable ultimately enables to derive a batchwise estimator of the entropy criterion, that is scalable and integrates easily in a stochastic gradient descent (SGD) optimization scheme. The resulting loss, called SHADE for Shannon DEcay, has the advantage to be layerwise and more particularly neuronwise.
Finally, as a third contribution we provide extensive experiments on different datasets to motivate and to validate our regularization scheme.
2 Related work and motivation
Regularization in deep learning.
For classification tasks, DNN training schemes usually use an objective function which linearly combines a classification loss – generally crossentropy – and a regularization term , with , that aims at influencing the optimization toward a local minima with good generalization properties:
(1) 
For example, the weight decay (WD) loss weightdecay is supposed to penalize the complexity of the model. While there is strong motivations to WD use on linear models, in term of margins or in term of Lipschitz coefficient for instance, the extension of those theoretical results to DDN is not straightforward and the effects of WD on DNN’s generalization performances is still not clear as demonstrated in rethinking.
Our SHADE regularization scheme belongs to this family as we construct a loss , that aims at influencing the optimization toward representations with low class conditioned entropy. We show in the experiments that SHADE loss has a positive effect on our theoretical criterion , resulting in enhanced generalization performances.
Others popular regularization methods like dropout; conf/icml/WanZZLF13 deactivate part of the neurons during the training. Those methods, which tend to lower the dependency of the class prediction to a reduced set of features, is backed by some theoretical interpretations like 2016arXiv161101353A; bayes. Other methods that add stochasticity to the training such as batch normalization or stochastic poolings batchnorm; zeiler2013stochastic; graham2014fractional, beside being comparable to dataaugmentation, result in the addition of noise to the gradients during optimization. This property would enable the model to converge toward a flatter minima, that are known to generalize well as shown in largebatch. More generally, stochastic methods tend to make the learned parameters less dependant on the training data, which guarantee tighter generalization bounds nipsInformationGeneralization. In this article we focus on another way to make DNN’s models less dependant on the training data. By focusing on representation that are invariant to many transformations on the input variable, you make the training less dependant on the data. Having invariant representation is the motivation for our criterion .
Quantifying invariance.
Designing DNN models that are robust to variations on the training data and that preserve class information is the main motivation of this work. In the same direction, Scattering decompositions scattering are appealing transforms. They have been incorporated into adapted network architectures like bruna. However, for tasks like image recognition, it is very difficult to design an explicit modelling of all transformations a model should be invariant to.
Inversely, a criterion such as is agnostic to the transformations the representation should be invariant to, and is suitable to quantify how invariant it is in a context of class prediction. Indeed, a model that is invariant to many transformations will produce the same representation for different inputs, making it impossible to guess which input produced a given representation. This characteristic is perfectly captured by the entropy which represents the uncertainty in the variable knowing its representation . For instance, many works relate the reconstruction error to the entropy like in Fano’s inequality element in the discrete case. A general discussion with illustrations on how to bound the reconstruction error with can be found in Appendix E. Thus, the bigger the measure , the more invariant the representation. Let’s now analyze the entropy of the representation when is discrete and for a deterministic mapping . We have:
(2) 
being fixed, is inversely related to making also a good measure of invariance.
However, considering the target classification task, we do not want two inputs of different classes to have the same representation but rather want to focus on intraclass invariance. Therefore, all this reasoning should be done conditionally to the class , explaining final choice of ^{1}^{1}1 as a measure of intraclass invariance.
Informationtheorybased regularization.
Many works like DBLP:journals/corr/PereyraTCKH17; soatto use information measures as regularization criterion. Still with the objective of making the trained model less dependant on the data, soatto has built a specific weight regularization but had to model the weight distribution which is not easy. The information criterion that is closer to ours is the one defined in the Information Bottleneck framework (IB) proposed in IB that suggests to use mutual information (element)^{2}^{2}2for all variable , , as a regularization criterion. IBvariational extends it in a variational context, VIB, by constructing a variational upperbound of the criterion. Along with IB also come some theoretical investigations, with the definition of generalization bounds in IBbound. Using mutual information as a regularizer is also closely connected to invariance since attempts at compressing as much information as possible from input data. In the case is discrete and the representation mapping is deterministic (), our criterion is related to IB’s trough the following development . In a context of optimization with SGD, minimizing appears to be more efficient to preserve the term , which represents the mutual information of the representation variable with the class variable and which must stay high to predict accurately from .
Compressing the representation, without damaging the class information seems in fact to be a holy grail in machine learning. Our work, resulting in SHADE, goes in this direction.
3 SHADE: A new Regularization Method Based on
In this section, we will further describe SHADE, a new regularization loss based on the conditional entropy designed to drive the optimization toward more invariant representation. We first show how to derive a layerwise, neuronwise criterion before developing a proper tractable estimate of the unitwise criterion. In order to properly develop entropy inequalities it is necessary to suppose that is a discrete variable of the space with and respectively the height and width of the images. However it is common to consider as a discrete quantization of a continuous natural variable, enabling to exploit some properties verified by continuous variables such as gradient computing^{3}^{3}3A discussion on this topic can be found in discussionIB.
3.1 A unitwise criterion
Layerwise criterion.
A DNN is composed of a number of layers that sequentially transform the input. Each one can be seen as an intermediate representation variable, noted for layer , that is determined by the output of the previous layer and a set of parameters . Each layer filters out a certain part of the information from the initial input . Thus, the following inequalities can be derived from the data processing inequality in element:
(3) 
The conditional entropy of every layer is upper bounded by the conditional entropy of the previous layer. Similarly to the recommendation of IBdeep, we apply this regularization to all the layers, using a layerwise criterion , and producing a global criterion^{4}^{4}4Confirming the intuition, in our experiments, regularizing all layer has proved to be more efficient than regularizing the final layer representation only:
(4) 
Where is a weighting term differentiating the importance of the regularization between layers. Those coefficient will be omitted in the following as in our experiments all are identical. Adjusting the values of the variables remains open for further research. It is illustrated in Fig. 1 where we see that (in green) remains constant while (in red) decreases.
Unitwise criterion.
Examining one layer , its representation variable is a random vector of coordinates and of dimension : . The upper bound enables to define a unitwise criterion that SHADE seeks to minimize. For each unit of every layer we design a loss that will be part of the global regularization loss:
(5) 
For the rest of the paper, we use the notation instead of for simplicity, since the layers and coordinates are all considered independently to define .
3.2 Estimating the Entropy
In this section, we describe how to define a loss based on the measure with being one coordinate variable of one layer output. Defining this loss is not obvious as the gradient of with respect to the layer’s parameters may be computationally intractable. has an unknown distribution and without modeling it properly it is not possible to compute . Since , a direct approach would consist in computing different entropies . This means that, given a batch, the number of samples used to estimate one of these entropies is divided by on average which becomes particularly problematic when dealing with a large number of classes such as the 1,000 classes of ImageNet. Furthermore, entropy estimators are extremely inaccurate considering the number of samples in a batch. For example, LME estimators of entropy in entropyestimation converge in for samples. Finally, most estimators require discretizing the space in order to approximate the distribution via a histogram. This raises issues on the bins definition considering that the variable distribution is unknown and varies during the training in addition to the fact that having a histogram for each neuron is computationally and memory consuming. Moreover, entropy estimators using kernel density estimation usually have a too high complexity () to be applied efficient on deep learning models.
To tackle these drawbacks we propose the two following solutions: the introduction of a latent variable that enables to use more examples to estimate the conditional entropies; and a bound on the entropy of the variable by an increasing function of its variance to avoid the issue of entropy estimation with a histogram, making the computation tractable and scalable.
Latent code.
First, considering a single neuron (before ReLU), the ReLU activation induces a detector behavior toward a certain pattern. If the pattern is absent from the input, the signal is zero; if it is present, the activation quantifies the resemblance with the pattern. We therefore propose to associate a binomial variable with each unit variable (before ReLU). This variable indicates if a particular pattern is present on the input ( when ) or not ( when ). It acts like a latent code from which the input is generated like in variational models (e.g. aevb) or in generative models (e.g. infogan).
Furthermore, it is very likely that most intermediate features of a DNN can take similar values for inputs of different classes – this is especially true for lowlevel features. The semantic information provided by a feature is thus more about a particular pattern than about the class itself. Only the association of features allows determining the class. So represents a semantically meaningful factor about the class . The feature value is then a quantification of the possibility for this semantic attribute to be present in the input or not.
We thus assume the Markov chain . Indeed, during the training, the distribution of varies in order to get as close as possible to a sufficient statistic of for (see definition in element). Therefore, we expect to be such that draws near a sufficient statistic of for as well. By assuming the sufficient statistic relation we get the equivalent equality , and finally obtain:
(6) 
This modeling of as a Bernoulli variable (one for each unit) has the advantage of enabling good estimators of conditional entropy since we only divide the batch into two sets for the estimation ( and ) regardless of the number of classes. The fact that most of information about is contained in such a variable is validated in the experiments Sec. 4.4.
Variance bound.
The previous trick allows computing fewer entropy estimates to obtain the global conditional entropy, therefore increasing the sample size used for each entropy estimation. Unfortunately, it does not solve the bin definition issue. To address this, we propose to use the following bound on , that does not require the definition of bins:
(7) 
This bound holds for any continuous distributions and there is equality if the distribution is Gaussian. For many other distributions such as the exponential one, the entropy is also equal to an increasing function of the variance. In addition, one main advantage is that variance estimators are much more robust than entropy estimators, converging in for samples instead of . The use of this bound is well justified in our case because the variable is the quantization of a continuous variable. Moreover, even if is discrete, this inequality still holds with respect to a term depending on the quantization steps.
The function being onetoone and increasing, we only keep the simpler term to design our final loss:
(8) 
In the next section, we detail the definition of the differential loss, computed on a minibatch, using as a criterion.
3.3 Instantiating SHADE
For one unit of one layer, the previous criterion writes:
(9)  
(10) 
The quantity can be estimated with MonteCarlo sampling on a minibatch of inputtarget pairs of intermediate representations as in Eq. (10).
is interpreted as the probability of presence of attribute on the input, and should clearly be modeled such that increases with . The more similarities between the input and the pattern represented by , the higher the probability of presence for . We suggest using^{5}^{5}5Other functions have been experimented with similar results:
For the expected values we use a classic moving average that is updated after each batch as described in Algorithm 1. Note that the expected values are not changed by the optimization since translating a variable has no influence on its entropy.
The concrete behavior of SHADE can be interpreted by analyzing its gradient as described in Appendix F.
For this proposed instantiation, our SHADE regularization penalty takes the form:
(11) 
We have presented a regularizer that is applied neuronwise and that can be integrated into the usual optimization process of a DNN. The additional computation and memory usage induced by SHADE is almost negligible (computation and storage of two moving averages per neuron). For comparison, SHADE adds half as many parameters as batch normalization does.
4 Experiments
4.1 Image Classification with Various Architectures on CIFAR10
MLP  AlexNet  Inception  ResNet  

No regul.  64.68  83.25  91.21  92.95 
Weight decay  66.52  83.54  92.87  93.84 
Dropout  66.70  85.95  91.34  93.31 
SHADE  70.45  85.96  93.56  93.87 
We perform image classification on the CIFAR10 dataset, which contains 50k training images and 10k test images of 3232 RGB pixels, fairly distributed within 10 classes (see cifar for details). Following the architectures used in rethinking, we use a small Inception model, a threelayer MLP, and an AlexNetlike model with 3 convolutional layers (64 filters of size ) + max pooling and 2 fully connected layers with 1000 neurons for the intermediate variable. We also use a ResNet architecture from wideresnet (k=10, N=4). Those architectures represent a large family of DNN and some have been well studied in rethinking within the generalization scope. For training, we use randomly cropped images of size 2828 with random horizontal flips. For testing, we simply centercrop 2828 images. We use momentum SGD for optimization (same protocol as rethinking).
We compare SHADE with two regularization methods, namely weight decay weightdecay and dropout dropout. For all architectures, the regularization parameters have been crossvalidated to find the best ones for each method and the obtained accuracies on the test set are reported in Table 1. Find more details on the experiments protocol in B.
We obtain the same trends as rethinking, which shows a small improvement of 0.29% with weight decay on AlexNet. The improvement with weight decay is slightly more important with ResNet and Inception (0.79% and 1.66%). In our experiments dropout improves significantly generalization performances only for AlexNet and MLP. It is known that the use of batch normalization and only one fully connected layer lowers the benefit of dropout, which is in fact not used in resnet.
We first notice that for all kind of architectures that the use of SHADE significantly improves the generalization performances, 5.77% for MLP, 2.71% for AlexNet, 2.35% for Inception and 0.92% for ResNet. It demonstrates the ability of SHADE to regularize the training of deep architectures. Finally, SHADE shows better performances than dropout and weight decay on all architectures.
4.2 Large Scale Classification on ImageNet
In order to experiment SHADE regularization on a very large scale dataset, we train on ImageNet ImageNet a WELDON network from weldone adapted from ResNet101. This architecture changes the forward and pooling strategy by using the network in a fullyconvolutional way and adding a max+min pooling, thus improving the performance of the baseline network. We used the pretrained weights of ResNet101 (from the torchvision package of PyTorch) giving performances on the test set of 77.56% for top1 accuracy and 93.89% for top5 accuracy. Provided with the pretrained weights, the WELDON architecture obtains 78.51% for top1 accuracy and 94.65% for top5 accuracy. After fine tuning the network using SHADE for regularization we finally obtained 80.14% for top1 accuracy and 95.35% for top5 accuracy for a concrete improvement. This demonstrates the ability to apply SHADE on very large scale image classification successfully.
4.3 Training with a Limited Number of Samples
When datasets are small, DNN tend to overfit quickly and regularization becomes essential. Because it tends to filter out information and make the network more invariant, SHADE seems to be well fitted for this task. To investigate this, we propose to train DNN with and without SHADE on CIFAR10 and MNISTM with different numbers of samples in the training set.
First, we tested this approach on the digits dataset MNISTM ganin2015unsupervised. This dataset consists of the MNIST digits where the background and digit have been replaced by colored and textured information (see Fig. 1(c) for examples). The interest of this dataset is that it contains lots of unnecessary information that should be filtered out, and is therefore well adapted to measure the effect of SHADE. A simple convolutional network has been trained with different numbers of samples of MNISTM and the optimal regularization weight for SHADE have been determined on the validation set (see training details in Appendix C). The results can be seen on Figure 1(a). We can see that especially for small numbers of training samples ( 1000), SHADE provides an important gain of 10 to 15% points over the baseline. This shows that SHADE helped the model in finding invariant and discriminative patterns using less data samples.
Additionally, Figure 1(c) shows samples that are misclassified by the baseline model but correctly classified when using SHADE. These images contain a large amount of intraclass variance (color, texture, etc.) that is not useful for the classification tasks, explaining why adding SHADE, that encourages the model to discard information, allows important performance gains on this dataset and especially when only few training samples are given.
Finally, to confirm this behavior, we also applied the same procedure in a more conventional setting by training an Inception model on CIFAR10. Figure 1(b) shows the results in that case. We can see that once again SHADE helps the model gain in performance and that this behavior is more noticeable when the number of samples is limited, allowing a gain of 6% when using 4000 samples.
4.4 Further experiments: exploration on the latent variable
MLP 



AlexNet 


Inception 


ResNet 

SHADE is based on the intuition that the class information encoded within a neuron is mostly contained in a binary latent variable noted . To justify this assumption we propose two experiments that both study trained networks neuron variables. We first look at the the neuron variable distribution to discover that it is a two modes distribution. Those two modes would be associated to the two states of the latent variable . In the second experiments we work on the possibility to transform the ReLU activation function of a layer into a binary activation function that can only take two values. By exhibiting such a binary activation which does not affect the accuracy, we show that we can summarize the class information of a neuron into a binary variable and still get the same prediction accuracy as with the continuous ReLU activation. Both experiments have been done on CIFAR10 dataset with the same networks used in Sec. 4.1.
Two modes neuron variable.
Our first experiment tends to demonstrate that DNN optimization drives the neurons distribution toward a bimodal distribution. Focusing on the input neurons of the last layer of a trained network (before the class projection), the output of the network is obtained by applying a fully connected layer on plus a softmax activation: . is the matrix of weights of the last fully connected layer and is the bias vector of this layer. By plotting a histogram of any coordinate (one neuron) of , it will not be possible to identify two modes. This can be seen on the purple distribution on Figure 3 (left), which represents the distributions of on CIFAR10 training set for five random coordinates for an Inception model. However contains a lot of information that will not be exploited for the prediction. Indeed, lets rewrite with in the kernel of such that and is in the supplementary of ’s kernel. The class space has generally much fewer dimensions than the space of , thus the kernel of is not reduced to zero and some information will be filtered. On Figure 3 (middle and right) is the distribution of on the training set (blue, middle) and on the validation set (green right), for the same neurons of the same network as the activations on the left. is the information effectively used for the prediction and its distribution look very much like a mixture of two Gaussians. We clearly identify two modes, one negative and one positive. This confirms the intuition of a binary latent variable whose values correspond to the two modes. The fact that the distribution look like a mixture of two Gaussians support the use of the inequality at Eq. (7) in the definition of SHADE. The same experiment has been done on others network architecture and the distributions can be seen in Appendix 4.4. From those experiments we also notice that the better the two modes are separate the better the DNN behave, as can noted by comparing the MLP with the ResNet distributions.
The distributions are obtain via a kernel density estimator using as data the neuron variable output by a forward pass on the totality of the CIFAR10 training and test set. The three distributions are for the same coordinates taken randomly among all units. Note that the experiment could have been done on other layers but the computation of would be more complicated as the following transformations up to the top of the network are not linear. From another perspective, the way we exhibit the two modes could help to compute the probability with an Gaussian mixture model approximation for instance. We leave this lead for further work.
Binary activation.
In the second experiment, we have replaced the ReLU activation function on a chosen layer of a trained network with a binary activation function. The binary function is where stands for the average value of the positive variable values before any activation function. After replacing the activation function we fine tune the layers on the top of the chosen layer, in order to adapt the top of the network to the new values and we report the obtained accuracies on the Table 2 for the different networks and different layers. We note that the difference in accuracy are very small losses and can even sometimes be slight improvements. This means that for a given layer, the information that is used for the class prediction can be sum up in a binary variable confirming the existence of a binary latent variable containing most of the class information that is exploited by the rest of the network. The fact that this apply for all layers of the network is consistent with the application of SHADE loss for all layers. Note that this binary activation could be further researched to improve the modeling integrated in SHADE.
5 Conclusion
In this paper, we introduced a new regularization method for DNN training, SHADE, which focuses on minimizing the entropy of the representation conditionally to the labels. This regularization aims at increasing the intraclass invariance of the model while keeping class information. SHADE is tractable, adding only a small computational overhead when included into an efficient SGD training. We show that our SHADE regularization method significantly outperforms standard approaches such as weight decay or dropout with various DNN architectures on CIFAR10. We also validate the scalability of SHADE by applying it on ImageNet. The invariance potential brought out by SHADE is further illustrated by its ability to ignore irrelevant visual information (texture, color) on MNISTM. Finally, we also highlight the increasing benefit of our regularizer when the number of training examples becomes small. Furthermore there is no doubt that the informationtheorybased interpretation of SHADE, from which it has been established, allows for further improvements of SHADE for future work.
References
Appendix A Denoised activations
Appendix B Experiments details on CIFAR10
Optimization.
For all experiments the learning rate is initialize and a multiplicative decay is apply on it after every batches. The momentum is constant and setted to 0.9. We detail here the initial learning rate and the decay for every networks used in the format (initial learning rate value, decay): mlp (,), alexnet (,), inception (,), resnet(,)
Hyperparameter tuning
For weight decay and SHADE, the optimal regularization weight of each model has been chosen to maximize the accuracy on the validation sets. We tried the values . For the dropout we have apply it on the two last layer of every networks. The optimal activation probabilities for each model has been chosen among to maximize the accuracy on the validation sets.
Appendix C Experiments details on MNISTM
Dataset splits and creation.
To create MNISTM, we kept the provided splits of MNIST, so we have 55,000 training samples, 5,000 validation samples, and 10,000 test samples. Each digit of MNIST is processed to add color and texture by taking a crop in images from BST dataset. This procedure is explained in [ganin2015unsupervised].
Subsets of limited size.
To create the training sets of limited size , we keep (since there are 10 classes) randomly picked samples from each class. When increasing we keep the previously picked samples so the training samples for are a subset of the ones for . The samples chosen for a given value of are the same across all models trained using this number of samples.
Image preprocessing.
The only preprocessing applied to the input images is that their values are rescaled from to .
Optimization.
For the training, we use minibatch of size 50 and use Adam optimizer with the recommended parameters, i.e. .
Hyperparameter tuning.
For weight decay and SHADE, the optimal regularization weight of each model (for each value of ) has been chosen to maximize the accuracy on the validation sets. We tried the values .
Model architecture.
The model have the following architecture:

2D convolution ( kernel, padding 2, stride 1) + ReLU

MaxPooling

2D convolution ( kernel, padding 1, stride 1) + ReLU

MaxPooling

2D convolution ( kernel, padding 1, stride 1) + ReLU

MaxPooling

Fully connected (1024 inputs, 10 outputs) + SoftMax
Appendix D Experiments details on Imagenet
The fine tuning in the experiment section 4.2 has been done with momentumSGD with a learning rate of and a momentum of and a batch size of 16 images. It took 8 epochs to converge.
Appendix E Entropy bounding the reconstruction error
In section 2 we highlight a link between the entropy and the difficulty to recover the input from its representation . Here we exhibit a concrete relation between the reconstruction error, that quantifies the error made by a strategy that predicts from , and the conditional entropy. This relation takes the form of an inequality, bounding the error measure in the best case (with the reconstruction strategy that minimizes the error) by an increasing function of the entropy. We note the reconstruction model that tries to guess from .
The discrete case
In case the input space is discrete, we consider the zeroone reconstruction error for one representation point : . This is the probability of error when predicting from . The function that minimizes the expected error is as shown in Proof 1. We derive the following inequality:
(12) 
The left side of the inequality uses Fano’s inequality in [element], the right one is developed in proof 2. This first inequalities show how the reconstruction error and the entropy are related. For very invariant representations, it is hard to recover the input from and the entropy of is high.
Besides, there can be an underlying continuity in the input space and it could be unfair to penalize predictions close to the input as much as predictions far from it. We expose another case below that takes this proximity into account.
The continuous case
In the case of convex input space and input variable with continuous density, we consider the 2norm distance as reconstruction error: . This error penalizes the average distance of the input and its reconstruction from . The function that minimizes the expected error is the conditional expected value: . Then . Helped by the wellknown inequality we obtain:
(13) 
Here again, notice that a high entropy implies a high reconstruction error in the best case.
e.1 Proof 1
We have
(14)  
(15)  
(16) 
Since
(17) 
the reconstruction that minimizes the error is . However, this is theoretical because in most cases is unknown.
e.2 Proof 2
We have:
(18)  
(19)  
(20)  
(21)  
(22)  
(23) 
The Eq. (20) is obtained using Jensen inequality and Eq. (22) is obtained using the result of Proof 1.
Thus,
(24) 
Appendix F SHADE Gradients
Here is studied the influence of SHADE on a gradient descent step for a single neuron of a single layer and for one training sample . The considered case of a linear layer, we have: .
The gradient of with respect to is:
With which has positive derivative. We can interpret the direction of this gradient by analyzing the two terms and as follows:

: If is bigger than that means that is closer to than it is to . Then is positive and it contributes to increasing meaning that it increases the probability of being from mode 1. In a way it increases the average margin between positive and negative detections. Note that if there is no ambiguity about the mode of meaning that or is very small then this term has negligible effect.

: This term moves toward the of the mode that presents the bigger probability. This has the effect of concentrating the outputs around their expectancy depending on their mode to get sparser activation.