Softmax Is Not an Artificial Trick:
An InformationTheoretic View of Softmax in Neural Networks
Abstract
Despite great popularity of applying softmax to map the nonnormalised outputs of a neural network to a probability distribution over predicting classes, this normalised exponential transformation still seems to be artificial. A theoretic framework that incorporates softmax as an intrinsic component is still lacking. In this paper, we view neural networks embedding softmax from an informationtheoretic perspective. Under this view, we can naturally and mathematically derive logsoftmax as an inherent component in a neural network for evaluating the conditional mutual information between network output vectors and labels given an input datum. We show that training deterministic neural networks through maximising logsoftmax is equivalent to enlarging the conditional mutual information, i.e., feeding label information into network outputs. We also generalise our informativetheoretic perspective to neural networks with stochasticity and derive information upper and lower bounds of logsoftmax. In theory, such an informationtheoretic view offers rationality support for embedding softmax in neural networks; in practice, we eventually demonstrate a computer vision application example of how to employ our informationtheoretic view to filter out targeted objects on images.
Softmax Mutual Information Neural Network
1 Introduction
Thanks to AlexNet [11], neural networks have attracted attention back in computer vision and machine learning communities and demonstrate spectacular performances in a wide range of applications [12]. A large number of neural networks, particularly in the classification tasks, applied softmax to map the nonnormalised network outputs to a categorical probability distribution over classes [3], followed by taking a negative logsoftmax as the crossentropy between this estimated and true class distributions for minimisation [14].
Despite the tremendous popularity of employing softmax and its extraordinary performance in modelling categorical probability distributions [15], such a transformation of converting neural network outputs to probability distributions with softmax seem to be artificial. We desire to build up a theoretic framework that can naturally and mathematically derive logsoftmax as an inherent ingredient in a neural network, instead of rigidly glueing the theories of neural networks and probabilities. In this paper, we present such an endtoend theoretic view such that logsoftmax is a smooth derivation for evaluating the conditional mutual information between network output vectors and labels given an input datum.
To the best of authors’ knowledge, we are the first to build a coherent informationtheoretic view that incorporates logsoftmax as its intrinsic building block. We summarise our contributions as the following:

We show that training deterministic neural networks through maximising logsoftmax is equivalent to enlarging the conditional mutual information, i.e., feeding label information into network outputs.

We generalise our informativetheoretic perspective to neural networks with stochasticity and derive information upper and lower bounds of logsoftmax.

We demonstrate a computer vision application example of how to employ our informationtheoretic view to filter out targeted objects on images.
In sum, our informationtheoretic view can offer rationality support for embedding softmax in neural networks in theory, and can apply to solve concrete tasks and demonstrate impressive performance.
2 Preliminaries
2.1 Softmax
2.1.1 Definition
Neural networks embed the softmax function, abbreviated as softmax in this paper, to learn multiclass categorical distributions, particularly within the classification tasks. We first formally define softmax . We let the training data consist of classes and labelled instances , where , of each datum , is an integer class label of the input datum . We train a neural network to output the the likelihood , where the greater each is, the more likely this input datum belongs to the class . Nonetheless, such is not a probability since it can be negative as well as greater than , violating probability axioms. Therefore, we employ softmax , which is a normalised exponential function, to map each to be in the range of . Formally,
(1) 
where denotes the th element of . As a consequence, we can treat the joint outcomes of softmax as a categorical probability distribution over the classes, since each element within is bounded between and , and all the elements within sum up to .
2.1.2 Training
It is common to train neural networks embedding softmax to minimise the crossentropy loss , which has the form
(2) 
i.e., by adding negativity and in front of softmax.
Equivalently and in this paper, we convert Eq. 2 to an objective function . As a result, instead of minimising Eq. 2 with respect to the parameter , we maximise the objective with respect to the same parameter, which has the form
(3) 
through which we have turned the problem of learning a categorical distribution to an optimisation problem. Specifically, via maximising in Eq. 3, or minimising in Eq. 2 with respect to parameter , we can refine the learned conditional categorical distribution given input datum .
2.2 The DonskerVaradhan Lower Bound of Mutual Information
2.2.1 Mutual Information
In information theory, mutual information is a fundamental quantity for evaluating the relationship between two random variables, which is a key problem in science and engineering [17]. Without loss of generality, we denote the two random variables as and , where is a continuous and is a discrete variable. Furthermore, their joint probability distribution is defined as , and their marginal probability distributions as and respectively. We abbreviate these three distributions as , and . Then the mutual information of and can be expressed in the form as
(4) 
We can express the mutual information in Eq. 4 as a KL divergence [10]. Formally,
(5) 
Intuitively, the larger the divergence between the joint and marginal probabilities of the random variables and , the more dependence existing between and . In contrast, if and are fully independent, then the mutual information between them vanishes [4].
Similarly, we have the definition of the conditional mutual information and its KLdivergence form, where is the conditional variable, formally as
(6)  
(7) 
where is the abbreviation of expectation , and , as well as are the abbreviations of , as well as .
2.2.2 The DonskerVaradhan Representation
The KL form of mutual information in Eq. 7 is intractable due to the integration. We utilise a lower bound to the mutual information based on DonskerVaradhan (DV) representation [6] to make mutual information computable. Formally,
(8) 
where is the parameter of the function . In theory, given an optimal neural network to simulate , Eq. 8 can be infinitesimally close to mutual information .
3 Information Theoretic View
We now provide a coherent informationtheoretic view, connecting logsoftmax with conditional mutual information of neural network output vectors and labels, conditioned on an input datum.
We first provide notations. As previously have defined in Section 2.1.1, logsoftmax corresponding to the input datum and label is as
(9) 
where is a neural network with parameter .
We generalise this neural network from outputting a single result to producing a conditional Dirac delta distribution whose probability density function (PDF) is defined as
(10) 
That is, instead of regarding as a direct output of the neural network input with , we consider it as a sampled result from a conditional distribution, formally written as . That is, the output of the neural network given becomes a conditional distribution. This more generalised definition does not impose any assumptions on the conventional implementations, but can make our theoretic framework more applicable and general. Under this new general form, we have
(11) 
where is the label corresponding to the input datum .
We denote the conditional mutual information between the neural network output and the class labels conditioned on as , and we define a function such that
(12) 
which depicts relationships among the neural network , the input datum , and the label .
Furthermore, we denote the distribution of as , and the conditional categorical distribution of given the input datum as . Then, based on Eq. 12, we can denote the conditional mutual information between the neural network output and the class labels conditioned on as
(13) 
We are interested in the expectation of logsoftmax, i.e., Eq. 11: under the distribution of , which can be written formally as
(14) 
where both and correspond to the input datum in Eq. 13. That is, they form positive sampling pairs.
For simplicity, we abbreviate Eq. 13 as
(15) 
and abbreviate Eq. 14 as
(16) 
3.1 Relationships Between LogSoftmax and Conditional Mutual Information
3.1.1 Derivation
We now derive the relationship between expectation logsoftmax, i.e., Eq. 16, and the DV representation of conditional lower bound of mutual information, i.e., Eq. 15. We show that excepting a constant , where is the class number, they are of equivalence.
(17) 
The constant is the class number, we further have
(18) 
Furthermore, the term in Eq. 19 can be viewed as an expectation of under the conditional distribution . Formally,
(20) 
Since the variable is sampled from a Dirac delta distribution, we have
(22) 
Thereby, we derive an equivalence between the DV representation of conditional mutual information and the expectation of logsoftmax under the entire dataset. Consider the previously defined objective in Eq. 3, where we aim to maximise . This objective is equivalent to maximising the DV representation of the lower bound for conditional mutual information .
3.1.2 Informative Flow of Training with LogSoftmax
We employ probabilistic graphical models to help explaining the information flow during training with logsoftmax. We consider feedforward and backpropagation separately. In the feedforward stage as Figure 0(a), the neural network output vector and the label are conditionally independent given the input datum . We feed information about to through backpropagation. During backpropagation as Figure 0(b), the objective is given. Therefore, the conditional orthogonality between the network output vector and the label is broken. That is, . As a consequence, information can be passed from label to network output vector .
3.2 Generalisation to Neural Networks with Stochastic Outputs
In Section 3.1.1, we derive the theoretic relationship between the logsoftmax and the DV conditional mutual information lower bound by assuming the distribution to be a Dirac delta distribution. We in this section relax this assumption to derive a more general relationship between logsoftmax and conditional mutual information of stochastic neural networks, i.e., the ones whose outputs can be stochastic.
3.2.1 Lower Bound of LogSoftmax in Stochastic Neural Networks
Unlike in Eq. 22 where we can switch expectation of logarithm to be logarithm of expectation without losing equality, here we need to apply Jensen’s inequality and derive an inequality. Formally
(24) 
As a consequence, Eq. 23 now becomes
(25) 
That is, the DV representation subtracting
acts as a lower bound of logarithm softmax
3.2.2 Upper Bound of LogSoftmax in Stochastic Neural Networks
In Section 2.1.1, we have showed that softmax will return a value that is in the range of . Therefore, its logarithm will result in a negative number, so that will also be of negativity. In contrast, by definition, mutual information is nonnegative due to its equivalence with KLdivergence. Thus, we have
(26) 
That is, conditional mutual information
acts as an upper bound of the softmax expectation
3.2.3 A Sandwich Form of LogSoftmax in Stochastic Neural Networks
Combining Eq. 25 and 26, we have the sandwich form of the logsoftmax expectation in stochastic neural networks
(27) 
When obtains its optimality , the sandwich form of Eq. 27 becomes
(28) 
3.2.4 Interpretation of InformationTheoretic View of LogSoftmax in Stochastic Neural Networks
Optimising stochastic neural networks with logsoftmax as the objective can diverge from maximising the conditional mutual information between neural network outputs and labels give the input datum. This is because, as the inequality in Eq. 25 indicates, maximising the expectation of logsoftmax cannot ensure the enlargement of its lower bound, which is the conditional mutual information, despite they both share the same parameter . Thus, training with logsoftmax can cause stochastic neural networks to converge at a nonoptimum from the informative perspective that the neural network output vector obtaining the most information about the label .
4 Application Example: Information Masking (InfoMasking)
Our informationtheoretic view of neural networks can be exploited for visualising and understanding convolutional neural networks. Here we present an example of how to employ it to conduct object masking. That is, to only keep objects relating to a particular class and disguise all the other irrelevant objects. Since we leverage our informationtheoretic view, we name the approach as information masking (infomasking).
Specifically, our dataset consists of images, each is comprised of one or two MNIST digits [13], appearing either left or right. Our classification network aims to predict whether an image containing a digit whose label is . We assume we have trained a neural network embedding logsoftmax that can accurately predict whether an image contains the target digit () or not ().
Proposition 1
The optimal function in DV representation evaluates the pointwise conditional mutual information and an additional constant.
Due to Proposition 1, we can consider , i.e., , as an approximation of the pointwise conditional mutual information between network output vector and class , i.e., the network output vector containing the target digit.
We then explain the approach of InformationMasking (InfoMasking). Given an input image , we split this image into small patches, and fetch the approximate pointwise mutual information between the learned feature and the class , i.e., or , where is the th small patch of the image . Afterwards, for each small patch , we use a simple thresholding technique to filter out the patches whose mutual information is above of the max value of the conditional mutual information, following [18]. That is, we mask the regions of image that contain too little information about the target, which also serves the reason for naming this approach as information masking (infomasking).
Figure 2 demonstrates the performance of our infomasking approach, where Figure 1(a) is the original image and Figure 1(b) is the processed outcome.
5 Related Work
The outputs of a multioutput classification network normally do not satisfy the axioms of probabilities, i.e., probabilities should maintain positivity and sum up to one [5].
5.1 Probability Transformation
The softmax nonlinearity was originally proposed to bridge stochastic model paradigms and feedforward nonlinear “neural” networks, which also referred softmax nonlinearity as normalised exponential [3]. As to stochastic models, the method maximises the class conditional probabilities based on Bayes’ theorem. Formally
(29) 
In contrast to the stochastic model method, a standard “multilayer perceptron” (MLP) neural network is trained to minimise the squared error minimisation between the predicted and true targets, by updating its weights through backpropagation [9]. Formally,
(30) 
where represents “error”, is a network and is its parameters, stands for the true targets. However, the training method in Eq. 30 is intrinsically with over overfitting. As a consequence, softmax nonlinearity was introduced to map neural network outputs to follow the axioms of probabilities. Then, the logarithm of softmax can be further considered as the crossentropy between a “true” distribution and an estimated distribution.
5.2 WinnerTakeAll Mapping
Softmax generalises maximum picking [8]. It stands for a smooth version of the winnertakeall activation, in the sense that outputs change smoothly, and the same inputs will produce equal outputs [3]. Moreover, the exponential can enhance the dominance of the largest value so that after transformation the largest value turns to be while all the other transformed values become zero [2]. Work in [7] also shows an equivalence between the “WinnerTakeAll” network and softmax.
6 Anomalies with Differential Entropy
The authors are aware that the conditional mutual information is zero when the distribution is a Dirac delta distribution. Therefore, the DV representation acts as a lower bound of a conditional mutual information whose value is 0. This seems to be problematic since even under the optimal DV representation, i.e., the lower bound becomes the highest, seemingly still has no information about , and most time this DV representation is negative. Nonetheless, our informationtheoretic view is built upon differential entropy, which loses fundamental associations with the discrete entropy. We cannot interpret its values with the interpretations of discrete entropy. Therefore, we do not concern about the negativity of the DV representation and the zero information problem since the concerns mentioning above are based on the intuitions from the discrete information theory.
7 Acknowledgements
This is a preprint work. The authors welcome and thank any criticism and suggestion. Please feel free to contact the authors if you have any concerns.
References
 [1] (2018) Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062. Cited by: §4.
 [2] (1995) Neural networks for pattern recognition. Oxford university press. Cited by: §5.2.
 [3] (1990) Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pp. 227–236. Cited by: §1, §5.1, §5.2.
 [4] (2012) Elements of information theory. John Wiley & Sons. Cited by: §2.2.1.
 [5] (1991) Transforming neuralnet output levels to probability distributions. In Advances in neural information processing systems, pp. 853–859. Cited by: §5.
 [6] (1975) Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics 28 (1), pp. 1–47. Cited by: §2.2.2.
 [7] (1994) The" softmax" nonlinearity: derivation using statistical mechanics and useful properties as a multiterminal analog circuit element. In Advances in neural information processing systems, pp. 882–887. Cited by: §5.2.
 [8] (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §5.2.
 [9] (1990) Connectionist learning procedures. In Machine learning, pp. 555–610. Cited by: §5.1.
 [10] (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2.2.1.
 [11] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
 [12] (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
 [13] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.
 [14] (2015) CS231n: convolutional neural networks for visual recognition. University Lecture. Cited by: §1.
 [15] (2017) Convolutional neural networks for inverse problems in imaging: a review. IEEE Signal Processing Magazine 34 (6), pp. 85–95. Cited by: §1.
 [16] (2019) CCMI: classifier based conditional mutual information estimation. arXiv preprint arXiv:1906.01824. Cited by: §4.
 [17] (2019) On variational bounds of mutual information. arXiv preprint arXiv:1905.06922. Cited by: §2.2.1.
 [18] (2016) Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §4.