Softmax Is Not an Artificial Trick: An Information-Theoretic View of Softmax in Neural Networks

Softmax Is Not an Artificial Trick:
An Information-Theoretic View of Softmax in Neural Networks

Zhenyue Qin
Research School of Computer Science
Australian National University
Canberra, Australia
&Dongwoo Kim
Research School of Computer Science
Australian National University
Canberra, Australia

Despite great popularity of applying softmax to map the non-normalised outputs of a neural network to a probability distribution over predicting classes, this normalised exponential transformation still seems to be artificial. A theoretic framework that incorporates softmax as an intrinsic component is still lacking. In this paper, we view neural networks embedding softmax from an information-theoretic perspective. Under this view, we can naturally and mathematically derive log-softmax as an inherent component in a neural network for evaluating the conditional mutual information between network output vectors and labels given an input datum. We show that training deterministic neural networks through maximising log-softmax is equivalent to enlarging the conditional mutual information, i.e., feeding label information into network outputs. We also generalise our informative-theoretic perspective to neural networks with stochasticity and derive information upper and lower bounds of log-softmax. In theory, such an information-theoretic view offers rationality support for embedding softmax in neural networks; in practice, we eventually demonstrate a computer vision application example of how to employ our information-theoretic view to filter out targeted objects on images.


Softmax Mutual Information Neural Network

1 Introduction

Thanks to AlexNet [11], neural networks have attracted attention back in computer vision and machine learning communities and demonstrate spectacular performances in a wide range of applications [12]. A large number of neural networks, particularly in the classification tasks, applied softmax to map the non-normalised network outputs to a categorical probability distribution over classes [3], followed by taking a negative log-softmax as the cross-entropy between this estimated and true class distributions for minimisation [14].

Despite the tremendous popularity of employing softmax and its extraordinary performance in modelling categorical probability distributions [15], such a transformation of converting neural network outputs to probability distributions with softmax seem to be artificial. We desire to build up a theoretic framework that can naturally and mathematically derive log-softmax as an inherent ingredient in a neural network, instead of rigidly glueing the theories of neural networks and probabilities. In this paper, we present such an end-to-end theoretic view such that log-softmax is a smooth derivation for evaluating the conditional mutual information between network output vectors and labels given an input datum.

To the best of authors’ knowledge, we are the first to build a coherent information-theoretic view that incorporates log-softmax as its intrinsic building block. We summarise our contributions as the following:

  1. We show that training deterministic neural networks through maximising log-softmax is equivalent to enlarging the conditional mutual information, i.e., feeding label information into network outputs.

  2. We generalise our informative-theoretic perspective to neural networks with stochasticity and derive information upper and lower bounds of log-softmax.

  3. We demonstrate a computer vision application example of how to employ our information-theoretic view to filter out targeted objects on images.

In sum, our information-theoretic view can offer rationality support for embedding softmax in neural networks in theory, and can apply to solve concrete tasks and demonstrate impressive performance.

2 Preliminaries

2.1 Softmax

2.1.1 Definition

Neural networks embed the softmax function, abbreviated as softmax in this paper, to learn multi-class categorical distributions, particularly within the classification tasks. We first formally define softmax . We let the training data consist of classes and labelled instances , where , of each datum , is an integer class label of the input datum . We train a neural network to output the the likelihood , where the greater each is, the more likely this input datum belongs to the class . Nonetheless, such is not a probability since it can be negative as well as greater than , violating probability axioms. Therefore, we employ softmax , which is a normalised exponential function, to map each to be in the range of . Formally,


where denotes the -th element of . As a consequence, we can treat the joint outcomes of softmax as a categorical probability distribution over the classes, since each element within is bounded between and , and all the elements within sum up to .

2.1.2 Training

It is common to train neural networks embedding softmax to minimise the cross-entropy loss , which has the form


i.e., by adding negativity and in front of softmax.

Equivalently and in this paper, we convert Eq. 2 to an objective function . As a result, instead of minimising Eq. 2 with respect to the parameter , we maximise the objective with respect to the same parameter, which has the form


through which we have turned the problem of learning a categorical distribution to an optimisation problem. Specifically, via maximising in Eq. 3, or minimising in Eq. 2 with respect to parameter , we can refine the learned conditional categorical distribution given input datum .

2.2 The Donsker-Varadhan Lower Bound of Mutual Information

2.2.1 Mutual Information

In information theory, mutual information is a fundamental quantity for evaluating the relationship between two random variables, which is a key problem in science and engineering [17]. Without loss of generality, we denote the two random variables as and , where is a continuous and is a discrete variable. Furthermore, their joint probability distribution is defined as , and their marginal probability distributions as and respectively. We abbreviate these three distributions as , and . Then the mutual information of and can be expressed in the form as


We can express the mutual information in Eq. 4 as a KL divergence [10]. Formally,


Intuitively, the larger the divergence between the joint and marginal probabilities of the random variables and , the more dependence existing between and . In contrast, if and are fully independent, then the mutual information between them vanishes [4].

Similarly, we have the definition of the conditional mutual information and its KL-divergence form, where is the conditional variable, formally as


where is the abbreviation of expectation , and , as well as are the abbreviations of , as well as .

2.2.2 The Donsker-Varadhan Representation

The KL form of mutual information in Eq. 7 is intractable due to the integration. We utilise a lower bound to the mutual information based on Donsker-Varadhan (DV) representation [6] to make mutual information computable. Formally,


where is the parameter of the function . In theory, given an optimal neural network to simulate , Eq. 8 can be infinitesimally close to mutual information .

3 Information Theoretic View

We now provide a coherent information-theoretic view, connecting log-softmax with conditional mutual information of neural network output vectors and labels, conditioned on an input datum.

We first provide notations. As previously have defined in Section 2.1.1, log-softmax corresponding to the input datum and label is as


where is a neural network with parameter .

We generalise this neural network from outputting a single result to producing a conditional Dirac delta distribution whose probability density function (PDF) is defined as


That is, instead of regarding as a direct output of the neural network input with , we consider it as a sampled result from a conditional distribution, formally written as . That is, the output of the neural network given becomes a conditional distribution. This more generalised definition does not impose any assumptions on the conventional implementations, but can make our theoretic framework more applicable and general. Under this new general form, we have


where is the label corresponding to the input datum .

We denote the conditional mutual information between the neural network output and the class labels conditioned on as , and we define a function such that


which depicts relationships among the neural network , the input datum , and the label .

Furthermore, we denote the distribution of as , and the conditional categorical distribution of given the input datum as . Then, based on Eq. 12, we can denote the conditional mutual information between the neural network output and the class labels conditioned on as


We are interested in the expectation of log-softmax, i.e., Eq. 11: under the distribution of , which can be written formally as


where both and correspond to the input datum in Eq. 13. That is, they form positive sampling pairs.

For simplicity, we abbreviate Eq. 13 as


and abbreviate Eq. 14 as


3.1 Relationships Between Log-Softmax and Conditional Mutual Information

3.1.1 Derivation

We now derive the relationship between expectation log-softmax, i.e., Eq. 16, and the DV representation of conditional lower bound of mutual information, i.e., Eq. 15. We show that excepting a constant , where is the class number, they are of equivalence.


The constant is the class number, we further have


Taking the definition of in Eq. 12, we can rewrite Eq. 18 as


Furthermore, the term in Eq. 19 can be viewed as an expectation of under the conditional distribution . Formally,


We abbreviate as . Using Eq. 20, we now can express Eq. 19 as


Since the variable is sampled from a Dirac delta distribution, we have


Combining Eq. 21 and 22, as well as Eq. 4123.1.11819 and 20, we obtain


Thereby, we derive an equivalence between the DV representation of conditional mutual information and the expectation of log-softmax under the entire dataset. Consider the previously defined objective in Eq. 3, where we aim to maximise . This objective is equivalent to maximising the DV representation of the lower bound for conditional mutual information .

3.1.2 Informative Flow of Training with Log-Softmax

(a) Feed-Forward
(b) Back-Propagation
Figure 1: Probabilistic graphical models for visualising information flow of training with logarithm softmax.

We employ probabilistic graphical models to help explaining the information flow during training with log-softmax. We consider feed-forward and back-propagation separately. In the feed-forward stage as Figure 0(a), the neural network output vector and the label are conditionally independent given the input datum . We feed information about to through back-propagation. During back-propagation as Figure 0(b), the objective is given. Therefore, the conditional orthogonality between the network output vector and the label is broken. That is, . As a consequence, information can be passed from label to network output vector .

3.2 Generalisation to Neural Networks with Stochastic Outputs

In Section 3.1.1, we derive the theoretic relationship between the log-softmax and the DV conditional mutual information lower bound by assuming the distribution to be a Dirac delta distribution. We in this section relax this assumption to derive a more general relationship between log-softmax and conditional mutual information of stochastic neural networks, i.e., the ones whose outputs can be stochastic.

3.2.1 Lower Bound of Log-Softmax in Stochastic Neural Networks

Unlike in Eq. 22 where we can switch expectation of logarithm to be logarithm of expectation without losing equality, here we need to apply Jensen’s inequality and derive an inequality. Formally


As a consequence, Eq. 23 now becomes


That is, the DV representation subtracting

acts as a lower bound of logarithm softmax

3.2.2 Upper Bound of Log-Softmax in Stochastic Neural Networks

In Section 2.1.1, we have showed that softmax will return a value that is in the range of . Therefore, its logarithm will result in a negative number, so that will also be of negativity. In contrast, by definition, mutual information is non-negative due to its equivalence with KL-divergence. Thus, we have


That is, conditional mutual information

acts as an upper bound of the softmax expectation

3.2.3 A Sandwich Form of Log-Softmax in Stochastic Neural Networks

Combining Eq. 25 and 26, we have the sandwich form of the log-softmax expectation in stochastic neural networks


When obtains its optimality , the sandwich form of Eq. 27 becomes


3.2.4 Interpretation of Information-Theoretic View of Log-Softmax in Stochastic Neural Networks

Optimising stochastic neural networks with log-softmax as the objective can diverge from maximising the conditional mutual information between neural network outputs and labels give the input datum. This is because, as the inequality in Eq. 25 indicates, maximising the expectation of log-softmax cannot ensure the enlargement of its lower bound, which is the conditional mutual information, despite they both share the same parameter . Thus, training with log-softmax can cause stochastic neural networks to converge at a non-optimum from the informative perspective that the neural network output vector obtaining the most information about the label .

4 Application Example: Information Masking (Info-Masking)

Our information-theoretic view of neural networks can be exploited for visualising and understanding convolutional neural networks. Here we present an example of how to employ it to conduct object masking. That is, to only keep objects relating to a particular class and disguise all the other irrelevant objects. Since we leverage our information-theoretic view, we name the approach as information masking (info-masking).

Specifically, our dataset consists of images, each is comprised of one or two MNIST digits [13], appearing either left or right. Our classification network aims to predict whether an image containing a digit whose label is . We assume we have trained a neural network embedding log-softmax that can accurately predict whether an image contains the target digit () or not ().

We exploit the following proposition proposed in [16] and [1].

Proposition 1

The optimal function in DV representation evaluates the point-wise conditional mutual information and an additional constant.

Due to Proposition 1, we can consider , i.e., , as an approximation of the point-wise conditional mutual information between network output vector and class , i.e., the network output vector containing the target digit.

We then explain the approach of Information-Masking (Info-Masking). Given an input image , we split this image into small patches, and fetch the approximate point-wise mutual information between the learned feature and the class , i.e., or , where is the -th small patch of the image . Afterwards, for each small patch , we use a simple thresholding technique to filter out the patches whose mutual information is above of the max value of the conditional mutual information, following [18]. That is, we mask the regions of image that contain too little information about the target, which also serves the reason for naming this approach as information masking (info-masking).

Figure 2 demonstrates the performance of our info-masking approach, where Figure 1(a) is the original image and Figure 1(b) is the processed outcome.

(a) Original Image
(b) Filtering Out Digit 0
Figure 2: Demonstration of info-masking performance, where we target to filter out all the digit 0 objects.

5 Related Work

The outputs of a multi-output classification network normally do not satisfy the axioms of probabilities, i.e., probabilities should maintain positivity and sum up to one [5].

5.1 Probability Transformation

The softmax non-linearity was originally proposed to bridge stochastic model paradigms and feed-forward non-linear “neural” networks, which also referred softmax non-linearity as normalised exponential [3]. As to stochastic models, the method maximises the class conditional probabilities based on Bayes’ theorem. Formally


In contrast to the stochastic model method, a standard “multi-layer perceptron” (MLP) neural network is trained to minimise the squared error minimisation between the predicted and true targets, by updating its weights through back-propagation [9]. Formally,


where represents “error”, is a network and is its parameters, stands for the true targets. However, the training method in Eq. 30 is intrinsically with over overfitting. As a consequence, softmax non-linearity was introduced to map neural network outputs to follow the axioms of probabilities. Then, the logarithm of softmax can be further considered as the cross-entropy between a “true” distribution and an estimated distribution.

5.2 Winner-Take-All Mapping

Softmax generalises maximum picking [8]. It stands for a smooth version of the winner-take-all activation, in the sense that outputs change smoothly, and the same inputs will produce equal outputs [3]. Moreover, the exponential can enhance the dominance of the largest value so that after transformation the largest value turns to be while all the other transformed values become zero [2]. Work in  [7] also shows an equivalence between the “Winner-Take-All” network and softmax.

6 Anomalies with Differential Entropy

The authors are aware that the conditional mutual information is zero when the distribution is a Dirac delta distribution. Therefore, the DV representation acts as a lower bound of a conditional mutual information whose value is 0. This seems to be problematic since even under the optimal DV representation, i.e., the lower bound becomes the highest, seemingly still has no information about , and most time this DV representation is negative. Nonetheless, our information-theoretic view is built upon differential entropy, which loses fundamental associations with the discrete entropy. We cannot interpret its values with the interpretations of discrete entropy. Therefore, we do not concern about the negativity of the DV representation and the zero information problem since the concerns mentioning above are based on the intuitions from the discrete information theory.

7 Acknowledgements

This is a preprint work. The authors welcome and thank any criticism and suggestion. Please feel free to contact the authors if you have any concerns.


  • [1] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm (2018) Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062. Cited by: §4.
  • [2] C. M. Bishop et al. (1995) Neural networks for pattern recognition. Oxford university press. Cited by: §5.2.
  • [3] J. S. Bridle (1990) Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pp. 227–236. Cited by: §1, §5.1, §5.2.
  • [4] T. M. Cover and J. A. Thomas (2012) Elements of information theory. John Wiley & Sons. Cited by: §2.2.1.
  • [5] J. S. Denker and Y. Lecun (1991) Transforming neural-net output levels to probability distributions. In Advances in neural information processing systems, pp. 853–859. Cited by: §5.
  • [6] M. D. Donsker and S. S. Varadhan (1975) Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics 28 (1), pp. 1–47. Cited by: §2.2.2.
  • [7] I. M. Elfadel and J. L. Wyatt Jr (1994) The" softmax" nonlinearity: derivation using statistical mechanics and useful properties as a multiterminal analog circuit element. In Advances in neural information processing systems, pp. 882–887. Cited by: §5.2.
  • [8] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: Cited by: §5.2.
  • [9] G. E. Hinton (1990) Connectionist learning procedures. In Machine learning, pp. 555–610. Cited by: §5.1.
  • [10] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2.2.1.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [12] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
  • [13] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.
  • [14] F. Li, A. Karpathy, and J. Johnson (2015) CS231n: convolutional neural networks for visual recognition. University Lecture. Cited by: §1.
  • [15] M. T. McCann, K. H. Jin, and M. Unser (2017) Convolutional neural networks for inverse problems in imaging: a review. IEEE Signal Processing Magazine 34 (6), pp. 85–95. Cited by: §1.
  • [16] S. Mukherjee, H. Asnani, and S. Kannan (2019) CCMI: classifier based conditional mutual information estimation. arXiv preprint arXiv:1906.01824. Cited by: §4.
  • [17] B. Poole, S. Ozair, A. v. d. Oord, A. A. Alemi, and G. Tucker (2019) On variational bounds of mutual information. arXiv preprint arXiv:1905.06922. Cited by: §2.2.1.
  • [18] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016) Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description