Sigsoftmax: Reanalysis of the Softmax Bottleneck

Sigsoftmax: Reanalysis of the Softmax Bottleneck

Sekitoshi Kanai
NTT Software Innovation Center, Keio Univ.
kanai.sekitoshi@lab.ntt.co.jp
&Yasuhiro Fujiwara
NTT Software Innovation Center
fujiwara.yasuhiro@lab.ntt.co.jp
&Yuki Yamanaka
NTT Secure Platform Laboratories
yamanaka.yuki@lab.ntt.co.jp
&Shuichi Adachi
Keio Univ.
adachi.shuichi@appi.keio.ac.jp
Abstract

Softmax is an output activation function for modeling categorical probability distributions in many applications of deep learning. However, a recent study revealed that softmax can be a bottleneck of representational capacity of neural networks in language modeling (the softmax bottleneck). In this paper, we propose an output activation function for breaking the softmax bottleneck without additional parameters. We re-analyze the softmax bottleneck from the perspective of the output set of log-softmax and identify the cause of the softmax bottleneck. On the basis of this analysis, we propose sigsoftmax, which is composed of a multiplication of an exponential function and sigmoid function. Sigsoftmax can break the softmax bottleneck. The experiments on language modeling demonstrate that sigsoftmax and mixture of sigsoftmax outperform softmax and mixture of softmax, respectively.

1 Introduction

Deep neural networks are used in many recent applications such as image recognition Krizhevsky et al. (2012); He et al. (2016), speech recognition Graves et al. (2013), and natural language processing Mikolov (2012); Sutskever et al. (2014); Cho et al. (2014). High representational capacity and generalization performance of deep neural networks are achieved by many layers, activation functions and regularization methods Nair and Hinton (2010); He et al. (2016); Srivastava et al. (2014); Ioffe and Szegedy (2015); Goodfellow et al. (2016). Although various model architectures are built in the above applications, softmax is commonly used as an output activation function for modeling categorical probability distributions Bridle (1990b); Goodfellow et al. (2016); He et al. (2016); Mikolov (2012); Sutskever et al. (2014); Cho et al. (2014); Graves et al. (2013). For example, in language modeling, softmax is employed for representing the probability of the next word over the vocabulary in a sentence. When using softmax, we train the model by minimizing negative log-likelihood with a gradient-based optimization method. We can easily calculate the gradient of negative log-likelihood with softmax, and it is numerically stable Bridle (1990a, b).

Even though softmax is widely used, few studies have attempted to improve its modeling performance Chen et al. (2017); de Brébisson and Vincent (2016). This is because deep neural networks with softmax are believed to have a universal approximation property. However, Yang et al. (2018) recently revealed that softmax can be a bottleneck of representational capacity in language modeling. They showed that the representational capacity of the softmax-based model is restricted by the length of the hidden vector in the output layer. In language modeling, the length of the hidden vector is much smaller than the vocabulary size. As a result, the softmax-based model cannot completely learn the true probability distribution, and this is called the softmax bottleneck. For breaking the softmax bottleneck, Yang et al. (2018) proposed mixture of softmax (MoS) that mixes the multiple softmax outputs. However, this analysis of softmax does not explicitly show why softmax can be a bottleneck. Furthermore, MoS is an additional layer or mixture model rather than an alternative activation function to softmax: MoS has learnable parameters and hyper-parameters.

In this paper, we propose a novel output activation function for breaking the softmax bottleneck without additional parameters. We re-analyze the softmax bottleneck from the point of view of the output set (range) of a function and show why softmax can be a bottleneck. This paper reveals that (i) the softmax bottleneck occurs because softmax uses only exponential functions for nonlinearity and (ii) the range of log-softmax is a subset of the vector space whose dimension depends on the dimension of the input space. As an alternative activation function to softmax, we explore the output functions composed of rectified linear unit (ReLU) and sigmoid functions. In addition, we propose sigsoftmax, which is composed of a multiplication of an exponential function and sigmoid function. Sigsoftmax has desirable properties for output activation functions, e.g., the calculation of its gradient is numerically stable. More importantly, sigsoftmax can break the softmax bottleneck, and the range of softmax can be a subset of that of sigsoftmax. Experiments in language modeling demonstrate that sigsoftmax can break the softmax bottleneck and outperform softmax. In addition, mixture of sigsoftmax outperforms MoS.

2 Preliminaries

2.1 Softmax

Deep neural networks use softmax in learning categorical distributions. For example, in the classification, a neural network uses softmax to learn the probability distribution over classes conditioned on the input as where is a parameter. Let be a hidden vector and be a weight matrix in the output layer, the output of softmax represents the conditional probability of the -th class as follows:

(1)

where represents the -th element of . We can see that each element of is bounded from zero to one since the output of exponential functions is non-negative in Eq. (1). The summation of all elements of is obviously one. From these properties, we can regard output of the softmax trained by minimizing negative log-likelihood as a probability Bridle (1990b); Memisevic et al. (2010). If we only need the most likely label, we can find such a label by comparing elements of without the calculations of softmax once we have trained the softmax-based model. This is because exponential functions in softmax are monotonically increasing.

To train the softmax-based models, negative log-likelihood (cross entropy) is used as a loss function. Since the loss function is minimized by stochastic gradient descent (SGD), the properties of the gradients of functions are very important Nair and Hinton (2010); Pascanu et al. (2013); Glorot and Bengio (2010). One advantage of softmax is that the gradient of log-softmax is easily calculated as follows Bridle (1990a, b); Bishop (1995); de Brébisson and Vincent (2016):

(2)

where . Whereas the derivative of the logarithm can cause a division by zero since , the derivative of log-softmax cannot. As a result, softmax is numerically stable.

2.2 Softmax bottleneck

In recurrent neural network (RNN) language modeling, given a corpus of tokens , the joint probability is factorized as , where is referred to as the context of the conditional probability. Output of softmax learns where (a) is the hidden vector corresponding to the context and (b) is a weight matrix in the output layer (embedding layer). A natural language is assumed as a finite set of pairs of and as , where is the number of possible contexts. The objective of language modeling is to learn a model distribution parameterized by to match the true data distribution . Note that upper- and lower-case letters are used for variables and constants, respectively, in this section. Under the above assumptions, let be possible tokens in the language , the previous study of Yang et al. (2018) considers the following three matrices:

(3)

is a matrix composed of the hidden vectors, is a weight matrix, and is a matrix composed of the log probabilities of the true distribution. By using these matrices, the rank of should be greater than or equal to so that the softmax-based model completely learns Yang et al. (2018). However, the rank of is at most if any functions are used for and . Therefore, if we have , softmax can be the bottleneck of representational capacity as shown in the following theorem:

Theorem 1 (Softmax Bottleneck (Yang et al., 2018)).

If , for any function family and any model parameter , there exists a context in such that .

This theorem shows that the length of the hidden vector in the output layer determines the representational power of RNN with softmax. In language modeling, the rank of can be extremely high since contexts can vary and vocabulary size is much larger than . Therefore, the softmax can be the bottleneck of the representational power.

2.3 Mixture of softmax

A simple approach to improving the representational capacity is to use a weighted sum of the several models. In fact, Yang et al. (2018) use this approach for breaking the softmax bottleneck. As the alternative to softmax, they propose the mixture of softmax (MoS), which is the weighted sum of softmax functions:

(4)

where is the prior or mixture weight of the -th component, and is the -th context vector associated with the context . Let be input of MoS for the context . The priors and context vectors are parameterized as and , respectively. MoS can break the softmax bottleneck since the rank of the approximate can be arbitrarily large Yang et al. (2018). Therefore, language modeling with MoS performs better than that with softmax. However, in this method, the number of mixtures is the hyper-parameter which needs to be tuned. In addition, weights and are additional parameters. Thus, MoS can be regarded as an additional layer or mixing technique rather than the improvement of the activation function.

2.4 Related work

Previous studies proposed alternative functions to softmax de Brébisson and Vincent (2016); Mohassel and Zhang (2017); Ollivier (2013). The study of de Brébisson and Vincent (2016) explored spherical family functions: the spherical softmax and Taylor softmax. They showed that these functions do not outperform softmax when the length of an output vector is large. In addition, the spherical softmax has a hyper-parameter that should be carefully tuned for numerical stability reasons de Brébisson and Vincent (2016). On the other hand, the Taylor softmax might suffer from the softmax bottleneck since it approximates softmax. Mohassel and Zhang (2017) proposed a ReLU-based alternative function to softmax for privacy-preserving machine learning since softmax is expensive to compute inside a secure computation. However, it leads to a division by zero since all outputs of ReLUs frequently become zeros and the denominator for normalization becomes zero. Several studies improved the efficiency of softmax Grave et al. (2017); Shim et al. (2017); Titsias (2016); Martins and Astudillo (2016). However they did not improve the representational capacity.

3 Proposed method

3.1 Reanalysis of the softmax bottleneck

The analysis of the softmax bottleneck Yang et al. (2018) is based on matrix factorization and reveals that the rank of needs to be greater than or equal to . Since the rank of becomes the length of the hidden vector in the output layer, the length of the hidden vector determines the representational power as described in Sec. 2.2. However, this analysis does not explicitly reveal the cause of the softmax bottleneck. To identify the cause of the softmax bottleneck, we re-analyze the softmax bottleneck from the perspective of the range of log-softmax because it should be large enough to approximate the true log probabilities.

Log-softmax is a logarithm of softmax and is used in training of deep learning as mentioned in Sec. 2.1. By using the notation in Sec. 2.1, log-softmax can be represented as . This function can be expressed as

(5)

where is the vector of all ones. To represent various log probability distributions , the range of should be sufficiently large. Therefore, we investigate the range of . We assume that the hidden vector in the output layer can be an arbitrary vector in where , and the weight matrix is the full rank matrix; the rank of is .111If neural networks have the universal approximation property, can be an arbitrary vector in . If not, the input space is a subset of a dimensional vector space, and the range of log-softmax is still a subset of a dimensional vector space. When , we can examine the range of log-softmax in the same way by replacing with . If a bias is used in the output layer, the dimension of can be . Under these assumptions, the input vector space of softmax () is a dimensional vector space, and we have the following theorem:

Theorem 2.

Let be the dimensional vector space and be input of log-softmax, every range of the log-softmax is a subset of the dimensional vector space.

Proof.

The input of log-softmax can be represented by singular vectors of since the rank of is . In other words, the space of input vectors is spanned by basis vectors. Thus, the input vector space is represented as where for are linearly independent vectors and are their coefficients. From Eq. (5), by using and , the range of log-softmax becomes

(6)

where . This is the linear combination of linearly independent vectors and . Therefore, we have the following relation:

(7)

where is the vector space spanned by and . Let be the vector space , the dimension of becomes

(8)

We can see that is the or dimensional linear subspace of . From Eqs. (7) and (8), output vectors of log-softmax exist in the dimensional vector space, which completes the proof. ∎

Theorem 2 shows that the log-softmax has at most linearly independent output vectors, even if the various inputs are applied to the model. Therefore, if the vectors of true log probabilities have more than linearly independent vectors, the softmax-based model cannot completely represent the true probabilities. We can prove Theorem 1 by using Theorem 2 as follows:

Proof.

If we have , i.e., , the number of linearly independent vectors of is larger than . On the other hand, the output vectors of the model cannot be larger than linearly independent vectors from Theorem 2. Therefore, the softmax-based model cannot completely learn , i.e., there exists a context in such that . ∎

The above analysis shows that the softmax bottleneck occurs because the output of log-softmax is the linear combination of the input and vector as Eq. (5). Linear combination of the input and vector increases the number of linearly independent vectors by at most one, and as a result, the output vectors become at most linearly independent vectors. The reason log-softmax becomes the linear combination is that the logarithm of the exponential function is .

By contrast, the number of linearly independent output vectors of a nonlinear function can be much greater than the number of linearly independent input vectors. Therefore, if the other nonlinear functions are replaced with exponential functions, the logarithm of such functions can be nonlinear and the softmax bottleneck can be broken without additional parameters.

Our analysis provides new insights that the range of log-softmax is a subset of the less dimensional vector space although the dimension of a vector space is strongly related to the rank of a matrix. Furthermore, our analysis explicitly shows the cause of the softmax bottleneck.

3.2 Alternative functions to softmax and desirable properties

In the previous section, we explained that the softmax bottleneck can be broken by replacing nonlinear functions with exponential functions. In this section, we explain the desirable properties of an alternative function to softmax. We formulate a new output function as follows:

(9)

The new function is composed of the nonlinear function and the division for the normalization, so that the summation of the elements is one. As the alternative function to softmax, a new output function and its should have all of the following properties:

Nonlinearity of

As mentioned in Secs. 2.2 and 3.1, softmax can be the bottleneck of the representational power because is . Provided that is a linear function, is a subset of the dimensional vector space. In order to break the softmax bottleneck, should be nonlinear.

Numerically stable

In training of deep learning, we need to calculate the gradient for optimization. The derivative of logarithm of with respect to is

(10)

We can see that this function has a division by . It can cause a division by zero since can be close to zero if networks completely go wrong in training. The alternative functions should avoid a division by zero similar to softmax as shown in Eq. (2).

Non-negative

In Eq. (9), all elements of should be non-negative to limit output in . Therefore, should be non-negative: . Note that if is non-positive, also are limited to . We only mention non-negative since non-positive function can easily be non-negative as .

Monotonically increasing

should be monotonically increasing so that becomes a smoothed version of the argmax function Bridle (1990b); Bishop (2006). If is monotonically increasing, we can obtain the label that has the maximum value of by comparing elements of .

Note that, if we use ReLU as , the ReLU-based function does not have all the above properties since the gradient of its logarithm is not numerically stable. If we use sigmoid as , the new sigmoid-based function satisfies the above properties. However, the output of sigmoid is bounded above as , and this restriction might limit the representational power. In fact, the sigmoid-based function does not outperform softmax on the large dataset in Sec. 4. We discuss these functions in detail in the appendix. In the next section, we propose a new output activation function that can break the softmax bottleneck, and satisfies all the above properties.

3.3 Sigsoftmax

For breaking the softmax bottleneck, we propose sigsoftmax given as follows:

Definition 1.

Sigsoftmax is defined as

(11)

where represents a sigmoid function.

We theoretically show that sigsoftmax can break the softmax bottleneck and has the desired properties. In the same way as in the analysis of softmax in Sec. 3.1, we examine the range of log-sigsoftmax. Since we have , log-sigsoftmax becomes

(12)

where , and is the nonlinear function called softplus Goodfellow et al. (2016). Since log-sigsoftmax is composed of a nonlinear function, its output vectors can be greater than linearly independent vectors. Therefore, we have the following theorem:

Theorem 3.

Let be the dimensional vector space and be input of log-sigsoftmax, some range of log-sigsoftmax is not a subset of a dimensional vector space.

The detailed proof of this theorem is given in the appendix. Theorem 3 shows that sigsoftmax can break the softmax bottleneck; even if the vectors of the true log probabilities are more than linearly independent vectors, the sigsoftmax-based model can learn the true probabilities.

However, the representational powers of sigsoftmax and softmax are difficult to compare only by using the theorem based on the vector space. This is because both functions are nonlinear and their ranges are not necessarily vector spaces, even though they are subsets of vector spaces. Therefore, we directly compare the ranges of sigsoftmax and softmax as the following theorem:

Theorem 4.

Let be the input of sigsoftmax and softmax . If the is a dimensional vector space and , the range of softmax is a subset of the range of sigsoftmax

(13)
Proof.

If we have , can be written as where () and are linearly independent vectors. In addition, the arbitrary elements of can be written as , and thus, . For the output of softmax, by substituting for Eq. (1), we have

(14)

As a result, the range of softmax becomes as follows:

(15)

On the other hand, by substituting for Eq. (11), output of sigsoftmax becomes as follows:

(16)

When are fixed for and ,222 Even though is extremely large, the input vector is the element of the input space . we have the following equality:

(17)

since when is fixed. From Eq. (17), sigsoftmax has the following relation:

(18)

where is a hyperplane of with , . From Eqs. (15) and (18), we can see that the range of sigsoftmax includes the range of softmax. Therefore, we have . ∎

Theorem 4 shows that the range of sigsoftmax can be larger than that of softmax if . The assumption means that there exist inputs of which outputs are the equal probabilities for all labels as for all . This assumption is not very strong in practice. If , the range of sigsoftmax can include the range of softmax by introducing one learnable scalar parameter into sigsoftmax as . In this case, if softmax can fit the true probability, can become large enough for sigsoftmax to approximately equal softmax. In the experiments, we did not use in order to confirm that sigsoftmax can outperform softmax without additional parameters. From Theorems 3 and 4, sigsoftmax can break the softmax bottleneck, and furthermore, the representational power of sigsoftmax can be higher than that of softmax.

Then, we show that sigsoftmax has the desirable properties introduced in Sec. 3.2 as shown in the following theorem from Definition 1 although we show its proof in the appendix:

Theorem 5.

Sigsoftmax has the following properties:

  1. Nonlinearity of : .

  2. Numerically stable:

  3. Non-negative: .

  4. Monotonically increasing: .

Since sigsoftmax is an alternative function to softmax, we can use the weighted sum of sigsoftmax functions in the same way as MoS. Mixture of sigsoftmax (MoSS) is the following function:

(19)

is also composed of sigsoftmax as .

4 Experiments

4.1 Experimental conditions

To evaluate the effectiveness of sigsoftmax, we conducted experiments on language modeling. We compared sigsoftmax with softmax, the ReLU-based function and the sigmoid-based function. We also compared the mixture of sigsoftmax with that of softmax; MoSS with MoS. We used Penn Treebank dataset (PTB) Marcus et al. (1993); Mikolov (2012) and WikiText-2 dataset (WT2) Merity et al. (2017) by following the previous studies Merity et al. (2018); Krause et al. (2017); Yang et al. (2018). PTB is commonly used to evaluate the performance of RNN-based language modeling Mikolov (2012); Zaremba et al. (2014); Merity et al. (2018); Yang et al. (2018). PTB is split into a training set (about 930 k tokens), validation set (about 74 k tokens), and test set (about 82 k tokens). The vocabulary size was set to 10 k, and all words outside the vocabulary were replaced with a special token. WT2 is a collection of tokens from the set of articles on Wikipedia. WT2 is also split into a training set (about 2100 k), validation set (about 220 k), and test set (about 250 k). The vocabulary size was 33,278. Since WT2 is larger than PTB, language modeling of WT2 may require more representational power than that of PTB.

We trained a three-layer long short-term memory (LSTM) model with each output function. After we trained models, we finetuned them and applied the dynamic evaluation Krause et al. (2017). For fair comparison, the experimental conditions, such as unit sizes, dropout rates, initialization, and the optimization method were the same as in the previous studies Merity et al. (2018); Yang et al. (2018); Krause et al. (2017) except for the number of epochs by using their codes.333https://github.com/salesforce/awd-lstm-lm (Note that Merity et al. (2018) further tuned some hyper-parameters to obtain results better than those in the original paper in their code.);
https://github.com/benkrause/dynamic-evaluation; https://github.com/zihangdai/mos
We set the epochs to be twice as large as the original epochs used in Merity et al. (2018) since the losses did not converge in the original epochs. In addition, we trained each model with various random seeds and evaluated the average and standard deviation of validation and test perplexities for each method. The detailed conditions and the results at training and finetuning steps are provided in the appendix.

4.2 Experimental results

Validation perplexities and test perplexities of PTB and WT2 modeling are listed in Tabs. 3 and 3. Tab. 3 shows that the sigmoid-based function achieved the lowest perplexities among output activation functions on PTB. However, the sigmoid-based function did not outperform softmax on WT2. This is because sigmoid is bounded above by one, , and it may restrict the representational power. As a result, the sigmoid based function did not perform well on the large dataset. On the other hand, sigsoftmax achieved lower perplexities than softmax on PTB and achieved the lowest perplexities on WT2. Furthermore, between mixture models, MoSS achieved lower perplexities than MoS. Even though we trained and finetuned models under the conditions that are highly optimized for softmax and MoS in Merity et al. (2018); Yang et al. (2018), sigsoftmax and MoSS outperformed softmax and MoS, respectively. Therefore, we conclude that sigsoftmax outperforms softmax as an activation function.

Softmax :ReLU : Sigmoid Sigsoftmax MoS MoSS
Validation 51.20.5 (4.915) 49.20.4 49.70.5 48.60.2 48.30.1
Test 50.50.5 (2.788) 48.90.3 49.20.4 48.00.1 47.70.07


Table 2: Results of the language modeling experiment on WT2.
Softmax :ReLU :Sigmoid Sigsoftmax MoS MoSS
Validation 45.30.2 (1.790.8) 45.70.1 44.90.1 42.50.1 42.10.2
Test 43.30.1 (2.302) 43.50.1 42.90.1 40.80.03 40.30.2


Table 3: The number of linearly independent log-output vectors on test datasets: Ranks of .
Softmax : ReLU : Sigmoid Sigsoftmax MoS MoSS
PTB 402 8243 1304 4640 9980 9986
WT2 402 31400 463 5465 12093 19834
Table 1: Results of the language modeling experiment on PTB.

4.3 Evaluation of linear independence

In this section, we evaluate linear independence of output vectors of each function. First, we applied whole test data to the finetuned models and obtained log-output , e.g., log-softmax, at each time. Next, we made the matrices as where is the number of tokens of test data. and were respectively 10,000 and 82,430 on the PTB test set and 33,278 and 245,570 on the WT2 test set. Finally, we examined the rank of since the rank of the matrix is if the matrix is composed of linearly independent vectors. Note that the numerical approaches for computing ranks have roundoff error, and we used the threshold used in Press et al. (2007); Yang et al. (2018) to detect the ranks. The ranks of are listed in Tab. 3. The calculated singular values for detecting ranks are presented in the appendix.

We can see that log-softmax output vectors have 402 linearly independent vectors. In the experiments, the number of hidden units is set to 400, and we used a bias vector in the output layer. As a result, the dimension of the input space was at most 401, and log-softmax output vectors are theoretically at most 402 linearly independent vectors from Theorem 2. Therefore, we confirmed that the range of log-softmax is a subset of the dimensional vector space. On the other hand, the number of linearly independent output vectors of sigsoftmax, ReLU and sigmoid-based functions are not bounded by 402. Therefore, sigsoftmax, ReLU and sigmoid-based functions can break the softmax bottleneck. The ranks of the ReLU-based function are larger than the other activation functions. However, the ReLU-based function is numerically unstable as mentioned in Sec. 3.2. As a result, it was not trained well as shown in Tabs. 3 and 3. MoSS has more linearly independent output vectors than MoS. Therefore, MoSS may have more representational power than MoS.

5 Conclusion

In this paper, we investigated the range of log-softmax and identified the cause of the softmax bottleneck. We proposed sigsoftmax, which can break the softmax bottleneck and has more representational power than softmax without additional parameters. Experiments on language modeling demonstrated that sigsoftmax outperformed softmax. Since sigsoftmax has the desirable properties for output activation functions, it has the potential to replace softmax in many applications.

References

  • Bishop [1995] Christopher M Bishop. Neural Networks for Pattern Recognition. Oxford university press, 1995.
  • Bishop [2006] Christopher M Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, 2006.
  • Bridle [1990a] John S Bridle. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Proc. NIPS, pages 211–217, 1990a.
  • Bridle [1990b] John S Bridle. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pages 227–236. Springer, 1990b.
  • Chen et al. [2017] Binghui Chen, Weihong Deng, and Junping Du. Noisy softmax: Improving the generalization ability of dcnn via postponing the early softmax saturation. pages 5372–5381, 2017.
  • Cho et al. [2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proc. EMNLP, pages 1724–1734. ACL, 2014.
  • de Brébisson and Vincent [2016] Alexandre de Brébisson and Pascal Vincent. An exploration of softmax alternatives belonging to the spherical loss family. In Proc. ICLR, 2016.
  • Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proc. AISTATS, pages 249–256, 2010.
  • Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
  • Grave et al. [2017] Édouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou. Efficient softmax approximation for GPUs. In Proc. ICML, pages 1302–1310, 2017.
  • Graves et al. [2013] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Proc. ICASSP, pages 6645–6649. IEEE, 2013.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. pages 770–778, 2016.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. ICML, pages 448–456, 2015.
  • Krause et al. [2017] Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. arXiv preprint arXiv:1709.07432, 2017.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, pages 1097–1105, 2012.
  • Marcus et al. [1993] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
  • Martins and Astudillo [2016] Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proc. ICML, pages 1614–1623, 2016.
  • Memisevic et al. [2010] Roland Memisevic, Christopher Zach, Marc Pollefeys, and Geoffrey E Hinton. Gated softmax classification. In Proc. NIPS, pages 1603–1611, 2010.
  • Merity et al. [2017] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In Proc. ICLR, 2017.
  • Merity et al. [2018] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. In Proc. ICLR, 2018.
  • Mikolov [2012] Tomas Mikolov. Statistical language models based on neural networks. PhD thesis, Brno University of Technology, 2012.
  • Mohassel and Zhang [2017] Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-preserving machine learning. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 19–38. IEEE, 2017.
  • Nair and Hinton [2010] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proc. ICML, pages 807–814. Omnipress, 2010.
  • Ollivier [2013] Yann Ollivier. Riemannian metrics for neural networks i: feedforward networks. arXiv preprint arXiv:1303.0818, 2013.
  • Pascanu et al. [2013] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Proc. ICML, pages 1310–1318, 2013.
  • Press et al. [2007] William H Press, Saul A Teukolsky, William T Vetterling, and Brian P Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, 2007.
  • Shim et al. [2017] Kyuhong Shim, Minjae Lee, Iksoo Choi, Yoonho Boo, and Wonyong Sung. SVD-softmax: Fast softmax approximation on large vocabulary neural networks. In Proc. NIPS, pages 5469–5479, 2017.
  • Srivastava et al. [2014] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proc. NIPS, pages 3104–3112. 2014.
  • Titsias [2016] Michalis K. Titsias. One-vs-each approximation to softmax for scalable estimation of probabilities. In Proc. NIPS, pages 4161–4169, 2016.
  • Yang et al. [2018] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: a high-rank rnn language model. In Proc. ICLR, 2018.
  • Zaremba et al. [2014] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.

Appendix

Appendix A Proofs of theorems

In this section, we provide the proofs of theorems that are not provided in the paper.

Theorem 3.

Let be the dimensional vector space and be input of log-sigsoftmax, some range of log-sigsoftmax is not a subset of a dimensional vector space.

Proof.

We prove this by contradiction. If Theorem 3 does not hold, every range of log-sigsoftmax is a subset of a dimensional vector space. When we provide a counterexample of this statement, we prove Theorem 3 since this statement is the negation of Theorem 3. The counter example is the case in which is the one dimensional vector space (i.e., ), and . Under the above condition, from Definition 1 in the paper, outputs of log-sigsoftmax are as follows:

(20)

From , we choose three inputs and investigate the outputs. The outputs of log-sigsoftmax are as follows:

(21)
(22)
(23)

To evaluate linear independence, we examine the solution of the . If its solution is only , , , and are linearly independent. Each element of becomes the following equations:

(24)
(25)
(26)

From Eq. (26), we have

(27)

Substituting Eq. (27) for Eq. (24) and Eq. (25), we have

(28)
(29)

From Eq. (