Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks
Abstract
Neural language models have been widely used in various NLP tasks, including machine translation, next word prediction and conversational agents. However, it is challenging to deploy these models on mobile devices due to their slow prediction speed, where the bottleneck is to compute top candidates in the softmax layer. In this paper, we introduce a novel softmax layer approximation algorithm by exploiting the clustering structure of context vectors. Our algorithm uses a lightweight screening model to predict a much smaller set of candidate words based on the given context, and then conducts an exact softmax only within that subset. Training such a procedure endtoend is challenging as traditional clustering methods are discrete and nondifferentiable, and thus unable to be used with backpropagation in the training process. Using the Gumbel softmax, we are able to train the screening model endtoend on the training set to exploit data distribution. The algorithm achieves an order of magnitude faster inference than the original softmax layer for predicting top words in various tasks such as beam search in machine translation or next words prediction. For example, for machine translation task on German to English dataset with around 25K vocabulary, we can achieve 20.4 times speed up with 98.9% precision@1 and 99.3% precision@5 with the original softmax layer prediction, while stateoftheart (Zhang et al., 2018) only achieves 6.7x speedup with 98.7% precision@1 and 98.1% precision@5 for the same task.
Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks
PeiHung (Patrick) Chen^{†}^{†}thanks: This work is conducted during PeiHung Chen’s internship in Google Research., Si Si, Sanjiv Kumar, Yang Li, ChoJui Hsieh 
Department of Computer Science, University of California, Los Angeles 
Google Research 
patrickchen@g.ucla.edu, {sisidaisy,sanjivk,liyang}@google.com, 
chohsieh@cs.ucla.edu 
1 Introduction
Neural networks have been widely used in many natural language processing (NLP) tasks, including neural machine translation (Sutskever et al., 2014), text summarization (Rush et al., 2015) and dialogue systems (Li et al., 2016). In these applications, a neural network (e.g. LSTM) summarizes current state by a context vector, and a softmax layer is used to predict the next output word based on this context vector. The softmax layer first computes the “logit” of each word in the vocabulary, defined by the inner product of context vector and weight vector, and then a softmax function is used to transform logits into probabilities. For most applications, only top candidates are needed, for example in neural machine translation where corresponds to the search beam size. In this procedure, the computational complexity of softmax layer is linear in the vocabulary size, which can easily go beyond 10K. Therefore, the softmax layer has become the computational bottleneck in many NLP applications at inference time.
Our goal is to speed up the prediction time of softmax layer. In fact, computing top predictions in softmax layer is equivalent to the classical Maximum Inner Product Search (MIPS) problem—given a query, finding vectors in a database that have the largest inner product values with the query. In neural language model prediction, context vectors are equivalent to queries, and weight vectors are equivalent to the database. MIPS is an important operation in the prediction phase of many machine learning models, and many algorithms have been developed (Bachrach et al., 2014; Shrivastava & Li, 2014; Neyshabur & Srebro, 2015; Yu et al., 2017; Guo et al., 2016). Surprisingly, when we apply recent MIPS algorithms to LSTM language model prediction, there’s not much speedup if we need to achieve precision (see experimental section for more details). This motivates our work to develop a new algorithm for fast neural language model prediction.
In natural language, some combinations of words appear very frequently, and when some specific combination appears, it is almostsure that the prediction should only be within a small subset of vocabulary. This observation leads to the following question: Can we learn a faster “screening” model that identifies a smaller subset of potential predictions based on a query vector? In order to achieve this goal, we need to design a learning algorithm to exploit the distribution of context vectors (queries). This is quite unique compared with previous MIPS algorithms, where most of them only exploit the structure of database (e.g., KDtree, PCAtree, or small world graph) instead of utilizing the query distribution.
We propose a novel algorithm (L2S: learning to screen) to exploit the distribution of both context embeddings (queries) and word embeddings (database) to speed up the inference time in softmax layer. To narrow down the search space, we first develop a lightweight screening model to predict the subset of words that are more likely to belong to top candidates, and then conduct an exact softmax only within the subset. The algorithm can be illustrated in Figure 1. Our contribution is four folds:

We propose a screening model to exploit the clustering structure of context features. All the previous neural language models only consider partitioning the embedding matrix to exploit the clustering structure of the word embedding to achieve prediction speedup.

To make prediction for a context embedding, after obtaining cluster assignment from screening model, L2S only needs to evaluate a small set of vocabulary in that cluster. Therefore, L2S can significantly reduce the inference time complexity from to with and where is the context vector’ dimension; is the vocabulary size, is the number of clusters, and is the average word/candidate size inside clusters.

We propose to form a joint optimization problem to learn both screening model for clustering as well as the candidate label set inside each cluster simultaneously. Using the Gumbel trick (Jang et al., 2017), we are able to train the screening network endtoend on the training data.

We show in our experiment that L2S can quickly identify the top prediction words in the vocabulary an order of magnitude faster than original softmax layer inference for machine translation and next words prediction tasks.
2 Related Work
We summarize previous works on speeding up the softmax layer computation.
Algorithms for speeding up softmax in the training phase. Many approaches have been proposed for speeding up softmax training. Jean et al. (2014); Mnih & Teh (2012) proposed importance sampling techniques to select only a small subset as “hard negative samples” to conduct the updates. The hierarchical softmaxbased methods (Morin & Bengio, 2005; Grave et al., 2017) use the tree structure for decomposition of the conditional probabilities, constructed based on external word semantic hierarchy or by word frequency. Most hierarchical softmax methods cannot be used to speed up inference time since they only provide a faster way to compute probability for a target word, but not for choosing top predictions as they still need to compute the logits for all the words for inference. One exception is the recent work by Grave et al. (2017), which constructs the tree structure by putting frequent words in the first layer—so in the prediction phase, if top words are found in the first layer, they do not need to go down the tree. We provide comparison with this approach in our experiments.
Algorithms for Maximum Inner Product Search (MIPS). Given a query vector and a database with candidate vectors, MIPS aims to identify a subset of vectors in the database that have top inner product values with the query. Top softmax can be naturally approximated by conducting MIPS. Here we summarize existing MIPS algorithms:

Graphbased algorithm: Malkov et al. (2014); Malkov & Yashunin (2016a) recently developed an NNS algorithm based on small world graph. The main idea is to form a graph with candidate vectors as nodes and edges are formed between nearby candidate vectors. The query stage can then done by navigating in this graph. Zhang et al. (2018) applies the MIPStoNNS reduction and shows graphbased approach performs well on neural language model prediction.

Direct solvers for MIPS: Some algorithms are proposed to directly tackle MIPS problem instead of transforming to NNS. Guo et al. (2016); Wu et al. (2017) use quantizationbased approach to approximate candidate set. Another Greedy MIPS algorithm is recently proposed in (Yu et al., 2017), showing significant improvement over LSH and treebased approaches.
Algorithms for speeding up softmax in inference time.
MIPS algorithms can be used to speed up the prediction phase of softmax layer, since we can view context vectors as query vectors and weight vectors as database. In the experiments, we also include the comparisons with hashingbased approach (LSH) (Indyk & Motwani, 1998), partitionbased approach (PCAtree (Sproull, 1991)) and Greedy approach (Yu et al., 2017). The results show that they perform worse than graphbased approach (Zhang et al., 2018) and are not efficient if we want to keep a high precision.
For NLP tasks, there are two previous attempts to speed up softmax layer prediction time. (Shim et al., 2017) proposed to approximate the weight matrix in the softmax layer with singular value decomposition, find a smaller candidate set based on the approximate logits, and then do a finegrained search within the subset. (Zhang et al., 2018) transformed MIPS to NNS and applied graphbased NNS algorithm to speed up softmax. In the experiments, we show our algorithm is faster and more accurate than all these previous algorithms. Although they also have a screening component to select an important subset, our algorithm is able to learn the screening component using training data in an endtoend manner to achieve better performance.
3 Algorithm
Softmax layer is the main bottleneck when making prediction in neural language models. We assume is the number of output tokens, is the weight matrix of the softmax layer, and is the bias vector. For a given context vector (such as output of LSTM), softmax layer first computes the logits
(1) 
where is the th column of and is the th entry of , and then transform logits into probabilities for Finally it outputs the top candidate set by sorting the probabilities , and uses this information to perform beam search in translation or predict next word in language model.
To speedup the computation of top candidates, all the previous algorithms try to exploit the structure of vectors, such as lowrank, tree partitioning or small world graphs (Zhang et al., 2018; Shim et al., 2017; Grave et al., 2017). However, in NLP applications, there exists strong structure of context vectors that has not been exploited in previous work. In natural language, some combinations of words appear very frequently, and when some specific combinations appear, the next word should only be within a small subset of vocabulary. Intuitively, if two context vectors and are similar, meaning similar context, then their candidate label sets and can be shared. In other words, suppose we already know the candidate set of for , to find the candidate set for , instead of computing the logits for all tokens in the vocabulary, we can narrow down the candidate sets to be , and only compute the logits in to find top prediction for .
The Prediction Process. Suppose the context vectors are partitioned into disjoint clusters and similar ones are grouped in the same partition/cluster, if a vector falls into one of the cluster, we will narrow down to that cluster’s label sets and only compute the logits of that label set. This screening model is parameterized by clustering weights and label candidate set for each cluster . To predict a hidden state , our algorithm first computes the cluster indicator
(2) 
and then narrows down the search space to . The exact softmax is then computed within the subset to find the top predictions (used in language model) or compute probabilities used for beam search in neural machine translation. As we can see the prediction time includes two steps. The first step has inner product operations to find the cluster which takes time. The second step computes softmax over a subset, which takes time where () is the average number of labels in the subsets. Overall the prediction time for a context embedding is , which is much smaller than the complexity using the vanilla softmax layer. Figure 1 illustrates the overall prediction process.
However, how to learn the clustering parameter and the candidate sets ? We found that running spherical kmeans on all the context vectors in the training set can lead to reasonable results (as shown in the appendix), but can we learn even parameters to minimize the prediction error? In the following, we propose an endtoend procedure to learn both context clusters and candidate subsets simultaneously to maximize the performance.
Learning the clustering. Traditional clustering algorithms such as kmeans on Euclidean space or cosine similarity have two drawbacks. First, they are discrete and nondifferentiable, and thus hard to use with backpropagation in the endtoend training process. Second, they only consider clustering on , without taking the predicted label space into account. In this paper, we consider learning the partition through Gumbelsoftmax trick. We will briefly summarize the technique and direct the reader to (Jang et al., 2017) for further details on these techniques. In Table 4 in the appendix, we compare our proposed method to traditional sphericalkmeans to show that it can further improve the performance.
First, we turn the deterministic clustering in Eq(2) into a stochastic process: the probability that belongs to cluster is modeled as
(3) 
However, since argmax is a discrete operation, we cannot combine this operation with final objective function to find out better clustering weight vectors. To overcome this, we can reparameterize Eq(3) using Gumbel trick. Gumbel trick provides an efficient way to draw samples from the categorical distribution calculated in Eq(3):
(4) 
where each is an i.i.d sample drawn from . We then use the Gumbel softmax with temperature as a continuous, differentiable approximation to argmax, and generate dimensional sample vectors which is approximately onehot with
(5) 
Using the StraightThrough (ST) technique proposed in (Jang et al., 2017), we denote as the onehot representation of and assume backpropagation only goes through the first term. This enables endtoend training with the loss function defined in the following section. We also use to denote the onehot entry of (i.e., the position of the “one” entry of ).
Learning the candidate set for each cluster. For a context vector , after getting into the partition , we will narrow down the search space of labels to a smaller subset. Let be the label vector for th cluster, we define the following loss to penalize the mismatch between correct predictions and the candidate set:
(6) 
where is the ’ground truth’ label vector for that is computed from the exact softmax. We set to be the label vector from full softmax because our goal is to approximate full softmax prediction results while having faster inference (same setting with (Shim et al., 2017; Zhang et al., 2018)). The loss is designed based on the following intuition: when we narrow down the candidate set, there are two types of loss: 1) When a candidate () is a correct prediction but not in the candidate set (), then our algorithm will miss this label. 2) When a candidate () is not a correct prediction but it’s in the candidate set (), then our algorithm will waste the computation of one vector product. Intuitively, 1) is much worse than 2), so we put a much smaller weight on the second term.
The choice of true candidate set in can be set according to the application. Throughout this paper, we set to be the correct top prediction (i.e., positions of 5largest in Eq(1). means is within the correct top prediction of , while means it’s outside the top prediction.
Final objective function: We propose to learn the partition function (parameterized by ) and the candidate sets () simultaneously. The joint objective function will be:
(7)  
s.t.  
where is the number of samples, is the average label size defined as , is the index for where for ; is the desired average label/candidate size across different clusters which could be thought as prediction time budget. Since is related to the computation time of proposed method, by enforcing we can make sure label sets won’t grow too large and desired speedup rate could be achieved. Note that is for clustering assignment, and thus a function of clustering parameters as shown in Eq(3).
Optimization. To solve the optimization problem in Eq (7), we apply alternating minimization. First, when fixing the clustering (parameters ) to update the candidate sets (parameters ), the problem is identical to the classic “Knapsack” problem—each is an item, with weight proportional to number of samples belonging to this cluster, and value defined by the loss function of Eq(7), and the goal is to maximize the value within weight capacity . There is no polynomial time solution with respect to , so we apply a greedy approach to solve it. We sort items by the valuecapacity ratio and add them onebyone until reaching the upper capacity .
When fixing and learning , we convert the cluster size constraint to objective function by LagrangeMultiplier:
(8)  
s.t. 
and simply use SGD since backpropagation is available after applying Gumbel trick. To deal with in the minibatch setting, we replace it by the movingaverage, updated at each iteration when we go through a batch of samples. The overall learning algorithm is given in Algorithm 1.
4 Experiments
We evaluate our method on two tasks: Language Modeling (LM) and Neural Machine Translation (NMT). For LM, we use the Penn Treebank Bank (PTB) dataset (Marcus et al., 1993). For NMT, we use the IWSLT 2014 GermantoEnglish translation task (Cettolo et al., 2014) and IWSLT 2015 EnglishVietnamese data (Luong & Manning, 2015). All the models use a 2layer LSTM neural network structure. For IWSLT14 DEEN task, we use the PyTorch checkpoint provided by OpenNMT (Klein et al., 2017). For IESLT15 ENVE task, we set the dimension of hidden size to be 200, and the rest follows the default training hyperparameters of OpenNMT. For PTB, we train a 2layer LSTMbased language model on PTB from scratch with two setups: PTBSmall and PTBLarge. The LSTM hidden state sizes are 200 for PTBSmall and 1500 for PTBLarge, so are their embedding sizes. We verified that all these models achieved benchmark performance on the corresponding datasets as reported in the literature. We then apply our method to accelerate the inference of these benchmark models.
4.1 Competing Algorithms
We include the following algorithms in our comparisons:

L2S (Our proposed algorithm): the proposed learningtoscreen method. Number of clusters and average label size across clusters will be the main hyperparameters affecting computational time. We could control the tradeoff of time and accuracy by fixing the number of clusters and varying the size constraint . For all the experiments we set parameters and . We will show later that L2S is robust to different numbers of clusters.

FGD (Zhang et al., 2018): transform the softmax inference problem into nearest neighbor search (NNS) and solve it by a graphbased NNS algorithm.

SVDsoftmax (Shim et al., 2017): a lowrank approximation approach for fast softmax computation. We vary the rank of SVD to control the tradeoff between prediction speed and accuracy.

Adaptivesoftmax (Grave et al., 2017): a variant of hierarchical softmax that was mainly developed for fast training on GPUs. However, this algorithm can also be used to speedup prediction time (as discussed in Section 2), so we include it in our comparison. The tradeoff is controlled by varying the number of frequent words in the top level in the algorithm.

GreedyMIPS (Yu et al., 2017): the greedy algorithm for solving MIPS problem. The tradeoff is controlled by varying the budget parameter in the algorithm.

PCAMIPS (Bachrach et al., 2014): transform MIPS into Nearest Neighbor Search (NNS) and then solve NNS by PCAtree. The tradeoff is controlled by varying the tree depth.

LSHMIPS (Neyshabur & Srebro, 2015): transform MIPS into NNS and then solve NNS by Locality Sensitive Hashing (LSH). The tradeoff is controlled by varying number of hash functions.
We implement L2S, SVDsoftmax and Adaptivesoftmax in numpy. For FGD, we use the C++ library implemented in (Malkov & Yashunin, 2016b; Boytsov & Naidan, 2013) for the core NNS operations. The last three algorithms (GreedyMIPS, PCAMIPS and LSHMIPS) have not been used to speed up softmax prediction in the literature and they do not perform well in these NLP tasks, but we still include them in the experiments for completeness. We use the C++ code by (Yu et al., 2017) to run experiments for these three MIPS algorithms.
Since our focus is to speedup the softmax layer which is known to be the bottleneck of NLP tasks with large vocabulary, we only report the prediction time results for the softmax layer in all the experiments. To compare under the same amount of hardware resource, all the experiments were conducted on an Intel Xeon E52620 CPU using a single thread.
PTBSmall  PTBLarge  NMT: DEEN  

Speedup  P@1  P@5  Speedup  P@1  P@5  Speedup  P@1  P@5  
L2S (Our Method)  10.6x  0.998  0.990  45.3x  0.996  0.982  20.4x  0.989  0.993 
FGD  1.3x  0.980  0.989  6.9x  0.975  0.979  6.7x  0.987  0.981 
SVDsoftmax  0.8x  0.987  0.99  2.3x  0.988  0.981  3.4x  0.98  0.985 
Adaptivesoftmax  1.9x  0.972  0.981  4.2x  0.974  0.937  3.2x  0.982  0.984 
GreedyMIPS  0.5x  0.998  0.972  1.8x  0.945  0.903  2.6x  0.911  0.887 
PCAMIPS  0.14x  0.322  0.341  0.5x  0.361  0.326  1.3x  0.379  0.320 
LSHMIPS  1.3x  0.165  0.33  2.2x  0.353  0.31  1.6x  0.131  0.137 
4.2 Performance Comparisons
To measure the quality of top approximate softmax, we compute Precision@ (P@) defined by , where is the top candidates computed by the approximate algorithm and is the top candidates computed by exact softmax. We present the results for . This measures the accuracy of nextwordprediction in LM and NMT. To measure the speed of each algorithm, we report the speedup defined by the ratio of wall clock time of the exact softmax to find top words divided by the wall clock time of the approximate algorithm.
For each algorithm, we show the prediction accuracy vs speedup over the exact softmax in Figure 2, 3, 4, 5, 6, 7 (The last three are in the appendix). We do not show the results for PCAMIPS and LSHMIPS in the figures as their curves run outside the range of the figures. Some represented results are reported in Table 1. These results indicate that the proposed algorithm significantly outperforms all the previous algorithms for predicting top words/tokens on both language model (next word prediction) and neural machine translation.
Next, we measure the BLEU score of the NMT tasks when incorporating the proposed algorithm with beam search. We consider the common settings with beam size or , and report the wall clock time of each algorithm excluding the LSTM part. We only calculate logsoftmax values on reduced search space and leave probability of other vocabularies not in the reduced search space to be 0. From the precision comparison, since FGD shows better performance than other completing methods in Table 1, we only compare our method with stateoftheart algorithm FGD in Table 2 in terms of BLEU score. Our method can achieve more than 13 times speed up with only 0.14 loss in BLEU score in DEEN task with beam size 5. Similarly, our method can achieve 20 times speed up in ENVE task with only 0.08 loss in BLEU score. In comparison, FGD can only achieve less than 36 times speed up over exact softmax to achieve a similar BLEU score. We also compare our algorithm with other methods using perplexity as a metric in PTBSmall and PTBLarge as shown in Table 5 in the appendix. We observe more than 5 times speedup over using full softmax without losing much perplexity (less than 5% difference). More details can be found in the appendix.
In addition, we also show some qualitative results of our proposed method on DEEN translation task in Table 6 to demonstrate that our algorithm can provide similar translation results but with faster inference time.
Model  Metric  Original  FGD  Our method 

NMT: DEEN  Speedup Rate  1x  2.7x  14.0x 
Beam=1  BLEU  29.50  29.43  29.46 
NMT: DEEN  Speedup Rate  1x  2.9x  13.4x 
Beam=5  BLEU  30.33  30.13  30.19 
NMT: ENVE  Speedup Rate  1x  6.4x  12.4x 
Beam=1  BLEU  24.58  24.28  24.38 
NMT: ENVE  Speedup Rate  1x  4.6x  20x 
Beam=5  BLEU  25.35  25.26  25.27 
4.3 Selection of the Number of Clusters
Finally, we show the performance of our method with different number of clusters in Table 3. When varying number of clusters, we also vary the time budget so that the prediction time including finding the correct cluster and computing the softmax in the candidate set are similar. The results indicate that our method is quite robust to number of clusters. Therefore, in practice we suggest to just choose the number of clusters to be 100 or 200 and tune the “time budget” in our loss function to get the desired speedaccuracy tradeoff.
Number of Clusters  50  100  200  250 

Time in ms  0.12  0.17  0.14  0.12 
P@1  0.997  0.998  0.998  0.994 
P@5  0.988  0.99  0.99  0.98 
5 Conclusion
In this paper, we proposed a new algorithm for fast softmax inference on large vocabulary neural language models. The main idea is to use a lightweight screening model to predict a smaller subset of candidates, and then conduct exact search within that subset. By forming a joint optimization problem, we are able to learn the screening network endtoend using the Gumbel trick. In the experiment, we show that the proposed algorithm achieves much better inference speedup than stateoftheart algorithms for language model and machine translation tasks.
6 Acknowledgement
We are grateful to Ciprian Chelba for the fruitful comments, corrections and inspiration. CJH acknowledges the support of NSF via IIS1719097, Intel faculty award, Google Cloud and Nvidia.
References
 Bachrach et al. (2014) Yoram Bachrach, Yehuda Finkelstein, Ran GiladBachrach, Liran Katzir, Noam Koenigstein, Nir Nice, and Ulrich Paquet. Speeding up the xbox recommender system using a euclidean transformation for innerproduct spaces. In Proceedings of the 8th ACM Conference on Recommender systems, pp. 257–264. ACM, 2014.
 Boytsov & Naidan (2013) Leonid Boytsov and Bilegsaikhan Naidan. Engineering efficient and effective nonmetric space library. In Similarity Search and Applications  6th International Conference, SISAP 2013, A Coruña, Spain, October 24, 2013, Proceedings, pp. 280–293, 2013.
 Cettolo et al. (2014) Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, 2014.
 Grave et al. (2017) Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou. Efficient softmax approximation for gpus. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pp. 1302–1310, 2017.
 Guo et al. (2016) Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. Quantization based fast inner product search. In Artificial Intelligence and Statistics, pp. 482–490, 2016.
 Indyk & Motwani (1998) Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613. ACM, 1998.
 Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumblesoftmax. In International Conference on Learning Representations 2017. OpenReviews. net, 2017.
 Jean et al. (2014) Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007, 2014.
 Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. Opennmt: Opensource toolkit for neural machine translation. arXiv preprint arXiv:1701.02810, 2017.
 Li et al. (2016) Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
 Luong & Manning (2015) MinhThang Luong and Christopher D. Manning. Stanford neural machine translation systems for spoken language domain. In International Workshop on Spoken Language Translation, Da Nang, Vietnam, 2015.
 Malkov & Yashunin (2016a) Yu A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv preprint arXiv:1603.09320, 2016a.
 Malkov et al. (2014) Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45:61–68, 2014.
 Malkov & Yashunin (2016b) Yury A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. CoRR, abs/1603.09320, 2016b.
 Marcus et al. (1993) Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: the penn treebank. Comput. Linguist., 19(2):313–330, 1993.
 Mnih & Teh (2012) Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. In ICML, 2012.
 Morin & Bengio (2005) Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In AISTATS, volume 5, pp. 246–252, 2005.
 Neyshabur & Srebro (2015) Behnam Neyshabur and Nathan Srebro. On symmetric and asymmetric lshs for inner product search. In ICML, 2015.
 Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 379–389, 2015.
 Shim et al. (2017) Kyuhong Shim, Minjae Lee, Iksoo Choi, Yoonho Boo, and Wonyong Sung. Svdsoftmax: Fast softmax approximation on large vocabulary neural networks. In Advances in Neural Information Processing Systems 30, pp. 5463–5473. 2017.
 Shrivastava & Li (2014) Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pp. 2321–2329, 2014.
 Sproull (1991) Robert F Sproull. Refinements to nearestneighbor searching inkdimensional trees. Algorithmica, 6(16):579–589, 1991.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
 Wu et al. (2017) Xiang Wu, Ruiqi Guo, Ananda Theertha Suresh, Sanjiv Kumar, Daniel N HoltmannRice, David Simcha, and Felix Yu. Multiscale quantization for fast similarity search. In NIPS, pp. 5745–5755. 2017.
 Yu et al. (2017) HsiangFu Yu, ChoJui Hsieh, Qi Lei, and Inderjit Dhillon. A greedy approach for budgeted maximum inner product search. In NIPS, 2017.
 Zhang et al. (2018) Minjia Zhang, Xiaodong Liu, Wenhan Wang, Jianfeng Gao, and Yuxiong He. Navigating with graph representations for fast and scalable decoding of neural language models. In NIPS, 2018.
Appendix A Appendix
a.1 Precision@5 Results
a.2 Comparison to SphericalKMEANS initialization
Since we firstly initialize parameters in our method by ShpericalKMEANS, we also show in Table 4 that L2S can further improve over the baseline clustering methods. Notice that even the basic SphericalKMEANS can outperform stateoftheart methods. This shows that clustering structure of context features is a key to perform fast prediction.
PTBSmall  PTBLarge  NMT: DEEN  
Speedup  P@1  P@5  Speedup  P@1  P@5  Speedup  P@1  P@5  
Our Method  10.6x  0.998  0.990  45.3x  0.999  0.82  20.4x  0.989  0.993 
Sphereicalkmeans  4x  0.988  0.992  6.9x  0.992  0.971  13.8x  0.991  0.993 
FGD  1.3x  0.980  0.989  6.9x  0.975  0.979  6.7x  0.987  0.981 
a.3 Perplexity Results
Finally, we go beyond top prediction and apply our algorithm to speed up the perplexity computation for language models. To get perplexity, we need to compute the probability of each token appeared in the dataset, which may not be within top softmax predictions. In order to apply a top approximate softmax algorithm for this task, we adopt the lowrank approximation idea proposed in (Shim et al., 2017). For tokens within the candidate set, we compute the logits using exact inner product computation, while for tokens outside the set we approximate the logits by where is a lowrank approximation of the original weight matrix in the softmax layer. The probability can then be computed using these logits. For all the algorithms, we set the rank of to be 20 for PTBSmall and 200 for PTBLarge. The results are presented in Table 5. We observe that our method outperforms previous fast softmax approximation methods for computing perplexity on both PTBsmall and PTBlarge language models.
Model  Metric  Original  SVDsoftmax  Adaptivesoftmax  FGD  Our method 

PTBSmall  Speedup Rate  1x  0.84x  1.69x  0.95x  5.69x 
PPL  112.28  116.64  121.43  116.49  115.91  
PTBLarge  Speedup Rate  1x  0.61x  1.76x  2.27x  8.11x 
PPL  78.32  80.30  82.59  80.47  80.09 
a.4 Qualitative Results
We select some translated sentences of DEEN task shown in Table 6 to demonstrate that our algorithm can provide similar translations but with faster inference time.
Fullsoftmax  Our method 

you know , one of the great unk at travel and one of the pleasures at the unk research is to live with the people who remember the old days , who still feel their past in the wind , touch them on the rain of unk rocks , taste them in the bitter sheets of plants .  you know, one of the great unk at travel and one of the joy of the unk research is to live together with the people who remember the old days , who still feel their past in the wind , touch them on the rain of unk rocks , taste them in the bitter sheets of plants. 
it âs the symbol of all that we are , and what weâre capable of as astonishingly unk species .  itâs the symbol of all of what we are , and what weâre capable of as astonishingly unk species . 
when any of you were born in this room , there were 6,000 languages talking on earth .  when everybody was born in this room , there were 6,000 languages spoken on earth . 
a continent is always going to leave out , because the idea was that in subsaharan africa there was no religious faith , and of course there was a unk , and unk is just the remains of these very profound religious thoughts that unk in the tragic diaspora of the unk .  a continent is always going to leave out , because the presumption was that in subsaharan africa there was no religious faith , and of course there was a unk , and unk is just the cheapest of these very profound religious thoughts that unk in the tragic diaspora of unk unk . 
so , the fact is that , in the 20th century, in 300 years , it is not going to be remembered for its wars or technological innovation , but rather than an era where we were present , and the massive destruction of biological and cultural diversity on earth either on earth is either active or unk. so the problem is not the change .  so , the fact is that , in the 20th century , in 300 years , it is not going to be remembered for its wars or technological innovation , but rather than an era where we were present , and the massive destruction of biological and cultural diversity on earth either on earth is either unk or passive. so the problem is not the change . 
and in this song , we’re going to be able to connect the possibility of what we are : people with full consciousness , who are aware of the importance that all people and gardens have to thrive , and there are great moments of optimism .  and in this song , we’re going to be able to rediscover the possibility of what we are : people with full consciousness that the importance of the importance of being able to thrive is to be able to thrive , and there are great moments of optimism . 