The TwoPass Softmax Algorithm
Abstract
The softmax (also called softargmax) function is widely used in machine learning models to normalize realvalued scores into a probability distribution. To avoid floatingpoint overflow, the softmax function is conventionally implemented in three passes: the first pass to compute the normalization constant, and two other passes to compute outputs from normalized inputs. We analyze two variants of the ThreePass algorithm and demonstrate that in a welloptimized implementation on HPCclass processors performance of all three passes is limited by memory bandwidth.
We then present a novel algorithm for softmax computation in just two passes. The proposed TwoPass algorithm avoids both numerical overflow and the extra normalization pass by employing an exotic representation for intermediate values, where each value is represented as a pair of floatingpoint numbers: one representing the “mantissa” and another representing the “exponent”.
Performance evaluation demonstrates that on outofcache inputs on an Intel SkylakeX processor the new TwoPass algorithm outperforms the traditional ThreePass algorithm by up to 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. The proposed TwoPass algorithm also outperforms the traditional ThreePass algorithm on Intel Broadwell and AMD Zen 2 processors.
To foster reproducibility, we released an opensource implementation of the new TwoPass Softmax algorithm and other experiments in this paper as a part of XNNPACK library at GitHub.com/google/XNNPACK.
1 Introduction
The softmax (also called softargmax) function is a smooth approximation to the argmax function. The softmax function operates on a vector of realvalued scores and normalizes them into a probability distribution
where and .
The softmax function is commonly used in machine learning to give a probabilistic interpretation to outputs of classification models [2]. Such machine learning models output a probability for every possible object class, and the number of classes in modern datasets can reach millions. For example, natural language processing models may predict probability distribution over each possible word in a vocabulary, and recommendation systems may model the probability distribution over users, products, web pages, or their interactions. Table 1 summarized number of classes on several public classification datasets.
Dataset  Class description  Class Count 

ImageNet [8]  Image category  21841 
One Billion Word [4]  Unique Words  793471 
Wikilinks [33]  Wikipedia pages  2933659 
DepCC [28]  Web documents  364.80 million 
Hierarchical Softmax [13] (HSM) and its modifications [15] are the common techniques to scale classification models to large number of classes. HSM models jointly consider the softmax function and the matrixmatrix multiplication that produced its input, and replace them by a lowrank approximation. Thus, HSM methods improve performance by reducing the matrixmatrix multiplication cost, and do so by approximating the original machine learning model.
Unlike previous research, we focus on the softmax function in the context of inference using a pretrained model. This situation commonly arises in machine learning frameworks, such as TensorFlow [1] or PyTorch [31], when the training dataset or metaparameters needed to devise an accurate approximation to the model are not available. In this case, the model must be computed exactly according to the specification and unsafe approximations are not possible.
Our contributions in this paper are as follows:

We demonstrate that a welloptimized implementation of the softmax function can be memorybound even in singlethreaded execution. This result emphasizes the importance of eliminating memory operations for further improvements in the performance of the softmax function.

We introduce a novel algorithm for computing the softmax function. The new algorithm employs an exotic representation for intermediate values, where each value is represented as a pair of floatingpoint numbers: one representing the “mantissa” and another representing the “exponent”. Thanks to the special representation of intermediate results, the new algorithm needs only two passes over an input vector versus three passes for traditional algorithms.

We present and evaluate highperformance implementations of the new TwoPass softmax algorithms for the x8664 processors with AVX2 and AVX512F SIMD extensions. The experimental study confirms the speedups of up to on an Intel SkylakeX system.
The proposed improvements to the softmax implementation are orthogonal to matrixmatrix multiplication optimizations, and can be combined with sparsification [24, 29], lowrank decomposition [9], lowprecision arithmetic [19, 30, 35], or hardware acceleration [16, 20] for the matrixmatrix multiplication that produce softmax input.
2 Related work
The softmax function is widely used in the modern machine learning models and algorithms. In particular, the softmax function gained wide popularity in Natural Language Processing as a building block for language models, which predict a word or ngram out of a large vocabulary [21]. As the softmax function plays an important role in the machine learning models the several approximations have been proposed [23, 36, 34, 32, 22, 37, 15] in order to make the calculations efficient and fast. We refer the reader to Jozefowicz et al. [15] for a good overview of the recent approaches to speed up the softmax calculations.
Although a lot work has been done to address the computation complexity of the softmax function on a large inputs, the proposed solutions either 1) require special hardware (e.g FPGA) or 2) require to use an approximation to the softmax function during model training. In the latter case the softmax approximation can not be directly applied to a pretrained machine learning model, which dramatically limits its use in existing machine learning frameworks e.g. TensorFlow [1] and PyTorch[31]. Therefore, we developed a novel algorithm that improves performance of the original softmax function on widely available hardware, can be used with pretrained machine learning models, and can be implemented in any machine learning framework.
3 The ThreePass Algorithm
Direct calculation of the softmax function according to the formula is conjugate with numerical issues. Singleprecision function overflows
In order to overcome numerical instability issues machine learning frameworks adapt a workaround by utilizing the equivalence [12]:
which holds for any value. In particular, if , then:

No inputs to function exceed zero

There is at least one zero input to the function, and thus the denominator of the fraction is nonzero.
Together, these properties result in good numerical behavior of the computation and lead to Algorithm 1.
4 The TwoPass Algorithm
The ThreePass Algorithms 1 and 2 avoid numerical issues by normalizing inputs relative to their maximum value, but then require an additional memory pass to find the maximum value. In this section we suggest that it is possible to get the numerical stability without the extra memory pass, and present a practical TwoPass algorithm for softmax computation.
The immediate reason for the numerical instability of a naïve softmax implementation is the saturation of function for inputs outside of the narrow range of . Therefore, we have to look inside the function for a solution.
The function can be implemented in infinite number of ways, but practical implementations [11, 18, 25, 10] in IEEE floatingpoint arithmetic follow the traditional structure of elementary function implementation [27], and include three steps:

Range reduction, where the problem of approximating on the infinite input domain is reduced to a problem of approximating on a small finite range. For , a natural range reduction derives from the equivalence
which decompose approximating on into approximating on and multiplying the result by where is an integer.

Approximation of the function on the reduced range, i.e. on for . This step is achieved through a polynomial or rational approximation, often in combination with table lookups.

Reconstruction of the final value from the approximation on the reduced range. For , the reconstruction step consists of multiplication of by , and can be achieved at low cost by manipulating the exponent field of a binary floatingpoint number . It is this step where the underflow and overflow situations arise: and thus always fits into a singleprecision floatingpoint number, but can exceed the range of the exponent field, causing underflow or overflow.
The key idea that enables the TwoPass Softmax algorithm is to remove the reconstruction step, and instead keep the result of as a pair of floatingpoint values , where . Mathematically, , but in general this expression can not be evaluated in floatingpoint arithmetic without overflowing or underflowing the result. The representation of as a floatingpoint number is important: although is by design always an integer, it can have a very large magnitude, and fall outside of the range of standard integer formats. Therefore, the result must be represented as two, rather than one, floatingpoint numbers. Using multiple floatingpoint numbers to represent the realvalued result has a long history in doubledouble, tripledouble, quaddouble [17] representations and ErrorFree Transformations [14]. However, these representations use multiple floatingpoint numbers to improve precision of floatingpoint arithmetic, whereas we suggest to use two floatingpoint numbers to extend its dynamic range.
Algorithm 3 presents the softmax computation in just two passes by implementing addition for representation. The reduction pass keeps track of the running maximum value among all elements, and accumulates the scaled values to the running sum. It avoids the floatingpoint overflow by scaling values by the difference between the corresponding values and maximum of values. As this difference is never positive, values are never scaled up, which ensures the absence of the floatingpoint overflow.
5 Theoretical Analysis
Algorithm  Memory reads  Memory writes  Bandwidth cost 

ThreePass (Recompute) 1  3N  1N  4N 
ThreePass (Reload) 2  3N  2N  5N 
TwoPass 3  2N  1N  3N 
While the number of memory passes in the presented softmax algorithms is evident from the names, the number of actual memory operations is more nuanced. Every pass of the ThreePass algorithm with Recomputing reads the input array, while the last pass also writes the output array. The ThreePass algorithm with Reloading just reads the input array in the first pass, reads the input array and writes the output array in the second pass, and readsmodifieswrites the output array in the last pass. The TwoPass algorithm reads the input array in both passes, and also writes the output array in the second pass. Thus, the memory bandwidth requirements of the TwoPass algorithm are similar to just the last two passes of the ThreePass algorithm with Recomputing.
Table 2 summarize the number of memory reads, memory writes, and the memory bandwidth cost for the three algorithms on arrays of elements. Per Table 2, the TwoPass algorithm has a memory bandwidth advantage of over the ThreePass algorithm with Recomputing and over the ThreePass algorithm with Reloading. In practice, we should treat these numbers as upper bounds, because higher computational complexity of the TwoPass algorithm cuts into gains from bandwidth reduction.
6 Experimental Evaluation
6.1 Platform
We evaluate the performance of the three softmax algorithms on the Intel Xeon W2135 processor based on SkylakeX microarchitecture and with the characteristics listed in Table 3. To improve performance stability we disabled dynamic frequency scaling in the processor for the duration of our experiments.
Characteristic  Value 

Microarchitecture  SkylakeX 
Number of cores  6 
Number of hyperthreads  12 
Base frequency  3.7 GHz 
L1 cache size (per core)  32 KB 
L2 cache size (per core)  1 MB 
L3 cache size (shared by all cores)  8.25 MB 
AVX2 / AVX512 FMA throughput  2 / cycle 
AVX2 / AVX512 FMA latency  4 cycles 
Additionally, in Sec. 6.8 we replicate a subset of experiments on a Broadwellbased Intel Xeon E52696 v4 processor and on an AMD Ryzen 9 3900X processor with Zen 2 microarchitecture.
6.2 Protocol
We use Google Benchmark framework to estimate sustained performance of the softmax implementations. We set the minimum runtime for each measurement to 5 seconds, repeat each measurement 25 times and record the median of the 25 runs. In each benchmark run we simulate the cache state during neural network inference: output vector is evicted from the cache before each iteration, but input tensor stays in cache as long as it fits.
6.3 Implementation
We developed highly optimized implementations of the ThreePass Algorithms 1 and 2, and the TwoPass Algorithm 3 in C. For all three algorithms, we did two templated implementations targeting AVX2 and AVX512 instruction sets. We expressed highlevel optimization parameters, such as unroll factor for the loops and the number of accumulator variables in reduction functions, as metaparameters of the templated implementations, and employed autotuning to discover their optimal values.
An efficient implementation of vectorized function is a key component of all softmax variants. For our implementation, we adapted the throughputoptimized methods of Dukhan and Vuduc [10] to singleprecision floatingpoint evaluation. Algorithm 4 presents the resulting tablefree, branchfree, and divisionfree algorithm.
Algorithm 4 follows the traditional structure of elementary function implementation [27], described in Sec. 4. It starts with a range reduction to reduce approximation on the infinite input domain to approximation on a small and finite range. The calculation of the reduced argument in Algorithm 4 uses CodyWaite range reduction [6]: is represented as a sum of two singleprecision constants, and , to improve the accuracy of this step. Range reduction results in a reduced argument in the range and a reduced integer argument . Next, is approximated on with a degree5 polynomial. The polynomial coefficients are produced by the algorithm of Brisebarre and Chevillard [3] as implemented in Sollya software [5]. Following [10], we evaluate the approximation polynomial with Horner scheme using Fused MultiplyAdd instructions to minimize the number of floatingpoint instructions and maximize the throughput. In the last stage the Algorithm 4 reconstructs the final output value of the function by multiplying polynomial approximation result by . In AVX2 implementation, we do this multiplication by directly manipulating floatingpoint exponent to construct a scale number :
and compute the final step as . This reconstruction trick has two buitin assumptions: the argument to the is negative
The resulting implementation has maximum error under 2 ULP, validated through exhaustive evaluation on all valid inputs. This accuracy is comparable to other vectorized elementary function implementations, e.g. SIMD functions in GNU LibM library guarantee maximum error under 4 ULP.
Implementation of the ExtExp in the TwoPass softmax algorithm is similar to Algorithm 4 with the reconstruction step removed. Thus, implementations of both the ThreePass and the TwoPass algorithms use exactly the same range reduction and approximating polynomials to compute the exponential function.
6.4 The ThreePass Algorithms and Bandwidth Saturation
Fig. 1 presents the performance of the ThreePass softmax Algorithm 1 with recomputing of exponentials and the ThreePass softmax Algorithm 2 with reloading of computed exponentials in the AVX512 implementations. Reloading of exponential computations delivers speedup when the data is small enough to fit into private L1 and L2 caches, but turns into slowdown when operating on L3 cache, and eventually levels off at advantage when working set exceeds lastlevel cache.
The AVX2 implementation of the same ThreePass softmax Algorithm 1 and Algorithm 2 is illustrated in Fig. 2 and demonstrates similar trends. As the working set increases, the speedup from reloading of exponential computations goes down, and eventually levels off at for large arrays.
The small difference between recomputing and reloading of exponential computations on Fig. 1 and Fig. 2 suggests that despite the expensive exponential function, softmax computation might be memorybound for large arrays. To directly test this hypothesis, we decompose the Algorithms 1 and 2 into individual memory passes, and compare measured bandwidth to STREAM benchmarks [26].
Fig. 3 and 4 illustrate the memory bandwidth in each pass of the ThreePass softmax Algorithms 1 and 2, as well as Copy and Scale STREAM benchmarks [26]. As recommended in STREAM documentation, we set the array size to four times the size of lastlevel cache (8650752 singleprecision elements for Softmax, and 4325376 doubleprecision elements for STREAM). The first softmax pass (maxreduction) is the same in both versions of the ThreePass algorithm, and thus presented only once. This pass reads one input array, and doesn’t have a direct equivalent in STREAM, but achieves similar bandwidth to STREAM Copy and Scale benchmarks, which both read one array and write one array. The second pass in Algorithm 1 reads one array, computes exponentials on the inputs, and accumulates them. It achieves of STREAM Copy bandwidth in AVX512 version and in AVX2 version. The second pass Algorithm 2 is similar, but additionally stores computed exponents into the output array. Although it achieves higher bandwidth than the second pass in Algorithm 1, it takes substantially longer to complete; the higher bandwidth is due to doubling the number of transferred bytes with a less than proportional increase in run time. The third pass of Algorithm 1 reads one array, computes exponentials on the inputs, scales them, and writes results to another array. This pass does the same number of memory operations as STREAM Scale benchmark, but substantially more computational operations. Yet, our autotuned implementations exceed the performance of STREAM Scale benchmark in both the AVX512 and the AVX2 versions. The third pass of the Algorithm 2 is an inplace variant of STREAM Scale benchmark. The processor clearly favors inplace operation: it is faster than STREAM Scale with AVX512, and faster with AVX2.
To summarize, passes 1 and 3 of the Algorithm with Recomputing 1 demonstrate similar memory performance to STREAM benchmark, and pass 2 in AVX512 implementation is not far behind. Passes 1 and 2 of the Algorithm 2 with Reloading similarly perform close to STREAM bandwidth, and pass 3 is significantly faster than STREAM Scale benchmark. These results confirm that performance of ThreePass softmax algorithms is limited by achievable memory bandwidth, and suggest that softmax computation can be further accelerated only through reducing the number of memory operations.
6.5 The TwoPass Algorithm
On Fig. 5 we present the performance of the TwoPass softmax algorithm in comparison with the two versions of the ThreePass algorithm in AVX512 implementations. On outofcache working sets the proposed TwoPass softmax algorithm outperforms ThreePass algorithms by .
Fig. 6 similarly compares performance of the TwoPass and the ThreePass algorithms in the AVX2 implementations. Here, the TwoPass algorithm outperforms ThreePass algorithm with Reloading of exponential computations by on outofcache workloads. The smaller speedups, compared to AVX512 implementation, are explained by relatively higher cost of recomputing exponentials in AVX2 compared to AVX512. If we compare to the ThreePass Algorithm 1, which similarly recomputed exponentials, the TwoPass algorithm wins by .
On Fig.7 we decompose the absolute runtime for the three algorithms and two SIMD instruction sets into individual memory passes and offers insight into the origin of performance improvements with the TwoPass algorithm. The two passes of the TwoPass softmax algorithm have similar, but slightly higher absolute runtime to the last two passes of the ThreePass softmax algorithm with recomputation of recomputation of exponential function, which share the same memory access pattern. The slightly higher runtime in the passes of the TwoPass algorithm can be explained by larger number of operations needed for accumulation on the representation compared to just accumulating scalar floatingpoint values.
6.6 MultiThreaded Performance
The benchmarks in Sec. 6.4 and Sec. 6.5 presented performance in singlethreaded softmax computation, and demonstrated that on HPCclass systems softmax saturates memory bandwidth even when running on a single core. Utilizing multiple cores increases available computational resources faster than achievable memory bandwidth, and therefore increases the advantage of the bandwidthsaving TwoPass softmax algorithm. To quantify this advantage, we fix the size of the array at 4 times the lastlevel cache size [26], and vary the number of threads from 1 to 6 (number of cores in the system) to 12 (number of logical processors, including hyperthreads, in the system).
Fig. 8 illustrates weak multicore scaling of the AVX512 implementations. As the number of threads grows, the advantage of the TwoPass over ThreePass algorithms remains unchanged at . Interestingly, the reloading variant of the ThreePass algorithm scales worse than the recomputing variant, and the recomputing Algorithm 1 outperforms the reloading Algorithm 2 when at least 3 cores are being utilized.
Fig. 9 similarly illustrates weak multicore scaling of the AVX2 implementations. The advantage of the TwoPass over ThreePass algorithms grows from on a single core to when utilizing all 6 cores to when also using hyperthreads.
6.7 Comparison with Intel DNNL
The results in Sec. 6.56.6 demonstrate that on outofcache inputs the TwoPass softmax algorithm outperforms the ThreePass softmax algorithms in our implementation, but leaves out the question of whether our implementations are competitive with stateoftheart. To demonstrate the absolute effectiveness of the TwoPass algorithm, we compared our implementations of the three softmax algorithm to the softmax primitive in Intel Deep Neural Network Library (DNNL) version 1.1.1.
Intel DNNL implements the ThreePass softmax Algorithm 2 with reloading of computed exponentials. It includes implementations for SSE4.1, AVX, and AVX512 instruction sets, and automatically dispatch to AVX512 implementation on the SkylakeX processor. Unlike our implementations, Intel DNNL generates implementation in runtime using JustinTime (JIT) technology. JIT code generation potentially allows adaptation of implementation to parameters of a particular softmax operator (e.g. number of channels).
Fig. 10 presents the comparison between two implementations (Ours and DNNL) of the ThreePass algorithm with reloading of exponentials, and the TwoPass softmax algorithm in our implementation. For the ThreePass algorithm with reloading, our implementation ourperforms Intel DNNL for the majority of problem sizes. Its advantage is particularly high – over 2X – when data fits into L1, diminish to within L2, and levels off at for outofcache problem sizes. As the implementation in Intel DNNL is less efficient than ours, our TwoPass softmax algorithm outperforms DNNL softmax primitive on all problem sizes: it is faster on outofcache problem sizes, and up to when input fits into L1 cache.
6.8 Validation on Alternative Processors
The results in Sec. 6.46.7 were all collected on the Xeon W2135 processor with the Intel SkylakeX microarchitecture, which prompts a question as to whether the advantage of the TwoPass softmax algorithm is restricted to a specific type of processor. To demonstrate that the TwoPass algorithm generalize to other types of processors, we replicated results of Sec. 6.5 on Xeon E52696 v4 processor with Intel Broadwell microarchitecture and Ryzen 9 3900X with AMD Zen 2 microarchitecture. Both of these processors support the AVX2, but not the AVX512 instruction set, and have different cache hierarchy parameters than the Intel SkylakeX system.
Fig. 11 presents performance of the three softmax algorithms on the Intel Broadwell system. Although the TwoPass softmax algorithm demonstrates inferior performance on problems which fit into L2 cache, it gets competitive with the ThreePass softmax algorithms on L3sizes problems, and outperforms them by on outofcache problems.
Fig. 12 shows a similar picture on AMD Zen 2 microarchitecture. Here, the ThreePass algorithms deliver superior performance as long as data fits into L3 cache, but loose to the TwoPass algorithm when the data exceeds cache size.
7 Conclusion
We presented a novel TwoPass algorithm for softmax computation and demonstrated that the new TwoPass algorithm is up to 28% faster than the traditional ThreePass algorithm on large input vectors. The algorithm, however, offers no advantage over reloading variant of the ThreePass algorithm when the data fits into the processor’s cache.
This study focused on performance on a CPU, but the algorithm has great potential for GPU and hardware AI accelerators. These platforms further shift the balance between compute and memory performance towards expensive memory and cheap floatingpoint operations, and would favor reduced memory intensity of the presented TwoPass softmax algorithm.
To foster reproducibility, we released an opensource implementation of the new TwoPass Softmax algorithm and other experiments in this paper as a part of XNNPACK library at GitHub.com/google/XNNPACK.
Acknowledgements
We thank Matthias Grundmann, Sergey Ioffe, Juhyun Lee, Karthik Raveendran, and Richard Vuduc for helpful feedback and discussion, and Vijay Thakkar for sharing access to the Ryzen 9 3900X system.
Footnotes
 Produce floatingpoint infinity because result is too large to be represented as a finite singleprecision floatingpoint number
 Produce 0 because result is too small to be distinguishable from zero in the singleprecision floatingpoint format
 Not a Number: a special floatingpoint value defined by IEEE 754 standard indicating invalid result
 Always the case for the evaluation in the ThreePass softmax algorithms.
References
 (2016) Tensorflow: a system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §1, §2.
 (1990) Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pp. 227–236. Cited by: §1.
 (2007) Efficient polynomial lapproximations. In 18th IEEE Symposium on Computer Arithmetic (ARITH’07), pp. 169–176. Cited by: §6.3.
 (2013) One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005. Cited by: Table 1.
 (2010) Sollya: an environment for the development of numerical codes. In International Congress on Mathematical Software, pp. 28–31. Cited by: §6.3.
 (1980) Software manual for the elementary functions (prenticehall series in computational mathematics). PrenticeHall, Inc.. Cited by: §6.3.
 (2015) Intel AVX512 instructions and their use in the implementation of math functions. Intel Corporation. Cited by: §6.3.
 (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: Table 1.
 (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §1.
 (2013) Methods for highthroughput computation of elementary functions. In International Conference on Parallel Processing and Applied Mathematics, pp. 86–95. Cited by: §4, §6.3, §6.3.
 (1991) An accurate elementary mathematical library for the ieee floating point standard. ACM Transactions on Mathematical Software (TOMS) 17 (1), pp. 26–45. Cited by: §4.
 (2016) Deep learning. MIT press. Cited by: §3.
 (2001) Classes for fast maximum entropy training. arXiv preprint cs/0108006. Cited by: §1.
 (2007) Errorfree transformations in real and complex floating point arithmetic. In Proceedings of the International Symposium on Nonlinear Theory and its Applications, pp. 341–344. Cited by: §4.
 (2017) Efficient softmax approximation for GPUs. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. Cited by: §1, §2.
 (2016) EIE: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254. Cited by: §1.
 (2001) Algorithms for quaddouble precision floating point arithmetic. In Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH15 2001, pp. 155–162. Cited by: §4.
 (2003) An overview of floatingpoint support and math library on the intel xscale trade/architecture. In Proceedings 2003 16th IEEE Symposium on Computer Arithmetic, pp. 122–128. Cited by: §4.
 (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §1.
 (2017) Indatacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. Cited by: §1.
 (2016) Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410. Cited by: §2.
 (2018) Sigsoftmax: reanalysis of the softmax bottleneck. In Advances in Neural Information Processing Systems, pp. 286–296. Cited by: §2.
 (2019) Simplified hardware implementation of the softmax activation function. In 2019 8th International Conference on Modern Circuits and Systems Technologies (MOCAST), pp. 1–4. Cited by: §2.
 (2015) Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814. Cited by: §1.
 (2000) IA64 and elementary functions: speed and precision. Vol. 3, Prentice Hall. Cited by: §4.
 (199512) Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25. Cited by: §6.4, §6.4, §6.6.
 (2010) Handbook of floatingpoint arithmetic. Cited by: §4, §6.3.
 (2017) Building a webscale dependencyparsed corpus from commoncrawl. arXiv preprint arXiv:1710.01779. Cited by: Table 1.
 (2016) Holistic SparseCNN: forging the trident of accuracy, speed, and size. arXiv preprint arXiv:1608.01409 1 (2). Cited by: §1.
 (2018) Deep learning inference in facebook data centers: characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886. Cited by: §1.
 (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlchéBuc, E. Fox and R. Garnett (Eds.), pp. 8024–8035. Cited by: §1, §2.
 (2017) SVDsoftmax: fast softmax approximation on large vocabulary neural networks. In Advances in Neural Information Processing Systems, pp. 5463–5473. Cited by: §2.
 (2012) Wikilinks: a largescale crossdocument coreference corpus labeled via links to wikipedia. University of Massachusetts, Amherst, Tech. Rep. UMCS2012 15. Cited by: Table 1.
 (2018) A high speed softmax vlsi architecture based on basicsplit. In 2018 14th IEEE International Conference on SolidState and Integrated Circuit Technology (ICSICT), pp. 1–3. Cited by: §2.
 (2017) High performance ultralowprecision convolutions on mobile devices. arXiv preprint arXiv:1712.02427. Cited by: §1.
 (2018) A highspeed and lowcomplexity architecture for softmax function in deep learning. In 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), pp. 223–226. Cited by: §2.
 (2016) Efficient hardware architecture of softmax layer in deep neural network. In 2016 29th IEEE International SystemonChip Conference (SOCC), pp. 323–326. Cited by: §2.