SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for LargeScale Deep Learning Systems
1 Introduction
Deep Learning (DL) has become a topic of significant interest in the research community. The last few years have seen a remarkable growth of using DL to significantly improve the stateoftheart in many applications, particularly image, text classification, and speech recognition.
The Need for Hardware Acceleration: Vast amounts of data powered by the exponential increase in computing capabilities have been instrumental in the success of DL. More notably, with the advent of the powerful Graphic Processing Unit (GPU) (Owens et al., 2008), training processes of the DL models have been drastically accelerated.
Fast Matrix Multiplication has been heavily researched for the past several decades. We are now reaching a limit on speeding up matrix multiplication further. Furthermore, the need for astronomical size neural networks and unprecedented growth in the data volumes have worsened this problem. As a result, the community is heavily investing in dedicated hardware to take DL further beyond this point (Jouppi et al., 2017). Designing dedicated hardware is risky because they require significant investment and time to develop. Moreover, dedicated hardware caters to a specific algorithm for which they are designed. Thus, change in the stateoftheart algorithms can render specialized hardware less effective in the future. However, for the case of DL, this investment is justified due to the lack of significant progress in the algorithmic alternatives for years.
Unsuccessful Alternatives to Matrix Multiplication: On the orthogonal side, there have been several works on replacing the costly matrix multiplication with cheaper algorithmic alternatives (Le Gall, 2014). Unfortunately, we have seen minimal practical benefits from the algorithmic front. So far, there has been no demonstration, even remotely, that a smart algorithmic implementation in any form can outperform the advantages of hardware acceleration.
Exploiting Adaptive Sparsity in Neural Networks: In popular frameworks like Tensorflow (TF), Sampled Softmax (Jean et al., 2015) is deployed to approximate the full softmax efficiently. While sampled softmax offers computational savings, it has high estimation bias (Blanc and Rendle, 2018). This leads to poor convergence behavior which is empirically verified in our experiments in section 5. In this paper, we will exploit the idea of adaptive sparsity (Blanc and Rendle, 2018) or adaptive dropouts (Ba and Frey, 2013). The idea stems from several recent observations (Makhzani and Frey, 2015, 2013) that we can accurately train neural networks by selectively sparsifying most of the neurons, based on their activation, during every gradient update. (Srivastava et al., 2014) has also shown that selective sparsification can infact be superior in accuracy due to implicit regularization. However, selective sparsification does not directly lead to computational savings. (Spring and Shrivastava, 2017b) shows the first possibility of an algorithmically efficient solution by employing Locality Sensitive Hash (LSH) tables to identify a sparse set of neurons efficiently during each update. The proposed algorithm has an added advantage of making the gradient update HOGWILD style parallel (Recht et al., 2011). Such parallelism does not hurt convergence because extremely sparse and independent updates are unlikely to overlap and cause conflicts of considerable magnitude. Despite all the niceness presented, current implementation of (Spring and Shrivastava, 2017b) fails to demonstrate that the computational advantage can be translated into a faster implementation when directly compared with hardware acceleration of matrix multiplication. In particular, it is not clear if we can design a system that can effectively leverage the computational advantage and at the same time compensate for the hash table overheads using limited (only a few cores) parallelisms. In this paper, we provide the first such implementation for large fully connected neural networks.
1.1 Our Contributions
Our main contributions are as follows:

We show the first C++ OpenMP based system SLIDE with modest multicore parallelism on a standard CPU that can outperform the massive parallelism of a powerful V100 GPU on a headtohead timevsaccuracy comparison. This unique possibility is because the parallelism in SLIDE is naturally asynchronous by design. We have our code and benchmark scripts for reproducibility
^{1} . 
We make several novel algorithmic and datastructural choices in designing the LSH based sparsification to minimize the computational overheads to a few memory lookups only (truly ). At the same time, it does not affect the convergence of the DL algorithm. The implementation further takes advantage of the sparse gradient updates to achieve negligible update conflicts, which creates ideal settings for Asynchronous SGD (Stochastic Gradient Descent) (Recht et al., 2011). These contributions could be of independent interest in both the LSH and DL literature.

We provide a rigorous evaluation of our system on two large benchmark datasets involving fully connected networks. We show that SLIDE, on a modest CPU can be up to 2.7x faster, in wall clock time, than the best possible alternative with the best possible choice of hardware, at any accuracy. We perform a CPUefficiency analysis of SLIDE using Intel VTune Performance Analyzer and show that memorybound inefficiencies reduce for SLIDE with an increasing number of cores while it is the opposite for TFCPU.

Our analysis suggests that SLIDE is a memorybound application, prone to some bottlenecks described in appendix D. With careful workload and cache optimizations (eg. Transparent Hugepages) and a data access pattern (eg. SIMD instructions), we further speed up SLIDE by roughly 1.3x, making the overall speed up to 3.5x faster than TFGPU and over 10x faster than TFCPU.
2 Locality Sensitive Hashing
Our paper is based on several recent and classical ideas in Locality Sensitive Hashing (LSH) and adaptive dropouts in neural networks. LSH is a family of functions with the property that similar input objects in the domain of these functions have a higher probability of colliding in the range space than nonsimilar ones. A popular technique for approximate nearestneighbor search uses the underlying theory of Locality Sensitive Hashing (Indyk and Motwani, 1998). In formal terms, consider to be a family of hash functions mapping to some set .
Definition 2.1 (LSH Family)
A family is called
sensitive if for any two points and chosen uniformly from satisfies the following:

if then

if then
Typically, for approximate nearestneighbor search, we need and to hold. An LSH allows us to construct data structures that give provably efficient query time algorithms for the approximate nearestneighbor problem with the associated similarity measure.
One sufficient condition for a hash family to be an LSH family is that the collision probability should be a monotonically increasing with the similarity, i.e.
(1) 
where f is a monotonically increasing function. In fact, most of the popular known LSH families, such as Simhash (Gionis et al., 1999) and WTA hash (Yagnik et al., 2011; Chen and Shrivastava, 2018), satisfy this strong property. It can be noted that Equation 1 automatically guarantees the two required conditions in the Definition 2.1.
It was shown in (Indyk and Motwani, 1998) that having an LSH family for a given similarity measure is sufficient for efficiently solving nearestneighbor search in sublinear time.
The Algorithm: The LSH algorithm uses two parameters, . We construct independent hash tables. Each hash table has a metahash function that is formed by concatenating random independent hash functions from the collection . Given a query, we collect one bucket from each hash table and return the union of buckets. Intuitively, the metahash function makes the buckets sparse and reduces the number of false positives, because only valid nearestneighbor items are likely to match all hash values for a given query. The union of the buckets decreases the number of false negatives by increasing the number of potential buckets that could hold valid nearestneighbor items. The candidate generation algorithm works in two phases [See (Spring and Shrivastava, 2017a) for details]:

Preprocessing Phase: We construct hash tables from the data by storing all elements . We only store pointers to the vector in the hash tables because storing whole data vectors is very memory inefficient.

Query Phase: Given a query ; we search for its nearestneighbors. We report the union from all of the buckets collected from the hash tables. Note that we do not scan all the elements but only probe different buckets, one bucket for each hash table.
After generating the set of potential candidates, the nearestneighbor is computed by comparing the distance between each item in the candidate set and the query.
2.1 LSH for Estimation and Sampling
Although LSH provides provably fast retrieval in sublinear time, it is known to be very slow for accurate search because it requires very large number of tables, i.e. large . Also, reducing the overhead of bucket aggregation and candidate filtering is a problem on its own. Consequent research led to the sampling view of LSH (Spring and Shrivastava, 2017b; Chen et al., 2018, 2019; Luo and Shrivastava, 2018) that alleviates costly searching by efficient sampling, as shown in figure 1. It turns out that merely probing a few hash buckets (as low as 1) is sufficient for adaptive sampling. Observe that an item returned as a candidate from a parameterized LSH algorithm is sampled with probability , where is the collision probability of LSH function (sampling probability is monotonic in ). Thus, with LSH algorithm, the candidate set is adaptively sampled where the sampling probability changes with and .
This sampling view of LSH was the key for the algorithm proposed in paper (Spring and Shrivastava, 2017b) that shows the first possibility of adaptive dropouts in nearconstant time, leading to efficient backpropagation algorithm.
MIPS Sampling
Recent advances in maximum inner product search (MIPS) using asymmetric locality sensitive hashing has made it possible to sample large inner products. Given a collection of vectors and query vector , using parameterized LSH algorithm with MIPS hashing (Shrivastava and Li, 2014a), we get a candidate set . Every element in gets sampled into with probability , where is a monotonically increasing function of . Thus, we can pay a onetime linear cost of preprocessing into hash tables, and any further adaptive sampling for query only requires few hash lookups.
3 Proposed System: SLIDE
3.1 Introduction to the overall system
Before introducing SLIDE in details, we define important notations: 1) : input batch size 2) : Neuron in layer 3) : inputs for layer in the network 4) : weights for neuron in layer 5) : hash functions in layer 6) : the set of active neurons in layer for the current input.
Initialization: Figure 2 shows the modular structure of SLIDE and algorithm 1 shows the detailed steps. Every layer object contains a list of neurons and a set of LSH sampling hash tables. Each hash table contains ids of the neurons that are hashed into the buckets. During the network initialization, the weights of the network are initialized randomly. Afterwards, LSH hash functions are initialized along with hash tables for each of the layers. For instance, the example network in Figure 2 maintains hash tables in two hidden layers as well as the output layer. The details of using various hash functions are discussed in appendix A. The LSH hash codes of the weight vectors of neurons in the given layer are computed according to the hash functions. The id of the neuron is saved into the hash buckets mapped by the LSH function . This construction of LSH hash tables in each layer is a onetime operation which can easily be parallelized with multiple threads over different neurons in the layer independently.
Sparse FeedForward Pass with Hash Table Sampling: In the feedforward phase, given a single training instance, we compute the network activation until the final layer, which gives us the output. In SLIDE, instead of calculating all the activations in each layer, the input to each layer is fed into hash functions to compute . The hash codes serve as a query to retrieve ids of active (or sampled) neurons from the matching buckets in hash tables. For example, in the figure 3, is first computed and then used to retrieve and as the active neurons. Only the activations of active neurons are calculated and passed on as the inputs to the next layer. The other activations, like those of and , are directly treated as and never computed. We describe our design choices that reduce the sampling overheads significantly in section 4.1.
The abovedescribed operations are performed sequentially in every layer, starting from the very first layer where the input is the data itself. Even in the output layer, which has softmax activation, only neurons sampled from hash tables are treated as active neurons. For softmax, for every active neuron, we compute its output as Note that the normalizing constant for softmax is no longer the sum over all neurons but only the active ones.
Sparse Backpropagation or Gradient Update: The backpropagation step follows the feedforward step. After computing the output of the network, we compare it with the known label of the input and backpropagate the errors layerbylayer to calculate the gradient and update the weights. Here we use the classical backpropagation message passing type implementation rather than vector multiplication based. For every training data instance, after updating the weights of any given neuron, the neuron propagates the partial gradients (using error propagation) back to only active neurons in previous layers via the connected weights. As a result, we never access any nonactive neuron or any nonactive weight, which is not part of the feedforward process on a given input. This process ensures that we take full advantage of sparsity. Our computation over each input is only of the order of active neurons and weights rather than the total number of parameters. It should be noted that if we compute activation for fraction of neurons in each layer (on an average), the fraction of weights that needs to be updated is only, which is a significant reduction when is small (as is the case for our experiments).
Update Hash Tables after Weight Updates: After the weights are updated, we need to modify the positions of neurons in the hash tables accordingly. Updating neurons typically involves deletion from the old bucket followed by an addition to the new bucket, which can be expensive. We discuss several design tricks that we use to overcome this overhead of updating hash tables in section 4.2.
OpenMP Parallelization across a Batch: For any given training instance, both the feedforward and backpropagation operations are sequential as they need to be performed layer by layer. SLIDE uses usual Batch Gradient Descent with Adam optimizer, where the batch size is generally in the order of hundreds. Each data instance in the batch runs in a separate thread and its gradients are computed in parallel. To ensure the independence of computation across different threads, every neuron stores two additional arrays, each of whose length is equal to the batch size. These arrays keep track of the input specific neuron activations and error gradients. Every input is assigned an id, which can be used as an index to locate its activation (or error gradient) on any neuron. Besides, we also have a bit array at each neuron to determine whether the particular input activates a neuron or not. This small memory overhead is negligible for CPUs but it ensures that the gradient computation is independent across different instances in the batch.
The extreme sparsity and randomness in gradient updates allow us to asynchronously parallelize the accumulation step of the gradient across different training data without leading to a considerable amount of overlapping updates. SLIDE heavily capitalizes on the theory of HOGWILD (Recht et al., 2011), which shows that a small amount of overlap is tolerable. It does not hurt the convergence even if we resolve the concurrent updates randomly. Thus, after independently computing the gradients, each thread pushes the updates directly to the weights asynchronously. This asynchronous update avoids synchronization during batch accumulation which is otherwise sequential in the batch.
In section 5.3, we observe that due to this asynchronous choice, we obtain nearperfect scaling of our implementation with an increasing number of cores. Such perfect scaling is particularly exciting because even highly optimized implementation of TF on CPUs shows poor scaling behavior with increasing cores beyond 16 cores.
3.2 Details of Hash Functions and Hash Tables
SLIDE provides a natural tradeoff between the efficiency of retrieving active neurons and the quality of the retrieved ones. To facilitate this, we have three tunable parameters . As mentioned in section 2, serves as the number of hash tables. To determine which bucket to choose, we use hash codes for each hash table. Hence, SLIDE generates randomized hash functions all belonging to one hash family for each layer. In every bucket in a hash table, the number of entries is limited to a fixed bucket size. Such a limit helps with the memory usage and also balances the load on threads during parallel aggregation of neurons.
In our implementation of SLIDE, we support four types of hash functions from LSH family: 1) Simhash 2) WTA hash 3) DWTA hash and 4) Minhash respectively. Each of these hash families preserves different similarities and hence is useful for various scenarios. We discuss the implementation details of Simhash and DWTA hash below and others in appendix A. SLIDE also provides the interface to add customized hash functions based on need.
Simhash (Gionis et al., 1999): SimHash is a popular LSH for the cosine similarity measure. We use number of random pregenerated vectors with components taking only three values . The reason behind using only and is for fast implementation. It requires additions rather than multiplications, thereby reducing the computation and speeding up the hashing process. To further optimize the cost of Simhash in practice, we can adopt the sparse random projection idea (Li et al., 2006). A simple implementation is to treat the random vectors as sparse vectors and store their nonzero indices in addition to the signs. For instance, let the input vector for Simhash be in . Suppose we want to maintain sparsity, we may uniformly generate set of indices from . In this way, the number of multiplications for one inner product operation during the generation of the hash codes would simply reduce from to . Since the random indices are produced from onetime generation, the cost can be ignored.
DWTA hash (Chen and Shrivastava, 2018): DWTA hash transforms the input feature space into binary codes such that the Hamming distance in the resulting space closely correlates with rank similarity measure for sparse data. We generate number of permutations and every permutation is split into bins. DWTA loops through all the nonzero (NNZ) indices of the sparse input. For each of them, we update the current maximum index of the corresponding bins according to the mapping in each permutation. It should be noted that the number of comparisons and memory lookups in this step is , which is significantly more efficient than simply applying WTA hash to sparse input. For empty bins, the densification scheme proposed in (Chen and Shrivastava, 2018) is applied.
4 Reducing Overhead
4.1 Sampling Overhead
The key idea of using LSH for adaptive sampling of neurons is sketched in section 3.1. We have designed three strategies to sample neurons with large activation: 1) Vanilla Sampling 2) Topk Sampling 3) Hard Thresholding. We introduce them here and discuss their utility and efficiency in appendix B.
Vanilla Sampling: Denote as the number of active neurons we target to retrieve in layer . After computing the hash codes of the input, we randomly choose a table and only retrieve the neurons in its corresponding bucket. We continue retrieving neurons from another random table until neurons are selected or all the tables have been looked up. Let us assume we retrieve from tables in total. Formally, the probability that a neuron gets chosen is, where is the collision probability of the LSH function that SLIDE uses. The time complexity of vanilla sampling is .
TopK Sampling: In this strategy, the basic idea is to obtain those neurons that occur more frequently among all hash tables. After querying with the input, we first retrieve all the neurons from the corresponding bucket in each hash table and aggregate their frequencies across all hash tables. The frequencies are sorted, and only the neurons with top frequencies are selected. This requires additional space for maintaining the hashmap and time for both sampling and sorting.
Hard Thresholding: In this strategy, we bypass the sorting step in TopK sampling by selecting neurons that appear at least times in the retrieved buckets. Here, the probability that a neuron gets chosen is, .
4.2 Updating Overhead
We introduce the following heuristics for addressing the expensive costs of updating the hash tables:
1) Recomputing the hash codes after every gradient update is computationally very expensive. Therefore, we dynamically change the update frequency of hash tables to reduce the overhead. Assume that we update the hash tables for the first time after iterations. Let be the number of times the hash tables have already been updated. We apply exponential decay on the update frequency such that the hash table update happens on iteration , where is a tunable decay constant. The intuition behind this scheme is that the gradient updates in the initial stage of the training are larger than those in the later stage, especially while close to convergence.
2) SLIDE needs a policy for adding a new neuron to a bucket when it is already full. To solve such a problem, we use the same solution in (Wang et al., 2018) that makes use of Vitterâs reservoir sampling algorithm (Vitter, 1985) as the replacement strategy. It was shown that reservoir sampling retains the adaptive sampling property of LSH tables, making the process sound. Additionally, we implement a simpler alternative policy based on FIFO (First In First Out).
3) For Simhash, the hash codes are computed by . During backpropagation, only the weights connecting the active neurons across layers get updated. Only those weights contribute to the change of . Therefore, we can also memorize the result of besides the hash codes. When gets updated in only out of dimensions, where , we only need rather than addition operations to compute the new hash codes.
5 Evaluations
In this section, we’re going to empirically investigate SLIDE’s performance on multiple fronts such as: 1) SLIDE against TFGPU with V100s 2) SLIDE against TFCPU 3) SLIDE’s adaptive sampling against sampled softmax (plain random sampling) 4) Scalability against TFCPU with CPU core count 5) Effect of batch size 6) Benefits of Design Choices. While we focus on evaluating the basic aspects of SLIDE, we additionally perform several CPU optimizations like support for Kernel Hugepages to reduce cache misses which improve SLIDE’s performance by . The optimization details are given in appendix D and the improvement in performance is shown in section 5.4.
Feature Dim  Feature Sparsity  Label Dim  Training Size  Testing Size  

Delicious200K  782,585  0.038 %  205,443  196,606  100,095 
Amazon670K  135,909  0.055 %  670,091  490,449  153,025 
FullyConnected Large Architecture: Fully connected networks are common in most applications. To show SLIDE’s real advantage, we will need large networks where even a slight decrease in performance is noticeable. Thus, the publicly available extreme classification datasets, requiring more than 100 million parameters to train due to their extremely wide last layer, fit this setting appropriately. For these tasks, most of the computations (more than ) are in the final layer.
Datasets: We employ two large real datasets, Delicious200K and Amazon670K, from the Extreme Classification Repository Bhatia (). Delicious200K dataset is a subsampled dataset generated from a vast corpus of almost 150 million bookmarks from Social Bookmarking Systems (del.icio.us). Amazon670K dataset is a product to product recommendation dataset with 670K labels. The statistics of the datasets are included in Table 1.
Infrastructure: All the experiments are conducted on a server equipped with two 22core/44thread processors (Intel Xeon E52699A v4 2.40GHz) and one NVIDIA Tesla V100 Volta 32GB GPU. The server has an Ubuntu 16.04.5 LTS system with the installation of TFGPU 1.12. We compiled TFCPU 1.12 from source with GCC5.4 in order to support FMA, AVX, AVX2, SSE4.1, and SSE4.2 instructions, which boost the performance of TFCPU by about . SLIDE is written in C++ and compiled under GCC5.4 with OpenMP flag. The most exciting part is that SLIDE only uses vanilla CPU thread parallelism and yet outperforms TFGPU (V100) by a large margin in performance.
Baselines: We benchmark the tasks with our system SLIDE, and compare against highly optimized TF framework for both CPU and GPU. Specifically, the comparison is between the same tasks, with the exact same architecture, running on TFCPU and TFGPU. The optimizer and the learning hyperparameters (details later) were also the same to avoid unfair comparisons. Most of the computations in our architecture are in the softmax layer. Hence, to corroborate the advantage of adaptive sampling (Yen et al., 2018) vs vanilla sampling, we also compare against the popular sampled softmax algorithm (Jean et al., 2015) which is a fast proxy to the full softmax. We use the optimized Sampled Softmax functionality provided in TFGPU. This comparison sheds light on the necessity of LSH based input dependent adaptive sampling compared to static sampling scheme which is the only other alternative in practice.
Hyper Parameters: For both the datasets, we adopt the same model architecture in (Yen et al., 2018). We choose the standard fully connected neural network with one hidden layer of size 128. We choose a batch size of 128 for Delicious200K dataset and 256 for Amazon670K dataset as the input dimension for the former is very large. We run all algorithms until convergence. To quantify the superiority of SLIDE over other baselines, we also use the same optimizer, Adam (Kingma and Ba, 2014) by varying the initial step size from to which leads to better convergence in all experiments. For SLIDE, we maintain the hash tables for the last layer, where we have a computational bottleneck of the models. For specific LSH setting, we choose Simhash, , for Delicious dataset and DWTA hash, for Amazon670k dataset. We update the hash tables with an initial update period of iterations and then exponentially decaying (section 4.2).
Main Results: We show the time and iteration wise comparisons for SLIDE vs TF GPU/CPU in Figure 5. Note that the axis is in logscale, and all the curves have a long flat converged portion when plotted on a linear scale indicating clear convergence behavior. Red, blue and black lines represent the performance of SLIDE, TFGPU, TFCPU, respectively. We can see from the plots that SLIDE on CPU achieves any accuracy faster than TF on V100. TFGPU is always faster than TFCPU which is expected. It should be noted that these datasets are very sparse, e.g., Delicious dataset has only 75 nonzeros on an average for input features, and hence the advantage of GPU over CPU is not always noticeable.
SLIDE is around 1.8 times faster than TFGPU on Delicious 200k. On the larger Amazon 670k dataset, where we need more computations, the gains are substantially more. SLIDE is around 2.7 (2 hrs vs. 5.5 hrs) times faster than TFGPU. Most of the computational benefits of SLIDE come from sampling a small subset of active neurons in the output layer. After a few iterations into the training process, the average number of neurons sampled in the output layer for Delicious200K is . Similarly, for Amazon670K, we sample neurons. With fewer than of active neurons, SLIDE outperforms TFGPU on time by a huge margin on either dataset. It is interesting to note that even after compiling TFCPU with AVX2 instructions, it is nowhere close to the performance of SLIDE or TFGPU (SLIDE is 8x faster than TFCPU). Therefore, it is exciting to note that without any rigorous optimization in our prototype, SLIDE outperforms both baselines using smart randomized algorithms with OpenMP parallelism.
For Iteration vs. Accuracy plots in Figure 5, we can observe that SLIDE achieves the same accuracy per iteration even though it adaptively selects neurons in some layers. This observation confirms that adaptively selecting neurons and performing asynchronous SGD does not hurt the convergence from an optimization perspective. The plot also confirms that the advantage of SLIDE is not due to any bells and whistles in the optimization process as the convergence with iteration has very similar behavior. For this plot, we only show TFGPU as the curve for TFCPU would be identical because the optimization algorithm is the same.
Since SLIDE performs much fewer computations and memory accesses on the last layer, each iteration is faster than the baselines. This is the critical reason why SLIDE outperforms other baselines when compared on wallclock time.
8  16  32  

TensorflowCPU  45  35  32 
SLIDE  82  81  85 
Inefficiency Diagnosis: We profile and analyze TFCPU and SLIDE by a stateoftheart parallel performance analyzer tool, the Intel VTune Performance Analyzer (Malladi, 2009). Table 2 exhibits the results for core utilization comparison between both frameworks using 8, 16, 32 threads for the above tasks. We can see that for TFCPU, the utilization is generally low (). It further decreases with more threads. For SLIDE, the core utilization is stable (around ) across all threads presented in the table 2.
Figure 6 presents the distribution of inefficiencies in CPU usage for TFCPU and SLIDE. Based on core utilization, the overall inefficiencies of TFCPU is much more than those of SLIDE. Figure 6 provides a detailed distribution of different types of inefficiencies. It is obvious that being memorybound is a major issue for any number of threads in the histogram. The biggest bottleneck is that a significant fraction of execution pipeline slots are stalled due to demand memory load and store. Observe that the higher the number of cores TFCPU uses, the more memorybound it gets.
On the other hand, the higher the number of cores SLIDE uses, the less memorybound it becomes. Recall that the critical advantage of SLIDE is that it has a lot fewer active neurons and sparse gradient updates. Naturally, memory accesses are a lot fewer than TFCPU due to very sparse memory accesses within each thread. Our choice of using extra arrays to separate the computations of each thread with asynchronous gradient updates (section 3.1) across all the threads ensures that simple OpenMP parallelism is sufficient to get nearpeak utilization.
5.1 Comparison with other Heuristics
During the full softmax process in training on Tensorflow, for every training instance, it needs to compute logits (output of the last layer before applying softmax function) for all classes. This step is followed by computing the softmax (normalized sigmoid) of logits. In extreme classification tasks (with a large number of classes), computing these logits gets expensive. Therefore, there has been a line of research working on reducing this cost (Mikolov et al., 2013; Bengio, ; Gutmann and Hyvärinen, 2010). The most common methods are samplingbased (static sampling weights) methods which shortlist a candidate set of classes for every batch of training data. By doing this, the number of computed logits gets reduced significantly. Due to its popularity, Tensorflow supports an optimized implementation of sampled softmax (Jean et al., 2015).
We explore how sampled softmax on TensorflowGPU performs compared to SLIDE. LSH sampling process in SLIDE is principally very similar to the process of sampled softmax but with sampling probabilities changing dynamically with inputs. We adopt the exact same settings in the previous section for the experiments. Recall that the average number of sampled classes for SLIDE for both the datasets is . For sampled softmax, we try a various number of samples. However, with a comparable number of samples, sampled softmax leads to poor accuracy. We empirically observe that we have to sample of the total number of classes to obtain any decent accuracy.
The results are shown in Figure 7. The red lines represent SLIDE, and the green lines represent sampled softmax on TensorflowGPU. We can see that both time and iteration wise, the red lines outperform the green lines significantly. Sampled softmax uses static sampling strategies which are fast compared to SLIDE which in contrast uses adaptively changing hash tables for input specific dynamic sampling. Unfortunately, the uninformative static sampling of softmax leads to poor accuracy as shown in the plot. Noted that in these plots, sampled softmax uses significantly more neurons than SLIDE and still shows poor convergence behavior.
Figure 7 clearly confirms the need for adaptive sampling of neurons (in proportion to input dependent activation) for sparsifying neural networks in order to retain good convergence. This phenomenon supports our choice of LSH based adaptive sampling.
5.2 Effect of Batch Size
Batch size is a crucial parameter that can affect the training speed and model quality in Machine Learning. In general, a large batch size may help in reducing the training time per epoch as we process more gradient updates at a time (Goyal et al., 2017). But large batches are known to be bad from optimization perspective as they reduce the generalization capability (Keskar et al., 2016). In the case of extreme classification datasets, the number of computations performed is huge owing to large input dimension and a large number of classes. Hence, a larger batch size may not necessarily translate into faster training per epoch. To clarify this, we study the effect of varying batch size on the results. We choose the larger Amazon670k dataset for this task. Please note that when the batch size is larger than the number of threads, the default scheduling type of OpenMP is static.
In figure 8, we observe that SLIDE outperforms TensorflowGPU by a significant margin irrespective of the batch size. This observation could be attributed to the fact that SLIDE performs very few computations per instance. Our data structures allow us to process all samples in a batch in parallel, and the gradient updates are made asynchronously among threads as described in section 3.1, which enables effective use of parallel threads and it reflects in superior performance over Tensorflow. It is interesting to note that the gap between SLIDE and Tensorflow widens as the batch size grows from 64 to 256.
5.3 Scalability Tests
In this section, we try to understand the effect of increasing CPU cores on the scalability of SLIDE and TensorflowCPU. Besides, we intend to know the number of cores SLIDE needs to outperform Tensorflow. As mentioned before, our machine has 44 cores, and each core can have 2 threads. However, we disable multithreading and the effective number of threads and cores is the same. Hence, we interchangeably use the words “threads” and “cores” from here on. We benchmark both frameworks with 2, 4, 8, 16, 32, 44 threads.
For the different number of threads, we run the same classification experiments on SLIDE and TensorflowCPU for both datasets and clock the corresponding convergence time. Figure 9 presents the results. The red, blue, black lines represent SLIDE, TensorflowGPU, and TensorflowCPU, respectively. It should be noted that the blue line is flat because GPU computations were done on V100 with thousands of cores and are mostly oblivious about the number of CPU cores. When the number of cores increases, the convergence time for both SLIDE and TensorflowCPU starts to decrease. This decrease is expected due to the benefits brought by more parallelism on each training batch. For Delicious dataset, the red line and the black line cross each other at around 8 cores, which means that with more than 8 cores, SLIDE can beat TensorflowCPU. The red and blue lines intersect between 16 and 32 cores. Hence, with fewer than 32 cores, SLIDE outperforms TensorflowGPU on Delicious dataset. Similarly, for larger Amazon dataset, the red and black line never intersect, and the red and blue line intersects on 8 cores. This means that SLIDE beats TensorflowGPU with as few as 8 CPU cores and TensorflowCPU with as few as 2 CPU cores.
5.4 Additional Speedup with Threading Model and Platform Microarchitecture
In this section, we perform several CPU optimizations outlined in appendix D to reduce cache misses. We first install package for Ubuntu, which offers 2MB and 1GB cache pages in addition to default 4KB ones. We preallocate 1000 2MB Hugepages and 10 1GB Hugepages which are found to be enough for both Delicious200K and Amazon670K datasets. To resolve the issue of the false sharing for OpenMP mutlithread, we give a provision to our data structures to align on cache line boundaries. Besides using Hugepages, we also used SIMD instructions (specifically, IntelAVX) to facilitate per thread batching. In figure 10, we compare the benefit of aforementioned optimizations against an unoptimized SLIDE and TFGPU. We notice that CacheOptimized SLIDE (in green) is times faster than basic SLIDE (in red). Since we already have a 2.7x speedup over TFGPU on Amazon670K, it translates to 3.5x speedup over TFGPU and a 10x speedup over TFCPU.
In appendix D.1, we measure the impact of HugePages on various CPUcounter metrics like TLB miss rates and PageFaults. Concisely, we notice that HugePages reduces the misses by a large margin.
6 Conclusion and Future Work
We provide the first evidence that a smart algorithm with modest CPU OpenMP parallelism can outperform the best available hardware NVIDIAV100, for training large deep learning architectures. Our system SLIDE is a combination of carefully tailored randomized hashing algorithms with the right data structures that allow asynchronous parallelism. We show up to 3.5x gain against TFGPU and 10x gain against TFCPU in training time with similar precision on popular extreme classification datasets. Our next steps are to extend SLIDE to include convolutional layers. SLIDE has unique benefits when it comes to random memory accesses and parallelism. We anticipate that a distributed implementation of SLIDE would be very appealing because the communication costs are minimal due to sparse gradients.
7 Acknowledgements
The work was supported by NSF1652131, NSFBIGDATA 1838177, AFOSRYIPFA95501810152, Amazon Research Award, and ONR BRC grant for Randomized Numerical Linear Algebra.
Appendix A Different Hash Functions
Signed Random Projection (Simhash) : Refer (Gionis et al., 1999) for explanation of the theory behind Simhash. We use number of random pregenerated vectors with components taking only three values . The reason behind using only and is for fast implementation. It requires additions rather than multiplications, thereby reducing the computation and speeding up the hashing process. To further optimize the cost of Simhash in practice, we can adopt the sparse random projection idea (Li et al., 2006). A simple implementation is to treat the random vectors as sparse vectors and store their nonzero indices in addition to the signs. For instance, let the input vector for Simhash be in . Suppose we want to maintain sparsity, we may uniformly generate set of indices from . In this way, the number of multiplications for one inner product operation during the generation of the hash codes would simply reduce from to . Since the random indices are produced from onetime generation, the cost can be safely ignored.
Winner Takes All Hashing (WTA hash) : In SLIDE, we slightly modify the WTA hash algorithm from (Yagnik et al., 2011) for memory optimization. Originally, WTA takes space to store the random permutations given the input vector is in . is a adjustable hyperparameter. We only generate rather than permutations and thereby reducing the space to . Every permutation is split into parts (bins) evenly and each of them can be used to generate one WTA hash code. Computing the WTA hash codes also takes operations.
Densified Winner Takes All Hashing (DWTA hash) : As argued in (Chen and Shrivastava, 2018), when the input vector is very sparse, WTA hashing no longer produces representative hash codes. Therefore, we use DWTA hashing, the solution proposed in (Chen and Shrivastava, 2018). Similar to WTA hash, we generate number of permutations and every permutation is split into bins. DWTA loops through all the nonzero (NNZ) indices of the sparse input. For each of them, we update the current maximum index of the corresponding bins according to the mapping in each permutation.
It should be noted that the number of comparisons and memory lookups in this step is , which is significantly more efficient than simply applying WTA hash to sparse input. For empty bins, the densification scheme proposed in (Chen and Shrivastava, 2018) is applied.
Densified One Permutation Minwise Hashing(DOPH): The implementation mostly follows the description of DOPH in (Shrivastava and Li, 2014b). DOPH is mainly designed for binary inputs. However, the weights of the inputs for each layer are unlikely to be binary. We use a thresholding heuristic for transforming the input vector to binary representation before applying DOPH. The highest values among all dimensions of the input vector are converted to s and the rest of them become s. Define as the indices of the top values for input vector . Formally,
We could use sorting algorithms to get the top indices, but it induces at least overhead. Therefore, we keep a priority queue with indices as keys and the corresponding data values as values. This requires operations.
Appendix B Reducing the Sampling Overhead
The key idea of using LSH for adaptive sampling of neurons with large activation is sketched in ‘Introduction to overall system’ section in the main paper. We have designed three strategies to sample large inner products: 1) Vanilla Sampling 2) Topk Sampling 3) Hard Thresholding. We first introduce them one after the other and then discuss their utility and efficiency. Further experiments are reported in section C.
Vanilla Sampling: Denote as the number of active neurons we target to retrieve in layer . After computing the hash codes of the input, we randomly choose a table and only retrieve the neurons in that table. We continue retrieving neurons from another random table until neurons are selected or all the tables have been looked up. Let us assume we retrieve from tables in total. Formally, the probability that a neuron gets chosen is,
(2) 
where is the collision probability of the LSH function that SLIDE uses. For instance, if Simhash is used,
From the previous process, we can see that the time complexity of vanilla sampling is .
TopK Sampling: In this strategy, the basic idea is to obtain those neurons that occur more frequently among all hash tables. After querying with the input, we first retrieve all the neurons from the corresponding bucket in each hash table. While retrieving, we use a hashmap to keep track of the frequency with which each neuron appears. The hashmap is sorted based on the frequencies, and only the neurons with top frequencies are selected. This requires additional space for maintaining the hashmap and time for both sampling and sorting.
Hard Thresholding: The TopK Sampling could be expensive due to the sorting step. To overcome this, we propose a simple variant that collects all neurons that occur more than a certain frequency. This bypasses the sorting step and also provides a guarantee on the quality of sampled neurons. Suppose we only select neurons that appear at least times in the retrieved buckets, the probability that a neuron gets chosen is,
(3) 
Figure 11 shows a sweep of curves that present the relation between collision probability of and and the probability that neuron is selected under various values of when . We can visualize the tradeoff between collecting more good neurons and omitting bad neurons by tweaking . For a high threshold like , only the neurons with have more than chance of retrieval. This ensures that bad neurons are eliminated but the retrieved set might be insufficient. However, for a low threshold like , all good neurons are collected but bad neurons with are also collected with . Therefore, depending on the tolerance for bad neurons, we choose an intermediate in practice.
Appendix C Design Choice Comparisons
In the main paper, we presented several design choices in SLIDE which have different tradeoffs and performance behavior, e.g., executing MIPS efficiently to select active neurons, adopting the optimal policies for neurons insertion in hash tables, etc. In this section, we substantiate those design choices with key metrics and insights. In order to better analyze them in more practical settings, we choose to benchmark them in real classification tasks on Delicious200K dataset.
c.1 Evaluating Sampling Strategies
Sampling is a crucial step in SLIDE. The quality and quantity of selected neurons and the overhead of the selection strategy significantly affect the SLIDE performance. We profile the running time of these strategies, including Vanilla sampling, TopK thresholding, and Hard thresholding, for selecting a different number of neurons from the hash tables during the first epoch of the classification task.
Figure 12 presents the results. The blue, red and green dots represent Vanilla sampling, TopK thresholding, and Hard thresholding respectively. It shows that the TopK thresholding strategy takes magnitudes more time than Vanilla sampling and Hard thresholding across all number of samples consistently. Also, we can see that the green dots are just slightly higher than the blue dots meaning that the time complexity of Hard Thresholding is slightly higher than Vanilla Sampling. Note that the yaxis is in log scale. Therefore when the number of samples increases, the rates of change for the red dots are much more than those of the others. This is not surprising because TopK thresholding strategy is based on sorting algorithms which has running time. Therefore, in practice, we suggest choosing either of Vanilla Sampling or Hard Thresholding for efficiency. For instance, we use Vanilla Sampling in our extreme classification experiments because it is the most efficient one. Furthermore, the difference between iteration wise convergence of the tasks with TopK Thresholding and Vanilla Sampling are negligible.
c.2 Addition to Hashtables
SLIDE supports two implementations of insertion policies for hash tables described in section 3.1 in main paper. We profile the running time of the two strategies, Reservoir Sampling and FIFO. After the weights and hash tables initialization, we clock the time of both strategies for insertions of all 205,443 neurons in the last layer of the network, where 205,443 is the number of classes for Delicious dataset. Then we also benchmark the time of whole insertion process including generating the hash codes for each neuron before inserting them into hash tables.
The results are shown in Table 3. The column “Full Insertion” represents the overall time for the process of adding all neurons to hash tables. The column “Insertion to HT” represents the exact time of adding all the neurons to hash tables excluding the time for computing the hash codes. Reservoir Sampling strategy is more efficient than FIFO. From an algorithmic view, Reservoir Sampling inserts based on some probability, but FIFO guarantees successful insertions. We observe that there are more memory accesses with FIFO. However, compared to the full insertion time, the benefits of Reservoir Sampling are still negligible. Therefore we can choose either strategy based on practical utility. For instance, we use FIFO in our experiments.
Insertion to HT  Full Insertion  

Reservoir Sampling  0.371 s  18 s 
FIFO  0.762 s  18 s 
Appendix D Threading Model and Platform Microarchitecture Optimization
Our experimental analysis shows that SLIDE is a memorybound workload. We show that a careful workload optimization to design a threading model and a data access pattern to take into consideration the underlying platform architecture leads to a significant performance boost.
OpenMP and Cache Optimizations: A key metric for the identification of memory and cache performance bottlenecks in a multithreaded application, e.g., SLIDE, is the number of data misses in the core private caches. This is a significant source of coherence traffic, potentially making the shared bus a bottleneck in a symmetric multiprocessor (SMP) architecture, thus increasing memory latency.
OpenMP provides a standard, easy to use model for scaling up a workload among all the available platform cores. The master thread forks a specified number of worker threads to run concurrently, and by default, threads are kept unbound and are spread across available cores, if any. Generally speaking, an inclusive last level cache (LLC) can improve data sharing, because a new thread created to run on a remote core, can probably find a copy of a shared data structure in the LLC, this is especially true if the accesses are mostly readonly, and ignoring the effect of evictions overhead from private core caches (Meng and Skadron, 2009). With the new trend in CPU architecture of a noninclusive LLC (e.g. Intel’s Skylake architecture Kumar et al. (2017)) multithreaded workloads can operate on larger data per thread (due to increased L2 size). However, due to the new design of a noninlusive LLC remote thread missing on a shared data structure can cause cache thrashing, invalidation, and bouncing of shared data among cores. We noticed that SLIDE is prone to this bottleneck.
Fortunately, OpenMP provides a control for thread affinity where a mask is set by an affinity preference and checked during runtime for possible locations for a thread to run. When threads are accessing mostly private independent data items, it is best to scatter these among the available possible cores for an almost linear speedup with the available cores due to no data dependency. On the other hand, if these threads are accessing items in a shared data structure, it is generally better to schedule these threads in a more compact packing (using the OpenMP Affinity=close) where threads are scheduled closer (same CPU socket) as the master thread.
Furthermore, CPU caches are arranged into cache lines. Multiple threads updating data items that happen to colocate into the same cache line (called false sharing) can also cause cache thrashing, since these updates need to be serialized to ensure correctness, leading to performance degradation. Much previous work (e.g., (Wicaksono et al., 2011)) have tried to detect and resolve the issue of false sharing for OpenMP multithreads mainly using compiler optimizations and hardware performance counters. However, generally speaking, carefully allocating data structures and aligning them on cache line boundaries (e.g., by padding) significantly reduce the false sharing opportunities. We chose to use the later alternative for SLIDE.
Address Translation and Support for Kernel Hugepages: Virtual memory provides applications with a flat address space and an illusion of sufficiently large and linear memory. The addressed memory is divided into fixedsize pages, and a page table is used to map virtual pages to physical ones. The address lookup is accelerated using Translation Lookaside Buffers (TLBs).
Since SLIDE is a workload with a large memory footprint, the performance of virtual memory paging can suffer due to stagnant TLB sizes. TLB address translation is on the processorsâ critical path. It requires low access times which constrain TLB size (and thus, the number of pages it holds). On a TLB miss, the system must walk the page table, which may incur additional cache misses. Recent studies show that workloads with large memory footprints can experience a significant performance overhead due to excessive page table walks (Karakostas et al., 2014; Basu et al., 2013).
We employ Hugepages for SLIDE, which is a technology for x8664 architectures to map much larger pages than the default 4KB normalsized pages on the orders of 2 MB to 1 GB. Use of huge pages (Transparent Hugepages and libhugetlbfs (Corbet, 2011)) increases TLB reach substantially, and reduces the overhead associated with excessive TLB misses and table walks.
Vector Processing, Software Pipelining, and Prefetching: We further use software optimization techniques to improve workload performance in SLIDE. In particular, we use Vector processing which is capable of exploiting datalevel parallelism through the use of SingleInstructionMultipleData (SIMD) execution, where a function is called with a batch of inputs instead of an individual input (e.g., the function to update a large matrix of weights in the backpropagation phase). The implementation uses SIMD instructions (e.g., Intel AVX (Kumar et al., 2017)) to implement the update to multiple weights simultaneously. Implementing a software pipeline is an excellent way to hide memory latency for memorybound workloads. Our implementation divides the processing of data items into stages of a pipeline, where explicit software prefetch stage (using, for example, x86 PREFETCHT0 instruction set) is followed by a processing stage(s). The data items that are accessed in the future are prefetched into the core caches in advance to the time when they are needed to get processed. In particular, for a vector processing of updating of N weights, a software implementation can prefetch weight (where d is the depth of the pipeline) while updating weight , as a result, when it is time to process weight it is already in the CPU cache.
d.1 Measuring the Impact of Transparent Hugepages
In table 4, we show the results for examining the impact of Transparent Hugepages on various CPUcounter metrics.
A direct benefit of employing Transparent Hugepages is the drastic reduction in TLB miss rate. For example, the first row in table 4 shows that the TLB load miss rate for data reduces from to . Similarly, TLB load miss rate for instruction also decreases from to . Consequently, we expect a huge reduction in page table walks (PTW) incurred due to TLB misses. This is corroborated in rows 3 and 4 of table 4. We see that the ratios of CPU cycles spent by PTWs caused by data and instruction TLB misses have reduced from to and to respectively. As mentioned in section D, TLB misses cause expensive main memory reads. Using Hugepages, we reduce the memory reads caused by data and instruction TLB misses from to and and respectively. Finally, we also report the reduction in page faults (which can possibly occur when there is a TLB miss) from to .
Metric  Without Hugepages  With Hugepages 

dTLB load miss rate  
iTLB load miss rate  
PTW dTLBmiss  
PTW iTLBmiss  
RAM read dTLBmiss  
RAM read iTLBmiss  
PageFault 
Appendix E More discussion on scalability
Moreover, based on the statistics collected through experiments as mentioned above, we show the ratio of convergence time with the different number of cores to the minimum convergence time (using 44 cores). The results are exhibited in Figure 13. Again, the red line represents SLIDE, and the black line represents TensorflowCPU. When the number of cores increases, that ratio decreases for both SLIDE and TensorflowCPU. However, it is explicit that the ratio drops more drastically for the red line than the black line. This behavior concludes that the scalability of SLIDE is much better than that of TensorflowCPU. Moreover, in the plot, we observe that the benefits of using more cores are not obvious after 16 cores for TensorflowCPU. Coincidentally, a very recent work (Hasabnis, 2018) introduces the hardness of finding the optimal parameter settings of Tensorflowâs threading model for CPU backends. It argues that getting the best performance from a CPU needs manual, tedious and timeconsuming tuning and it still may not guarantee the best performance. While analyzing the scalability and core utilization of TensorflowCPU can be an independent research interest, we explore a small aspect of it in the following paragraphs.
Footnotes
 https://github.com/keroro824/HashingDeepLearning
References
 Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems, pp. 3084–3092. Cited by: §1.
 Efficient virtual memory for big memory servers. In International Symposium on Computer Architecture, pp. 237–248. Cited by: Appendix D.
 Quick training of probabilistic neural nets by importance sampling.. Cited by: §5.1.
 The extreme classification repository: multilabel datasets and code. Note: \urlhttp://manikvarma.org/downloads/XC/XMLRepository.html#Prabhu14 Cited by: §5.
 Adaptive sampled softmax with kernel based sampling. In International Conference on Machine Learning, pp. 589–598. Cited by: §1.
 UNIQUE entity estimation with application to the syrian conflict. THE ANNALS. Cited by: §2.1.
 Densified winner take all (wta) hashing for sparse datasets. In Uncertainty in artificial intelligence, Cited by: Appendix A, Appendix A, §2, §3.2.
 Fast and accurate stochastic gradient estimation. In Advances in Neural Information Processing Systems, pp. 12339–12349. Cited by: §2.1.
 Transparent huge pages in 2.6.38. http://lwn.net/Articles/423584/. Cited by: Appendix D.
 Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB ’99, San Francisco, CA, USA, pp. 518–529. External Links: ISBN 1558606157, Link Cited by: Appendix A, §2, §3.2.
 Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §5.2.
 Noisecontrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. Cited by: §5.1.
 Autotuning tensorflow threading model for CPU backend. CoRR abs/1812.01665. External Links: Link, 1812.01665 Cited by: Appendix E.
 Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613. Cited by: §2, §2.
 On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 1–10. Cited by: §1, §5.1, §5.
 Indatacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. Cited by: §1.
 Performance analysis of the memory management unit under scaleout workloads. In International Symposium on Workload Characterization, pp. 1–12. Cited by: Appendix D.
 On largebatch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §5.2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
 The new intel xeon scalable processor(formerly skylakesp). In Hot Chips, Cited by: Appendix D, Appendix D.
 Powers of tensors and fast matrix multiplication. In Proceedings of the 39th international symposium on symbolic and algebraic computation, pp. 296–303. Cited by: §1.
 Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 287–296. Cited by: Appendix A, §3.2.
 Scalingup splitmerge mcmc with locality sensitive sampling (lss). arXiv preprint arXiv:1802.07444. Cited by: §2.1.
 Winnertakeall autoencoders. In Advances in neural information processing systems, pp. 2791–2799. Cited by: §1.
 Ksparse autoencoders. arXiv preprint arXiv:1312.5663. Cited by: §1.
 Using intel® vtuneâ¢ performance analyzer events/ratios & optimizing applications. http:/software. intel. com. Cited by: §5.
 Avoiding cache thrashing due to private data placement in lastlevel cache for manycore scaling. In International Conference on Computer Design, pp. 283–297. Cited by: Appendix D.
 Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §5.1.
 GPU computing. Cited by: §1.
 Hogwild: a lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701. Cited by: 2nd item, §1, §3.1.
 Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pp. 2321–2329. Cited by: §2.1.1.
 Densifying one permutation hashing via rotation for fast near neighbor search. In International Conference on Machine Learning, pp. 557–565. Cited by: Appendix A.
 A new unbiased and efficient class of lshbased samplers and estimators for partition function computation in loglinear models. arXiv preprint arXiv:1703.05160. Cited by: §2.
 Scalable and sustainable deep learning via randomized hashing. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 445–454. Cited by: §1, §2.1, §2.1.
 Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §1.
 Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11 (1), pp. 37–57. Cited by: §4.2.
 Randomized algorithms accelerated over cpugpu for ultrahigh dimensional similarity search. In ACM SIGMOD Record, pp. 889–903. Cited by: §4.2.
 Detecting false sharing in openmp applications using the darwin framework. In Lecture Notes in Computer Science, pp. 282–288. Cited by: Appendix D.
 The power of comparative reasoning. In 2011 International Conference on Computer Vision, pp. 2431–2438. Cited by: Appendix A, §2.
 Loss decomposition for fast learning in large output spaces. In International Conference on Machine Learning, pp. 5626–5635. Cited by: §5, §5.