Recommender Systems, Distributed Training
Large-scale recommender systems are critical tools to enhance user experience and promote sales/services for many online websites and mobile applications. One essential component in the recommender system pipeline is click-through rate (CTR) prediction. Usually, people use machine learning models with tens or even hundreds billions of parameters to provide the prediction based on tons of streaming input data that include user preferences, item features, user-item past interactions, etc. Current industrial-level recommender systems(RSs) usually have so large parameter size that asynchronous parameter-server (PS) mode has become the only available method for building such systems.
Ideally, an efficient distributed recommender system should meet three requirements:
Dynamic Features: In industrial scenarios, more and more recommender systems run on streaming mode because new users or items arrive continuously in infinite data streams. In the streaming recommender systems Gholami et al. (2018); Chang et al. (2017), the size of model parameters is usually temporal dynamic and reaches hundreds of GBs or even several TBs. Such large-scale of the parameters naturally requires distributed storage.
Stable Convergence: Before the popularity of DLRMs, the negative impacts on accuracy caused by gradient staleness Chen et al. (2016)in asynchronous training is not significantly in RSs. With more and more deep learning components are introduced to recommendation models, the RSs are required to supporting fully synchronization training for stable convergence and higher AUC .
Real-time Updating: One vital characteristic of streaming recommendation scenarios is their high velocity of inference query. So an RS needs to update and response instantly in order to catch users’ real-time intention and demands. With model size increasing over time, it is more and more important for RSs to reduce the demand of network transmission to keep timeliness.
The above requirements are affected by two design choices we make when building a large-scale distributed recommender system: how to parallelize the training pipeline, and how to synchronize the parameters. For parallelization, we can use either data parallelism (to parallelize over the data dimension), or model parallelism (to parallelize computation on parameters on different devices). For synchronization, the system can be synchronous or asynchronous (usually when using PS mode).
However, existing methods cannot be easily adapted to recommender systems for two reasons:
First, for the DLRMs with very large size of parameters, pure data parallelism keeps replica of the entire model on a single device , which makes it impossible because recommender systems usually have very large weights to updating for the first few layers (we call operators in these layers weights-rich layers). Also, in the context of recommender system, features for different input samples in a batch can be different in length, so pure data parallelism with linearly-scaled batch size is inapplicable. Pure model parallelism usually treat the layers and operators as a whole and optimize the load balance by different device placement policies, which does not apply to most larger-scale recommender systems today either.
Second, current PS mode implementations of large-scale recommender systems is essentially a hybrid-data-and-model parallelism strategy and always needs to make a tradeoff between update frequency and communication bandwidth. Applying such asynchronous strategy to current and future models with even larger size of parameters will make it more difficult for these models to converge to the same performance while keeping the training efficient.
To solve the above two issues, we present a novel distributed training framework for recommender systems that achieves faster training speed with less communication overhead using a strategy we call distributed equivalent substitution (DES). The key idea of DES is to replace the weights-rich layers by an elaborate group of sub-operators which make each sub-operator only update its co-located partial weights. The partial computation results get aggregated and form a computationally equivalent substitution to the original operator. To achieve less communication, we find sub-operators that generate partial results with smaller sizes to form the equivalent substitution. We empirically show that for all the weights-rich operators whose parameters dominate the model, it is easy to find an equivalent substitution strategy to create an order of magnitude less communication demand. We also discuss how to extend DES to other general models
The main contributions of this paper are as follows:
We present DES training, a distributed training method for recommender systems that achieves better convergence with less communication overhead on large-scale streaming recommendation scenarios.
We propose a group of strategies that replaces the weights-rich layers in multiple popular recommendation models by computationally equivalent sub-operators which only update co-located weights and aggregate partial results with much smaller communication cost.
We show that for different types of models that are most often used in recommender systems, we can find according substitution strategies for all of their weights-rich layers.
We present an implementation of DES training framework that outperforms the state-of-the-art recommender system. In particular, we show that our framework achieves 68.7% communication savings on average compared to other PS-based recommender systems.
2 Related Work
Large-scale recommender systems are distributed systems designed specifically for training recommendation models. This section reviews related works from the perspectives of both fields:
2.1 Large-Scale Distributed Training Systems
Data Parallelism splits training data on the batch domain and keeps replica of the entire model on each device. The popularity of ring-based AllReduce Gibiansky (2017) has enabled large-scale data parallelism training Goyal et al. (2017); Jia (2018); You et al. (2019). Parameter Server (PS) is a primary method for training large-scale recommender systems due to its simplicity and scalability Dean et al. (2012); Li et al. (2014). Each worker processes on a subset of the input data, and is allowed to use stale weights and update either its weights or that of a parameter server. Model Parallelism is another commonly used distributed training strategy Krizhevsky (2014); Dean et al. (2012). More recent model parallelism strategy learns the device placement Mirhoseini et al. (2017) or uses pipelining Huang et al. (2018). These works usually focus on enabling the system to process complex models with large amount of weights.
Previously, there have been several hybrid-data-and-model parallelism strategies. Krizhevsky Krizhevsky (2014) proposed a general method for using both data and model parallelism for convolutional neural networks. Gholami et al. Gholami et al. (2018) developed an integrated model, data, and domain parallelism strategy. Though theoretically summarized several possible ways to distribute the training process, the method only focused on limited operations such as convolution, and is not applicable to fully connected layers. Zhihao et al. Jia et al. (2018) proposed another integrated parallelism strategy called ”layer parallelism”. However, it also focuses on a limited set of operations and cannot split the computation for an operation, which makes it difficult to apply this method to recommender systems. Mesh-TensorFlow Shazeer et al. (2018) implements a more flexible parameter server-like architecture, but for recommender systems, it could introduce unnecessary weights communication between different operations.
2.2 Recommender Systems
The critical problem a recommender system tries to solve is the Click-Through Rate (CTR) prediction. Logistic regression (LR) is one of the first methods that has been applied Richardson et al. (2007) and is still a common practice now. Factorization machine (FM) Rendle (2010) utilizes addition and inner product operations to capture the linear and pairwise interactions between features. More recently, deep-learning based recommendation models(DLRMs) have gained more and more attentions Zhang et al. (2016); Cheng et al. (2016); Guo et al. (2017); Lian et al. (2018); Zhou et al. (2018). Wide & Deep(W&D) model combines a general linear model (the wide part) with a deep learning component (the deep part) to enable the recommender to capture both memorization and generalization. DeepFM seamlessly integrates factorization machine and multi-layer perceptron (MLP) to model both the high-order and low-order feature interactions. Other applications of DLRM include music recommendation Oord et al. (2013) and video recommendation Covington et al. (2016). Among all the existing industrial-level recommender systems, one common characteristic is tens or even hundreds billions of dynamic features. To the best knowledge of the authors, the dominant way to build a large-scale recommender system today is still parameter-server based methods.
3 Background and Design Methodology
3.1 Recommender System Overview
The typical process of a recommender system starts when a user-generated query comes in. The recommender system will return a list of items for the user to further interact (clicking or purchasing) or ignore. These user operations, queries and interactions are recorded in the log as training data for future use. Due to the large number of simultaneous queries in recommender systems, it is difficult to score each query in detail within the service latency requirement (usually 100 milliseconds). Therefore, we need a recall system to pick from the global item list a most-relevant short list, using a combination of machine learning models and manually defined rules. After reducing the candidate pool, a ranking system ranks all items according to their scores. The score usually presents the probability of user behavior tag for a given feature includes user characteristics (e.g., country, language, demographic), context features (e.g., devices, hours of the day, days of the week) and impression features (e.g., application age, application history statistics). This paper mainly studies the core component of a recommender system: models that are used for ranking and online learning.
3.2 Distributed Equivalent Substitution Strategy
Previous PS-based or model parallelism methods usually do not change the operator on algorithm level. That means for recommender systems that have weights-rich layers for the first one or more layers, putting operators on different devices still cannot solve the out-of-memory problem for a single weights-rich layer. Some works do split the operator Huang et al. (2018); Jia et al. (2018), but they focus on the convolution, which has completely different characteristics than operators that are frequently used in recommender systems. Our strategy, instead, designs a computationally equivalent substitution for the original weights-rich layer, replace it into a group of computational equivalent operators that update only portions of weights, and processes the computation on non-overlapping input data. Since only one portion of weights is updated by one of new operators, our method could break through the single-node memory limitation and avoid transmitting a large number of parameters between the nodes. This strategy is particularly designed for large-scale recommender systems. In models for such recommender systems, the majority of the parameters only participate in very simple computation in the first few layers. Such models include LR, FM, W&D, and many other follow-ups.
Definitions and Notations
To help readers better follow our contributions in later sections, we hereby list some basic definitions and notations in the context of distributed training framework for recommender system. We first define the operation for the convenience of description:
In the context of this paper, is one of the MPI-style collective operations: . However, it can be any communicative-associative aggregation operation. presents local values hold by processor , presents the final result. The following are some definitions we need for the description of DES strategy:
: the original operator function;
: the sub-operator function;
: the computationally equivalent substitution of ;
: the local result for one substitution operator of ;
: batch size of samples on each iteration;
: number of worker processes;
: number of sub-operators;
: input tensor of an operator;
: weights tensor of an operator;
: latency of the network.
: network bandwidth;
: size of features, weights, gradients, or intermediate results in bytes;
Without losing generality, we suppose that each worker only has one process, so the number of workers is equal to the number of processes. We also assume that all operators only take one input tensor and one weights tensor .
The key observation is that for models in recommender systems, there is always one or more weights-rich layers with dominant portion of the parameters. The core idea of DES strategy is to find a computationally equivalent substitution to the operator of these weights-rich layers, and to find a splitting method to reduce the communication among all the sub-operators.
Forward Phase: Figure 1 illustrates the forward pass in two-worker case, and compares our DES strategy with PS-based strategy. In PS-based strategy, is not split, so each operator needs its entire when doing the computation. Also, is not co-located with but pulled to the device when needed. In DES strategy, we partition the weights and inputs on different processes, do parallel aggregations on results of one or more sub-operators , then use the substitution operator to get the final result on each process. Algorithm 1 shows this process:
The layers follow the weights-rich layer will get the same aggregated results on each process, so there is no need for further inter-process communication in subsequent computation for the forward phase. To guarantee the correctness of equation 1, it is very important that is computationally equivalent to the original operator . We observe that on all the popular models for recommender systems, we can always find such sub-operators to form computational equivalent substitutions. We will show details on how we get the substitutions for operators in different models in section 4.
Back-propagation Phase:After the forward phase, each process has the entire results . Because we are not doing AllReduce on the gradients, but only on some small intermediate results, and also because aggregation operation distributes gradients equally to all its inputs, there is no inter-process communication during the back-propagation phase either. Each process just transfers the gradients directly back to its own sub-operator.
Performance & Complexity Analysis
PS-based: Weights are distributed on parameter-servers, while workers process on different batches each with samples. The time cost for PS-based mode is:
Mesh-based: A special form of PS-based is Mesh-based in which the weights are divided into chunks and co-located with some workers. It has smaller network cost than original PS-based strategies. In this strategy, each worker processes one batch, the time cost for batches in synchronous mode is:
AllReduce: A full replica of weights is stored on each worker. The workers synchronize the gradients every iteration. We use Ring-based AllReduce, the most widely-adopted AllReduce algorithm, as the default algorithm for the scope of this paper. The time cost of the communication is:
Where is the size of gradients for the model.
DES: Each aggregation operation uses AllReduce, DES may use several such aggregation operations to form the final result, so the time cost of the communication is:
Where is the number of aggregation operations, and is the size of intermediate results for the th operation . Let
and we can see if is satisfied for each , DES will reduce communication cost.
For both PS-mode strategy, time complexity of the communication is proportional to batch size . For AllReduce and DES-based strategies, time complexity of the communication is constant (because the number of aggregation operations is usually smaller than 3).
The benefits of DES strategy is three-fold: first, with new operators and their co-located weights, one can split an operator with a huge amount of weights into sub-operators with arbitrarily small amount of parameters, given abundant number of workers. This enables better scalability for our framework when compared to traditional PS-based frameworks; second, DES strategy does not send weights but instead intermediate results from sub-operators, which can be much smaller in size compared to the original weights. This can significantly reduce the total amount of communication needed for our framework; third, with the above two improvements, our framework brings synchronous training to large-scale recommender system. With fully-synchronization per-iteration, the model converges faster, which makes the training process more efficient.
4 Applications on Models for Recommender Systems
We observe that many models in recommender systems share similar components (Table 1). For example, LR model is the linear part of W&D model; almost all models include first-order feature crossover; all FM-based models include second-order feature crossover; the deep component of W&D model and DeepFM model share similar structures. An optimal DES strategy finds substitutions of first-order, second-order, or higher-order operations, which are usually simple computation but with a large number of weights. The goal is to achieve the same computation but with much smaller communication cost for sending partial results over the network. In this section, we describe how to find such computational equivalent substitutions for different models.
4.1 Logistic Regression
Logistic Regression(LR) Richardson et al. (2007) is a generalized linear model that is widely used in recommender systems. Due to its simplicity, scalability, and interpretability, LR can be used not only as an independent model, but also an important component in many DLRMs, such as Wide&Deep and DeepFM . The form of LR is as follows:
where, and are two d-dimension vectors represent inputs and weights respectively, is the bias, and is a non-linear transform, usually a sigmoid function for LR. The major part of the computation in is dot product. It is easy for us to find an of : , where denotes the subset of co-located with the -th process. We then define a local operator on :
We have the equivalent substitution of :
Assume that all weights of sparse features are stored in hash tables as float32. In mesh-based strategy, each worker needs to transfer weights with unsigned int64 keys from the hash tables co-located with other workers. So the total data size to transfer through the network for each worker is:
Where and denote the size of feature keys and weights respectively.
Using DES, we only need to synchronize a scalar value with other workers for every sample, so the total data size to transfer through the network for each worker is:
Where denotes the size of intermediate results.So the communication-saving ratio for LR is:
4.2 Factorization Machine
Besides linear interactions among features, FM models pairwise feature interactions as inner product of latent vectors. FM is both an independent model and an important component of DLRMs such as DeepFM and xDeepFM Lian et al. (2018). The linear interactions are similar to LR model, so here we only focus on the order-2 operator (denoted by ):
denotes a latent vector, is the feature value of , the presents the inner product operation.
Applying Algorithm 1 to FM, we get an -partition of using any partition policy that balances on each process. We then define two local operators: and that process on local subset of weights :
We have the equivalent substitution of :
In mesh-based strategy, each worker needs to lookup latent vectors with feature IDs from the hash tables co-located with other workers. The total data size to transfer through the network for each worker is:
Where and denote the size of feature keys and latent vectors per batch respectively.
Using DES, the FM order-2 operators only require all workers to exchange and among each other, so we have:
The communication-saving ratio for FM is:
4.3 Deep Neural Network
Recommender systems use DNN to learn high-order feature interactions. The features are usually categorical and grouped in fields. A DNN starts from an embedding layer which compresses the latent vectors into dense embedding vectors by fields, and is usually followed by multiple fully-connected layers as shown in Figure 4.
Like FM, in DNNs, the majority of weights are from the embedding layer and the first FC layer:
denotes the concated output of the embedding layer and denotes the weights of the first FC layer.
Using DES, we split and into partitions over the fields dimension, and use blocked matrix multiplication (Figure 5), which is similar to the method proposed by Gholami et al. Gholami et al. (2018). Our strategy differs in splitting: we divide and in the same dimension to ensure that the computation and weights do not overlap in different parts:
Hence we get the of and : , , where and denote the subset of and co-located with the -th process respectively.
Considering that the embedding layer will aggregate the latent vectors by fields before concatenating them, we store the latent vectors of the same field on the same process to avoid unnecessary weights exchange. In this way, we also avoid communication during the back-propagation phase.
Using this -partition we can define the local operator as follows:
The distributed equivalent substitution of is hence defined as:
In mesh-based strategy, each worker needs to lookup of and by keys(unsigned int64) from the hash tables co-located with other workers. The total data size to transfer for each worker is:
, and denote the size of feature keys, and per batch respectively. Compared to mesh-based strategy, DNN using DES only requires all workers to exchange among each other (Figure 4):
The communication-saving ratio for DNN is:
|512||147,664||99.769 %||99.376 %||90.310 %|
|1024||257,757||99.735 %||99.285 %||86.226 %|
|2048||448,814||99.696 %||99.179 %||81.658 %|
|4096||789,511||99.654 %||99.066 %||77.015 %|
|8192||1,389,353||99.607 %||98.939 %||72.264 %|
Using DES does not increase the computation compared to PS/mesh-based strategy, and often leads to smaller computation load. Table 2 shows the number of unique features per batch as well as the communication-saving ratio for three models with different batch sizes on a real-world recommender systems. The communication costs when using DES are reduced from 72.26% (with a batch size of 8192) to 99.77% (with a batch size of 512) compared to mesh-based strategy.
Our analysis here only include the communication cost for transferring the sparse weights. In fact, for most recommender systems, state-of-the-art stateful optimizer such as FTRL McMahan et al. (2013), AdaGrad Duchi et al. (2011) and Adam Kingma and Ba (2014) require saving and transferring the corresponding state variables as well as the sparse weights. When using DES strategy, these variables are kept local, which will reduce even more communication cost.
Extending to General Models: Previous analysis show that we can apply DES to several state-of-the-art models for recommender systems. We think this is not a coincidence. To generalize our observations for the above models, we claim that for any DLRM, as long as the computational equivalent substitution of the weights-rich layers do not surpass linear complexity, we can apply DES strategy. FM Rendle (2010) is the work that inspired us on finding linear substitution to operators. The linear complexity is where is the size of the feature parameters. Since DES splits an -dimension feature vector to part where , is a constant, and is the number of DES worker processes. We use to represent this. We have a simple rule to judge whether it is linear complexity: if the computation process of weights-rich layer satisfies the Commutative Law and Associative Law, we can apply DES strategy to help reduce the communication cost in forward phase and eliminate the gradient aggregation in backward phase.
5 System Implementation
We choose TensorFlow as the backend for our training framework due to its flexibility and natural distributed-friendliness. More specifically, we implement our system by enhancing TensorFlow in the following two aspects: large-scale sparse features and dynamic hash table.
Large-scale Sparse Features: As mentioned earlier, an industrial streaming recommender system may have hundreds of billions of dynamic features. Given the embedding size with , the feature weights require 3.2TB of memory at least. Table 2 shows that for a single iteration, weights update on unique features is sparse. To achieve constant cost data access/update and get over the memory constraint of a single node, we use distributed hash table. We use a simple method to distribute weights: In a cluster with nodes, the -th node will hold all the weights that are corresponding with feature field IDs where . There are other methods that could achieve better load balancing, but we found this simple method works fine in our case.
Dynamic Hash Table: In DES strategy, there are three places we operate on hash tables: given a feature ID in a batch of input samples, we lookup the corresponding weight; when a new feature ID is given as the key, we insert the initialized weight into the hash table; given the gradient of a weight, we apply it locally, and then update the hash table with the new weight. To achieve this, we provide a modified dynamic hash table implementation in TensorFlow with key operations adapted to our needs (Figure 6). Compared to alternative design choices, this implementation makes use of as many existing TensorFlow features as possible but only introduces hash table operations during batch building and optimizer phase. Because after the lookup, the sparse weights are reformed into dense tensors and are fully compatible with the native training pipeline of TensorFlow.
6 Experiments And Analysis
Hardware: We ran all experiments in this paper on a testing cluster which has four LINUX servers with each consisting of 2 hyperthreaded 24-core Intel Xeon E5-2670v3(2.3GHz) CPUs, 128 GB of host memory, and one Intel Ethernet Controller 10-Gigabit X540-AT2 without RDMA support.
Software: Our DES framework is based on an enhanced version of TensorFlow 1.13.1 and a standard OpenMPI with version 4.0.1. Considering that mesh-based frameworks is a special form of PS-based and usually has less communication cost than original PS-based frameworks, we use mesh-based strategy for comparison. The mesh-based strategy we compare with is implemented using a popular open-source framework: DiFacto Li et al. (2016).
Dataset: In order to verify the performance of DES in real industrial context, we evaluate our framework on the following two datasets.
1) Criteo Dataset: Criteo dataset
2) Company* Dataset: We extract a continuous segment of samples from a recommender system in use internally. On average, each sample contains 950 unique feature values. The total number of samples is 10,809,440. It is stored in a remote sample server.
Parameter Settings: We set DiFacto to run one worker process on each server, the batch size is 4,096, and the number of concurrency threads is 24. Correspondingly, the parameters of and for DES on TensorFlow are both set to 24, the batch size on DES is set to 4096 when testing AUC . Since for DES, all workers train samples from the same batch synchronously in parallel, when testing communication ratio, we set the batch size to 16384 (for =4) to guarantee a fair comparison. We train all models with the same optimizer setting: FTRL for order-1 components, AdgaGrad or Adam for both Embedding and DNN components.
Evaluation Metrics: We use two evaluation metrics in our experiments: AUC (Area Under ROC) and Logloss (cross entropy).
Performance Summary We compare our framework to mesh-based implementation on three different widely-adopted models in mainstream recommender systems: LR, W&D, and DeepFM . In general, on all three models, DES can achieve better AUC in smaller number of iterations with order of magnitude smaller communication cost.
Table 3 shows that during long-term online training, when consuming the same amount of samples with similar distribution, DES shows better average AUC for all three models. One possible explanation for this is that with DES, the training is in synchronous mode, which usually leads to better and faster convergence compared to asynchronous mode. The reason we care about small amount AUC increase is that in several real-world applications we run internally, even increase in AUC will have a 5x amplification ( increase) when transferred to final CTR .
Table 4 shows the AUC and log loss for three models using PS-mode asynchronous training and DES-mode fully-synchronous training on TensorFlow respectively
Computation vs. Communication Time: Figure 7 shows that in all experiments, DiFacto framework needs to spend more time on both computation and communication. The absolute total network communication time using DiFacto framework is 2.7x, 2.3x, and 3.2x larger for LR, W&D, and DeepFM respectively, than using DES . The saving on communication time comes from the smaller amount of intermediate results sent among workers during the forward phase and the elimination of gradient aggregation during the backward phase. The saving on computation time comes from the reduced time complexity of computational equivalent substitution as well as several optimizations we have put in our DES framework.
Throughput: Table 5 compares the throughput of DES and DiFacto. For deep models with high-order components (W&D and DeepFM), DES has more advantages. It indicates larger benefits when applying DES to future DLRMs.
7 Conclusions and Future Works
We propose a novel framework for models with large-scale sparse dynamic features in streaming recommender systems. Our framework achieves efficient synchronous distributed training due to its core component: Distributed Equivalent Substitution (DES) algorithm. We take advantage of the observation that for all models in recommender systems, the first one or few weights-rich layers only participate in straightforward computation, and can be replaced by a group of distributed operators that form a computationally equivalent substitution. Using DES, the intermediate information needed to transfer between workers during the forward phase has been reduced, the AllReduce on gradients between workers during the backward phase has been eliminated. The application of DES on popular DLRMs such as FM, DNN, Wide&Deep, and DeepFM shows the universal generality of our algorithm. Experiments on a public dataset and an internal dataset that compare our implementation with a popular PS-based implementation show that our framework achieves up to 68.7% communication savings and higher AUC .
Future Works: We have shown in section 6 that our current implementation of DES is bounded by computation. So the natural next step is to transfer the computation of current bottleneck operators such as hash table to GPU and to improve the existing kernel implementations. We have also started the initial work to apply DES to more models commonly used in industry such as DCN Wang et al. (2017) and DIN Zhou et al. (2018).
Acknowledgement We appreciate the technical assistance, advice and machine access from colleagues at Tencent: Chaonan Guo and Fei Sun.
- More details in Section 4.
- We use FTRL optimizer for LR model, and Adam optimizer for the other two models.
- Streaming recommender systems. In Proceedings of the 26th International Conference on World Wide Web, WWW â17, Republic and Canton of Geneva, CHE, pp. 381â389. External Links: Cited by: 1st item.
- Revisiting distributed synchronous SGD. CoRR abs/1604.00981. External Links: Cited by: 2nd item.
- Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS 2016, New York, NY, USA, pp. 7–10. External Links: Cited by: §2.2.
- Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, New York, NY, USA, pp. 191–198. External Links: Cited by: §2.2.
- Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, USA, pp. 1223–1231. External Links: Cited by: §2.1.
- Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, pp. 2121–2159. External Links: Cited by: §4.3.
- Integrated model, batch, and domain parallelism in training neural networks. In SPAA’18: 30th ACM Symposium on Parallelism in Algorithms and Architectures, External Links: Cited by: 1st item, §2.1, §4.3.
- External Links: Cited by: §2.1.
- Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR abs/1706.02677. External Links: Cited by: §2.1.
- DeepFM: a factorization-machine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pp. 1725–1731. External Links: Cited by: §2.2.
- GPipe: efficient training of giant neural networks using pipeline parallelism. CoRR abs/1811.06965. External Links: Cited by: §2.1, §3.2.
- Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes. CoRR abs/1807.11205 (1807.11205v1). External Links: Cited by: §2.1.
- Exploring hidden dimensions in parallelizing convolutional neural networks. CoRR abs/1802.04924. External Links: Cited by: §2.1, §3.2.
- Adam: A Method for Stochastic Optimization. arXiv e-prints, pp. arXiv:1412.6980. External Links: Cited by: §4.3.
- One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997. External Links: Cited by: §2.1, §2.1.
- Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, Berkeley, CA, USA, pp. 583–598. External Links: Cited by: §2.1.
- DiFacto: distributed factorization machines. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, New York, NY, USA, pp. 377–386. External Links: Cited by: §6.
- XDeepFM: combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, New York, NY, USA, pp. 1754–1763. External Links: Cited by: §2.2, §4.2.
- Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, New York, NY, USA, pp. 1222–1230. External Links: Cited by: §4.3.
- Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 2430–2439. External Links: Cited by: §2.1.
- Deep content-based music recommendation. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, USA, pp. 2643–2651. External Links: Cited by: §2.2.
- Factorization machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, Washington, DC, USA, pp. 995–1000. External Links: Cited by: §2.2, §4.2, §4.3.
- Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, New York, NY, USA, pp. 521–530. External Links: Cited by: §2.2, §4.1.
- Mesh-tensorflow: deep learning for supercomputers. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 10435–10444. External Links: Cited by: §2.1.
- Deep & cross network for ad click predictions. CoRR abs/1708.05123. External Links: Cited by: §7.
- Large batch optimization for deep learning: training BERT in 76 minutes. CoRR abs/1904.00962. External Links: Cited by: §2.1.
- Deep learning over multi-field categorical data: a case study on user response prediction. ArXiv abs/1601.02376. Cited by: §2.2.
- Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, New York, NY, USA, pp. 1059–1068. External Links: Cited by: §2.2, §7.