Recommender Systems, Distributed Training
1 Introduction
Largescale recommender systems are critical tools to enhance user experience and promote sales/services for many online websites and mobile applications. One essential component in the recommender system pipeline is clickthrough rate (CTR) prediction. Usually, people use machine learning models with tens or even hundreds billions of parameters to provide the prediction based on tons of streaming input data that include user preferences, item features, useritem past interactions, etc. Current industriallevel recommender systems(RSs) usually have so large parameter size that asynchronous parameterserver (PS) mode has become the only available method for building such systems.
Ideally, an efficient distributed recommender system should meet three requirements:

Dynamic Features: In industrial scenarios, more and more recommender systems run on streaming mode because new users or items arrive continuously in infinite data streams. In the streaming recommender systems Gholami et al. (2018); Chang et al. (2017), the size of model parameters is usually temporal dynamic and reaches hundreds of GBs or even several TBs. Such largescale of the parameters naturally requires distributed storage.

Stable Convergence: Before the popularity of DLRMs, the negative impacts on accuracy caused by gradient staleness Chen et al. (2016)in asynchronous training is not significantly in RSs. With more and more deep learning components are introduced to recommendation models, the RSs are required to supporting fully synchronization training for stable convergence and higher AUC .

Realtime Updating: One vital characteristic of streaming recommendation scenarios is their high velocity of inference query. So an RS needs to update and response instantly in order to catch users’ realtime intention and demands. With model size increasing over time, it is more and more important for RSs to reduce the demand of network transmission to keep timeliness.
The above requirements are affected by two design choices we make when building a largescale distributed recommender system: how to parallelize the training pipeline, and how to synchronize the parameters. For parallelization, we can use either data parallelism (to parallelize over the data dimension), or model parallelism (to parallelize computation on parameters on different devices). For synchronization, the system can be synchronous or asynchronous (usually when using PS mode).
However, existing methods cannot be easily adapted to recommender systems for two reasons:
First, for the DLRMs with very large size of parameters, pure data parallelism keeps replica of the entire model on a single device , which makes it impossible because recommender systems usually have very large weights to updating for the first few layers (we call operators in these layers weightsrich layers). Also, in the context of recommender system, features for different input samples in a batch can be different in length, so pure data parallelism with linearlyscaled batch size is inapplicable. Pure model parallelism usually treat the layers and operators as a whole and optimize the load balance by different device placement policies, which does not apply to most largerscale recommender systems today either.
Second, current PS mode implementations of largescale recommender systems is essentially a hybriddataandmodel parallelism strategy and always needs to make a tradeoff between update frequency and communication bandwidth. Applying such asynchronous strategy to current and future models with even larger size of parameters will make it more difficult for these models to converge to the same performance while keeping the training efficient.
To solve the above two issues, we present a novel distributed training framework for recommender systems that achieves faster training speed with less communication overhead using a strategy we call distributed equivalent substitution (DES). The key idea of DES is to replace the weightsrich layers by an elaborate group of suboperators which make each suboperator only update its colocated partial weights. The partial computation results get aggregated and form a computationally equivalent substitution to the original operator. To achieve less communication, we find suboperators that generate partial results with smaller sizes to form the equivalent substitution. We empirically show that for all the weightsrich operators whose parameters dominate the model, it is easy to find an equivalent substitution strategy to create an order of magnitude less communication demand. We also discuss how to extend DES to other general models
The main contributions of this paper are as follows:

We present DES training, a distributed training method for recommender systems that achieves better convergence with less communication overhead on largescale streaming recommendation scenarios.

We propose a group of strategies that replaces the weightsrich layers in multiple popular recommendation models by computationally equivalent suboperators which only update colocated weights and aggregate partial results with much smaller communication cost.

We show that for different types of models that are most often used in recommender systems, we can find according substitution strategies for all of their weightsrich layers.

We present an implementation of DES training framework that outperforms the stateoftheart recommender system. In particular, we show that our framework achieves 68.7% communication savings on average compared to other PSbased recommender systems.
2 Related Work
Largescale recommender systems are distributed systems designed specifically for training recommendation models. This section reviews related works from the perspectives of both fields:
2.1 LargeScale Distributed Training Systems
Data Parallelism splits training data on the batch domain and keeps replica of the entire model on each device. The popularity of ringbased AllReduce Gibiansky (2017) has enabled largescale data parallelism training Goyal et al. (2017); Jia (2018); You et al. (2019). Parameter Server (PS) is a primary method for training largescale recommender systems due to its simplicity and scalability Dean et al. (2012); Li et al. (2014). Each worker processes on a subset of the input data, and is allowed to use stale weights and update either its weights or that of a parameter server. Model Parallelism is another commonly used distributed training strategy Krizhevsky (2014); Dean et al. (2012). More recent model parallelism strategy learns the device placement Mirhoseini et al. (2017) or uses pipelining Huang et al. (2018). These works usually focus on enabling the system to process complex models with large amount of weights.
Previously, there have been several hybriddataandmodel parallelism strategies. Krizhevsky Krizhevsky (2014) proposed a general method for using both data and model parallelism for convolutional neural networks. Gholami et al. Gholami et al. (2018) developed an integrated model, data, and domain parallelism strategy. Though theoretically summarized several possible ways to distribute the training process, the method only focused on limited operations such as convolution, and is not applicable to fully connected layers. Zhihao et al. Jia et al. (2018) proposed another integrated parallelism strategy called ”layer parallelism”. However, it also focuses on a limited set of operations and cannot split the computation for an operation, which makes it difficult to apply this method to recommender systems. MeshTensorFlow Shazeer et al. (2018) implements a more flexible parameter serverlike architecture, but for recommender systems, it could introduce unnecessary weights communication between different operations.
2.2 Recommender Systems
The critical problem a recommender system tries to solve is the ClickThrough Rate (CTR) prediction. Logistic regression (LR) is one of the first methods that has been applied Richardson et al. (2007) and is still a common practice now. Factorization machine (FM) Rendle (2010) utilizes addition and inner product operations to capture the linear and pairwise interactions between features. More recently, deeplearning based recommendation models(DLRMs) have gained more and more attentions Zhang et al. (2016); Cheng et al. (2016); Guo et al. (2017); Lian et al. (2018); Zhou et al. (2018). Wide & Deep(W&D) model combines a general linear model (the wide part) with a deep learning component (the deep part) to enable the recommender to capture both memorization and generalization. DeepFM seamlessly integrates factorization machine and multilayer perceptron (MLP) to model both the highorder and loworder feature interactions. Other applications of DLRM include music recommendation Oord et al. (2013) and video recommendation Covington et al. (2016). Among all the existing industriallevel recommender systems, one common characteristic is tens or even hundreds billions of dynamic features. To the best knowledge of the authors, the dominant way to build a largescale recommender system today is still parameterserver based methods.
3 Background and Design Methodology
3.1 Recommender System Overview
The typical process of a recommender system starts when a usergenerated query comes in. The recommender system will return a list of items for the user to further interact (clicking or purchasing) or ignore. These user operations, queries and interactions are recorded in the log as training data for future use. Due to the large number of simultaneous queries in recommender systems, it is difficult to score each query in detail within the service latency requirement (usually 100 milliseconds). Therefore, we need a recall system to pick from the global item list a mostrelevant short list, using a combination of machine learning models and manually defined rules. After reducing the candidate pool, a ranking system ranks all items according to their scores. The score usually presents the probability of user behavior tag for a given feature includes user characteristics (e.g., country, language, demographic), context features (e.g., devices, hours of the day, days of the week) and impression features (e.g., application age, application history statistics). This paper mainly studies the core component of a recommender system: models that are used for ranking and online learning.
3.2 Distributed Equivalent Substitution Strategy
Previous PSbased or model parallelism methods usually do not change the operator on algorithm level. That means for recommender systems that have weightsrich layers for the first one or more layers, putting operators on different devices still cannot solve the outofmemory problem for a single weightsrich layer. Some works do split the operator Huang et al. (2018); Jia et al. (2018), but they focus on the convolution, which has completely different characteristics than operators that are frequently used in recommender systems. Our strategy, instead, designs a computationally equivalent substitution for the original weightsrich layer, replace it into a group of computational equivalent operators that update only portions of weights, and processes the computation on nonoverlapping input data. Since only one portion of weights is updated by one of new operators, our method could break through the singlenode memory limitation and avoid transmitting a large number of parameters between the nodes. This strategy is particularly designed for largescale recommender systems. In models for such recommender systems, the majority of the parameters only participate in very simple computation in the first few layers. Such models include LR, FM, W&D, and many other followups.
Definitions and Notations
To help readers better follow our contributions in later sections, we hereby list some basic definitions and notations in the context of distributed training framework for recommender system. We first define the operation for the convenience of description:
(1) 
In the context of this paper, is one of the MPIstyle collective operations: . However, it can be any communicativeassociative aggregation operation. presents local values hold by processor , presents the final result. The following are some definitions we need for the description of DES strategy:

: the original operator function;

: the suboperator function;

: the computationally equivalent substitution of ;

: the local result for one substitution operator of ;

: batch size of samples on each iteration;

: number of worker processes;

: number of suboperators;

: input tensor of an operator;

: weights tensor of an operator;

: latency of the network.

: network bandwidth;

: size of features, weights, gradients, or intermediate results in bytes;
Without losing generality, we suppose that each worker only has one process, so the number of workers is equal to the number of processes. We also assume that all operators only take one input tensor and one weights tensor .
Algorithm
The key observation is that for models in recommender systems, there is always one or more weightsrich layers with dominant portion of the parameters. The core idea of DES strategy is to find a computationally equivalent substitution to the operator of these weightsrich layers, and to find a splitting method to reduce the communication among all the suboperators.
Forward Phase: Figure 1 illustrates the forward pass in twoworker case, and compares our DES strategy with PSbased strategy. In PSbased strategy, is not split, so each operator needs its entire when doing the computation. Also, is not colocated with but pulled to the device when needed. In DES strategy, we partition the weights and inputs on different processes, do parallel aggregations on results of one or more suboperators , then use the substitution operator to get the final result on each process. Algorithm 1 shows this process:
The layers follow the weightsrich layer will get the same aggregated results on each process, so there is no need for further interprocess communication in subsequent computation for the forward phase. To guarantee the correctness of equation 1, it is very important that is computationally equivalent to the original operator . We observe that on all the popular models for recommender systems, we can always find such suboperators to form computational equivalent substitutions. We will show details on how we get the substitutions for operators in different models in section 4.
Backpropagation Phase:After the forward phase, each process has the entire results . Because we are not doing AllReduce on the gradients, but only on some small intermediate results, and also because aggregation operation distributes gradients equally to all its inputs, there is no interprocess communication during the backpropagation phase either. Each process just transfers the gradients directly back to its own suboperator.
Performance & Complexity Analysis
PSbased: Weights are distributed on parameterservers, while workers process on different batches each with samples. The time cost for PSbased mode is:
Meshbased: A special form of PSbased is Meshbased in which the weights are divided into chunks and colocated with some workers. It has smaller network cost than original PSbased strategies. In this strategy, each worker processes one batch, the time cost for batches in synchronous mode is:
AllReduce: A full replica of weights is stored on each worker. The workers synchronize the gradients every iteration. We use Ringbased AllReduce, the most widelyadopted AllReduce algorithm, as the default algorithm for the scope of this paper. The time cost of the communication is:
Where is the size of gradients for the model.
DES: Each aggregation operation uses AllReduce, DES may use several such aggregation operations to form the final result, so the time cost of the communication is:
Where is the number of aggregation operations, and is the size of intermediate results for the th operation . Let
and we can see if is satisfied for each , DES will reduce communication cost.
For both PSmode strategy, time complexity of the communication is proportional to batch size . For AllReduce and DESbased strategies, time complexity of the communication is constant (because the number of aggregation operations is usually smaller than 3).
The benefits of DES strategy is threefold: first, with new operators and their colocated weights, one can split an operator with a huge amount of weights into suboperators with arbitrarily small amount of parameters, given abundant number of workers. This enables better scalability for our framework when compared to traditional PSbased frameworks; second, DES strategy does not send weights but instead intermediate results from suboperators, which can be much smaller in size compared to the original weights. This can significantly reduce the total amount of communication needed for our framework; third, with the above two improvements, our framework brings synchronous training to largescale recommender system. With fullysynchronization periteration, the model converges faster, which makes the training process more efficient.
4 Applications on Models for Recommender Systems
We observe that many models in recommender systems share similar components (Table 1). For example, LR model is the linear part of W&D model; almost all models include firstorder feature crossover; all FMbased models include secondorder feature crossover; the deep component of W&D model and DeepFM model share similar structures. An optimal DES strategy finds substitutions of firstorder, secondorder, or higherorder operations, which are usually simple computation but with a large number of weights. The goal is to achieve the same computation but with much smaller communication cost for sending partial results over the network. In this section, we describe how to find such computational equivalent substitutions for different models.
Model  firstorder  secondorder  highorder 
LR  ✓  
W&D  ✓  ✓  
FM  ✓  ✓  
DeepFM  ✓  ✓  ✓ 
4.1 Logistic Regression
Logistic Regression(LR) Richardson et al. (2007) is a generalized linear model that is widely used in recommender systems. Due to its simplicity, scalability, and interpretability, LR can be used not only as an independent model, but also an important component in many DLRMs, such as Wide&Deep and DeepFM . The form of LR is as follows:
where, and are two ddimension vectors represent inputs and weights respectively, is the bias, and is a nonlinear transform, usually a sigmoid function for LR. The major part of the computation in is dot product. It is easy for us to find an of : , where denotes the subset of colocated with the th process. We then define a local operator on :
(2) 
We have the equivalent substitution of :
(3)  
Assume that all weights of sparse features are stored in hash tables as float32. In meshbased strategy, each worker needs to transfer weights with unsigned int64 keys from the hash tables colocated with other workers. So the total data size to transfer through the network for each worker is:
Where and denote the size of feature keys and weights respectively.
Using DES, we only need to synchronize a scalar value with other workers for every sample, so the total data size to transfer through the network for each worker is:
Where denotes the size of intermediate results.So the communicationsaving ratio for LR is:
4.2 Factorization Machine
Besides linear interactions among features, FM models pairwise feature interactions as inner product of latent vectors. FM is both an independent model and an important component of DLRMs such as DeepFM and xDeepFM Lian et al. (2018). The linear interactions are similar to LR model, so here we only focus on the order2 operator (denoted by ):
(4)  
denotes a latent vector, is the feature value of , the presents the inner product operation.
Equation 4 shows another popular form for FM mentioned in Rendle (2010) with only linear complexity. Here we adopt this equation to form our computational equivalent substitution of FM .
Applying Algorithm 1 to FM, we get an partition of using any partition policy that balances on each process. We then define two local operators: and that process on local subset of weights :
(5)  
We have the equivalent substitution of :
(6) 
In meshbased strategy, each worker needs to lookup latent vectors with feature IDs from the hash tables colocated with other workers. The total data size to transfer through the network for each worker is:
Where and denote the size of feature keys and latent vectors per batch respectively.
Using DES, the FM order2 operators only require all workers to exchange and among each other, so we have:
The communicationsaving ratio for FM is:
4.3 Deep Neural Network
Recommender systems use DNN to learn highorder feature interactions. The features are usually categorical and grouped in fields. A DNN starts from an embedding layer which compresses the latent vectors into dense embedding vectors by fields, and is usually followed by multiple fullyconnected layers as shown in Figure 4.
Like FM, in DNNs, the majority of weights are from the embedding layer and the first FC layer:
(7) 
denotes the concated output of the embedding layer and denotes the weights of the first FC layer.
Using DES, we split and into partitions over the fields dimension, and use blocked matrix multiplication (Figure 5), which is similar to the method proposed by Gholami et al. Gholami et al. (2018). Our strategy differs in splitting: we divide and in the same dimension to ensure that the computation and weights do not overlap in different parts:
(8)  
Hence we get the of and : , , where and denote the subset of and colocated with the th process respectively.
Considering that the embedding layer will aggregate the latent vectors by fields before concatenating them, we store the latent vectors of the same field on the same process to avoid unnecessary weights exchange. In this way, we also avoid communication during the backpropagation phase.
Using this partition we can define the local operator as follows:
The distributed equivalent substitution of is hence defined as:
(9) 
In meshbased strategy, each worker needs to lookup of and by keys(unsigned int64) from the hash tables colocated with other workers. The total data size to transfer for each worker is:
, and denote the size of feature keys, and per batch respectively. Compared to meshbased strategy, DNN using DES only requires all workers to exchange among each other (Figure 4):
The communicationsaving ratio for DNN is:
batch  uniq_feats  

512  147,664  99.769 %  99.376 %  90.310 % 
1024  257,757  99.735 %  99.285 %  86.226 % 
2048  448,814  99.696 %  99.179 %  81.658 % 
4096  789,511  99.654 %  99.066 %  77.015 % 
8192  1,389,353  99.607 %  98.939 %  72.264 % 
Using DES does not increase the computation compared to PS/meshbased strategy, and often leads to smaller computation load. Table 2 shows the number of unique features per batch as well as the communicationsaving ratio for three models with different batch sizes on a realworld recommender systems. The communication costs when using DES are reduced from 72.26% (with a batch size of 8192) to 99.77% (with a batch size of 512) compared to meshbased strategy.
Our analysis here only include the communication cost for transferring the sparse weights. In fact, for most recommender systems, stateoftheart stateful optimizer such as FTRL McMahan et al. (2013), AdaGrad Duchi et al. (2011) and Adam Kingma and Ba (2014) require saving and transferring the corresponding state variables as well as the sparse weights. When using DES strategy, these variables are kept local, which will reduce even more communication cost.
Extending to General Models: Previous analysis show that we can apply DES to several stateoftheart models for recommender systems. We think this is not a coincidence. To generalize our observations for the above models, we claim that for any DLRM, as long as the computational equivalent substitution of the weightsrich layers do not surpass linear complexity, we can apply DES strategy. FM Rendle (2010) is the work that inspired us on finding linear substitution to operators. The linear complexity is where is the size of the feature parameters. Since DES splits an dimension feature vector to part where , is a constant, and is the number of DES worker processes. We use to represent this. We have a simple rule to judge whether it is linear complexity: if the computation process of weightsrich layer satisfies the Commutative Law and Associative Law, we can apply DES strategy to help reduce the communication cost in forward phase and eliminate the gradient aggregation in backward phase.
5 System Implementation
We choose TensorFlow as the backend for our training framework due to its flexibility and natural distributedfriendliness. More specifically, we implement our system by enhancing TensorFlow in the following two aspects: largescale sparse features and dynamic hash table.
Largescale Sparse Features: As mentioned earlier, an industrial streaming recommender system may have hundreds of billions of dynamic features. Given the embedding size with , the feature weights require 3.2TB of memory at least. Table 2 shows that for a single iteration, weights update on unique features is sparse. To achieve constant cost data access/update and get over the memory constraint of a single node, we use distributed hash table. We use a simple method to distribute weights: In a cluster with nodes, the th node will hold all the weights that are corresponding with feature field IDs where . There are other methods that could achieve better load balancing, but we found this simple method works fine in our case.
Dynamic Hash Table: In DES strategy, there are three places we operate on hash tables: given a feature ID in a batch of input samples, we lookup the corresponding weight; when a new feature ID is given as the key, we insert the initialized weight into the hash table; given the gradient of a weight, we apply it locally, and then update the hash table with the new weight. To achieve this, we provide a modified dynamic hash table implementation in TensorFlow with key operations adapted to our needs (Figure 6). Compared to alternative design choices, this implementation makes use of as many existing TensorFlow features as possible but only introduces hash table operations during batch building and optimizer phase. Because after the lookup, the sparse weights are reformed into dense tensors and are fully compatible with the native training pipeline of TensorFlow.
6 Experiments And Analysis
Hardware: We ran all experiments in this paper on a testing cluster which has four LINUX servers with each consisting of 2 hyperthreaded 24core Intel Xeon E52670v3(2.3GHz) CPUs, 128 GB of host memory, and one Intel Ethernet Controller 10Gigabit X540AT2 without RDMA support.
Software: Our DES framework is based on an enhanced version of TensorFlow 1.13.1 and a standard OpenMPI with version 4.0.1. Considering that meshbased frameworks is a special form of PSbased and usually has less communication cost than original PSbased frameworks, we use meshbased strategy for comparison. The meshbased strategy we compare with is implemented using a popular opensource framework: DiFacto Li et al. (2016).
Dataset: In order to verify the performance of DES in real industrial context, we evaluate our framework on the following two datasets.
1) Criteo Dataset: Criteo dataset
2) Company* Dataset: We extract a continuous segment of samples from a recommender system in use internally. On average, each sample contains 950 unique feature values. The total number of samples is 10,809,440. It is stored in a remote sample server.
Parameter Settings: We set DiFacto to run one worker process on each server, the batch size is 4,096, and the number of concurrency threads is 24. Correspondingly, the parameters of and for DES on TensorFlow are both set to 24, the batch size on DES is set to 4096 when testing AUC . Since for DES, all workers train samples from the same batch synchronously in parallel, when testing communication ratio, we set the batch size to 16384 (for =4) to guarantee a fair comparison. We train all models with the same optimizer setting: FTRL for order1 components, AdgaGrad or Adam for both Embedding and DNN components.
Evaluation Metrics: We use two evaluation metrics in our experiments: AUC (Area Under ROC) and Logloss (cross entropy).
Performance Summary We compare our framework to meshbased implementation on three different widelyadopted models in mainstream recommender systems: LR, W&D, and DeepFM . In general, on all three models, DES can achieve better AUC in smaller number of iterations with order of magnitude smaller communication cost.
model  DiFacto  DES 

LR  0.7906  0.7913 
W&D  0.8015  0.8025 
DeepFM  0.8027  0.8035 
Table 3 shows that during longterm online training, when consuming the same amount of samples with similar distribution, DES shows better average AUC for all three models. One possible explanation for this is that with DES, the training is in synchronous mode, which usually leads to better and faster convergence compared to asynchronous mode. The reason we care about small amount AUC increase is that in several realworld applications we run internally, even increase in AUC will have a 5x amplification ( increase) when transferred to final CTR .
model  PS  DES  

AUC  LogLoss  AUC  LogLoss  
LR  0.7353  0.5037  0.7534  0.4823 
W&D  0.7737  0.4753  0.7822  0.4761 
DeepFM  0.7750  0.4747  0.7924  0.4674 
Table 4 shows the AUC and log loss for three models using PSmode asynchronous training and DESmode fullysynchronous training on TensorFlow respectively
Computation vs. Communication Time: Figure 7 shows that in all experiments, DiFacto framework needs to spend more time on both computation and communication. The absolute total network communication time using DiFacto framework is 2.7x, 2.3x, and 3.2x larger for LR, W&D, and DeepFM respectively, than using DES . The saving on communication time comes from the smaller amount of intermediate results sent among workers during the forward phase and the elimination of gradient aggregation during the backward phase. The saving on computation time comes from the reduced time complexity of computational equivalent substitution as well as several optimizations we have put in our DES framework.
Throughput: Table 5 compares the throughput of DES and DiFacto. For deep models with highorder components (W&D and DeepFM), DES has more advantages. It indicates larger benefits when applying DES to future DLRMs.
model  Throughput (samples/sec)  improvement  

PS  DES  
LR  50396.8  78205.3  1.55x 
W&D  11023.9  49837.3  4.52x 
DeepFM  10560.1  41295.5  3.91x 
7 Conclusions and Future Works
We propose a novel framework for models with largescale sparse dynamic features in streaming recommender systems. Our framework achieves efficient synchronous distributed training due to its core component: Distributed Equivalent Substitution (DES) algorithm. We take advantage of the observation that for all models in recommender systems, the first one or few weightsrich layers only participate in straightforward computation, and can be replaced by a group of distributed operators that form a computationally equivalent substitution. Using DES, the intermediate information needed to transfer between workers during the forward phase has been reduced, the AllReduce on gradients between workers during the backward phase has been eliminated. The application of DES on popular DLRMs such as FM, DNN, Wide&Deep, and DeepFM shows the universal generality of our algorithm. Experiments on a public dataset and an internal dataset that compare our implementation with a popular PSbased implementation show that our framework achieves up to 68.7% communication savings and higher AUC .
Future Works: We have shown in section 6 that our current implementation of DES is bounded by computation. So the natural next step is to transfer the computation of current bottleneck operators such as hash table to GPU and to improve the existing kernel implementations. We have also started the initial work to apply DES to more models commonly used in industry such as DCN Wang et al. (2017) and DIN Zhou et al. (2018).
Acknowledgement We appreciate the technical assistance, advice and machine access from colleagues at Tencent: Chaonan Guo and Fei Sun.
Footnotes
 More details in Section 4.
 http://labs.criteo.com/downloads/2014kaggledisplayadvertisingchallengedataset/
 We use FTRL optimizer for LR model, and Adam optimizer for the other two models.
References
 Streaming recommender systems. In Proceedings of the 26th International Conference on World Wide Web, WWW â17, Republic and Canton of Geneva, CHE, pp. 381â389. External Links: ISBN 9781450349130, Link, Document Cited by: 1st item.
 Revisiting distributed synchronous SGD. CoRR abs/1604.00981. External Links: Link, 1604.00981 Cited by: 2nd item.
 Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS 2016, New York, NY, USA, pp. 7–10. External Links: ISBN 9781450347952, Link, Document Cited by: §2.2.
 Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, New York, NY, USA, pp. 191–198. External Links: ISBN 9781450340359, Link, Document Cited by: §2.2.
 Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 1, NIPS’12, USA, pp. 1223–1231. External Links: Link Cited by: §2.1.
 Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, pp. 2121–2159. External Links: ISSN 15324435, Link Cited by: §4.3.
 Integrated model, batch, and domain parallelism in training neural networks. In SPAA’18: 30th ACM Symposium on Parallelism in Algorithms and Architectures, External Links: Link Cited by: 1st item, §2.1, §4.3.
 External Links: Link Cited by: §2.1.
 Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR abs/1706.02677. External Links: Link, 1706.02677 Cited by: §2.1.
 DeepFM: a factorizationmachine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pp. 1725–1731. External Links: ISBN 9780999241103, Link Cited by: §2.2.
 GPipe: efficient training of giant neural networks using pipeline parallelism. CoRR abs/1811.06965. External Links: Link, 1811.06965 Cited by: §2.1, §3.2.
 Highly scalable deep learning training system with mixedprecision: training imagenet in four minutes. CoRR abs/1807.11205 (1807.11205v1). External Links: 1807.11205v1 Cited by: §2.1.
 Exploring hidden dimensions in parallelizing convolutional neural networks. CoRR abs/1802.04924. External Links: Link, 1802.04924 Cited by: §2.1, §3.2.
 Adam: A Method for Stochastic Optimization. arXiv eprints, pp. arXiv:1412.6980. External Links: 1412.6980 Cited by: §4.3.
 One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997. External Links: Link, 1404.5997 Cited by: §2.1, §2.1.
 Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, Berkeley, CA, USA, pp. 583–598. External Links: ISBN 9781931971164, Link Cited by: §2.1.
 DiFacto: distributed factorization machines. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, New York, NY, USA, pp. 377–386. External Links: ISBN 9781450337168, Link, Document Cited by: §6.
 XDeepFM: combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, New York, NY, USA, pp. 1754–1763. External Links: ISBN 9781450355520, Link, Document Cited by: §2.2, §4.2.
 Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, New York, NY, USA, pp. 1222–1230. External Links: ISBN 9781450321747, Link, Document Cited by: §4.3.
 Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning  Volume 70, ICML’17, pp. 2430–2439. External Links: Link Cited by: §2.1.
 Deep contentbased music recommendation. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 2, NIPS’13, USA, pp. 2643–2651. External Links: Link Cited by: §2.2.
 Factorization machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, Washington, DC, USA, pp. 995–1000. External Links: ISBN 9780769542560, Link, Document Cited by: §2.2, §4.2, §4.3.
 Predicting clicks: estimating the clickthrough rate for new ads. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, New York, NY, USA, pp. 521–530. External Links: ISBN 9781595936547, Link, Document Cited by: §2.2, §4.1.
 Meshtensorflow: deep learning for supercomputers. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 10435–10444. External Links: Link Cited by: §2.1.
 Deep & cross network for ad click predictions. CoRR abs/1708.05123. External Links: Link, 1708.05123 Cited by: §7.
 Large batch optimization for deep learning: training BERT in 76 minutes. CoRR abs/1904.00962. External Links: Link, 1904.00962 Cited by: §2.1.
 Deep learning over multifield categorical data: a case study on user response prediction. ArXiv abs/1601.02376. Cited by: §2.2.
 Deep interest network for clickthrough rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, New York, NY, USA, pp. 1059–1068. External Links: ISBN 9781450355520, Link, Document Cited by: §2.2, §7.