1 Introduction


Recommender Systems, Distributed Training

1 Introduction

Large-scale recommender systems are critical tools to enhance user experience and promote sales/services for many online websites and mobile applications. One essential component in the recommender system pipeline is click-through rate (CTR) prediction. Usually, people use machine learning models with tens or even hundreds billions of parameters to provide the prediction based on tons of streaming input data that include user preferences, item features, user-item past interactions, etc. Current industrial-level recommender systems(RSs) usually have so large parameter size that asynchronous parameter-server (PS) mode has become the only available method for building such systems.

Ideally, an efficient distributed recommender system should meet three requirements:

  • Dynamic Features: In industrial scenarios, more and more recommender systems run on streaming mode because new users or items arrive continuously in infinite data streams. In the streaming recommender systems Gholami et al. (2018); Chang et al. (2017), the size of model parameters is usually temporal dynamic and reaches hundreds of GBs or even several TBs. Such large-scale of the parameters naturally requires distributed storage.

  • Stable Convergence: Before the popularity of DLRMs, the negative impacts on accuracy caused by gradient staleness  Chen et al. (2016)in asynchronous training is not significantly in RSs. With more and more deep learning components are introduced to recommendation models, the RSs are required to supporting fully synchronization training for stable convergence and higher AUC .

  • Real-time Updating: One vital characteristic of streaming recommendation scenarios is their high velocity of inference query. So an RS needs to update and response instantly in order to catch users’ real-time intention and demands. With model size increasing over time, it is more and more important for RSs to reduce the demand of network transmission to keep timeliness.

The above requirements are affected by two design choices we make when building a large-scale distributed recommender system: how to parallelize the training pipeline, and how to synchronize the parameters. For parallelization, we can use either data parallelism (to parallelize over the data dimension), or model parallelism (to parallelize computation on parameters on different devices). For synchronization, the system can be synchronous or asynchronous (usually when using PS mode).

However, existing methods cannot be easily adapted to recommender systems for two reasons:

First, for the DLRMs with very large size of parameters, pure data parallelism keeps replica of the entire model on a single device , which makes it impossible because recommender systems usually have very large weights to updating for the first few layers (we call operators in these layers weights-rich layers). Also, in the context of recommender system, features for different input samples in a batch can be different in length, so pure data parallelism with linearly-scaled batch size is inapplicable. Pure model parallelism usually treat the layers and operators as a whole and optimize the load balance by different device placement policies, which does not apply to most larger-scale recommender systems today either.

Second, current PS mode implementations of large-scale recommender systems is essentially a hybrid-data-and-model parallelism strategy and always needs to make a tradeoff between update frequency and communication bandwidth. Applying such asynchronous strategy to current and future models with even larger size of parameters will make it more difficult for these models to converge to the same performance while keeping the training efficient.

To solve the above two issues, we present a novel distributed training framework for recommender systems that achieves faster training speed with less communication overhead using a strategy we call distributed equivalent substitution (DES). The key idea of DES is to replace the weights-rich layers by an elaborate group of sub-operators which make each sub-operator only update its co-located partial weights. The partial computation results get aggregated and form a computationally equivalent substitution to the original operator. To achieve less communication, we find sub-operators that generate partial results with smaller sizes to form the equivalent substitution. We empirically show that for all the weights-rich operators whose parameters dominate the model, it is easy to find an equivalent substitution strategy to create an order of magnitude less communication demand. We also discuss how to extend DES to other general models1.

The main contributions of this paper are as follows:

  • We present DES training, a distributed training method for recommender systems that achieves better convergence with less communication overhead on large-scale streaming recommendation scenarios.

  • We propose a group of strategies that replaces the weights-rich layers in multiple popular recommendation models by computationally equivalent sub-operators which only update co-located weights and aggregate partial results with much smaller communication cost.

  • We show that for different types of models that are most often used in recommender systems, we can find according substitution strategies for all of their weights-rich layers.

  • We present an implementation of DES training framework that outperforms the state-of-the-art recommender system. In particular, we show that our framework achieves 68.7% communication savings on average compared to other PS-based recommender systems.

2 Related Work

Large-scale recommender systems are distributed systems designed specifically for training recommendation models. This section reviews related works from the perspectives of both fields:

2.1 Large-Scale Distributed Training Systems

Data Parallelism splits training data on the batch domain and keeps replica of the entire model on each device. The popularity of ring-based AllReduce Gibiansky (2017) has enabled large-scale data parallelism training Goyal et al. (2017); Jia (2018); You et al. (2019). Parameter Server (PS) is a primary method for training large-scale recommender systems due to its simplicity and scalability Dean et al. (2012); Li et al. (2014). Each worker processes on a subset of the input data, and is allowed to use stale weights and update either its weights or that of a parameter server. Model Parallelism is another commonly used distributed training strategy Krizhevsky (2014); Dean et al. (2012). More recent model parallelism strategy learns the device placement Mirhoseini et al. (2017) or uses pipelining Huang et al. (2018). These works usually focus on enabling the system to process complex models with large amount of weights.

Previously, there have been several hybrid-data-and-model parallelism strategies. Krizhevsky Krizhevsky (2014) proposed a general method for using both data and model parallelism for convolutional neural networks. Gholami et al. Gholami et al. (2018) developed an integrated model, data, and domain parallelism strategy. Though theoretically summarized several possible ways to distribute the training process, the method only focused on limited operations such as convolution, and is not applicable to fully connected layers. Zhihao et al. Jia et al. (2018) proposed another integrated parallelism strategy called ”layer parallelism”. However, it also focuses on a limited set of operations and cannot split the computation for an operation, which makes it difficult to apply this method to recommender systems. Mesh-TensorFlow Shazeer et al. (2018) implements a more flexible parameter server-like architecture, but for recommender systems, it could introduce unnecessary weights communication between different operations.

2.2 Recommender Systems

The critical problem a recommender system tries to solve is the Click-Through Rate (CTR) prediction. Logistic regression (LR) is one of the first methods that has been applied Richardson et al. (2007) and is still a common practice now. Factorization machine (FM) Rendle (2010) utilizes addition and inner product operations to capture the linear and pairwise interactions between features. More recently, deep-learning based recommendation models(DLRMs) have gained more and more attentions Zhang et al. (2016); Cheng et al. (2016); Guo et al. (2017); Lian et al. (2018); Zhou et al. (2018). Wide & Deep(W&D) model combines a general linear model (the wide part) with a deep learning component (the deep part) to enable the recommender to capture both memorization and generalization. DeepFM seamlessly integrates factorization machine and multi-layer perceptron (MLP) to model both the high-order and low-order feature interactions. Other applications of DLRM include music recommendation Oord et al. (2013) and video recommendation Covington et al. (2016). Among all the existing industrial-level recommender systems, one common characteristic is tens or even hundreds billions of dynamic features. To the best knowledge of the authors, the dominant way to build a large-scale recommender system today is still parameter-server based methods.

3 Background and Design Methodology

3.1 Recommender System Overview

The typical process of a recommender system starts when a user-generated query comes in. The recommender system will return a list of items for the user to further interact (clicking or purchasing) or ignore. These user operations, queries and interactions are recorded in the log as training data for future use. Due to the large number of simultaneous queries in recommender systems, it is difficult to score each query in detail within the service latency requirement (usually 100 milliseconds). Therefore, we need a recall system to pick from the global item list a most-relevant short list, using a combination of machine learning models and manually defined rules. After reducing the candidate pool, a ranking system ranks all items according to their scores. The score usually presents the probability of user behavior tag for a given feature includes user characteristics (e.g., country, language, demographic), context features (e.g., devices, hours of the day, days of the week) and impression features (e.g., application age, application history statistics). This paper mainly studies the core component of a recommender system: models that are used for ranking and online learning.

3.2 Distributed Equivalent Substitution Strategy

Previous PS-based or model parallelism methods usually do not change the operator on algorithm level. That means for recommender systems that have weights-rich layers for the first one or more layers, putting operators on different devices still cannot solve the out-of-memory problem for a single weights-rich layer. Some works do split the operator Huang et al. (2018); Jia et al. (2018), but they focus on the convolution, which has completely different characteristics than operators that are frequently used in recommender systems. Our strategy, instead, designs a computationally equivalent substitution for the original weights-rich layer, replace it into a group of computational equivalent operators that update only portions of weights, and processes the computation on non-overlapping input data. Since only one portion of weights is updated by one of new operators, our method could break through the single-node memory limitation and avoid transmitting a large number of parameters between the nodes. This strategy is particularly designed for large-scale recommender systems. In models for such recommender systems, the majority of the parameters only participate in very simple computation in the first few layers. Such models include LR, FM, W&D, and many other follow-ups.

Definitions and Notations

To help readers better follow our contributions in later sections, we hereby list some basic definitions and notations in the context of distributed training framework for recommender system. We first define the operation for the convenience of description:


In the context of this paper, is one of the MPI-style collective operations: . However, it can be any communicative-associative aggregation operation. presents local values hold by processor , presents the final result. The following are some definitions we need for the description of DES strategy:

  • : the original operator function;

  • : the sub-operator function;

  • : the computationally equivalent substitution of ;

  • : the local result for one substitution operator of ;

  • : batch size of samples on each iteration;

  • : number of worker processes;

  • : number of sub-operators;

  • : input tensor of an operator;

  • : weights tensor of an operator;

  • : latency of the network.

  • : network bandwidth;

  • : size of features, weights, gradients, or intermediate results in bytes;

Without losing generality, we suppose that each worker only has one process, so the number of workers is equal to the number of processes. We also assume that all operators only take one input tensor and one weights tensor .


The key observation is that for models in recommender systems, there is always one or more weights-rich layers with dominant portion of the parameters. The core idea of DES strategy is to find a computationally equivalent substitution to the operator of these weights-rich layers, and to find a splitting method to reduce the communication among all the sub-operators.

Figure 1: Forward pass for one operator of PS/Mesh-based strategy (left) and DES strategy (right).

Forward Phase: Figure 1 illustrates the forward pass in two-worker case, and compares our DES strategy with PS-based strategy. In PS-based strategy, is not split, so each operator needs its entire when doing the computation. Also, is not co-located with but pulled to the device when needed. In DES strategy, we partition the weights and inputs on different processes, do parallel aggregations on results of one or more sub-operators , then use the substitution operator to get the final result on each process. Algorithm 1 shows this process:

  Input: data , weights , number of processes , number of sub-ops
  for all -th process such that  do
     make and co-located with -th process
     for  to  do
         {parallel aggregation}
     end for
  end for
  return   {each process gets the same final results}
Algorithm 1 Distributed Equivalent Substitution Algorithm

The layers follow the weights-rich layer will get the same aggregated results on each process, so there is no need for further inter-process communication in subsequent computation for the forward phase. To guarantee the correctness of equation 1, it is very important that is computationally equivalent to the original operator . We observe that on all the popular models for recommender systems, we can always find such sub-operators to form computational equivalent substitutions. We will show details on how we get the substitutions for operators in different models in section 4.

Back-propagation Phase:After the forward phase, each process has the entire results . Because we are not doing AllReduce on the gradients, but only on some small intermediate results, and also because aggregation operation distributes gradients equally to all its inputs, there is no inter-process communication during the back-propagation phase either. Each process just transfers the gradients directly back to its own sub-operator.

Performance & Complexity Analysis

PS-based: Weights are distributed on parameter-servers, while workers process on different batches each with samples. The time cost for PS-based mode is:

Mesh-based: A special form of PS-based is Mesh-based in which the weights are divided into chunks and co-located with some workers. It has smaller network cost than original PS-based strategies. In this strategy, each worker processes one batch, the time cost for batches in synchronous mode is:

AllReduce: A full replica of weights is stored on each worker. The workers synchronize the gradients every iteration. We use Ring-based AllReduce, the most widely-adopted AllReduce algorithm, as the default algorithm for the scope of this paper. The time cost of the communication is:

Where is the size of gradients for the model.

DES: Each aggregation operation uses AllReduce, DES may use several such aggregation operations to form the final result, so the time cost of the communication is:

Where is the number of aggregation operations, and is the size of intermediate results for the th operation . Let

and we can see if is satisfied for each , DES will reduce communication cost.

For both PS-mode strategy, time complexity of the communication is proportional to batch size . For AllReduce and DES-based strategies, time complexity of the communication is constant (because the number of aggregation operations is usually smaller than 3).

The benefits of DES strategy is three-fold: first, with new operators and their co-located weights, one can split an operator with a huge amount of weights into sub-operators with arbitrarily small amount of parameters, given abundant number of workers. This enables better scalability for our framework when compared to traditional PS-based frameworks; second, DES strategy does not send weights but instead intermediate results from sub-operators, which can be much smaller in size compared to the original weights. This can significantly reduce the total amount of communication needed for our framework; third, with the above two improvements, our framework brings synchronous training to large-scale recommender system. With fully-synchronization per-iteration, the model converges faster, which makes the training process more efficient.

4 Applications on Models for Recommender Systems

We observe that many models in recommender systems share similar components (Table 1). For example, LR model is the linear part of W&D model; almost all models include first-order feature crossover; all FM-based models include second-order feature crossover; the deep component of W&D model and DeepFM model share similar structures. An optimal DES strategy finds substitutions of first-order, second-order, or higher-order operations, which are usually simple computation but with a large number of weights. The goal is to achieve the same computation but with much smaller communication cost for sending partial results over the network. In this section, we describe how to find such computational equivalent substitutions for different models.

Model first-order second-order high-order
Table 1: Some common components that are shared among different recommender system models.

4.1 Logistic Regression

Logistic Regression(LR) Richardson et al. (2007) is a generalized linear model that is widely used in recommender systems. Due to its simplicity, scalability, and interpretability, LR can be used not only as an independent model, but also an important component in many DLRMs, such as Wide&Deep and DeepFM . The form of LR is as follows:

where, and are two d-dimension vectors represent inputs and weights respectively, is the bias, and is a non-linear transform, usually a sigmoid function for LR. The major part of the computation in is dot product. It is easy for us to find an of : , where denotes the subset of co-located with the -th process. We then define a local operator on :


We have the equivalent substitution of :

Figure 2: Forward pass for LR operator in PS/mesh-based strategy (left) and DES strategy when N=2 (right).

Assume that all weights of sparse features are stored in hash tables as float32. In mesh-based strategy, each worker needs to transfer weights with unsigned int64 keys from the hash tables co-located with other workers. So the total data size to transfer through the network for each worker is:

Where and denote the size of feature keys and weights respectively.

Using DES, we only need to synchronize a scalar value with other workers for every sample, so the total data size to transfer through the network for each worker is:

Where denotes the size of intermediate results.So the communication-saving ratio for LR is:

4.2 Factorization Machine

Besides linear interactions among features, FM models pairwise feature interactions as inner product of latent vectors. FM is both an independent model and an important component of DLRMs such as DeepFM and xDeepFM Lian et al. (2018). The linear interactions are similar to LR model, so here we only focus on the order-2 operator (denoted by ):


denotes a latent vector, is the feature value of , the presents the inner product operation.

Equation 4 shows another popular form for FM mentioned in  Rendle (2010) with only linear complexity. Here we adopt this equation to form our computational equivalent substitution of FM .

Applying Algorithm 1 to FM, we get an -partition of using any partition policy that balances on each process. We then define two local operators: and that process on local subset of weights :


We have the equivalent substitution of :

Figure 3: Forward pass for FM order-2 operators using DES strategy when =2.

In mesh-based strategy, each worker needs to lookup latent vectors with feature IDs from the hash tables co-located with other workers. The total data size to transfer through the network for each worker is:

Where and denote the size of feature keys and latent vectors per batch respectively.

Using DES, the FM order-2 operators only require all workers to exchange and among each other, so we have:

The communication-saving ratio for FM is:

4.3 Deep Neural Network

Recommender systems use DNN to learn high-order feature interactions. The features are usually categorical and grouped in fields. A DNN starts from an embedding layer which compresses the latent vectors into dense embedding vectors by fields, and is usually followed by multiple fully-connected layers as shown in Figure 4.

Figure 4: The architecture of DNN with 2 FC layers of PS-based strategy(left) and DES strategy(right)

Like FM, in DNNs, the majority of weights are from the embedding layer and the first FC layer:


denotes the concated output of the embedding layer and denotes the weights of the first FC layer.

Using DES, we split and into partitions over the fields dimension, and use blocked matrix multiplication (Figure 5), which is similar to the method proposed by Gholami et al. Gholami et al. (2018). Our strategy differs in splitting: we divide and in the same dimension to ensure that the computation and weights do not overlap in different parts:


Hence we get the of and : , , where and denote the subset of and co-located with the -th process respectively.

Figure 5: The blocked matrix multiplication in DNN using DES strategy.

Considering that the embedding layer will aggregate the latent vectors by fields before concatenating them, we store the latent vectors of the same field on the same process to avoid unnecessary weights exchange. In this way, we also avoid communication during the back-propagation phase.

Using this -partition we can define the local operator as follows:

The distributed equivalent substitution of is hence defined as:


In mesh-based strategy, each worker needs to lookup of and by keys(unsigned int64) from the hash tables co-located with other workers. The total data size to transfer for each worker is:

, and denote the size of feature keys, and per batch respectively. Compared to mesh-based strategy, DNN using DES only requires all workers to exchange among each other (Figure 4):

The communication-saving ratio for DNN is:

batch uniq_feats
512 147,664 99.769 % 99.376 % 90.310 %
1024 257,757 99.735 % 99.285 % 86.226 %
2048 448,814 99.696 % 99.179 % 81.658 %
4096 789,511 99.654 % 99.066 % 77.015 %
8192 1,389,353 99.607 % 98.939 % 72.264 %
Table 2: The number of unique features and communication-saving ratio of different models using a 4-node cluster.

Using DES does not increase the computation compared to PS/mesh-based strategy, and often leads to smaller computation load. Table 2 shows the number of unique features per batch as well as the communication-saving ratio for three models with different batch sizes on a real-world recommender systems. The communication costs when using DES are reduced from 72.26% (with a batch size of 8192) to 99.77% (with a batch size of 512) compared to mesh-based strategy.

Our analysis here only include the communication cost for transferring the sparse weights. In fact, for most recommender systems, state-of-the-art stateful optimizer such as FTRL McMahan et al. (2013), AdaGrad Duchi et al. (2011) and Adam  Kingma and Ba (2014) require saving and transferring the corresponding state variables as well as the sparse weights. When using DES strategy, these variables are kept local, which will reduce even more communication cost.

Extending to General Models: Previous analysis show that we can apply DES to several state-of-the-art models for recommender systems. We think this is not a coincidence. To generalize our observations for the above models, we claim that for any DLRM, as long as the computational equivalent substitution of the weights-rich layers do not surpass linear complexity, we can apply DES strategy. FM Rendle (2010) is the work that inspired us on finding linear substitution to operators. The linear complexity is where is the size of the feature parameters. Since DES splits an -dimension feature vector to part where , is a constant, and is the number of DES worker processes. We use to represent this. We have a simple rule to judge whether it is linear complexity: if the computation process of weights-rich layer satisfies the Commutative Law and Associative Law, we can apply DES strategy to help reduce the communication cost in forward phase and eliminate the gradient aggregation in backward phase.

5 System Implementation

We choose TensorFlow as the backend for our training framework due to its flexibility and natural distributed-friendliness. More specifically, we implement our system by enhancing TensorFlow in the following two aspects: large-scale sparse features and dynamic hash table.

Large-scale Sparse Features: As mentioned earlier, an industrial streaming recommender system may have hundreds of billions of dynamic features. Given the embedding size with , the feature weights require 3.2TB of memory at least. Table 2 shows that for a single iteration, weights update on unique features is sparse. To achieve constant cost data access/update and get over the memory constraint of a single node, we use distributed hash table. We use a simple method to distribute weights: In a cluster with nodes, the -th node will hold all the weights that are corresponding with feature field IDs where . There are other methods that could achieve better load balancing, but we found this simple method works fine in our case.

Dynamic Hash Table: In DES strategy, there are three places we operate on hash tables: given a feature ID in a batch of input samples, we lookup the corresponding weight; when a new feature ID is given as the key, we insert the initialized weight into the hash table; given the gradient of a weight, we apply it locally, and then update the hash table with the new weight. To achieve this, we provide a modified dynamic hash table implementation in TensorFlow with key operations adapted to our needs (Figure 6). Compared to alternative design choices, this implementation makes use of as many existing TensorFlow features as possible but only introduces hash table operations during batch building and optimizer phase. Because after the lookup, the sparse weights are reformed into dense tensors and are fully compatible with the native training pipeline of TensorFlow.

Figure 6: Data flow chart with our enhanced TensorFlow.(The two operators of lookup and insert isolate the sparse domain.)

6 Experiments And Analysis

Hardware: We ran all experiments in this paper on a testing cluster which has four LINUX servers with each consisting of 2 hyperthreaded 24-core Intel Xeon E5-2670v3(2.3GHz) CPUs, 128 GB of host memory, and one Intel Ethernet Controller 10-Gigabit X540-AT2 without RDMA support.

Software: Our DES framework is based on an enhanced version of TensorFlow 1.13.1 and a standard OpenMPI with version 4.0.1. Considering that mesh-based frameworks is a special form of PS-based and usually has less communication cost than original PS-based frameworks, we use mesh-based strategy for comparison. The mesh-based strategy we compare with is implemented using a popular open-source framework: DiFacto Li et al. (2016).

Dataset: In order to verify the performance of DES in real industrial context, we evaluate our framework on the following two datasets.

1) Criteo Dataset: Criteo dataset2 includes 45 million users’ click records with 13 continuous features and 26 categorical features. We use 95% for training and the rest 5% for testing.

2) Company* Dataset: We extract a continuous segment of samples from a recommender system in use internally. On average, each sample contains 950 unique feature values. The total number of samples is 10,809,440. It is stored in a remote sample server.

Parameter Settings: We set DiFacto to run one worker process on each server, the batch size is 4,096, and the number of concurrency threads is 24. Correspondingly, the parameters of and for DES on TensorFlow are both set to 24, the batch size on DES is set to 4096 when testing AUC . Since for DES, all workers train samples from the same batch synchronously in parallel, when testing communication ratio, we set the batch size to 16384 (for =4) to guarantee a fair comparison. We train all models with the same optimizer setting: FTRL for order-1 components, AdgaGrad or Adam for both Embedding and DNN components.

Evaluation Metrics: We use two evaluation metrics in our experiments: AUC (Area Under ROC) and Logloss (cross entropy).

Performance Summary We compare our framework to mesh-based implementation on three different widely-adopted models in mainstream recommender systems: LR, W&D, and DeepFM . In general, on all three models, DES can achieve better AUC in smaller number of iterations with order of magnitude smaller communication cost.

model DiFacto DES
LR 0.7906 0.7913
W&D 0.8015 0.8025
DeepFM 0.8027 0.8035
Table 3: Average AUC for three models after a 7-day training session on Company* Dataset.

Table 3 shows that during long-term online training, when consuming the same amount of samples with similar distribution, DES shows better average AUC for all three models. One possible explanation for this is that with DES, the training is in synchronous mode, which usually leads to better and faster convergence compared to asynchronous mode. The reason we care about small amount AUC increase is that in several real-world applications we run internally, even increase in AUC will have a 5x amplification ( increase) when transferred to final CTR .

model PS DES
AUC LogLoss AUC LogLoss
LR 0.7353 0.5037 0.7534 0.4823
W&D 0.7737 0.4753 0.7822 0.4761
DeepFM 0.7750 0.4747 0.7924 0.4674
Table 4: Average AUC and log loss for three models using PS (async training) and DES (sync training) with TensorFlow after a one epoch training session on Criteo Dataset.

Table 4 shows the AUC and log loss for three models using PS-mode asynchronous training and DES-mode fully-synchronous training on TensorFlow respectively3. The batch size is set to 2,048. As the convergence curve does not change much later, we only show the results after the first epoch. For PS-mode, we use 15 parameter servers (with 10GB memory) and 20 workers (with 5GB memory); for DES-mode, we use 15 workers (with 10GB memory). The one-epoch results show that DES have reached higher AUC on all three models (boosts are from 0.8% to 1.7%) even at very early stage during the training.

Computation vs. Communication Time: Figure 7 shows that in all experiments, DiFacto framework needs to spend more time on both computation and communication. The absolute total network communication time using DiFacto framework is 2.7x, 2.3x, and 3.2x larger for LR, W&D, and DeepFM respectively, than using DES . The saving on communication time comes from the smaller amount of intermediate results sent among workers during the forward phase and the elimination of gradient aggregation during the backward phase. The saving on computation time comes from the reduced time complexity of computational equivalent substitution as well as several optimizations we have put in our DES framework.

Figure 7: Per-iteration computation and communication time for three models.

Throughput: Table 5 compares the throughput of DES and DiFacto. For deep models with high-order components (W&D and DeepFM), DES has more advantages. It indicates larger benefits when applying DES to future DLRMs.

model Throughput (samples/sec) improvement
LR 50396.8 78205.3 1.55x
W&D 11023.9 49837.3 4.52x
DeepFM 10560.1 41295.5 3.91x
Table 5: Throughput of DES and PS on three models.

7 Conclusions and Future Works

We propose a novel framework for models with large-scale sparse dynamic features in streaming recommender systems. Our framework achieves efficient synchronous distributed training due to its core component: Distributed Equivalent Substitution (DES) algorithm. We take advantage of the observation that for all models in recommender systems, the first one or few weights-rich layers only participate in straightforward computation, and can be replaced by a group of distributed operators that form a computationally equivalent substitution. Using DES, the intermediate information needed to transfer between workers during the forward phase has been reduced, the AllReduce on gradients between workers during the backward phase has been eliminated. The application of DES on popular DLRMs such as FM, DNN, Wide&Deep, and DeepFM shows the universal generality of our algorithm. Experiments on a public dataset and an internal dataset that compare our implementation with a popular PS-based implementation show that our framework achieves up to 68.7% communication savings and higher AUC .

Future Works: We have shown in section 6 that our current implementation of DES is bounded by computation. So the natural next step is to transfer the computation of current bottleneck operators such as hash table to GPU and to improve the existing kernel implementations. We have also started the initial work to apply DES to more models commonly used in industry such as DCN Wang et al. (2017) and DIN Zhou et al. (2018).

Acknowledgement We appreciate the technical assistance, advice and machine access from colleagues at Tencent: Chaonan Guo and Fei Sun.


  1. More details in Section 4.
  2. http://labs.criteo.com/downloads/2014-kaggle-displayadvertising-challenge-dataset/
  3. We use FTRL optimizer for LR model, and Adam optimizer for the other two models.


  1. Streaming recommender systems. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, Republic and Canton of Geneva, CHE, pp. 381–389. External Links: ISBN 9781450349130, Link, Document Cited by: 1st item.
  2. Revisiting distributed synchronous SGD. CoRR abs/1604.00981. External Links: Link, 1604.00981 Cited by: 2nd item.
  3. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS 2016, New York, NY, USA, pp. 7–10. External Links: ISBN 978-1-4503-4795-2, Link, Document Cited by: §2.2.
  4. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, New York, NY, USA, pp. 191–198. External Links: ISBN 978-1-4503-4035-9, Link, Document Cited by: §2.2.
  5. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, USA, pp. 1223–1231. External Links: Link Cited by: §2.1.
  6. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, pp. 2121–2159. External Links: ISSN 1532-4435, Link Cited by: §4.3.
  7. Integrated model, batch, and domain parallelism in training neural networks. In SPAA’18: 30th ACM Symposium on Parallelism in Algorithms and Architectures, External Links: Link Cited by: 1st item, §2.1, §4.3.
  8. External Links: Link Cited by: §2.1.
  9. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR abs/1706.02677. External Links: Link, 1706.02677 Cited by: §2.1.
  10. DeepFM: a factorization-machine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pp. 1725–1731. External Links: ISBN 978-0-9992411-0-3, Link Cited by: §2.2.
  11. GPipe: efficient training of giant neural networks using pipeline parallelism. CoRR abs/1811.06965. External Links: Link, 1811.06965 Cited by: §2.1, §3.2.
  12. Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes. CoRR abs/1807.11205 (1807.11205v1). External Links: 1807.11205v1 Cited by: §2.1.
  13. Exploring hidden dimensions in parallelizing convolutional neural networks. CoRR abs/1802.04924. External Links: Link, 1802.04924 Cited by: §2.1, §3.2.
  14. Adam: A Method for Stochastic Optimization. arXiv e-prints, pp. arXiv:1412.6980. External Links: 1412.6980 Cited by: §4.3.
  15. One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997. External Links: Link, 1404.5997 Cited by: §2.1, §2.1.
  16. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, Berkeley, CA, USA, pp. 583–598. External Links: ISBN 978-1-931971-16-4, Link Cited by: §2.1.
  17. DiFacto: distributed factorization machines. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, New York, NY, USA, pp. 377–386. External Links: ISBN 978-1-4503-3716-8, Link, Document Cited by: §6.
  18. XDeepFM: combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, New York, NY, USA, pp. 1754–1763. External Links: ISBN 978-1-4503-5552-0, Link, Document Cited by: §2.2, §4.2.
  19. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, New York, NY, USA, pp. 1222–1230. External Links: ISBN 978-1-4503-2174-7, Link, Document Cited by: §4.3.
  20. Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 2430–2439. External Links: Link Cited by: §2.1.
  21. Deep content-based music recommendation. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, USA, pp. 2643–2651. External Links: Link Cited by: §2.2.
  22. Factorization machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, Washington, DC, USA, pp. 995–1000. External Links: ISBN 978-0-7695-4256-0, Link, Document Cited by: §2.2, §4.2, §4.3.
  23. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, New York, NY, USA, pp. 521–530. External Links: ISBN 978-1-59593-654-7, Link, Document Cited by: §2.2, §4.1.
  24. Mesh-tensorflow: deep learning for supercomputers. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 10435–10444. External Links: Link Cited by: §2.1.
  25. Deep & cross network for ad click predictions. CoRR abs/1708.05123. External Links: Link, 1708.05123 Cited by: §7.
  26. Large batch optimization for deep learning: training BERT in 76 minutes. CoRR abs/1904.00962. External Links: Link, 1904.00962 Cited by: §2.1.
  27. Deep learning over multi-field categorical data: a case study on user response prediction. ArXiv abs/1601.02376. Cited by: §2.2.
  28. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, New York, NY, USA, pp. 1059–1068. External Links: ISBN 978-1-4503-5552-0, Link, Document Cited by: §2.2, §7.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description