DeepFM: A FactorizationMachine based Neural Network for CTR Prediction
Abstract
Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods seem to have a strong bias towards low or highorder interactions, or require expertise feature engineering. In this paper, we show that it is possible to derive an endtoend learning model that emphasizes both low and highorder feature interactions. The proposed model, DeepFM, combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture. Compared to the latest Wide & Deep model from Google, DeepFM has a shared input to its “wide” and “deep” parts, with no need of feature engineering besides raw features. Comprehensive experiments are conducted to demonstrate the effectiveness and efficiency of DeepFM over the existing models for CTR prediction, on both benchmark data and commercial data.
DeepFM: A FactorizationMachine based Neural Network for CTR Prediction
Huifeng Guo^{†}^{†}thanks: This work is done when Huifeng Guo worked as intern at Noah’s Ark Research Lab, Huawei. , Ruiming Tang, Yunming Ye^{†}^{†}thanks: Corresponding Author., Zhenguo Li, Xiuqiang He Shenzhen Graduate School, Harbin Institute of Technology, China Noah’s Ark Research Lab, Huawei, China huifengguo@yeah.net, yeyunming@hit.edu.cn {tangruiming, li.zhenguo, hexiuqiang}@huawei.com
1 Introduction
The prediction of clickthrough rate (CTR) is critical in recommender system, where the task is to estimate the probability a user will click on a recommended item. In many recommender systems the goal is to maximize the number of clicks, and so the items returned to a user can be ranked by estimated CTR; while in other application scenarios such as online advertising it is also important to improve revenue, and so the ranking strategy can be adjusted as CTRbid across all candidates, where “bid” is the benefit the system receives if the item is clicked by a user. In either case, it is clear that the key is in estimating CTR correctly.
It is important for CTR prediction to learn implicit feature interactions behind user click behaviors. By our study in a mainstream apps market, we found that people often download apps for food delivery at mealtime, suggesting that the (order2) interaction between app category and timestamp can be used as a signal for CTR. As a second observation, male teenagers like shooting games and RPG games, which means that the (order3) interaction of app category, user gender and age is another signal for CTR. In general, such interactions of features behind user click behaviors can be highly sophisticated, where both low and highorder feature interactions should play important roles. According to the insights of the Wide & Deep model [?] from google, considering low and highorder feature interactions simultaneously brings additional improvement over the cases of considering either alone.
The key challenge is in effectively modeling feature interactions. Some feature interactions can be easily understood, thus can be designed by experts (like the instances above). However, most other feature interactions are hidden in data and difficult to identify a priori (for instance, the classic association rule “diaper and beer” is mined from data, instead of discovering by experts), which can only be captured automatically by machine learning. Even for easytounderstand interactions, it seems unlikely for experts to model them exhaustively, especially when the number of features is large.
Despite their simplicity, generalized linear models, such as FTRL [?], have shown decent performance in practice. However, a linear model lacks the ability to learn feature interactions, and a common practice is to manually include pairwise feature interactions in its feature vector. Such a method is hard to generalize to model highorder feature interactions or those never or rarely appear in the training data [?]. Factorization Machines (FM) [?] model pairwise feature interactions as inner product of latent vectors between features and show very promising results. While in principle FM can model highorder feature interaction, in practice usually only order2 feature interactions are considered due to high complexity.
As a powerful approach to learning feature representation, deep neural networks have the potential to learn sophisticated feature interactions. Some ideas extend CNN and RNN for CTR predition [?; ?], but CNNbased models are biased to the interactions between neighboring features while RNNbased models are more suitable for click data with sequential dependency. [?] studies feature representations and proposes Factorizationmachine supported Neural Network (FNN). This model pretrains FM before applying DNN, thus limited by the capability of FM. Feature interaction is studied in [?], by introducing a product layer between embedding layer and fullyconnected layer, and proposing the Productbased Neural Network (PNN). As noted in [?], PNN and FNN, like other deep models, capture little loworder feature interactions, which are also essential for CTR prediction. To model both low and highorder feature interactions, [?] proposes an interesting hybrid network structure (Wide & Deep) that combines a linear (“wide”) model and a deep model. In this model, two different inputs are required for the “wide part” and “deep part”, respectively, and the input of “wide part” still relies on expertise feature engineering.
One can see that existing models are biased to low or highorder feature interaction, or rely on feature engineering. In this paper, we show it is possible to derive a learning model that is able to learn feature interactions of all orders in an endtoend manner, without any feature engineering besides raw features. Our main contributions are summarized as follows:

We propose a new neural network model DeepFM (Figure 1) that integrates the architectures of FM and deep neural networks (DNN). It models loworder feature interactions like FM and models highorder feature interactions like DNN. Unlike the wide & deep model [?], DeepFM can be trained endtoend without any feature engineering.

DeepFM can be trained efficiently because its wide part and deep part, unlike [?], share the same input and also the embedding vector. In [?], the input vector can be of huge size as it includes manually designed pairwise feature interactions in the input vector of its wide part, which also greatly increases its complexity.

We evaluate DeepFM on both benchmark data and commercial data, which shows consistent improvement over existing models for CTR prediction.
2 Our Approach
Suppose the data set for training consists of instances , where is an fields data record usually involving a pair of user and item, and is the associated label indicating user click behaviors ( means the user clicked the item, and otherwise). may include categorical fields (e.g., gender, location) and continuous fields (e.g., age). Each categorical field is represented as a vector of onehot encoding, and each continuous field is represented as the value itself, or a vector of onehot encoding after discretization. Then, each instance is converted to where is a dimensional vector, with being the vector representation of the th field of . Normally, is highdimensional and extremely sparse. The task of CTR prediction is to build a prediction model to estimate the probability of a user clicking a specific app in a given context.
2.1 DeepFM
We aim to learn both low and highorder feature interactions. To this end, we propose a FactorizationMachine based neural network (DeepFM). As depicted in Figure 1^{1}^{1}1In all Figures of this paper, a Normal Connection in black refers to a connection with weight to be learned; a Weight1 Connection, red arrow, is a connection with weight 1 by default; Embedding, blue dashed arrow, means a latent vector to be learned; Addition means adding all input together; Product, including Inner and OuterProduct, means the output of this unit is the product of two input vector; Sigmoid Function is used as the output function in CTR prediction; Activation Functions, such as relu and tanh, are used for nonlinearly transforming the signal., DeepFM consists of two components, FM component and deep component, that share the same input. For feature , a scalar is used to weigh its order1 importance, a latent vector is used to measure its impact of interactions with other features. is fed in FM component to model order2 feature interactions, and fed in deep component to model highorder feature interactions. All parameters, including , , and the network parameters (, below) are trained jointly for the combined prediction model:
(1) 
where is the predicted CTR, is the output of FM component, and is the output of deep component.
FM Component
The FM component is a factorization machine, which is proposed in [?] to learn feature interactions for recommendation. Besides a linear (order1) interactions among features, FM models pairwise (order2) feature interactions as inner product of respective feature latent vectors. It can capture order2 feature interactions much more effectively than previous approaches especially when the dataset is sparse. In previous approaches, the parameter of an interaction of features and can be trained only when feature and feature both appear in the same data record. While in FM, it is measured via the inner product of their latent vectors and . Thanks to this flexible design, FM can train latent vector () whenever (or ) appears in a data record. Therefore, feature interactions, which are never or rarely appeared in the training data, are better learnt by FM.
As Figure 2 shows, the output of FM is the summation of an Addition unit and a number of Inner Product units:
(2) 
where and ( is given)^{2}^{2}2We omit a constant offset for simplicity.. The Addition unit () reflects the importance of order1 features, and the Inner Product units represent the impact of order2 feature interactions.
Deep Component
The deep component is a feedforward neural network, which is used to learn highorder feature interactions. As shown in Figure 3, a data record (a vector) is fed into the neural network. Compared to neural networks with image [?] or audio [?] data as input, which is purely continuous and dense, the input of CTR prediction is quite different, which requires a new network architecture design. Specifically, the raw feature input vector for CTR prediction is usually highly sparse^{3}^{3}3Only one entry is nonzero for each field vector., super highdimensional^{4}^{4}4E.g., in an app store of billion users, the one field vector for user ID is already of billion dimensions., categoricalcontinuousmixed, and grouped in fields (e.g., gender, location, age). This suggests an embedding layer to compress the input vector to a lowdimensional, dense realvalue vector before further feeding into the first hidden layer, otherwise the network can be overwhelming to train.
Figure 4 highlights the subnetwork structure from the input layer to the embedding layer. We would like to point out the two interesting features of this network structure: 1) while the lengths of different input field vectors can be different, their embeddings are of the same size (); 2) the latent feature vectors () in FM now server as network weights which are learned and used to compress the input field vectors to the embedding vectors. In [?], is pretrained by FM and used as initialization. In this work, rather than using the latent feature vectors of FM to initialize the networks as in [?], we include the FM model as part of our overall learning architecture, in addition to the other DNN model. As such, we eliminate the need of pretraining by FM and instead jointly train the overall network in an endtoend manner. Denote the output of the embedding layer as:
(3) 
where is the embedding of th field and is the number of fields. Then, is fed into the deep neural network, and the forward process is:
(4) 
where is the layer depth and is an activation function. , , are the output, model weight, and bias of the th layer. After that, a dense realvalue feature vector is generated, which is finally fed into the sigmoid function for CTR prediction: , where is the number of hidden layers.
It is worth pointing out that FM component and deep component share the same feature embedding, which brings two important benefits: 1) it learns both low and highorder feature interactions from raw features; 2) there is no need for expertise feature engineering of the input, as required in Wide & Deep [?].
2.2 Relationship with the other Neural Networks
Inspired by the enormous success of deep learning in various applications, several deep models for CTR prediction are developed recently. This section compares the proposed DeepFM with existing deep models for CTR prediction.
FNN: As Figure 5 (left) shows, FNN is a FMinitialized feedforward neural network [?]. The FM pretraining strategy results in two limitations: 1) the embedding parameters might be over affected by FM; 2) the efficiency is reduced by the overhead introduced by the pretraining stage. In addition, FNN captures only highorder feature interactions. In contrast, DeepFM needs no pretraining and learns both high and loworder feature interactions.
PNN: For the purpose of capturing highorder feature interactions, PNN imposes a product layer between the embedding layer and the first hidden layer [?]. According to different types of product operation, there are three variants: IPNN, OPNN, and PNN, where IPNN is based on inner product of vectors, OPNN is based on outer product, and PNN is based on both inner and outer products.
To make the computation more efficient, the authors proposed the approximated computations of both inner and outer products: 1) the inner product is approximately computed by eliminating some neurons; 2) the outer product is approximately computed by compressing dimensional feature vectors to one dimensional vector. However, we find that the outer product is less reliable than the inner product, since the approximated computation of outer product loses much information that makes the result unstable. Although inner product is more reliable, it still suffers from high computational complexity, because the output of the product layer is connected to all neurons of the first hidden layer. Different from PNN, the output of the product layer in DeepFM only connects to the final output layer (one neuron). Like FNN, all PNNs ignore loworder feature interactions.
Wide & Deep: Wide & Deep (Figure 5 (right)) is proposed by Google to model low and highorder feature interactions simultaneously. As shown in [?], there is a need for expertise feature engineering on the input to the “wide” part (for instance, crossproduct of users’ install apps and impression apps in app recommendation). In contrast, DeepFM needs no such expertise knowledge to handle the input by learning directly from the input raw features.
A straightforward extension to this model is replacing LR by FM (we also evaluate this extension in Section 3). This extension is similar to DeepFM, but DeepFM shares the feature embedding between the FM and deep component. The sharing strategy of feature embedding influences (in backpropagate manner) the feature representation by both low and highorder feature interactions, which models the representation more precisely.
Summarizations: To summarize, the relationship between DeepFM and the other deep models in four aspects is presented in Table 1. As can be seen, DeepFM is the only model that requires no pretraining and no feature engineering, and captures both low and highorder feature interactions.
No  Highorder  Loworder  No Feature  

Pretraining  Features  Features  Engineering  
FNN  
PNN  
Wide & Deep  
DeepFM 
3 Experiments
In this section, we compare our proposed DeepFM and the other stateoftheart models empirically. The evaluation result indicates that our proposed DeepFM is more effective than any other stateoftheart model and the efficiency of DeepFM is comparable to the best ones among the others.
3.1 Experiment Setup
Datasets
We evaluate the effectiveness and efficiency of our proposed DeepFM on the following two datasets.
1) Criteo Dataset: Criteo dataset ^{5}^{5}5http://labs.criteo.com/downloads/2014kaggledisplayadvertisingchallengedataset/ includes 45 million users’ click records. There are 13 continuous features and 26 categorical ones. We split the dataset randomly into two parts: 90% is for training, while the rest 10% is for testing.
2) Company Dataset: In order to verify the performance of DeepFM in real industrial CTR prediction, we conduct experiment on Company dataset. We collect 7 consecutive days of users’ click records from the game center of the Company App Store for training, and the next 1 day for testing. There are around 1 billion records in the whole collected dataset. In this dataset, there are app features (e.g., identification, category, and etc), user features (e.g., user’s downloaded apps, and etc), and context features (e.g., operation time, and etc).
Evaluation Metrics
We use two evaluation metrics in our experiments: AUC (Area Under ROC) and Logloss (cross entropy).
Model Comparison
We compare 9 models in our experiments: LR, FM, FNN, PNN (three variants), Wide & Deep, and DeepFM. In the Wide & Deep model, for the purpose of eliminating feature engineering effort, we also adapt the original Wide & Deep model by replacing LR by FM as the wide part. In order to distinguish these two variants of Wide & Deep, we name them LR & DNN and FM & DNN, respectively.^{6}^{6}6We do not use the Wide & Deep API released by Google, as the efficiency of that implementation is very low. We implement Wide & Deep by ourselves by simplifying it with shared optimizer for both deep and wide part.
Parameter Settings
To evaluate the models on Criteo dataset, we follow the parameter settings in [?] for FNN and PNN: (1) dropout: 0.5; (2) network structure: 400400400; (3) optimizer: Adam; (4) activation function: tanh for IPNN, relu for other deep models. To be fair, our proposed DeepFM uses the same setting. The optimizers of LR and FM are FTRL and Adam respectively, and the latent dimension of FM is 10.
To achieve the best performance for each individual model on Company dataset, we conducted carefully parameter study, which is discussed in Section 3.3.
3.2 Performance Evaluation
In this section, we evaluate the models listed in Section 3.1 on the two datasets to compare their effectiveness and efficiency.
Efficiency Comparison
The efficiency of deep learning models is important to realworld applications. We compare the efficiency of different models on Criteo dataset by the following formula: . The results are shown in Figure 6, including the tests on CPU (left) and GPU (right), where we have the following observations: 1) pretraining of FNN makes it less efficient; 2) Although the speed up of IPNN and PNN on GPU is higher than the other models, they are still computationally expensive because of the inefficient inner product operations; 3) The DeepFM achieves almost the most efficient in both tests.
Effectiveness Comparison
The performance for CTR prediction of different models on Criteo dataset and Company dataset is shown in Table 2, where we have the following observations:

Learning feature interactions improves the performance of CTR prediction model. This observation is from the fact that LR (which is the only model that does not consider feature interactions) performs worse than the other models. As the best model, DeepFM outperforms LR by 0.86% and 4.18% in terms of AUC (1.15% and 5.60% in terms of Logloss) on Company and Criteo datasets.

Learning high and loworder feature interactions simultaneously and properly improves the performance of CTR prediction model. DeepFM outperforms the models that learn only loworder feature interactions (namely, FM) or highorder feature interactions (namely, FNN, IPNN, OPNN, PNN). Compared to the second best model, DeepFM achieves more than 0.37% and 0.25% in terms of AUC (0.42% and 0.29% in terms of Logloss) on Company and Criteo datasets.

Learning high and loworder feature interactions simultaneously while sharing the same feature embedding for high and loworder feature interactions learning improves the performance of CTR prediction model. DeepFM outperforms the models that learn high and loworder feature interactions using separate feature embeddings (namely, LR & DNN and FM & DNN). Compared to these two models, DeepFM achieves more than 0.48% and 0.33% in terms of AUC (0.61% and 0.66% in terms of Logloss) on Company and Criteo datasets.
Company  Criteo  

AUC  LogLoss  AUC  LogLoss  
LR  0.8640  0.02648  0.7686  0.47762 
FM  0.8678  0.02633  0.7892  0.46077 
FNN  0.8683  0.02629  0.7963  0.45738 
IPNN  0.8664  0.02637  0.7972  0.45323 
OPNN  0.8658  0.02641  0.7982  0.45256 
PNN  0.8672  0.02636  0.7987  0.45214 
LR & DNN  0.8673  0.02634  0.7981  0.46772 
FM & DNN  0.8661  0.02640  0.7850  0.45382 
DeepFM  0.8715  0.02618  0.8007  0.45083 
Overall, our proposed DeepFM model beats the competitors by more than 0.37% and 0.42% in terms of AUC and Logloss on Company dataset, respectively. In fact, a small improvement in offline AUC evaluation is likely to lead to a significant increase in online CTR. As reported in [?], compared with LR, Wide & Deep improves AUC by 0.275% (offline) and the improvement of online CTR is 3.9%. The daily turnover of Company’s App Store is millions of dollars, therefore even several percents lift in CTR brings extra millions of dollars each year.
3.3 HyperParameter Study
We study the impact of different hyperparameters of different deep models, on Company dataset. The order is: 1) activation functions; 2) dropout rate; 3) number of neurons per layer; 4) number of hidden layers; 5) network shape.
Activation Function
According to [?], relu and tanh are more suitable for deep models than sigmoid. In this paper, we compare the performance of deep models when applying relu and tanh. As shown in Figure 7, relu is more appropriate than tanh for all the deep models, except for IPNN. Possible reason is that relu induces sparsity.
Dropout
Dropout [?] refers to the probability that a neuron is kept in the network. Dropout is a regularization technique to compromise the precision and the complexity of the neural network. We set the dropout to be 1.0, 0.9, 0.8, 0.7, 0.6, 0.5. As shown in Figure 8, all the models are able to reach their own best performance when the dropout is properly set (from 0.6 to 0.9). The result shows that adding reasonable randomness to model can strengthen model’s robustness.
Number of Neurons per Layer
When other factors remain the same, increasing the number of neurons per layer introduces complexity. As we can observe from Figure 9, increasing the number of neurons does not always bring benefit. For instance, DeepFM performs stably when the number of neurons per layer is increased from 400 to 800; even worse, OPNN performs worse when we increase the number of neurons from 400 to 800. This is because an overcomplicated model is easy to overfit. In our dataset, 200 or 400 neurons per layer is a good choice.
Number of Hidden Layers
As presented in Figure 10, increasing number of hidden layers improves the performance of the models at the beginning, however, their performance is degraded if the number of hidden layers keeps increasing. This phenomenon is also because of overfitting.
Network Shape
We test four different network shapes: constant, increasing, decreasing, and diamond. When we change the network shape, we fix the number of hidden layers and the total number of neurons. For instance, when the number of hidden layers is 3 and the total number of neurons is 600, then four different shapes are: constant (200200200), increasing (100200300), decreasing (300200100), and diamond (150300150). As we can see from Figure 11, the “constant” network shape is empirically better than the other three options, which is consistent with previous studies [?].
4 Related Work
In this paper, a new deep neural network is proposed for CTR prediction. The most related domains are CTR prediction and deep learning in recommender system. In this section, we discuss related work in these two domains.
CTR prediction plays an important role in recommender system [?; ?; ?]. Besides generalized linear models and FM, a few other models are proposed for CTR prediction, such as treebased model [?], tensor based model [?], support vector machine [?], and bayesian model [?].
The other related domain is deep learning in recommender systems. In Section 1 and Section 2.2, several deep learning models for CTR prediction are already mentioned, thus we do not discuss about them here. Several deep learning models are proposed in recommendation tasks other than CTR prediction (e.g., [?; ?; ?; ?; ?; ?; ?]). [?; ?; ?] propose to improve Collaborative Filtering via deep learning. The authors of [?; ?] extract content feature by deep learning to improve the performance of music recommendation. [?] devises a deep learning network to consider both image feature and basic feature of display adverting. [?] develops a twostage deep learning framework for YouTube video recommendation.
5 Conclusions
In this paper, we proposed DeepFM, a FactorizationMachine based Neural Network for CTR prediction, to overcome the shortcomings of the stateoftheart models and to achieve better performance. DeepFM trains a deep component and an FM component jointly. It gains performance improvement from these advantages: 1) it does not need any pretraining; 2) it learns both high and loworder feature interactions; 3) it introduces a sharing strategy of feature embedding to avoid feature engineering. We conducted extensive experiments on two realworld datasets (Criteo dataset and a commercial App Store dataset) to compare the effectiveness and efficiency of DeepFM and the stateoftheart models. Our experiment results demonstrate that 1) DeepFM outperforms the stateoftheart models in terms of AUC and Logloss on both datasets; 2) The efficiency of DeepFM is comparable to the most efficient deep model in the stateoftheart.
There are two interesting directions for future study. One is exploring some strategies (such as introducing pooling layers) to strengthen the ability of learning most useful highorder feature interactions. The other is to train DeepFM on a GPU cluster for largescale problems.
References
 [BoulangerLewandowski et al., 2013] Nicolas BoulangerLewandowski, Yoshua Bengio, and Pascal Vincent. Audio chord recognition with recurrent neural networks. In ISMIR, pages 335–340, 2013.
 [Chang et al., 2010] YinWen Chang, ChoJui Hsieh, KaiWei Chang, Michael Ringgaard, and ChihJen Lin. Training and testing lowdegree polynomial data mappings via linear SVM. JMLR, 11:1471–1490, 2010.
 [Chen et al., 2016] Junxuan Chen, Baigui Sun, Hao Li, Hongtao Lu, and XianSheng Hua. Deep CTR prediction in display advertising. In MM, 2016.
 [Cheng et al., 2016] HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. Wide & deep learning for recommender systems. CoRR, abs/1606.07792, 2016.
 [Covington et al., 2016] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In RecSys, pages 191–198, 2016.
 [Graepel et al., 2010] Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. Webscale bayesian clickthrough rate prediction for sponsored search advertising in microsoft’s bing search engine. In ICML, pages 13–20, 2010.
 [He et al., 2014] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Candela. Practical lessons from predicting clicks on ads at facebook. In ADKDD, pages 5:1–5:9, 2014.
 [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [Juan et al., 2016] YuChin Juan, Yong Zhuang, WeiSheng Chin, and ChihJen Lin. Fieldaware factorization machines for CTR prediction. In RecSys, pages 43–50, 2016.
 [Larochelle et al., 2009] Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, and Pascal Lamblin. Exploring strategies for training deep neural networks. JMLR, 10:1–40, 2009.
 [Liu et al., 2015] Qiang Liu, Feng Yu, Shu Wu, and Liang Wang. A convolutional click prediction model. In CIKM, 2015.
 [McMahan et al., 2013] H. Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. Ad click prediction: a view from the trenches. In KDD, 2013.
 [Qu et al., 2016] Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. Productbased neural networks for user response prediction. CoRR, abs/1611.00144, 2016.
 [Rendle and SchmidtThieme, 2010] Steffen Rendle and Lars SchmidtThieme. Pairwise interaction tensor factorization for personalized tag recommendation. In WSDM, pages 81–90, 2010.
 [Rendle, 2010] Steffen Rendle. Factorization machines. In ICDM, 2010.
 [Richardson et al., 2007] Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: estimating the clickthrough rate for new ads. In WWW, pages 521–530, 2007.
 [Salakhutdinov et al., 2007] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey E. Hinton. Restricted boltzmann machines for collaborative filtering. In ICML, pages 791–798, 2007.
 [Sedhain et al., 2015] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. Autorec: Autoencoders meet collaborative filtering. In WWW, pages 111–112, 2015.
 [Srivastava et al., 2014] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
 [van den Oord et al., 2013] Aäron van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep contentbased music recommendation. In NIPS, pages 2643–2651, 2013.
 [Wang and Wang, 2014] Xinxi Wang and Ye Wang. Improving contentbased and hybrid music recommendation using deep learning. In ACM MM, pages 627–636, 2014.
 [Wang et al., 2015] Hao Wang, Naiyan Wang, and DitYan Yeung. Collaborative deep learning for recommender systems. In ACM SIGKDD, pages 1235–1244, 2015.
 [Wu et al., 2016] Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. Collaborative denoising autoencoders for topn recommender systems. In ACM WSDM, pages 153–162, 2016.
 [Wu et al., 2017] ChaoYuan Wu, Amr Ahmed, Alex Beutel, Alexander J. Smola, and How Jing. Recurrent recommender networks. In WSDM, pages 495–503, 2017.
 [Zhang et al., 2014] Yuyu Zhang, Hanjun Dai, Chang Xu, Jun Feng, Taifeng Wang, Jiang Bian, Bin Wang, and TieYan Liu. Sequential click prediction for sponsored search with recurrent neural networks. In AAAI, 2014.
 [Zhang et al., 2016] Weinan Zhang, Tianming Du, and Jun Wang. Deep learning over multifield categorical data   A case study on user response prediction. In ECIR, 2016.
 [Zheng et al., 2016] Yin Zheng, YuJin Zhang, and Hugo Larochelle. A deep and autoregressive approach for topic modeling of multimodal data. IEEE Trans. Pattern Anal. Mach. Intell., 38(6):1056–1069, 2016.
 [Zheng et al., 2017] Lei Zheng, Vahid Noroozi, and Philip S. Yu. Joint deep modeling of users and items using reviews for recommendation. In WSDM, pages 425–434, 2017.